Issue with find and replace apostrophe( ' ) in a Word Docx using OpenXML and Regex

The reason this happens is because they are different characters.

Word actually changes some punctuation characters after you type them in order to give them the right inclination or to improve presentation.

I ran in the very same issue before and I used this as regular expression: [\u2018\u2019\u201A\u201b\u2032']

So essentially modify your code to:

Click to copy

Regex apostropheReplace = new Regex("s\\[\u2018\u2019\u201A\u201b\u2032']s");
docText = apostropheReplace.Replace(docText, "s\'")

I found these were the five most common type of single quotes and apostrophes used.

And in case you come across the same issue with double quotes, here is what you can use: [\u201C\u201D\u201E\u201F\u2033\u2036\"]

Answering the question:

Is there a way to do it so that both characters work?

If you want one Regex to be able to handle both scenarios, this is perhaps a simple and readable solution:

Click to copy

 Regex apostropheReplace = new Regex("s\\['’]s");
 docText = apostropheReplace.Replace(docText, "s\'")

This has the added benefit of being understandable to other developers that you are attempting to cover both apostrophe cases. This benefit gets at the other part of your question:

If using the copied character from Word is the proper way of doing this?

That depends on what you mean by "proper". If you mean "most understandable to other developers," I'd say yes, because there would be the least amount of look-up needed to know exactly what your Regex is looking for. If you mean "most performant", that should not be an issue with this straightforward Regex search (some nice Regex performance tips can be found here).

If you mean "most versatile/robust single quote Regex", then as @Leonardo-Seccia points out, there are other character encodings that might cause trouble. (Some of the common Microsoft Word ones are listed here.) Such a solution might look like this:

Click to copy

Regex apostropheReplace =
    new Regex("s\\['\u2018\u2019\u201A\u201b]s");
docText = apostropheReplace.Replace(docText, "s\'")

But you can certainly add other character encodings as needed. A more complete list of character encodings can be found here - to add them to the above Regex, simply change the "U+" to "u" and add it to the list after another "\" character. For example, to add the "prime" symbol (′ or U+2032) to the list above, change the RegEx string from

Click to copy

Regex("s\\['\u2018\u2019\u201A\u201b]s")

Click to copy

Regex("s\\['\u2018\u2019\u201A\u201b\u2032]s")

Ultimately, you would be the judge of what character encodings are the most "proper" for inclusion in your Regex based on your use cases.

Issue with find and replace apostrophe( ' ) in a Word Docx using OpenXML and Regex

Tags:

C#

Regex

Openxml

Replace

Related

Recent Posts