locked
Character classes and CultureInfo RRS feed

  • Question

  • Some of the character classes for regular expressions, such as the /w ( /W ), /s ( /S ), and /d ( /D ), appear to depend on the culture. For instance a digit as represented in one culture could be different from a digit as represented in another culture. Ditto with the characters representing a word or a space.

    Yet there seems no way to manipulate the character classes of regular expressions to be dependent on .Net CultureInfo. How do regular expressions resolve the culture from which to decide what these character classes entail ?
    Monday, May 4, 2009 8:56 PM

All replies

  • I can't answer your question about CultureInfo, but here is a link on the Character Classes for Regex.

    http://msdn.microsoft.com/en-us/library/20bw873z.aspx

    You will see that \d translates to \p{Nd} for Unicode and [0-9] for non-Unicode.  So, the Regex appears not to be culturally aware, but is Unicode aware.  Unicode standards provide properties to indicate a character is numeric or whitespace and so forth.  I would expect that CultureInfo to determine which code page to use, but Regex to then interpret based upon the code page.
    Les Potter, Xalnix Corporation, Yet Another C# Blog
    Tuesday, May 5, 2009 12:58 PM
  • I can't answer your question about CultureInfo, but here is a link on the Character Classes for Regex.

    http://msdn.microsoft.com/en-us/library/20bw873z.aspx

    You will see that \d translates to \p{Nd} for Unicode and [0-9] for non-Unicode.  So, the Regex appears not to be culturally aware, but is Unicode aware.  Unicode standards provide properties to indicate a character is numeric or whitespace and so forth.  I would expect that CultureInfo to determine which code page to use, but Regex to then interpret based upon the code page.
    Les Potter, Xalnix Corporation, Yet Another C# Blog
    I am aware that the character class uses Unicode definitions of what digits, whitespace, letters etc. are.

    The problem is that I do not see how the CultureInfo is established for Regex in order to choose the code page. It is not in the constructor for the Regex class nor in any of the Regex member functions. Therefore it appears to be hardcoded to probably the current thread's CurrentCulture value. This is not a very flexible system for regular expressions. If I need to use Regex against a particular CultureInfo I can not do so.

    But perhaps there is something about Regex I have missed. Was it really Microsoft's intention to have regular expressions work only against the current thread's current culture ?
    Tuesday, May 5, 2009 1:48 PM
  • I can't answer your question about CultureInfo, but here is a link on the Character Classes for Regex.

    http://msdn.microsoft.com/en-us/library/20bw873z.aspx

    You will see that \d translates to \p{Nd} for Unicode and [0-9] for non-Unicode.  So, the Regex appears not to be culturally aware, but is Unicode aware.  Unicode standards provide properties to indicate a character is numeric or whitespace and so forth.  I would expect that CultureInfo to determine which code page to use, but Regex to then interpret based upon the code page.
    Les Potter, Xalnix Corporation, Yet Another C# Blog

    I am going to backtrack here from my previous reply.

    If, as the documentation explains, Regex strictly uses Unicode to determine what is a digit, space, alphabetic etc., then clearly Regex does not take into account CultureInfo information other than to say that any culture's notion of a digit, space, alphabetic matches Regex's idea in the character classes of what a digit, space, and word are. While this makes Regex culture neutral it does not allow to regular expression pattern matching for a particular culture ( CultureInfo ).

    As an example, if we take the digits ( 0 - 9 ), these may correspond not only to the '0' - '9' characters in the ASCII page but may encompass Unicode characters from any language which has a different set of character strings representing the digits.  So if I am searching for a digit in my regular expression using the /d notation, and the string I have contains some sequence of digits which correspond to the digit 0 in some language, let's say Swahili, which has nothing to do with the current culture of the current thread, let's say US English, and which in that culture would not correspond to a '0' in any way, Regex will still say that a match for a digit has been found because somewhere in the Unicode database that sequence of unicode characters is designated as a digit, in Unicode code ranges which encompass Swahili. So even though my current culture has no notion of '0' as the sequence of digits found in the string I am searching, just because it is marked as a digit in Swahili in the Unicode database, it is determined to be a digit by Regex.

    This does not seem as if it can be right, but perhaps that is how Microsoft designed regular expressions in .Net. is there a flaw in my thinking and does CultureInfo actually play a part in Regex character class determination ?
    Tuesday, May 5, 2009 4:07 PM
  • If you have a document in English and a Regex pattern that finds certain things in that document, it would be incorrect behavior to get different results on the same document simply because the Current CultureInfo changed.  The practice I've heard suggested is to use localization techniques for your Regex patterns.  I.e., have a different pattern (if needed) for different cultures.

    But, I haven't any real world experience with Regex in internationalized applications, so please keep searching for the answer and post back.  I'd be very interested to hear how this is done and the recommended best practices.


    Les Potter, Xalnix Corporation, Yet Another C# Blog
    Wednesday, May 6, 2009 10:52 AM
  • If you have a document in English and a Regex pattern that finds certain things in that document, it would be incorrect behavior to get different results on the same document simply because the Current CultureInfo changed.  The practice I've heard suggested is to use localization techniques for your Regex patterns.  I.e., have a different pattern (if needed) for different cultures.

    But, I haven't any real world experience with Regex in internationalized applications, so please keep searching for the answer and post back.  I'd be very interested to hear how this is done and the recommended best practices.


    Les Potter, Xalnix Corporation, Yet Another C# Blog
    If Regex had the ability to set the CultureInfo I want to use for character classes, there would be much less of a problem.

    Character classes should allow one to use the same Regex without having to have a different pattern for every culture. But the way that Microsoft appears to have designed character classes for the .net regular expressions, to depend strictly on Unicode properties, makes character classes almost completely useless since no cultural information is being used at all and the notion of a character class appears to cover every culture in the world within the Unicode properties database.

    I do not think that design for regular expression character classes can possibly be right.
    Wednesday, May 6, 2009 7:37 PM
  • Can you propose a specific example where you expect a culture-specific result from Regex?

    The notion of "word" character set and word boundaries can be quite vague, implicit, or idiomatic in some languages/writing systems, if you're using the intuitive sence of "word". For regex, there is a uniform definition for "word" and "word boundary", but they do not neceesarily work for some languages. IIRC Thai is a good example: determining actual word boundaris (such as for word wrap) requires a dictionary, which of course the regex engine does not have.

    Is this what you're getting at?

    The RE engine uses the static Unicode character properties for classifications, and it is clearly documented. This makes the behavior uniform, predictable, and consistent.  I for one would not want the meaning of a regex to change depending on the whim of the current thread's locale.

    I don't know why you wouldn't match arabic digits if you want digits, even if the thread locale isn't an Arabic one. After all, we use arabic digits here in the US. We (English speakers) took the Latin/Roman script for our text, but not the Roman digits -- we use the superior arabic ones.

    The topic of national digits is an interesting one. Whether or not a match for some type of digit is correct for your usage depends on entirely on your application and it's intended usage. In one sense a number is a number, no matter what script it's written in, or the culture of the context it appears in. A digit is a digit, whatever the script. A number written in Swahili is still a number. It doesn't mean something else just because I can only read English (or my thread locale is en-US).

    It depends on what you're trying to express in your regex.
    Wednesday, May 27, 2009 6:28 AM
  • Can you propose a specific example where you expect a culture-specific result from Regex?

    The notion of "word" character set and word boundaries can be quite vague, implicit, or idiomatic in some languages/writing systems, if you're using the intuitive sence of "word". For regex, there is a uniform definition for "word" and "word boundary", but they do not neceesarily work for some languages. IIRC Thai is a good example: determining actual word boundaris (such as for word wrap) requires a dictionary, which of course the regex engine does not have.

    Is this what you're getting at?

    The RE engine uses the static Unicode character properties for classifications, and it is clearly documented. This makes the behavior uniform, predictable, and consistent.  I for one would not want the meaning of a regex to change depending on the whim of the current thread's locale.

    I don't know why you wouldn't match arabic digits if you want digits, even if the thread locale isn't an Arabic one. After all, we use arabic digits here in the US. We (English speakers) took the Latin/Roman script for our text, but not the Roman digits -- we use the superior arabic ones.

    The topic of national digits is an interesting one. Whether or not a match for some type of digit is correct for your usage depends on entirely on your application and it's intended usage. In one sense a number is a number, no matter what script it's written in, or the culture of the context it appears in. A digit is a digit, whatever the script. A number written in Swahili is still a number. It doesn't mean something else just because I can only read English (or my thread locale is en-US).

    It depends on what you're trying to express in your regex.
    If I am searching for digits in a Unicode document written in English, and the Swahili notion of any digit occurs somehow in that document, is it correct that the Swahili notion of a digit be found when I am using character classes ? I do not think so. Now you may say "What is the Swahili digit sequence doing in a Unicode document written in English ?" and you have a point as to its likelihood being very small in my hypothetical case, but I still feel as if I should be able to limit the notion of "digit" specified by using character classes to a particular culture if I choose to do so.
    Wednesday, June 10, 2009 1:58 PM