locked
Looking for regular expression to parse a string (extract number without leading zero) RRS feed

  • Question

  • User1428246847 posted

    I need to parse a simple string that ends with '_nn' (nn being 00 to 99 with leading zeroes for numbers below 10); e.g. abc_01 or abc_56

    ^([A-Za-z0-9]+)_0([0-9]{1})$|^([A-Za-z0-9]+)_([0-9]{2})$

    The problem is that I either get results in the first 2 groups (and the last two groups are empty) or in the last two groups (and the first two groups are empty. Is there a way to create a regexp that only returns two groups regardless of which part matches?

    //edit: updated title

    Sunday, December 7, 2014 12:50 PM

Answers

  • User1428246847 posted

    Sorry people for not being clear.

    I need to split strings into two groups; the string consists of at least one character followed by an underscore followed by a two digit number (that includes leading zeroes if necessary).

    When I split abc_01, I want to get back abc and 1. When I split abcd123efgh_97, I want to get back abcd123efgh and 97. I use regular expressions with grouping to extract the data The given regular expression results in

    |in      | grp1 | grp2 | grp3 | grp4 |
    +--------+------+------+------+------+
    |abc_01  |'abc' | '1'  | ''   | ''   |
    +--------+------+------+------+------+
    |xyz_97  |''    | ''   | 'xyz'| '97' |
    +--------+------+------+------+------+

    Where grp1..4 are the groups that the earlier regular expression returns (actually it also returns a grp0 for the complete match). The single quotes are not part of the result and just for display purposes.

    What I'm asking for is a regexp that always returns the result in grp1 and grp2 independent of the number at the end.

    I have found a regexp that nearly does it, and I'm happy to use it, so from that perspective the thread is solved. Only disadvantage is that it does not limit the number to two digits (so wim_1234567 is considered valid while it is not); I however can work around that. But if you can come up with a better one, you're welcome.

    ([a-zA-Z0-9]+)_0*([1-9][0-9]*|0)


    ====================

    I like to explain why I use regular expressions.

    I big part of my life I spend on writing small tools to process (fixed width, csv and xml) text files; processing usually consists of re-formatting of fields, recombining (parts) of fields into other fields and moving fields around. I'm also crazy about flexibility and hate hard-coding; therefore the regular expression is stored in a configuration file. If the format of the input string ever changes, it's a matter of changing the regular expression to let the program do what it needs to do instead of modifying the program.

    In the project that I'm working on, a (simplified) configuration entry looks like

      <FieldDefinitions>
        <FieldDefinition>
          <FieldName>The first field</FieldName>
          <StartPosition>123</StartPosition>
          <Length>12</Length>
          <inputvalidation></inputvalidation>
          <inputformat>([a-zA-Z0-9]+)_0*([1-9][0-9]*|0)</inputformat>
          <outputformat>{0}@S{1}</outputformat>
          <IsDate>false</IsDate>
        </FieldDefinition>
      </FieldDefinitions>
    

    In this case, the program will read a field in a record (position 123, 12 characters max), parse the information into an array using the inputformat (the regular expression) and write a modified output defined in outputformat. The output format is used as the format specifier in C#'s String.Format while the resulting array of the parsing is used as the object array in the String.Format. If the content needs to be organised differently, simply change the outputformat, if the input format changes, (sometimes not so simply, hence the question) change the inputformat.

    I hope this explains why I use regular expressions.

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Monday, December 8, 2014 12:07 AM

All replies

  • User-434868552 posted

    @wim sturkenb...    Wim, for me, this is unclear because you need to show at least one example with more text so that we can get a better sense of the data that you are parsing.

    paraphrasing Paul Linton, if you need to use a regular expression to solve a problem, you have two problems.

    i do not know whether this applies to your case, often rather than regex, i choose to use .String methods. http://msdn.microsoft.com/en-us/library/system.string_methods(v=vs.110).aspx TIMTOWTDI

    one primitive approach could be to search for the underscore, inspect the character after this underscore for any of 0...9 and if found, inspect the following character likewise.; if that passes, then determine whether the three characters before the underscore are alphabetic ... if true, you have found a string of the form xxx_nn.

    again, it's hard to suggest an appropriate solution using regex, .String methods, or a combination of both because your data sample is just too small.  FWIW

    http://weblogs.asp.net/gerrylowry/clarity-is-important-both-in-question-and-in-answer 

    Sunday, December 7, 2014 3:06 PM
  • User-821857111 posted

    If all you want to do is extract the number at the end of the string (assumed from the title of your post), the following will do it:

    var input = "abc_56";
    var number = Convert.ToInt32(input.Substring(input.IndexOf("_") + 1, 2));

    Sunday, December 7, 2014 4:48 PM
  • User-434868552 posted

    @Mikesdotnett...

    Mike, the O.P. is expecting "two groups", whatever that means:

     "... that only returns two groups regardless of which part matches"

    FWIW, unless the O.P. can guarantee that only 00..99 will be at the end of a string, then if Wim requires an Int32, Int32.TryParse needs to be used.  http://msdn.microsoft.com/en-us/library/f02979c7(v=vs.110).aspx "Int32.TryParse Method (String, Int32)"

    However, Wim did not explicitly state that the numeric part of the string is to be converted to Int32.

    For that reason, Wim needs to show at least one example with more text so that we can get a better sense of the data to be parsed imho; if Wim intends to convert the 00..99 part to an integer, imho Wim should also mention that fact.

    Sunday, December 7, 2014 5:39 PM
  • User1428246847 posted

    Sorry people for not being clear.

    I need to split strings into two groups; the string consists of at least one character followed by an underscore followed by a two digit number (that includes leading zeroes if necessary).

    When I split abc_01, I want to get back abc and 1. When I split abcd123efgh_97, I want to get back abcd123efgh and 97. I use regular expressions with grouping to extract the data The given regular expression results in

    |in      | grp1 | grp2 | grp3 | grp4 |
    +--------+------+------+------+------+
    |abc_01  |'abc' | '1'  | ''   | ''   |
    +--------+------+------+------+------+
    |xyz_97  |''    | ''   | 'xyz'| '97' |
    +--------+------+------+------+------+

    Where grp1..4 are the groups that the earlier regular expression returns (actually it also returns a grp0 for the complete match). The single quotes are not part of the result and just for display purposes.

    What I'm asking for is a regexp that always returns the result in grp1 and grp2 independent of the number at the end.

    I have found a regexp that nearly does it, and I'm happy to use it, so from that perspective the thread is solved. Only disadvantage is that it does not limit the number to two digits (so wim_1234567 is considered valid while it is not); I however can work around that. But if you can come up with a better one, you're welcome.

    ([a-zA-Z0-9]+)_0*([1-9][0-9]*|0)


    ====================

    I like to explain why I use regular expressions.

    I big part of my life I spend on writing small tools to process (fixed width, csv and xml) text files; processing usually consists of re-formatting of fields, recombining (parts) of fields into other fields and moving fields around. I'm also crazy about flexibility and hate hard-coding; therefore the regular expression is stored in a configuration file. If the format of the input string ever changes, it's a matter of changing the regular expression to let the program do what it needs to do instead of modifying the program.

    In the project that I'm working on, a (simplified) configuration entry looks like

      <FieldDefinitions>
        <FieldDefinition>
          <FieldName>The first field</FieldName>
          <StartPosition>123</StartPosition>
          <Length>12</Length>
          <inputvalidation></inputvalidation>
          <inputformat>([a-zA-Z0-9]+)_0*([1-9][0-9]*|0)</inputformat>
          <outputformat>{0}@S{1}</outputformat>
          <IsDate>false</IsDate>
        </FieldDefinition>
      </FieldDefinitions>
    

    In this case, the program will read a field in a record (position 123, 12 characters max), parse the information into an array using the inputformat (the regular expression) and write a modified output defined in outputformat. The output format is used as the format specifier in C#'s String.Format while the resulting array of the parsing is used as the object array in the String.Format. If the content needs to be organised differently, simply change the outputformat, if the input format changes, (sometimes not so simply, hence the question) change the inputformat.

    I hope this explains why I use regular expressions.

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Monday, December 8, 2014 12:07 AM