locked
[Help!] Why my regular expression can not match the text? RRS feed

  • Question

  • The regular expression:
    (?:(?<=[ ]|^)([-+]?(?:[\d]+(?:\.[\d]*)?|\.[\d]+)(?:[eE][-+]?[\d]+)?)(?=[ ]|$)){4}

    The text:
    94.2848E12    167.7924      0.0000      90.000

    The target text contains several numbers, seperated by one or more spaces.

    With out the suffix {4}, my regular expression can successfully match each number. However, with the suffix added on, nothing can be matched.

    My mission is to find a line with 4 numbers among all the lines (some lines may contain 5 number, and others may 3 or no number, and words may also exist in some lines). I have tried for two weeks but still can not figure out a way.

    Please help me!

    Monday, April 1, 2013 12:23 PM

Answers

  • Oh, I forgot about * greediness, so it should be

    (?:(?<=\s|^)((?:\s*?)[-+]?(?:[\d]+(?:\.[\d]*)?|\.[\d]+)(?:[eE][-+]?[\d]+)?)(?=\s|$)(?:\s*)){4}

    This matches even with a single space between numbers.

    Yes, your understanding is correct: match continues from the first space after the number, that's what I've said earlier. But in your initial pattern a space is never matched, so your expression fails.

    There's a lot of lookahead/lookbehind internals nicely explained here: http://www.regular-expressions.info/lookaround.html

    Also, if it was me, I'd use something like that:

    (?:\b([-+]?(?:\d+(?:\.\d*)?|\.\d+)(?:e[-+]?\d+)?)\b(?:\s+|$)){4} (with ignore-case mode)

    • Edited by Alex Skalozub Tuesday, April 2, 2013 4:19 AM
    • Marked as answer by WWWFdisk Tuesday, April 2, 2013 6:18 AM
    Tuesday, April 2, 2013 4:04 AM
  • Yes, the first one is OK, except it matches extra spaces before the line end of the above line: the spaces in the above line is included in the match of the next line.
    94.28480       163.75000      0.00000        90.00000       TOP       PLACED         <---Some spaces here
    35.00000       255.70000      0.00000        0.00000

    I have to say, the \s thing may not be so convenient to use, use [ ] to represent the space is safer. So I changed the expression to:
    (?:(?<=[ ]|^)(?:[ ]*?)([-+]?(?:[\d]+(?:\.[\d]*)?|\.[\d]+)(?:[eE][-+]?[\d]+)?)(?=[ ]|$)(?:[ ]*?)){4}
    This works just as I wish. Not a single extra space is included.

    The second one you'd like to use is not quite OK, it can not match the line below, the \b thing may not be so convenient to use, either. In fact, the minus letter [-] causes a problem, the engine will think [-] as a word connector:
    HOLE138P       35.00000       255.70000      -0.00000        0.0000




    • Marked as answer by WWWFdisk Tuesday, April 2, 2013 10:05 AM
    • Edited by WWWFdisk Tuesday, April 2, 2013 11:01 AM Modify the expression
    Tuesday, April 2, 2013 6:42 AM

All replies

  • So if I understand correctly you want to match a line containing exactly 4 floats separated by one or more spaces.  Looking at your expression it seems like you're trying to match the begin/end of the string within the inner expression and you're saying that must happen 4 times which wouldn't work.

    I believe this will match floating point values: [+-]?\d+(\.\d+)?(E[+-]?\d+)?

    Combining that with spaces between the numbers, requiring 4 and being the entire line gives you this:

    ^([+-]?\d+(\.\d+)?(E[+-]?\d+)?\s*){4}$

    Note that you should set the option to ignore case so that you match E10 and e10.

    Michael Taylor - 4/1/2013
    http://msmvps.com/blogs/p3net

    Monday, April 1, 2013 1:56 PM
  • Thank you very much for your reply, CoolDadTx

    I think I didn't make the question clear enough.

    Your regular expression requires the line contains 4 and only 4 float numbers. The first one must be placed at the line head, and the last one must be placed at the line end or before some space of the line end. But because the \s*, this expression can not ensure there is 1 or more space between each number.

    However, my text line may look like below, this is what your expression can not match:
    94.28480       163.75000      0.00000        90.00000       TOP       PLACED        
    Besides, the start of the line may also not a number, like below:
    HOLE138P       35.00000       255.70000      0.00000        0.00000

    The only rule is: 4 numbers, seperated by 1 or more spaces with themselves and with other fields. So in my regular expression, I used lookbehind(?<=[ ]|^) and lookforward(?=[ ]|$), to make sure the letter before the number is a line head or a space, and the letter after the number is a line end or a space. The reason to use lookaround, is that, lookaround does not consume the letter at current position. If there is only 1 space between 2 numbers like "  35.00000 255.70000 ", use a match will cause the space being consumed, and cause the next number fail to match since the space is considered part of the previous match. Everything is working as designed before the repeat count {4} is added.

    It just can't match when I added the repeat count {4}! That's truely headache O(>_<)O.


    • Edited by WWWFdisk Tuesday, April 2, 2013 2:30 AM Make more clear.
    Tuesday, April 2, 2013 1:55 AM
  • Remove the head and tail checks and it'll match 4 numbers anywhere in the string.  Whether there is text before or after wouldn't matter.  Lookahead is complicated and I believe it is very much overkill for your problem.  You could get more precise by requiring either the start of the line or one or more spaces before the first number and after the last to filter out any odd cases like

    TEXT123 456 789 012ABC

    Tuesday, April 2, 2013 2:15 AM
  • Repeat count {4} means four continuous repeats, not just four matches.

    Here's the logic in your example: parser takes first number (94.2848E12), it matches pattern, lookbehind is valid (line beginning), lookahead is also valid (space character). Then it moves to the next character after currently captured string. Since lookahead value is excluded from capture, it will be the first space after number, and it doesn't match your pattern anymore.

    So the simplest solution is to include spaces in your pattern:

    (?:(?<=\s|^)((?:\s*)[-+]?(?:[\d]+(?:\.[\d]*)?|\.[\d]+)(?:[eE][-+]?[\d]+)?(?:\s*))(?=\s|$)){4}

    But if you're using matched expression value somewhere, you'll need to trim it, as spaces also will be captured.

    Tuesday, April 2, 2013 2:57 AM
  • Thank you very much. Alex Skalozub.

    But if only 1 space is between 2 numbers, your expression will not match.

    And I still do not quite understand what you said above: "Since lookahead value is excluded from capture, it will be the first space after number, and it doesn't match your pattern anymore."

    My understanding is, the next match will start from the space where the lookahead just looked.
    HOLE138P       35.00000       255.70000      0.00000        0.00000
                                          ^
                              Look ahead here, and start next search also at here.
    Why "it doesn't match your pattern anymore"? Thanks.

    Tuesday, April 2, 2013 3:30 AM
  • Hi, CoolDadTx. Do you mean this?
    (?:[ ]|^)(?:([-+]?(?:[\d]+(?:\.[\d]*)?|\.[\d]+)(?:[eE][-+]?[\d]+)?)(?:[ ]+|$)){4}
    Yes, it can do the work.

    But if I use this as the regular expression, it will not be a common one. What I mean "common one" is that this expression can not used to combine to other expressions, because the (?:[ ]|^) head is not a part of the repeated portion. This is the reason why I use Lookaround. Of cause, if there is no solution at last, I will have to use this one, and handle all the head part manually...:-(.


    • Edited by WWWFdisk Tuesday, April 2, 2013 3:52 AM Be more polite.
    Tuesday, April 2, 2013 3:51 AM
  • Oh, I forgot about * greediness, so it should be

    (?:(?<=\s|^)((?:\s*?)[-+]?(?:[\d]+(?:\.[\d]*)?|\.[\d]+)(?:[eE][-+]?[\d]+)?)(?=\s|$)(?:\s*)){4}

    This matches even with a single space between numbers.

    Yes, your understanding is correct: match continues from the first space after the number, that's what I've said earlier. But in your initial pattern a space is never matched, so your expression fails.

    There's a lot of lookahead/lookbehind internals nicely explained here: http://www.regular-expressions.info/lookaround.html

    Also, if it was me, I'd use something like that:

    (?:\b([-+]?(?:\d+(?:\.\d*)?|\.\d+)(?:e[-+]?\d+)?)\b(?:\s+|$)){4} (with ignore-case mode)

    • Edited by Alex Skalozub Tuesday, April 2, 2013 4:19 AM
    • Marked as answer by WWWFdisk Tuesday, April 2, 2013 6:18 AM
    Tuesday, April 2, 2013 4:04 AM
  • Yes, the first one is OK, except it matches extra spaces before the line end of the above line: the spaces in the above line is included in the match of the next line.
    94.28480       163.75000      0.00000        90.00000       TOP       PLACED         <---Some spaces here
    35.00000       255.70000      0.00000        0.00000

    I have to say, the \s thing may not be so convenient to use, use [ ] to represent the space is safer. So I changed the expression to:
    (?:(?<=[ ]|^)(?:[ ]*?)([-+]?(?:[\d]+(?:\.[\d]*)?|\.[\d]+)(?:[eE][-+]?[\d]+)?)(?=[ ]|$)(?:[ ]*?)){4}
    This works just as I wish. Not a single extra space is included.

    The second one you'd like to use is not quite OK, it can not match the line below, the \b thing may not be so convenient to use, either. In fact, the minus letter [-] causes a problem, the engine will think [-] as a word connector:
    HOLE138P       35.00000       255.70000      -0.00000        0.0000




    • Marked as answer by WWWFdisk Tuesday, April 2, 2013 10:05 AM
    • Edited by WWWFdisk Tuesday, April 2, 2013 11:01 AM Modify the expression
    Tuesday, April 2, 2013 6:42 AM