locked
Regex to extract delimited, repeating groups with double quotes involved RRS feed

  • Question

  • I am trying to create a regex pattern string to parse a string with the following structure (there will be multiple lines with the same structure):

        I,F:V(,F:V)(,F:V) ... \n

     

    Note:

        The use of parens above is only intended to show the optional-occurring nature of “,F:V”.

        The use of “…” above is only intended to show the multiple-occurring nature of “,F:V”

     

    Where

        I = record id – occurs at beginning of each line once (can be represented by \w+)

        F = field name (can be represented by \w+)

        V = field value (gets tricky)

        \n = newline character at end of each line

     

    “, F:V” = will occur at least one time on each line and can occur any number of times (1+) per line.

     

    Given the following string:

        start,id:925fe1,next:e~7f0866

        cform,id:7f0866,fid:OD_Start2,ver:3,next:e~24591d

     

    I have the basis of the pattern that generates the exact grouping I am looking for:

        (\w+,)*(\w+:)*([^,:\n]+)

     

    Match#

    Full Match

    Group 1

    Group 2

    Group 3

    1

    start,id:925fe1

    start,

    id:

    925fe1

    2

    next:e~7f0866

    next:

    e~7f0866

    3

    cform,id:7f0866

    cform,

    id:

    7f0866

    4

    fid:OD_Start2

    fid:

    OD_Start2

    5

    ver:3

    ver:

    3

    6

    next:e~24591d

    next:

    e~24591d

     

    I can deal with “,” and “:” being appended to the match groups, but ideally, I’d like to not have those appended, which I am able to accomplish with the following:

     

        (\w+)(?:,(\w+):([^,:\n]+))*

      

    The problem:

    So now to my problem… the value of ‘V’ can be a double-quoted string which may contain: commas, colons, or other nested double-quotes.  I need ‘V’ to be everything between the 2 outer double quotes (inclusive or not)

     

    The following line needs to resolve as follows:

    exit,id:24591d,desc:"some:“” text, more",lo:y

     

    Match#

    Full Match

    Group 1

    Group 2

    Group 3

    1

    exit,id:24591d

    exit,

    id:

    24591d

    2

    desc:"some:”” text, more"

    desc:

    “some:”” text, more"

    3

    lo:y

    lo:

    y

     

    I am not very experienced with regex, so this is beyond my abilities to figure out.  I don't understand the more complex features of regular expressions.  I was playing with the following to deal with the double-quoted portion, but have not been able to incorporate it into the main pattern to make it work as a whole.  And also, it does not deal with nested double-quotes.

        ".*?[,:].*?"

    Thanks in advance to any help or insight you can provide!

    • Edited by JeremyRuth Saturday, January 12, 2019 8:22 PM
    Saturday, January 12, 2019 7:57 PM

Answers

  • If both of \" and "" are allowed, then try this:

       (\w+,)*(\w+:)*("(?:\\"|""|[^"])*"|[^,:\n]+)

    • Marked as answer by JeremyRuth Sunday, January 13, 2019 5:08 PM
    Sunday, January 13, 2019 7:18 AM

All replies

  • Check this expression:

       (\w+,)*(\w+:)*("(?:[^"]|"")*"|[^,:\n]+)

    Saturday, January 12, 2019 8:29 PM
  • Viorel_,

    Thank you very much!  This does get me a lot closer, however, there are a few cases of 'V' that still break it.  In simplest terms, I need everything between the 2 outer-most quotes to get grouped as one string no matter what resides in between.

    Take the following example (sorry, it's a lot, but it's a real-world example) line:

    more,t:V_ReportBodyString,s:"TrimRight(V_ReportBodyString) +\"<fontsize><8>\"+ \" \" + SubStr(Text(51101),1,40) + \" ========\" + Char(10) + \" \" + Char(10) +\"<fontsize><8>\"+Format(\"\",\"25C\") + ToUpper(Text(51067)) + Format(Format(Val(D_GatBal.PREVIOUS_BALANCE),\"####0.00\"),\"8R\") + Char(10) + \"<fontsize><8>\"+Format(\"\",\"31C\")+ ToUpper(Text(51024)) + Format(Format(Val(D_GatBal.CREDIT),\"####0.00\"),\"8R\") + Char(10) +\"<fontsize><8>\"+ Format(\"\",\"30C\")+ Text(50042) + Format(Format(Val(V_TotInvAmount) + Val(D_GatBal.DEBIT),\"####0.00\"),\"8R\") + Char(10) + Format(\"\",\"29C\")+ ToUpper(Text(51068)) + Format(Format(Val(V_TotInvAmount) + Val(D_GatBal.CURRENT_BALANCE),\"####0.00\"),\"8R\")+ Char(10)"

    There should be 1 Group-1 match (more,) and 2 Group-2 matches( t: and s:).  The s: match should have 1 long Group-3 match that starts after the "s:" and goes until the end of the line.

    It finds another Group-1 match after the following text: SubStr(Text(51101)

    Saturday, January 12, 2019 11:11 PM
  • If both of \" and "" are allowed, then try this:

       (\w+,)*(\w+:)*("(?:\\"|""|[^"])*"|[^,:\n]+)

    • Marked as answer by JeremyRuth Sunday, January 13, 2019 5:08 PM
    Sunday, January 13, 2019 7:18 AM