none
(REGEX) My "zero-width negative look-ahead assertion" Is Not Working RRS feed

  • Question

  • (Sorry, I don't see a "Forum Category"which allows me to specify the Regex forum)

    I think that I need a "zero-width negative look-ahead assertion" and I think that I have correctly specified one but it doesn't seem to be working, meaning, of course, that it is not doing what I need it to do.  Any assistance will be greatly appreciated.  

    I have input which looks like this ...

                                    54  APRIL SHOWERS       97      BILL BAILEY, WON'T YOU PLEASE COME HOME?        
    18 A-TISKET A-TASKET 54 APRIL SHOWERS
    18 THE ABA DABA HONEYMOON 55 AQUARIUS 95 BILLBOARD MARCH
    19 ABILENE 53 AROUND THE WORLD 98 A BIRD IN A GILDED CAGE

    .. 2 or 3 digit page numbers followed by song titles.  Unfortunately some titles contain strings of digits so I have to allow for digits in the titles.  

    In Expresso I have been trying to develop the right pattern but what I have is not working and I can't figure out why.  The expression I've come up with is:

     ((?<=(^|\s*))((?<pageno>\s+\d{2,3})\s+(?<title>[ \w\'\,\-\d\?\(\)\!]+)(?!(\s+\d{2,3}\s+))))+

    There are two problems: 1) the zero-width negative look-ahead assertion doesn't seem to always work.  In the first record everything following the first page number is being captured as the title.  And 2) the last character of the titles is being lost.

    Expresso screen shot below.  I'll be grateful for any help.  Bob

    Monday, December 28, 2015 11:30 PM

Answers

  • I think that with “(?!(\s+\d{2,3}\s+))” you say “not spaces followed by two or three digits followed by spaces”. The last character of the title meets this requirement, therefore it is not included into the found <title> and is lost.

    In order to use zero-width negative look-ahead assertion, try a modification of original expression:

    (?<=(^|\s*))(?<pageno>\s+\d{2,3})\s+(?<title>((?!\s+\d{2,3}\s+)[ \w\'\,\-\d\?\(\)\!])+)
    


    • Edited by Viorel_MVP Tuesday, December 29, 2015 7:50 PM
    • Marked as answer by eBob.com Tuesday, December 29, 2015 8:59 PM
    Tuesday, December 29, 2015 7:49 PM
  • Seems that similarly to a loop that can be written as “while condition” and “while not opposite-condition”, this Regular Expression can be made with positive and negative assertions.

    • Marked as answer by eBob.com Tuesday, December 29, 2015 8:59 PM
    Tuesday, December 29, 2015 8:48 PM

All replies

  • The forum code did not correctly format the example of the input I included in the text.  Most records contain several song titles.  The Expresso screen shot is correct.
    Monday, December 28, 2015 11:37 PM
  • Can you show examples of a titles that contain numbers and how will you distinguish these numbers from the page numbers?

    Maybe it is possible to have each item in a separate line. Then it will be easier to extract the page numbers and the titles.

    • Edited by Viorel_MVP Tuesday, December 29, 2015 9:13 AM
    Tuesday, December 29, 2015 9:11 AM
  • Hi Bob,

    Based on your match string, I haven't find some regular pattern.

     54  APRIL SHOWERS       97      BILL BAILEY, WON'T YOU PLEASE COME HOME?        
     18 A-TISKET A-TASKET
    54 APRIL SHOWERS
    
     18 THE ABA DABA HONEYMOON
    55 AQUARIUS
    95 BILLBOARD MARCH
     19 ABILENE  53 AROUND THE WORLD
    98 A BIRD IN A GILDED CAGE

    About "zero-width negative look-ahead assertion", here is a regex patten.

    (?<=<(\w+)>).*(?=<\/\1>)

    If the  prefix is actually<b>,the suffix is</b>. The whole expression match is<b> </b>between its content. But based on your scenario, it does not seem to meet the requirements.

    Best regards,

    Kristin


    We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time. Thanks for helping make community forums a great place.
    Click HERE to participate the survey.

    Tuesday, December 29, 2015 9:19 AM
  • In response to Viorel_ ...

    Thanks for your interest in my problem.

    There are well over 1000 song titles and I haven't examined them all.  The one record I've noticed is:

    511 PENNSYLVANIA 6-5000 508 PENNSYLVANIA POLKA 511 PENTHOUSE SERENADE 512 PERDIDO

    But since the numeric string in the first song title is not 2-3 digits it escapes my regex as it should.

    Surely a Regex expression cannot distinguish between a 2-3 digit page number and a 2-3 digit string which is part of a song title.  But I can manipulate the input to make sure than 2-3 digit strings in song titles look sufficiently different - i,.e. change "50 shades of grey" to "_50_ shades of grey". 

    Bob

    Tuesday, December 29, 2015 4:07 PM
  • Try this expression:

        (?<=^|\s)(?<pageno>\d{2,3})\s+(?<title>.+?)(?=\s+\d{2,3}\s|$)

    Tuesday, December 29, 2015 5:40 PM
  • Thank you very much Viorel_.  That seems to have worked wonderfully.  I like the simplification.  And it is faster than mine.  I especially like that you use "." to grab the title text.  That's very helpful because I've noticed many strange characters in some of the titles.  

    At this point I am happy just to have an expression which works so well.  But do you know why my expression was dropping the last character of most titles?  That's a real mystery and I had not idea how to figure out what I was doing wrong.

    Thanks again very much for your solution.

    Bob

    Tuesday, December 29, 2015 7:22 PM
  • I think that with “(?!(\s+\d{2,3}\s+))” you say “not spaces followed by two or three digits followed by spaces”. The last character of the title meets this requirement, therefore it is not included into the found <title> and is lost.

    In order to use zero-width negative look-ahead assertion, try a modification of original expression:

    (?<=(^|\s*))(?<pageno>\s+\d{2,3})\s+(?<title>((?!\s+\d{2,3}\s+)[ \w\'\,\-\d\?\(\)\!])+)
    


    • Edited by Viorel_MVP Tuesday, December 29, 2015 7:50 PM
    • Marked as answer by eBob.com Tuesday, December 29, 2015 8:59 PM
    Tuesday, December 29, 2015 7:49 PM
  • Yes, I see.  Thanks.  And maybe you have cleared up something else for me.

    I was using a negative look-ahead assertion because I wanted the capture to stop at that point.  But I am thinking now that both positive and negative look-ahead assertions can be used to stop capturing.  The difference is whether the capture stops because of what is there or what isn't there.  Is that right?

    (So often I am reminded of that old saying "A little knowledge is a dangerous thing!"

    Thanks again for all of your help.

    Bob

    Tuesday, December 29, 2015 8:05 PM
  • Seems that similarly to a loop that can be written as “while condition” and “while not opposite-condition”, this Regular Expression can be made with positive and negative assertions.

    • Marked as answer by eBob.com Tuesday, December 29, 2015 8:59 PM
    Tuesday, December 29, 2015 8:48 PM
  • Thanks Viorel.  You've been very helpful.

    Bob

    Tuesday, December 29, 2015 9:00 PM