none
Parsing a string of variable length

    Question

  • I have a string in each row of my data that I need to parse into the components that each part represents.  The string is of variable length, such that a different number of resulting components/variables/columns should be produced based on the data within the string.  Here's an example:

    Row 1: Lastname1, Firstname1 (AAA)(BBB) 01/01/2011 500.00 CCC LASTNAME2 FIRSTNAME2 1 MAIN STREET CITY1, ST1 99999 CAREER1  Lastname3, Firstname3 MI1 (DDD)(EEE) 01/01/2010 5.00 FFF LASTNAME4 FIRSTNAME4 2 OUT THE WAY CITY2, ST2 88888 CAREER2  Lastname5, Firstname5 (GGG)(HHH) 01/01/2009 50.00 III LASTNAME6 FIRSTNAME6 MI2 3 MAPLE LONG NAME AVENUE CITY3, ST3 77777 CAREER3

    Row 2:  Lastname7, Firstname7 (JJJ)(KKK) 01/01/2011 10.00 LLL LASTNAME8 FIRSTNAME8 10000 OAK BLVD LONG CITY NAME, ST4 66666 CAREER4

    Ideally, I'd like to produce a new worksheet with one row for each unique combination of mini-variables, where my "Row 2" example represents exactly one row on this new worksheet.  My "Row 1" example should produce 3 rows on the new worksheet.  I need a loop that will work for any number of "instances" in the string.

    The only delineator between real instances in my variable string is a double space (literally, two spaces next to each other).  Additionally, there are variable lengths within each instance.  Names can be different lengths, a middle initial occurs sometimes, street and city names can be one word or four words.

    Thanks in advance!

    Tuesday, July 26, 2011 3:52 PM

Answers

  • One of several ways -

    Sub test()
    Dim arr
    Dim i As Long
    Dim s As String
        s = "double  space  delimited  string"
       arr = Split(s, "  ")
    
    ' normally better to check arr is an array here
       For i = 1 To UBound(arr) + 1
           Cells(i, 1) = arr(i - 1)
       Next
    
    End Sub

    Peter Thornton

    • Marked as answer by SMMiller Monday, August 01, 2011 7:14 PM
    Tuesday, July 26, 2011 5:51 PM
    Moderator

All replies

  • Hello SMMiller:

    I believe it would help all the forum participants if you would show a simple "before" and "after" example.  In your existing example, it's hard to tell what is in each column.  If you identify the "before" and show what is in each column, and then the "after" and what is in each column, it would help all of us.

    Regards,


    Rich Locus Logicwurks, LLC www.logicwurks.com
    Tuesday, July 26, 2011 5:27 PM
  • One of several ways -

    Sub test()
    Dim arr
    Dim i As Long
    Dim s As String
        s = "double  space  delimited  string"
       arr = Split(s, "  ")
    
    ' normally better to check arr is an array here
       For i = 1 To UBound(arr) + 1
           Cells(i, 1) = arr(i - 1)
       Next
    
    End Sub

    Peter Thornton

    • Marked as answer by SMMiller Monday, August 01, 2011 7:14 PM
    Tuesday, July 26, 2011 5:51 PM
    Moderator
  • Of course - sorry!

    BEFORE

    Lastname1, Firstname1 (AAA)(BBB) 01/01/2011 500.00 CCC LASTNAME2 FIRSTNAME2 1 MAIN STREET CITY1, ST1 99999 CAREER1  Lastname3, Firstname3 MI1 (DDD)(EEE) 01/01/2010 5.00 FFF LASTNAME4 FIRSTNAME4 2 OUT THE WAY CITY2, ST2 88888 CAREER2  Lastname5, Firstname5 (GGG)(HHH) 01/01/2009 50.00 III LASTNAME6 FIRSTNAME6 MI2 3 MAPLE LONG NAME AVENUE CITY3, ST3 77777 CAREER3
    Lastname7, Firstname7 (JJJ)(KKK) 01/01/2011 10.00 LLL LASTNAME8 FIRSTNAME8 10000 OAK BLVD LONG CITY NAME, ST4 66666 CAREER4

     

    AFTER

    PayorLast PayorFirst PayorMI Code1 Code2 Date Amount Code3 PayeeLast PayeeFirst PayeeMI Street City State Zip Job
    Lastname1 Firstname1 AAA BBB 1/1/2011 500.00 CCC LASTNAME2 FIRSTNAME2 1 MAIN STREET CITY1 ST1 99999 CAREER1
    Lastname3 Firstname3 MI1 DDD EEE 1/1/2010 5.00 FFF LASTNAME4 FIRSTNAME4 2 OUT OF THE WAY CITY2 ST2 88888 CAREER2
    Lastname5 Firstname5 GGG HHH 1/1/2009 50.00 III LASTNAME6 FIRSTNAME6 MI2 3 MAPLE LONG NAME AVENUE CITY3 ST3 77777 CAREER3
    Lastname7 Firstname7 JJJ KKK 1/1/2011 10.00 LLL LASTNAME8 FIRSTNAME8 10000 OAK BLVD LONG CITY NAME ST4 66666 CAREER4

    Thanks so much!

    Tuesday, July 26, 2011 6:05 PM
  • SMMiller:

    Your example helps!!  Peter Thornton suggested using the Split function, which would appear to be the best way.  I've used it on other parsing applications.  What makes it a little more difficult is that one ROW of raw data translates into multiple rows in Excel.  Also, there doesn't seem to be any apparent way to tell when a Last Name starts for the next row.  The Split function just tears each item apart and stacks it into the Array.  Once in the array, how would you know when a "Last Name" begins for the next Excel row?  Also, some names have two components (Last, First) and some have three (Last, First MI).

    There needs to be a way to be "column specific" in identifying the data that is being sent from the "split" array back into Excel.

    Ideas?


    Rich Locus Logicwurks, LLC www.logicwurks.com
    Tuesday, July 26, 2011 6:17 PM
  • The delineator of two spaces is within the single row to show where the last name should begin a new row in the target worksheet, so if I'm understanding Peter's code correctly, each element in the resulting array should represent one future row in my target worksheet.

    In a cursory review of my data, it looks like I consistently have values for first and last names in both payee and payor.  The middle initial is definitely one of the difficulties.  I'm also finding that some data rows have nothing included for my Code1 column in the result I'm trying to achieve.

    There is consistency in the wrapping of both Code1 and Code2 in parentheses, so I could use those.  Actually, for my purposes, I'm also OK with losing the middle initial entirely.  We'll still experience some data corruption in the parsing, because it's programmatically so complex to know the difference between the name Martin Van Buren where "Van Buren" is the full last name and Martin B Smith where "Smith" is the last name.  Maybe a length check though?  My middle initials seem to be consistently either a single character or a single character followed by a period.

    My date is formatted beautifully with the full MM/DD/YYYY in every example I've reviewed, and my amount is always reported with the two decimal places.

    The issue that I've gotten stuck on is splitting street and city.  For the reports I'm generating, it's not a requirement, but in general, I hate to ever lose data.  I've not been able to come up with logic to programmatically separate the two though.  I suspect the only way I'll ever solve that is to build an array or table of street "enders".  Ex: {Avenue, Ave, Street, St, Boulevard, Blvd,...} and use those as multiple potential delineators for the street/city sub-string.

    Sorry for the rambling - just thinking out loud a little!

     

    Tuesday, July 26, 2011 6:50 PM
  • SMMiller:

    You may need to do some "grooming" on your data.  I looked at the example you posted and there was only one space between the first two people.  Yet, the third person had two spaces.

    All the other data items just had one space between them (only did a cursory review).

    So, you may need to do the Split function twice, first to segregate each person (using two spaces as the delimiter), and second to break down the elements within each person (using a one space delimiter).  It is starting to look like a rather involved project.  When you do the first split, it would load your array with 3 entries if you had three people.  Then, you would have to dump the 3 array entries back to temporary worksheet lines.  At that point, you would run another Split with only once space as the delimiter.

    Does that sound correct?


    Rich Locus Logicwurks, LLC www.logicwurks.com
    Tuesday, July 26, 2011 7:08 PM
  • Assuming your entire rows are reliably separated with double spaces use the Split example to get you started.

    If the sample data you posted is indicative of what you've got there are no obvious delimeters between each of the fields. If there were you could use Split again within the loop

    arr2 = Split(arr1, ",")   assuming comma separators
    ' dump arr2 into columns in the row

    But you've just got a mess! If you've only got up to say 100 rows probably better to manually insert unique separators as appropriate.

    FYI, another common way to parse is to find the location of known characters, eg, to separate your dates

    arr = Split(s, "  ")
    For i = 1 To UBound(arr) + 1
       Cells(i, 1) = arr(i - 1)
       pos = 0
       pos = InStr(1, arr(i - 1), "/")  ' first date separator
       Cells(i, 2) = Left$(arr(i - 1), pos - 3)
       Cells(i, 3) = Mid$(arr(i - 1), pos - 2, 5)
       Cells(i, 4) = Mid$(arr(i - 1), pos + 9, 300)
    Next

    Peter Thornton

    Tuesday, July 26, 2011 7:09 PM
    Moderator
  • This logic is working for me except when my original string (s in your sample code) is long.  There seems to be a limit/bug in the Split function when that happens.

    For example, when stepping through the VBA code,

    when s = "AAA 111  BBB 222  CCC 333  DDD 444  EEE 555  FFF 666  GGG 777  HHH 888  III 999  JJJ 000  KKK 111111  LLL 121212  MMM 131313  NNN 141414  OOO 151515"

    after Split(s, "  "), UBound(arr) = 14.

    On the other hand,

    when s = "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 111  BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 222  CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC 333  DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD 444  EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 555  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 666  GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG 777  HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH 888  IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 999  JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ 000  KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK 111111  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL 121212  MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 131313  NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 141414  OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO 151515"

    after Split(s, "  "), UBound(arr) = 11.

    In fact, if I make s longer, I can get a smaller UBound.  My real data produces UBound(arr) = 9, where it should be 14.

    Is there a different function that works regardless of size?  Work-around ideas?

    Thank you very much!

    Thursday, July 28, 2011 5:20 PM
  • Maybe the problem isn't the Split function, but a limitation in the Array type?

    I tried replacing the use of Split with the more basic code documented here http://support.microsoft.com/kb/126939 and arrived at the same place (problem) with my example.

    Thursday, July 28, 2011 7:05 PM
  • Hello again SMMiller:

    I used the following test program on a huge string and had no issues.  You should verify your delimiters.

    Option Explicit
    
    Public Sub TestSplit()
    Dim strParsedData() As String
    Dim i As Long
    
    ' ***********************************************************************
    ' Parse The Input File into multiple output records
    ' ***********************************************************************
    strParsedData = Split(Cells(1, 1), " ")
    
    For i = 0 To UBound(strParsedData)
     If strParsedData(i) <> "" Then
     MsgBox ("Value is " & strParsedData(i))
     End If
    Next i
    
    End Sub
    
    

    My UBound was 14 (for 15 entries 0 through 14)

    Here's the data using a single space as the delimiter: 11111 22222222222222222222222222222222222222222222222222222222222222222222 33333333333333333333333333333333333333333333333333333333333333333333 444444444444444444444444444444444444444444444444444444444444 55555555555555555555555555555555555555555555555555555555555555555555555555555555555555555555 6666666666666666666666666666666666666666666666666666666666666666666 7777777777777777777777777777777777777777777777 8888888888888888888888888888888888888888888888888888 99999999999999999999999999999999999999999999999999999999999999 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
    Rich Locus Logicwurks, LLC www.logicwurks.com

    Thursday, July 28, 2011 9:17 PM
  • Your example as posted returns UBound 14 with Split, ie 15 strings as expected.

    My guess is your input string is not what you think it is. Split should be able to efficiently handle vastly a greater input string and return an array with a UBound of several thousand.

    Are you quite sure all your spaces are indeed CHR(32) spaces.

    Peter Thornton

    Friday, July 29, 2011 3:41 PM
    Moderator
  • SMMiller:

    As did Peter Thorton, I copied your exact character string from your post, pasted it into TextPad (basically Notepad on steroids), then copied it from there to Excel and ran the same code as illustrated above.  I also achieved 15 entries (Ubound = 14).  So, there is something else going on with your application.  I looked at the text stream in "HexWorks", and indeed, all the delimiters were Hex 20 (A Space Character).  Try using my little snippet of code in my previous post with your data and see if you get the same results as Peter Thornton and I.

    Regards,


    Rich Locus Logicwurks, LLC www.logicwurks.com

    Friday, July 29, 2011 4:02 PM
  • Rich and Peter,

    I double-checked everything and verified my results a dozen times.  This morning, I set up a new laptop with MS Office 2007 and retried.  Everything works fine on this configuration!  It looks like I found a bug in Excel/VBA 2003.

    Thanks to both of you!  I'm going to mark the Split on a double-space solution as my answer.

    Monday, August 01, 2011 7:14 PM
  • SMMiller:

    A "Helpful" vote on my contribution would be appreciated.

    Regards,


    Rich Locus Logicwurks, LLC www.logicwurks.com
    Monday, August 01, 2011 9:00 PM
  •  Everything works fine on this configuration! It looks like I found a bug in Excel/VBA 2003.

    No, there's no difference in how VBA.Strings functions work in all Excel versions since XL-2000. Your String code will work exactly the same in 2003 and later versions.

    Peter Thornton

    Tuesday, August 02, 2011 8:23 AM
    Moderator