none
Multiline text with regular expression RRS feed

  • Question

  • See the above image, I need to split this using the regular expression. You can see that we can group the data using the right most value of the row. each group has 1,2,3...8. some numbers are duplicated in the group. But value 1,2,3,4 are not. Sometimes filed doesn't contain the value row 4.
    see right most, there are some numbers 1,2,3 etc...

    I need to get the highlighted value of records for each group. below is the sample file.

     EVE999A12341082412A EVE999                                    12              1
     EVE999 950708 93 HOND  JHMCB7677PC011173                                      2
     EVE999 940901 DRIVER ANNIE CAR                                                3
     EVE999 44 940903 ARTS LICENSE PLATE                                           5
     EVE999 06 081031 DOJ STOP                                                     5
     EVE999 62 970616 EXEMPT BY STATUTE REG DEFERRED                               5
     EVE999 20 950829 UNCLAIMED REGISTRATION                                       5
     EVE999 083106 080100          DMV CARDRIVER                                   7
     EVE999                                                                        8
     EVE999 083106 080106          DMV CARDRIVER                     2             7
     EVE999                                                          2             8
     EVE999 092006 090106          COBB KATHLEEN                     3             7
     EVE999                                                          3             8
     EVE999 110606 101006          DRIVER PAUL                       4             7
     EVE999                                                          4             8
     EVE999 011207 102306          CARDRIVER ANY                     5             7
     EVE999                                                          5             8
     EVE999 110706 110106          RIFFLE OVETA                      6             7
     EVE999                                                          6             8
     EVE999 022807 022707          USED CAR DEALERSHIP OF CALIF      7             7
     EVE999                                                          7             8
     HUG999A12341082412A HUG999                                    13              1
     HUG999 950831 89 TOYT  JT4RN93P2K0013108                                      2
     HUG999 940908 DRIVER ANNIE CAR                                                3
     HUG999 44 940923 KIDS LICENSE PLATE - HAND       1                            4
     HUG999 62 970616 EXEMPT BY STATUTE REG DEFERRED                               5
     HUG999 26 951122 ADDITIONAL MAKES                                             5
     HUG999 46 940923 SMOG DUE 08/31/95                                            5
     HUG999 091206 090106          SAM MATTS                                       7
     HUG999                                                                        8
    

    Please help.

    • Moved by Lisa Zhu Thursday, March 14, 2013 6:29 AM Regular Expressions related
    Wednesday, March 13, 2013 8:41 AM

Answers

  • If this is a real app you are working on (instead of a programming course assignment), you don't need to do everything with regular expressions.  You could grab that character at the end of the line (1,2,3,4,...), and examine it, and then use the appropriate regular expression to get the data that the line provides.  I guess I'd use string.substring to get that character at the end of the line (although you could do that with a regular expression too) and then use Select ... Case to choose the right regular expression for that line type.  An advantage of this approach is that you avoid one really long regular expression.

    But if, because this is a course assignment, or just to make the project more interesting, you want to employ the full power of regular expressions, then you need to know that regular expressions can specify "optional" matches; so a* matches any number of "a"s, a+ matches 1 or more "a"s, and a? matches 0 or 1 "a"s.  So the pattern     abcd?    will match abc and abcd. 

    So here's a pattern which does not do exactly what you need but which might be part of what you need:

    (?<firstline>\w{3}999.+?1\r\n)\s*(?<secondline>\w{3}999.*?2\r\n)\s*(?<thirdline>\w{3}999.*?3\r\n)\s*(?<fourthline>\w{3}999.*?4\r\n)?

    The ? which matters in this expression is the very last one.  And note that ? following + or * has a different meaning.  That final ? provides a match even when the fourth line is not present.  Here's the Expresso output using your data and the expression above:

    I am not capturing exactly what you need, I am capturing the entire line.  But you should be able to tailor the expression above to capture exactly the data you need. 

    I haven't explained all of the expression above because I want to "inspire" you to further your regex education.  But one thing I do want to explain is that I originally tried $ to match the "end of line".  For some reason which I did not research that did not work.  So I used "\r\n" instead of $. ALSO, I don't recall but the regex options may matter in this case. In case they do here are the options I used:

    Bob


    Sunday, March 24, 2013 11:23 PM

All replies

  • Wouldn't it be easier to read through the file and pull out what you are looking for as you parse the info rather than use a regex?

    Wednesday, March 13, 2013 9:35 AM
  • You can use the expression "\S+" to split the row.  I don't understand which words are highlighted from your description.  Can you give a better explaination.  The RegEx language is described on the webpage below

    http://msdn.microsoft.com/en-us/library/az24scfc.aspx


    jdweng

    Wednesday, March 13, 2013 9:45 AM
  • Please see the image. I have highlited the record which I need using different color 
    Wednesday, March 13, 2013 10:22 AM
  • Try something like the code below

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Text.RegularExpressions;
    using System.IO;
    using System.Data;
    namespace ConsoleApplication2
    {
        class Program
        {
             static void Main(string[] args)
            {
                string filename = @"c:\temp\Tharaka.txt";
                string[] file = File.ReadAllLines(filename);
                Regex expression = new Regex(@"\w+");
                string[] fields = null;
                DataTable  data = new DataTable();
                data.Columns.Add("Field 1");
                data.Columns.Add("Field 2");
                data.Columns.Add("Field 3");
                data.Columns.Add("Field 4");
                data.Columns.Add("Field 5");
                foreach (string row in file)
                {
                    MatchCollection parseRow = expression.Matches(row);
                    string rowindexStr = parseRow[parseRow.Count - 1].ToString();
                    int rowindex = int.Parse(rowindexStr);
                    switch (rowindex)
                    {
                        case 1:
                            if (fields != null)
                                data.Rows.Add(fields);
                            fields = new string[5];
                            fields[0] = parseRow[1].ToString();
                            fields[1] = parseRow[2].ToString();
                            break;
                        case 2:
                            fields[2] = parseRow[4].ToString();
                            break;
                        case 3:
                            fields[3] = parseRow[2].ToString() + " " + parseRow[3].ToString();
                            break;
                        case 4:
                            fields[4] = parseRow[2].ToString() + " " + parseRow[3].ToString();
                            break;
                    }
                }
                data.Rows.Add(fields);
            }
     
        }
       


    jdweng

    Wednesday, March 13, 2013 11:52 AM
  • Yes, It is easier. but I just need the solution based on regular expression. I need to implement this based on regular expression grouping. In future if they modify the template, then I just need to modify the regular expression. No any code changes.

    Wednesday, March 13, 2013 4:10 PM
  • I rewrote the code so that you only have to change the table at the top of the code to select different positions in the file.

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Text.RegularExpressions;
    using System.IO;
    using System.Data;
    namespace ConsoleApplication2
    {
        class Program
        {
            static List<List<int>> fieldpositions = new List<List<int>>() 
               { new List<int>() { 0, 1, 2 }, //field 0 : Row 1, column 2
                 new List<int>() { 1, 1, 3 }, //field 1 : Row 1, column 3
                 new List<int>() { 2, 2, 5 }, //field 2 : Row 2, column 5
                 new List<int>() { 3, 3, 3 }, //field 3 : Row 3, column 3
                 new List<int>() { 4, 4, 3 }, //field 4 : Row 4, column 3
               };
            
            static void Main(string[] args)
            {
                string filename = @"c:\temp\Tharaka.txt";
                string[] file = File.ReadAllLines(filename);
                Regex expression = new Regex(@"\w+");
                string[] fields = null;
                DataTable data = new DataTable();
                data.Columns.Add("Field 1");
                data.Columns.Add("Field 2");
                data.Columns.Add("Field 3");
                data.Columns.Add("Field 4");
                data.Columns.Add("Field 5");
                foreach (string row in file)
                {
                    MatchCollection parseRow = expression.Matches(row);
                    string rowindexStr = parseRow[parseRow.Count - 1].ToString();
                    int rowindex = int.Parse(rowindexStr);
                    if (rowindex == 1)
                    {
                        if (fields != null)
                            data.Rows.Add(fields);
                        fields = new string[5];
                    }
                    var selectFields = from g in fieldpositions where g[1] == rowindex select g;
                    foreach (var selectrow in selectFields)
                    {
                        fields[selectrow[0]] = parseRow[selectrow[2] - 1].ToString();
                    }
      
                }
                data.Rows.Add(fields);
            }
        }
    }


    jdweng

    Wednesday, March 13, 2013 5:34 PM
  • Hi Tharaka ,

    From your description, I ‘d like to move this post to the most related forum.There are more  experts in this aspect, so you will get  better support and  may have more luck getting answers.

    Thanks for your understanding.

    Regards,


    Lisa Zhu [MSFT]
    MSDN Community Support | Feedback to us
    Develop and promote your apps in Windows Store
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

    Thursday, March 14, 2013 6:29 AM
  • Thank you . But still didn't get any suitable solution.
    Thursday, March 14, 2013 5:16 PM
  • What is the issue with my posted code?

    jdweng

    Thursday, March 14, 2013 5:20 PM
  • Regular expressions are extremely useful and worth learning.  But expect to spend some time learning them.  Get a free product from UltraPico called Expresso.  It's very helpful for developing regular expressions.

    You have not said what the "rules" are for the strings you want to find.  Using this expressions

    \w{3}\d{3}\w(?<firststring>\d{11}\w) (?<secondstring>\w{3}\d{3})

    Here's the output as shown by Expresso:

    But I had to assume some rules.  So I assumed that firststring follows three alphanumeric characters followed by three digits followed by one alphanumeric character (e.g. EVE999A).  Then I assumed that firststring consists of 11 digits followed by one alphanumeric character (e.g.12341082412A) and that firststring is followed by a blank.  And I made similar assumptions about the rules for secondstring.

    If you want to use regular expressions get Expresso, do some reading, make an attemp and post any problems you encounter.

    Bob

    Saturday, March 16, 2013 5:30 PM
  • HI,

    Thank you for your reply. I exactly need that type of solution. See the below image I have attached.

    you can see that in right most, there are some integer sequence display.  in above string we just need only the lines with 1,2,3 or 4 in right most, and split that lines.

    I'll explain this through the example

    suppose, you need to pass customer names and address, telephone, mobile, with text files you have a two options.

    Either you can pass that as follows

    <name><address><telephone>
    <name><address><telephone><mobile>
    <name><address><telephone>

    or as follows

    <name>
    <address>
    <telephone>
    <name>
    <address>
    <telephone>
    <mobile>
    <name>
    <address>
    <telephone>

    the data is in my text file is received as 2nd format. to identify the row it display the integer number in right most column. we just need number 1,2,3 columns only. and also, you can see my image I have highlighted the needed fields in different colors.

    Is that make sense? please let me know if you need any more details

    Wednesday, March 20, 2013 2:34 AM
  • The code I posted is doing exactly what you need.  I put the data into a datatable which has 5 fields (columns) for the 5 items that are highlighted.  th eonly thing I'm not sure of is the two words that hard highlighted in blue.  I don't know if you are aways going to have two words or sometimes only one.  Right now my code is only putting in the datatable th e1st of the two words.  If you can explain under what conditions Yo uneed to take one word and what conditions require two words in the final results.

    Try my code changing the file name of the input.  I can upgrade the code to put the results into a DataGridView so you can see the resultsing table.


    jdweng

    Wednesday, March 20, 2013 5:21 AM
  • Thanks Joel. Your solution is great.!! But I prefer the solution like Bob provided. Because it is easy for me to maintenance. What I wanted to do is, set the regular expression on the configuration section and changed it with the file format change. If I use the regular expression with grouping then in my application I can retrieve the data by accessing group name. (like Bob, first string, second string etc..)

    That's why I preferred for solution like Bob provided. But the thing is Bob's solution is not completed one.

    Friday, March 22, 2013 2:15 AM
  • From the table I produced, you can use Linq to perform any searches you need very easily on the table after reading all the results. 

    I've been mining data from files like yo for almost 40 years.  The solution you are asking for has lots of deficiencies and the code will give wrong results if some of the data is missing.


    jdweng

    Friday, March 22, 2013 9:17 AM
  • I didn't intend to provide a complete solution.  I'll be very glad to provide additional help, but you'll have to develop the solution yourself.  You do know something about regular expressions because my earlier help did not mention group names.  So read some more if you need to, use Expresso, try some code, and then post any specific problems or ask about any specific thing which you do not understand.

    Bob

    Friday, March 22, 2013 3:13 PM
  • HI Bob,

    Thank you for the reply. Yes, I can create the solution. But my only problem is how do I separate the row set of data.
    See the image I have attached in my earlier post some row set has the 3 rows, but some row set has 4 rows. (right most columns 1,2,3 or 1,2,3,4) I know how to process the first 3 row (first set). but I don't have idea how to take the 2nd row set. If you can help me to do this, I can fix the problem my self.

    Regards,

    Sunday, March 24, 2013 6:47 PM
  • If this is a real app you are working on (instead of a programming course assignment), you don't need to do everything with regular expressions.  You could grab that character at the end of the line (1,2,3,4,...), and examine it, and then use the appropriate regular expression to get the data that the line provides.  I guess I'd use string.substring to get that character at the end of the line (although you could do that with a regular expression too) and then use Select ... Case to choose the right regular expression for that line type.  An advantage of this approach is that you avoid one really long regular expression.

    But if, because this is a course assignment, or just to make the project more interesting, you want to employ the full power of regular expressions, then you need to know that regular expressions can specify "optional" matches; so a* matches any number of "a"s, a+ matches 1 or more "a"s, and a? matches 0 or 1 "a"s.  So the pattern     abcd?    will match abc and abcd. 

    So here's a pattern which does not do exactly what you need but which might be part of what you need:

    (?<firstline>\w{3}999.+?1\r\n)\s*(?<secondline>\w{3}999.*?2\r\n)\s*(?<thirdline>\w{3}999.*?3\r\n)\s*(?<fourthline>\w{3}999.*?4\r\n)?

    The ? which matters in this expression is the very last one.  And note that ? following + or * has a different meaning.  That final ? provides a match even when the fourth line is not present.  Here's the Expresso output using your data and the expression above:

    I am not capturing exactly what you need, I am capturing the entire line.  But you should be able to tailor the expression above to capture exactly the data you need. 

    I haven't explained all of the expression above because I want to "inspire" you to further your regex education.  But one thing I do want to explain is that I originally tried $ to match the "end of line".  For some reason which I did not research that did not work.  So I used "\r\n" instead of $. ALSO, I don't recall but the regex options may matter in this case. In case they do here are the options I used:

    Bob


    Sunday, March 24, 2013 11:23 PM
  • Thank you Bob. This is really help full.
    Monday, March 25, 2013 12:28 PM