locked
Regex Split RRS feed

  • Question

  • hi,

    I am trying to implement a solution to be able to split a line into array of strings, considering two criteria. Firstly- there are certain columns that are text-qualified with multi-character boundaries. Secondly, a multi-character delimiter. The situation may get complex when there are common characters in the two features. To add, metacharacters such as $ and ^ may add more challenges. It seems that, Regex is most suited for such purposes. One of the implementations as below is working for most cases, but, is breaking for metacharacters being opted in the text-qualifier and/or delimiters

    using System.Text.RegularExpressions;
    
    public string[] Split(string expression, string delimiter,
                string qualifier, bool ignoreCase)
    {
        string _Statement = String.Format
            ("{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))",
                            Regex.Escape(delimiter), Regex.Escape(qualifier));
    
        RegexOptions _Options = RegexOptions.Compiled | RegexOptions.Multiline;
        if (ignoreCase) _Options = _Options | RegexOptions.IgnoreCase;
    
        Regex _Expression = New Regex(_Statement, _Options);
        return _Expression.Split(expression);
    }

    The above works for majority of the scenarios, but, doesn't for such situations where metacharacters like $ are involved (especially as part of text-qualifier. Looks like particular interpretation of escaping is needed)

    string input = "*|This is an ..  example*|..Am2..Cool!";
    string input2 = "*|This is an $  example*|$Am2$Cool!";
    string input3 = "$|This is an $  example$|$Am2$Cool!";
    string input4 = "|$This is an $  example|$$Am2$Cool!";
    
    foreach (string _Part in Split(input, "..", "*|", true))
    Console.WriteLine(_Part);
    
    foreach (string _Part in Split(input2, "$", "*|", true))
    Console.WriteLine(_Part);
    
    foreach (string _Part in Split(input3, "$", "$|", true)) // doesn't work correctly
    Console.WriteLine(_Part);
    
    foreach (string _Part in Split(input4, "$", "|$", true)) //  doesn't work correctly
    Console.WriteLine(_Part);

    Could you please let me know how do we handle all situations, including the ones that involve metacharacters as part of text-qualifier and/or delimiters?

    thank you
    Thursday, September 3, 2020 10:42 PM

All replies

  • Hi etl2016,

    Thank you for posting here.

    I haven't thought of a way to modify the regex to make it work, but I wrote a piece of code to replace it, please try to see if it can work for you.

            public static string[] Split2(string expression, string delimiter,
              string qualifier)
            {
                List<string> re = new List<string>();
                int beginQualifier = expression.IndexOf(qualifier);
                int endQualifier = expression.LastIndexOf(qualifier);
                if (beginQualifier>0)
                {
                    string start = expression.Substring(0, beginQualifier+1);
                    re.AddRange(start.Split(new string[] { delimiter }, StringSplitOptions.RemoveEmptyEntries));
                }
           
                string mid = expression.Substring(beginQualifier, endQualifier-beginQualifier+qualifier.Length);
                string end = expression.Substring(endQualifier + qualifier.Length);
               
                re.Add(mid);
                re.AddRange(end.Split(new string[] { delimiter }, StringSplitOptions.RemoveEmptyEntries));
                return re.ToArray();
            }

    Hope this could be helpful.

    Best Regards,

    Timon


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Friday, September 4, 2020 3:20 AM
  • If this approach does not work, maybe you can switch to scanning the string with Regex.Matches.

    Did you consider the case when the characters, which also denote text-qualifiers, appear inside the column? For example, how do you represent the column “This is an $|$ example” using “$|” as text-qualifier and “$” as delimiter? 



    • Edited by Viorel_MVP Friday, September 4, 2020 7:59 AM
    Friday, September 4, 2020 7:45 AM
  • thanks Timon,   the non-Regex is easy to understand, debug and modify.

    However, above code didn't work correctly for below sample scenario. (nor did its Regex counterpart in my first post)

    string input5 = "|$This is an $  example|$$Am2$Cool!|$$|";

    Here, if the text-qualifier is |$  and the delimiter is $ then, the expected array of strings is as follows:

    a[0]  is |$This is an $  example|$

    a[1]  is  Am2

    a[2]  is Cool! 

    a[3]  is null

    In the above non-Regex code, beginQualifier is evaluating to 0 and endQualifier is evaluating to 35,  whereas in reality,  the endQualifier is appearing much earlier in the input, ahead of $Am2.

    thanks

    Friday, September 4, 2020 11:58 AM
  • Hi etl2016,

    In this example, the qualifier appears 3 times, so the "expression.LastIndexOf(qualifier)" in the previous code ignores the second qualifier.

    I replaced this method with IndexOf(String value, int startIndex) and now the program can work for this example.

    But in this example, because the delimiter is '$', the final result will look like this:

    |$This is an $  example|$
    Am2
    Cool!|
    |

    I think this is not what you want, so I treat '|' as a delimiter too.

    The modified code is as follows:

            static void Main(string[] args)
            {
                string input = "*|This is an ..  example*|..Am2..Cool!";
                string input2 = "*|This is an $  example*|$Am2$Cool!";
                string input3 = "$Am2$Cool!$$|This is an $  example$|$Am2$Cool!";
                string input4 = "|$This is an $  example|$$Am2$Cool!";
                string input5 = "|$This is an $  example|$$Am2$Cool!|$$|";
                foreach (string _Part in Split2(input, new string[] { ".." }, "*|"))
                    Console.WriteLine(_Part);
                Console.WriteLine("~~~~~~~");
                foreach (string _Part in Split2(input2, new string[] { "$" }, "*|"))
                    Console.WriteLine(_Part);
                Console.WriteLine("~~~~~~~");
                foreach (string _Part in Split2(input3, new string[] { "$" }, "$|")) // doesn't work correctly
                    Console.WriteLine(_Part);
                Console.WriteLine("~~~~~~~");
                foreach (string _Part in Split2(input4, new string[] { "$" }, "|$")) //  doesn't work correctly
                    Console.WriteLine(_Part);
    
                Console.WriteLine("~~~~~~~");
                foreach (string _Part in Split2(input5, new string[] { "$" ,"|"}, "|$")) //  doesn't work correctly
                    Console.WriteLine(_Part);
                Console.WriteLine("~~~~~~~");
    
                Console.WriteLine("Press any key to continue...");
                Console.ReadLine();
            } 
            public static string[] Split2(string expression, string[] delimiter,
             string qualifier)
            {
                List<string> re = new List<string>();
                int beginQualifier = expression.IndexOf(qualifier);
                int endQualifier = expression.IndexOf(qualifier,beginQualifier + 1);
    
                if (beginQualifier>0)
                {
                    string start = expression.Substring(0, beginQualifier+1);
                    re.AddRange(start.Split(delimiter, StringSplitOptions.None));
                }
           
                string mid = expression.Substring(beginQualifier, endQualifier-beginQualifier+qualifier.Length);
                string end = expression.Substring(endQualifier + qualifier.Length);
               
                re.Add(mid);
                re.AddRange(end.Split(delimiter, StringSplitOptions.None));
                return re.ToArray();
            }

    I used StringSplitOptions.RemoveEmptyEntries when splitting the string, so the empty string will not be displayed. If you need it, you can use StringSplitOptions.None.

    But I think this solution still has shortcomings, some new string formats may still cause errors.

    For example, if there are 4 qualifiers in the text, should we treat each two as a pair to extract the text completely, or treat the last two as ordinary characters?

    Best Regards,

    Timon


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Monday, September 7, 2020 8:08 AM
  • Hi,

    Has your issue been resolved?

    If so, please click on the "Mark as answer" option of the reply that solved your question, so that it will help other members to find the solution quickly if they face a similar issue.

    Best Regards,

    Timon


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Thursday, September 17, 2020 7:52 AM
  • hi,

    I am trying to implement a solution to be able to split a line into array of strings, considering two criteria. Firstly- there are certain columns that are text-qualified with multi-character boundaries. Secondly, a multi-character delimiter. The situation may get complex when there are common characters in the two features. To add, metacharacters such as $ and ^ may add more challenges. It seems that, Regex is most suited for such purposes.

    What you're describing is more or less the same situation as XML/HTML:  You are dealing with irregular expressions in the input.  So it does not seem that RegEx is suited for anything like this purpose.

    Before you can learn anything new you have to learn that there's stuff you don't know.

    Thursday, September 17, 2020 1:09 PM