locked
Parse String RRS feed

  • Question

  • Hello,

    I have a string as follows: "One Phrase #123 #XY4".

    How can I get the values after # into an array?

    Thank You,

    Miguel

    Monday, February 4, 2013 7:20 PM

Answers

  • If the format is always similar to this, you can use:

    string[] values = theString.Split('#').Skip(1).Select(s => s.Trim()).ToArray();
    


    Reed Copsey, Jr. - http://reedcopsey.com
    If a post answers your question, please click "Mark As Answer" on that post and "Mark as Helpful".

    Monday, February 4, 2013 7:22 PM
  • Reed Copsey, Jr's answer is great.

    But if the grammar of this expression is more complex than just splitting on # and and trimming, then you'll want to switch to using a Regular Expression.

    Here's a regular expression framework for you to test some test cases and make sure that you get exactly the results that you want.

    I've included the equivalent RegEx to Reed's technique, and I've also proposed one of my own.  (adjust the #if statement to use it.)

    using System;
    using System.Linq;
    using System.Text.RegularExpressions;
    
    class Program
    {
        static string[] Parse( string text )
        {
    #if true
            string pattern = @"\b([^#]+)\b"; // Reed Copsey, Jr
    #else
            string pattern = @"#([^\s]*)"; // Wyck
    #endif
            return Regex.Matches( text, pattern ).Cast<Match>().Select( m => m.Groups[1].Value ).ToArray();
        }
    
        static void Main( string[] args )
        {
            string[] tests = new string[] {
                @"One Phrase #123 #XY4",
                @"One Phrase #123 #XY4  ",
                @"One Phrase   #123   #XY4",
                @"#1#2#3",
                @"Apple #banana boat #cloud atlas",
                @"#",
                @"##",
                @"###",
                @"#1#2 #3#4# 5#6",
                @"#hello#",
            };
    
            foreach( string test in tests )
            {
                string[] vals = Parse( test );
                    
                Console.WriteLine( "\"{0}\" : {1}", test, 
                    string.Join( ", ", 
                        vals.Select( x => string.Format(
                            "\"{0}\"", x
                            )).ToArray() )
                    );
            }
        }
    }
    

    • Proposed as answer by Lisa Zhu Wednesday, February 6, 2013 6:37 AM
    • Marked as answer by Jason Dot Wang Tuesday, February 12, 2013 5:52 AM
    Monday, February 4, 2013 8:03 PM

All replies

  • If the format is always similar to this, you can use:

    string[] values = theString.Split('#').Skip(1).Select(s => s.Trim()).ToArray();
    


    Reed Copsey, Jr. - http://reedcopsey.com
    If a post answers your question, please click "Mark As Answer" on that post and "Mark as Helpful".

    Monday, February 4, 2013 7:22 PM
  • Yes,

    The format will always be similar.

    Thank you,

    Miguel

    Monday, February 4, 2013 7:37 PM
  • Reed Copsey, Jr's answer is great.

    But if the grammar of this expression is more complex than just splitting on # and and trimming, then you'll want to switch to using a Regular Expression.

    Here's a regular expression framework for you to test some test cases and make sure that you get exactly the results that you want.

    I've included the equivalent RegEx to Reed's technique, and I've also proposed one of my own.  (adjust the #if statement to use it.)

    using System;
    using System.Linq;
    using System.Text.RegularExpressions;
    
    class Program
    {
        static string[] Parse( string text )
        {
    #if true
            string pattern = @"\b([^#]+)\b"; // Reed Copsey, Jr
    #else
            string pattern = @"#([^\s]*)"; // Wyck
    #endif
            return Regex.Matches( text, pattern ).Cast<Match>().Select( m => m.Groups[1].Value ).ToArray();
        }
    
        static void Main( string[] args )
        {
            string[] tests = new string[] {
                @"One Phrase #123 #XY4",
                @"One Phrase #123 #XY4  ",
                @"One Phrase   #123   #XY4",
                @"#1#2#3",
                @"Apple #banana boat #cloud atlas",
                @"#",
                @"##",
                @"###",
                @"#1#2 #3#4# 5#6",
                @"#hello#",
            };
    
            foreach( string test in tests )
            {
                string[] vals = Parse( test );
                    
                Console.WriteLine( "\"{0}\" : {1}", test, 
                    string.Join( ", ", 
                        vals.Select( x => string.Format(
                            "\"{0}\"", x
                            )).ToArray() )
                    );
            }
        }
    }
    

    • Proposed as answer by Lisa Zhu Wednesday, February 6, 2013 6:37 AM
    • Marked as answer by Jason Dot Wang Tuesday, February 12, 2013 5:52 AM
    Monday, February 4, 2013 8:03 PM
  • Hello Wyck,

    Thank you for your code. I also like that solution.

    I am considering simplifying my code and making this in two steps.

    For example, consider the following string "User Verify #8024".

    This string will always be "User Verify #" + Number. It will have ONLY one #number!

    I would like to create two methods. One to validate and one to parse:

    public Boolean IsValid(string value) {
      // Check if it is valid, e.g, of type "User Verify #" + Number
    }
    public Int32 Parse(string value) {
      // Receives a string of type "User Verify #" + Number and if it is valid returns the number.
    }
    
    

    Can, or should, I implement this with Regex?

    Thank You,

    Miguel

    Tuesday, February 5, 2013 11:30 AM
  • Hello Wyck,

    Thank you for your code. I also like that solution.

    I am considering simplifying my code and making this in two steps.

    For example, consider the following string "User Verify #8024".

    This string will always be "User Verify #" + Number. It will have ONLY one #number!

    I would like to create two methods. One to validate and one to parse:

    public Boolean IsValid(string value) {
      // Check if it is valid, e.g, of type "User Verify #" + Number
    }
    public Int32 Parse(string value) {
      // Receives a string of type "User Verify #" + Number and if it is valid returns the number.
    }

    Can, or should, I implement this with Regex?

    Thank You,

    Miguel

    Since the regular expression will both validate and return the result in one go, you'll end up running it twice: once when you call IsValid, and then again when you call Parse.

    Instead, I suggest you make your method signature something like this:

    public bool TryParse( string value, out int result )

    Then it can be implemented with a single call to a regular expression.

    You can add validation by using ^ and \z in your expression, which match the beginning and end of the string respectively to make sure there's nothing more than what matches your expression.  And you can use parentheses to produce captures, which will package up parts of the matching text into individual string results.

    Here's an implementation with a test.

    using System;
    using System.Text.RegularExpressions;
    
    class Program
    {
        public static bool TryParse( string text, out int n )
        {
            string pattern = @"^User Verify #([\d]+)\z";
            Match m = Regex.Match( text, pattern );
            if( m.Success )
            {
                return int.TryParse( m.Groups[1].Value, out n );
            }
            else
            {
                n = 0;
                return false;
            }
        }
    
        static void Main( string[] args )
        {
            object[] testData = new object[] {
                @"User Verify #8024", true,
                @"User Verify#8024", false,
                @"user verify #8024", false, // case must match
                @"User Verify #8024.", false, // nothing extra at end.
                @" User Verify #8024", false, // nothing extra at beginning
                @"User Verify ##8024", false,
                @"User Verify # 8024", false,
                @"User Verify 8024", false, // missing #
                "\nUser Verify #8024", false,
                "User Verify #8024\n", false, // This catches using $ rather than \z in regex.
                "User Verify #8024\r\n", false,
                @"User Verify #", false,
                @"User Verify #0", true, // zero is ok
                @"User Verify #1", true, 
                @"User Verify #-1", false, // Negative disallowed.
                @"User Verify #111111111", true,
                @"User Verify #1111111111", true,
                @"User Verify #11111111111", false, // Exceeds int.int.MaxValue
                @"User Verify #999999999", true,
                @"User Verify #9999999999", false, // Exceeds int.int.MaxValue
                @"User Verify #99999999999", false, // Exceeds int.int.MaxValue
                @"User Verify #000000000", true,
                @"User Verify #0000000000", true,
                @"User Verify #00000000000", true,
            };
    
            int failures = 0;
            for( int i = 0; i + 1 < testData.Length; i += 2 )
            {
                int n;
                string testInput = (string)testData[i];
                bool expectedResult = (bool)testData[i + 1];
    
                bool actualResult = TryParse( testInput, out n );
                Console.Write( actualResult == expectedResult ? "PASS" : "FAIL" );
                if( actualResult != expectedResult ) ++failures;
                Console.Write( " \"{0}\"", testInput );
                if( actualResult )
                {
                    Console.WriteLine( " is valid: n = {0}", n );
                }
                else
                {
                    Console.WriteLine( " is invalid." );
                }
            }
            Console.WriteLine();
            Console.WriteLine( failures == 0 ? "PASS" : "FAIL" );
        }
    }
    

    • Edited by Wyck Tuesday, February 5, 2013 2:17 PM oops, left in a tiny bit of cruft.
    Tuesday, February 5, 2013 2:14 PM
  • Hello Wyck,

    Thank you for your code. I also like that solution.


    I'm curious.

    Reed shows you an easy well maintainable piece of code, and you want code which is complex for beginners but otherwise only less easy to maintain. 

    Do you have a reason for that?


    Success
    Cor

    Tuesday, February 5, 2013 2:50 PM
  • I'm curious.

    Reed shows you an easy well maintainable piece of code, and you want code which is complex for beginners but otherwise only less easy to maintain. 

    Do you have a reason for that?


    Success
    Cor

    TL;DR: If the subject is "Parse String", my answer is going to be "use Regex".

    Alas, it is with great irony that my response to your question will be as complex as my original code response, but you asked for a reason, and reasons are complex things, so here goes.

    I'll provide two reasons.  One is a meta-reason: a philosophical introspective, and the other is more concrete, but highly specific to performance under extreme circumstances.

    I'm not an efficient programmer in that I don't produce results quickly.  I'll admit that.  I always favour the do-it-yourself yourself approach and I tend to get involved in projects for great lengths of time.  But it's because I always strive for a deep understanding of what's going on under the hood. In this case, I fully acknowledge that the code itself is much easier in Reed's version, and it's definitely the right code for a beginner to be writing.

    As I tackle similar kinds of problems in the real world, I increasingly find that performance and scalability become the issue.  (In other career paths, some may find that code maintenance becomes the dominant issue, but historically, for me, it's performance and scale.)  So I feel compelled to provide an alternative that reflects my needs in my little corner of real world.  Again, I admit that the larger world is a big place with lots of interesting requirements, and if code maintenance is the most important thing, then, by all means, please use Reed's approach.  But if you ever find yourself wondering about what can be done to make things more efficient in extreme circumstances, then you might enjoy what I have to say.  I've always had to have my mind on performance.  In my career, it didn't matter how long it took to write or maintain the code as long as the new code was faster or better in some way.  That's just my narrow view of the big world, though.

    But that's enough of a meta-answer.  Let's look at the actual efficiency for a moment.

    The bulk of the work is being done by string.Split.  Here we have a function that is easy to discover, easy to write correctly, and easy to maintain.  It's awesome, but it's not very efficient.  We must construct new strings for each of the results whether we use them or not.

    Skip(1) is wistful thinking, it the Split returned an IEnumerable, or if Split were part of Linq, then this would be advantageous, but Split returns an array.  I think it's useful to know this.  Skip is really just being used to throw out the first thing in the array.  Does it matter when it's just 1 thing?  No.

    Then we need to trim off the the whitespace.  Reed has chosen string.Trim().  This is another favourite of many and it's very easy to write and discoverable, and all that, but again, it's inefficient in the big scheme of things in that it is using a temporary string as input, not the original string.  This is my biggest pet peeve about string processing.  This relies heavily on the garbage collection mechanism doing its job, which it does, provided there is enough "scratch" space for all the temporary results.  Is Trim() acceptable for most cases?  yes, much to my disappointment, it is.  My technique only wins in a few kinds of circumstances.

    But if you look at the Regex Match, on the API alone it has the potential to be better for performance, if string operations are the expensive part.  Here we combine the description of splitting on the # and trimming whitespace into a single regular expression.  I'd like to point out that this regular expression can be compiled so that it can be reused on many different input strings.  At the expense of some up-front cost of constructing the regex, we can now efficiently process the strings.  The regular expression produces an object that can efficiently execute the "program" of the regular expression to produce matches.  It's highly fine-tuned to handle the specific parsing task (the "pattern").

    And like I said, if the grammar of the expression that you are trying to parse gets more complex, then Reed will have to dig deeper into the toolbox and pull out more tricks, like Split, Skip and Trim.  Similarly, I will have to dig deeper into my regular expression toolbox to find more things like ( # ) /z * + etc.  But if the strings are large and the expression is complex, then Reeds version is going to construct more temporary strings and use more memory than the version that uses the regular expression.  The regular expression take much more up-front processing though.  And for such a simple example, it takes more to construct the Match object than it does to do the string processing so it's moot for a simple grammar (which, again, is why I said Reed's answer is great.)

    If I have failed to make it look as elegant as Reed's then it's because it's not a one liner, and I supplied some hack-ish unit-testing kind of supporting code, as well as the fact that I went to the effort of doing the validation, and parsing the int.

    I'm often reminded that my answers aren't the best and that my tendency to over-engineer and plan for circumstances that don't arise is a detriment to my career.  But every once in a while I get to knock one out of the park and create something huge, and that's been fun and educational.

    So finally, when asked to advise someone about how to parse, sure I can respond with the fact that there are a few cute and clever tools like Split and Trim that you can snap together to get short, elegant, high-performance solutions to simple parsing problems.  But I would be remiss not to mention that people have built regular expression tools that can very efficiently handle large parsing problems.  So for something tiny, go ahead and use the quick tool, but for something more complex, you can leverage the time and effort that has been put into solving large scale parsing problems by taking advantage of regular expressions.


    Tuesday, February 5, 2013 7:23 PM