New Article on Regex speed versus XML processing

Locked New Article on Regex speed versus XML processing

All Replies

  • Saturday, November 26, 2011 4:35 AM
     
      Has Code
    > I have written another article on regex speed which was based on a question asked in the C# forum..Net Regex: Can Regular Expression Parsing be Faster than XmlDocument or Linq to Xml?
     
     

    i have tested Regex, XmlDocument, XElement and have got the following results:
     
    { XElement = 75, Regex = 42, XmlDocument = 37 }	best: XmlDocument
    { XElement = 37, Regex = 33, XmlDocument = 31 }	best: XmlDocument
    { XElement = 35, Regex = 34, XmlDocument = 31 }	best: XmlDocument
    { XElement = 34, Regex = 34, XmlDocument = 31 }	best: XmlDocument
    { XElement = 34, Regex = 33, XmlDocument = 31 }	best: XmlDocument
    { XElement = 43, Regex = 36, XmlDocument = 32 }	best: XmlDocument
    { XElement = 33, Regex = 33, XmlDocument = 31 }	best: XmlDocument
    { XElement = 36, Regex = 37, XmlDocument = 33 }	best: XmlDocument
    { XElement = 36, Regex = 34, XmlDocument = 32 }	best: XmlDocument
    { XElement = 33, Regex = 34, XmlDocument = 31 }	best: XmlDocument
    { XElement = 35, Regex = 33, XmlDocument = 33 }	best: Regex
    { XElement = 34, Regex = 33, XmlDocument = 31 }	best: XmlDocument
    { XElement = 33, Regex = 32, XmlDocument = 30 }	best: XmlDocument
    { XElement = 34, Regex = 33, XmlDocument = 30 }	best: XmlDocument
    { XElement = 33, Regex = 33, XmlDocument = 32 }	best: XmlDocument
    { XElement = 34, Regex = 37, XmlDocument = 30 }	best: XmlDocument
    { XElement = 33, Regex = 32, XmlDocument = 31 }	best: XmlDocument
    { XElement = 38, Regex = 33, XmlDocument = 30 }	best: XmlDocument
    { XElement = 33, Regex = 32, XmlDocument = 30 }	best: XmlDocument
    { XElement = 32, Regex = 33, XmlDocument = 30 }	best: XmlDocument
    { XElement = 33, Regex = 33, XmlDocument = 31 }	best: XmlDocument
    { XElement = 34, Regex = 35, XmlDocument = 30 }	best: XmlDocument
    { XElement = 33, Regex = 33, XmlDocument = 30 }	best: XmlDocument
    { XElement = 34, Regex = 33, XmlDocument = 31 }	best: XmlDocument
    { XElement = 33, Regex = 32, XmlDocument = 30 }	best: XmlDocument
    { XElement = 35, Regex = 33, XmlDocument = 30 }	best: XmlDocument
    { XElement = 33, Regex = 33, XmlDocument = 33 }	best: XElement
    { XElement = 53, Regex = 34, XmlDocument = 30 }	best: XmlDocument
    { XElement = 33, Regex = 32, XmlDocument = 30 }	best: XmlDocument
    { XElement = 32, Regex = 32, XmlDocument = 30 }	best: XmlDocument


    below is a test code
     
    using System;
    using System.Diagnostics;
    using System.IO;
    using System.Linq;
    using System.Text.RegularExpressions;
    using System.Xml;
    using System.Xml.Linq;
    
    namespace WindowsFormsApplication4
    {
        static class Program
        {
            [STAThread]
            static void Main()
            {
                var di = new DirectoryInfo("test");
                if (di.Exists == false)
                {
                    di.Create();
                    var xml = @"<?xml version=""1.0"" encoding=""UTF-8""?>
                    <urlset xmlns=""http://www.sitemaps.org/schemas/sitemap/0.9"">" +
                            String
                                .Concat(Enumerable.Range(0, 100)
                                    .Select(c =>
                                        "<url><loc>http://localhost:" + c + "</loc></url>")) +
                        "</urlset>";
                    for (int i = 0; i < 100; i++)
                        File.WriteAllText(Path.Combine(di.FullName, "test" + i + ".xml"), xml);
                }
                var res = Enumerable.Range(0, 30).Select(c => new
                {
                    XElement = TestXElement(di),
                    Regex = TestRegex(di),
                    XmlDocument = TestXmlDocument(di)
                });
                foreach (var c in res)
                {
                    var best = c.GetType()
                        .GetProperties()
                        .Select(p => new { Name = p.Name, Value = (long)p.GetValue(c, null) })
                        .OrderBy(pv => pv.Value)
                        .First();
                    System.Diagnostics.Trace.WriteLine(c + "\tbest: " + best.Name);
                }
            }
    
            static long TestXElement(DirectoryInfo di)
            {
                var s = Stopwatch.StartNew();
                foreach (var file in di.EnumerateFiles())
                {
                    var xe = XElement.Load(file.FullName);
                    var c = xe.Elements().Count();
                }
                s.Stop();
                return s.ElapsedMilliseconds;
            }
    
            static long TestXmlDocument(DirectoryInfo di)
            {
                var s = Stopwatch.StartNew();
                foreach (var file in di.EnumerateFiles())
                {
                    var xe = new XmlDocument();
                    xe.Load(file.FullName);
                    var c = xe.DocumentElement.ChildNodes.Count;
                }
                s.Stop();
                return s.ElapsedMilliseconds;
            }
    
            static long TestRegex(DirectoryInfo di)
            {
                var s = Stopwatch.StartNew();
                var re = new Regex(@"<url>\s*<loc>", RegexOptions.Multiline | RegexOptions.IgnoreCase);
                foreach (var file in di.EnumerateFiles())
                {
                    var txt = File.ReadAllText(file.FullName);
                    var c = re.Matches(txt).Count;
                }
                s.Stop();
                return s.ElapsedMilliseconds;
            }
        }
    }
    
    
    please note that XmlDocument has been loaded through the method .Load(file.FullName) instead of .LoadXml(File.ReadAllText(file.FullName))

      
    • Edited by Malobukv Saturday, November 26, 2011 4:55 AM
    •  
  • Saturday, November 26, 2011 9:32 PM
    Moderator
     
     

    You have added to the load of the regular expression; which is signficant and possibly uncessarily slowing it down..

    1. Multiline is specified , though there is no $ or ^ present in the regex pattern! Why?
    2. IgnoreCase which causes the regex to really slow down. See (Want faster regular expressions? Maybe you should think about that IgnoreCase option...)
    3. Since you have created the regex object, why not used the compiled option? Using the static version caches the pattern. Question, though does that matter in the above code...

     

    Suffice it to say, one can make the regular expression parser behave slower by either bad patterns or by adding to its work load. Plus I can give you a regex which only counts subnodes...so the test would be fair...

    Thoughts?

     


    William Wegerson (www.OmegaCoder.Com)






  • Monday, December 05, 2011 9:43 PM
     
     
    Thanks William
    John Grove, Senior Software Engineer http://www.digitizedschematic.com/