locked
Java vs C# vs String.Contains

    Question

  • Hi,

    I'm not sure anyone is going to read this, but I find that when I do not use the static Regex.IsMatch() method, but instead the static string.Contains() method, the C# code executes many times faster.

    Java Code console output as a baseline:
    found AAAGTAAGCC at 1000000 after 16 milliseconds
    found AAATGAAAAAG at 1048960 after 407 milliseconds
    found GAAAAAGTAAG at 1085441 after 532 milliseconds
    found TCTAAAAATAG at 1179694 after 813 milliseconds
    found ACGTGATGTAG at 1204636 after 891 milliseconds
    found AATAGATTCGG at 1548576 after 1938 milliseconds
    found TCGTACAAATG at 1576094 after 2016 milliseconds
    found CGGACGTGATG at 1599255 after 2094 milliseconds
    found ATTCGGACGTG at 1689064 after 2375 milliseconds
    found AGATTCGGACG at 1859204 after 2875 milliseconds
    found TGATGTAGTCG at 1984902 after 3250 milliseconds
    found AAATAGATTCG at 2000000 after 3297 milliseconds
    Java regex took 3297 milliseconds 

    C# code using Regex.IsMatch():
    found AAAGTAAGCC at 1000000 after 1 milliseconds
    found AAATGAAAAAG at 1048960 after 802 milliseconds
    found GAAAAAGTAAG at 1085441 after 1420 milliseconds
    found TCTAAAAATAG at 1179694 after 2987 milliseconds
    found ACGTGATGTAG at 1204636 after 3396 milliseconds
    found AATAGATTCGG at 1548576 after 9025 milliseconds
    found TCGTACAAATG at 1576094 after 9467 milliseconds
    found CGGACGTGATG at 1599255 after 9845 milliseconds
    found ATTCGGACGTG at 1689064 after 11346 milliseconds
    found AGATTCGGACG at 1859204 after 14098 milliseconds
    found TGATGTAGTCG at 1984902 after 16145 milliseconds
    found AAATAGATTCG at 2000000 after 16391 milliseconds
    .NET regex took 16391 milliseconds
    .

    C# code using string.Contains():
    found AAAGTAAGCC at 1000000 after 1 milliseconds
    found AAATGAAAAAG at 1048960 after 67 milliseconds
    found GAAAAAGTAAG at 1085441 after 116 milliseconds
    found TCTAAAAATAG at 1179694 after 243 milliseconds
    found ACGTGATGTAG at 1204636 after 276 milliseconds
    found AATAGATTCGG at 1548576 after 742 milliseconds
    found TCGTACAAATG at 1576094 after 779 milliseconds
    found CGGACGTGATG at 1599255 after 811 milliseconds
    found ATTCGGACGTG at 1689064 after 932 milliseconds
    found AGATTCGGACG at 1859204 after 1161 milliseconds
    found TGATGTAGTCG at 1984902 after 1340 milliseconds
    found AAATAGATTCG at 2000000 after 1361 milliseconds
    .NET string contains took 1361 milliseconds.



    In short:
    Java regex code           3297 ms.
    C# string.Contains()    1361 ms.
    C# Regex.IsMatch()    16391 ms.

    Any thoughts?
    Wednesday, August 06, 2008 1:37 PM

All replies

  • Hi,

    Is there someplace where I could download that genome to do my own test?

    I'm curious about the diferrence in C# times, the only time I did my own benchmarks was years ago and iirc regex won, but of course things change a lot depending on input and pattern.

    Cheers,

    John
    Wednesday, August 06, 2008 8:14 PM
  • Hi,

    My post actually was a reply to this thread (over a year old):

    Java Regex faster than C# Regex
     

    See all relevant code there.

    To my surprise searching a substring within a string was much faster using 'string.Contains(substring)' than 'Regex.IsMatch()' (and also substantially faster than the faster Java regular expression classes.
    Wednesday, August 06, 2008 8:34 PM
  • It shouldn't be too surprising that a pattern matcher runs slower than a substring matcher.  Regex will have to scan the pattern to see if there are any wild cards each time.  string.Contains() looks for an exact match and does not need to preprocess the "pattern".

    Have you tried using the Java equivalent of string.Contains()?  I would expect that to be faster, too.



    Les Potter, Xalnix Corporation
    Wednesday, August 06, 2008 11:49 PM
  • xalnix, you're absolutely right, that should be done, so:

    found AAAGTAAGCC at 1000000 after 0 milliseconds
    found AAATGAAAAAG at 1048960 after 141 milliseconds
    found GAAAAAGTAAG at 1085441 after 203 milliseconds
    found TCTAAAAATAG at 1179694 after 359 milliseconds
    found ACGTGATGTAG at 1204636 after 406 milliseconds
    found AATAGATTCGG at 1548576 after 953 milliseconds
    found TCGTACAAATG at 1576094 after 1000 milliseconds
    found CGGACGTGATG at 1599255 after 1047 milliseconds
    found ATTCGGACGTG at 1689064 after 1188 milliseconds
    found AGATTCGGACG at 1859204 after 1469 milliseconds
    found TGATGTAGTCG at 1984902 after 1672 milliseconds
    found AAATAGATTCG at 2000000 after 1688 milliseconds
    Java string contains took 1688 milliseconds

    The complete listing then:
    C# string.Contains()        1361 ms.
    Java string.contains()      1688 ms.
    Java regex code             3297 ms.
    C# Regex.IsMatch()         16391 ms.


    Thursday, August 07, 2008 9:38 AM
  • So if you are comparing apples-to-apples, the C# is about 20% faster than Java for the same basic test jig when using string.Contains.  This suggests that the C# Regex compared to Java Regex is even more than 5x slower.  Are you using .NET Regex in Java (J#), or Sun Java and its built in library version of Regex?  Sorry if that sounds stupid, I am not a Java or J# user.  Regardless, there are many flavors of Regex and so will their performances will differ. 
    Les Potter, Xalnix Corporation
    Thursday, August 07, 2008 1:53 PM