Java vs C# vs String.Contains
-
Wednesday, August 06, 2008 1:37 PM
Hi,
I'm not sure anyone is going to read this, but I find that when I do not use the static Regex.IsMatch() method, but instead the static string.Contains() method, the C# code executes many times faster.
Java Code console output as a baseline:
found AAAGTAAGCC at 1000000 after 16 milliseconds
found AAATGAAAAAG at 1048960 after 407 milliseconds
found GAAAAAGTAAG at 1085441 after 532 milliseconds
found TCTAAAAATAG at 1179694 after 813 milliseconds
found ACGTGATGTAG at 1204636 after 891 milliseconds
found AATAGATTCGG at 1548576 after 1938 milliseconds
found TCGTACAAATG at 1576094 after 2016 milliseconds
found CGGACGTGATG at 1599255 after 2094 milliseconds
found ATTCGGACGTG at 1689064 after 2375 milliseconds
found AGATTCGGACG at 1859204 after 2875 milliseconds
found TGATGTAGTCG at 1984902 after 3250 milliseconds
found AAATAGATTCG at 2000000 after 3297 milliseconds
Java regex took 3297 milliseconds
C# code using Regex.IsMatch():
found AAAGTAAGCC at 1000000 after 1 milliseconds
found AAATGAAAAAG at 1048960 after 802 milliseconds
found GAAAAAGTAAG at 1085441 after 1420 milliseconds
found TCTAAAAATAG at 1179694 after 2987 milliseconds
found ACGTGATGTAG at 1204636 after 3396 milliseconds
found AATAGATTCGG at 1548576 after 9025 milliseconds
found TCGTACAAATG at 1576094 after 9467 milliseconds
found CGGACGTGATG at 1599255 after 9845 milliseconds
found ATTCGGACGTG at 1689064 after 11346 milliseconds
found AGATTCGGACG at 1859204 after 14098 milliseconds
found TGATGTAGTCG at 1984902 after 16145 milliseconds
found AAATAGATTCG at 2000000 after 16391 milliseconds
.NET regex took 16391 milliseconds.
C# code using string.Contains():
found AAAGTAAGCC at 1000000 after 1 milliseconds
found AAATGAAAAAG at 1048960 after 67 milliseconds
found GAAAAAGTAAG at 1085441 after 116 milliseconds
found TCTAAAAATAG at 1179694 after 243 milliseconds
found ACGTGATGTAG at 1204636 after 276 milliseconds
found AATAGATTCGG at 1548576 after 742 milliseconds
found TCGTACAAATG at 1576094 after 779 milliseconds
found CGGACGTGATG at 1599255 after 811 milliseconds
found ATTCGGACGTG at 1689064 after 932 milliseconds
found AGATTCGGACG at 1859204 after 1161 milliseconds
found TGATGTAGTCG at 1984902 after 1340 milliseconds
found AAATAGATTCG at 2000000 after 1361 milliseconds
.NET string contains took 1361 milliseconds.
In short:
Java regex code 3297 ms.
C# string.Contains() 1361 ms.
C# Regex.IsMatch() 16391 ms.
Any thoughts?- Split by OmegaManMVP, Moderator Wednesday, August 06, 2008 6:38 PM split from original thread.
All Replies
-
Wednesday, August 06, 2008 8:14 PMHi,
Is there someplace where I could download that genome to do my own test?
I'm curious about the diferrence in C# times, the only time I did my own benchmarks was years ago and iirc regex won, but of course things change a lot depending on input and pattern.
Cheers,
John
-
Wednesday, August 06, 2008 8:34 PMHi,
My post actually was a reply to this thread (over a year old):
Java Regex faster than C# Regex
See all relevant code there.
To my surprise searching a substring within a string was much faster using 'string.Contains(substring)' than 'Regex.IsMatch()' (and also substantially faster than the faster Java regular expression classes. -
Wednesday, August 06, 2008 11:49 PMIt shouldn't be too surprising that a pattern matcher runs slower than a substring matcher. Regex will have to scan the pattern to see if there are any wild cards each time. string.Contains() looks for an exact match and does not need to preprocess the "pattern".
Have you tried using the Java equivalent of string.Contains()? I would expect that to be faster, too.
Les Potter, Xalnix Corporation -
Thursday, August 07, 2008 9:38 AMxalnix, you're absolutely right, that should be done, so:
found AAAGTAAGCC at 1000000 after 0 milliseconds
found AAATGAAAAAG at 1048960 after 141 milliseconds
found GAAAAAGTAAG at 1085441 after 203 milliseconds
found TCTAAAAATAG at 1179694 after 359 milliseconds
found ACGTGATGTAG at 1204636 after 406 milliseconds
found AATAGATTCGG at 1548576 after 953 milliseconds
found TCGTACAAATG at 1576094 after 1000 milliseconds
found CGGACGTGATG at 1599255 after 1047 milliseconds
found ATTCGGACGTG at 1689064 after 1188 milliseconds
found AGATTCGGACG at 1859204 after 1469 milliseconds
found TGATGTAGTCG at 1984902 after 1672 milliseconds
found AAATAGATTCG at 2000000 after 1688 milliseconds
Java string contains took 1688 milliseconds
The complete listing then:
C# string.Contains() 1361 ms.
Java string.contains() 1688 ms.
Java regex code 3297 ms.
C# Regex.IsMatch() 16391 ms. -
Thursday, August 07, 2008 1:53 PMSo if you are comparing apples-to-apples, the C# is about 20% faster than Java for the same basic test jig when using string.Contains. This suggests that the C# Regex compared to Java Regex is even more than 5x slower. Are you using .NET Regex in Java (J#), or Sun Java and its built in library version of Regex? Sorry if that sounds stupid, I am not a Java or J# user. Regardless, there are many flavors of Regex and so will their performances will differ.
Les Potter, Xalnix Corporation

