Can anyone share ideas on fastest way to compare and count 2 sets of input sequences of up to a million in each?
-
Wednesday, May 30, 2012 1:51 AM
Hey everyone:
Given a master list of sequences of keys (about a million long) and another input list of sequences of keys (about a million long) to compare against, what is the fastest way to compare the two lists counting the number of keys each sequence from our master list matches in each sequence in our input list?
Best illustrated by simple example:
master list
a,d,c
3,f,d
4,2,3input list
4,5,6
d,3,2
1,2,3
2,b,4output required:
input sequence: number of matches: number of times matched:
1,2,3 2 1
2,b,4 2 1
d,2,3 2 2
1,2,3 1 1
d,2,3 1 1We're currently using double parallel.foreach loops which, with a million sequences with up to 8 keys in each sequence, is taking nearly 40 hours to complete.
What's given is:
- each "master" list and the associated input list will ALWAYS be comma separated.
- Each sequence in the "master" list will always have the same number of keys in each sequence in the list.
- Each sequence in teh "input" list will always have the same number of keys in each sequence.
- The number of keys in a master list could differ from the number of keys in the in each sequence in the input list. That is, the master list could have 3 keys for each sequence, and the input list could have 5 keys, 6 keys, 3 keys, or any number of keys > 0.
Basic pseudo code of what we have is below. The trouble is, whether the loops are single threaded (paralleloptions = 1), unlimited, or set to a fixed value, it's only getting through the outer loop at the rate of about 2500 every 5 minutes, which equates to 40 hours for a million sequences.
If rewritten with just single threaded foreach loops it's worse.
Can anyone come up with ANY ideas on how to process this faster to deliver our desired results? We're looking for speed so don't care how pretty the code isn't. :-)
//pseudo code below Parallel.foreach (key in the master list => { hashset<string> h = new hashset(key.split(',')); //hashset faster than strings for comparisons parallel.foreach (inputkey in the input sequence => { int counter = 0; string[] temp = inputkey.split(','); for (int x=0; x<temp.length; x++) { if (h.contains(temp[x])) counter += 1; // other stuff such as adding //each sequence and count to a //concurrentdictionary for the output } Array.Clear(temp, 0, temp.length); temp = null; }); h.Clear(); h = null; });
- Moved by Mike Dos ZhangMicrosoft Contingent Staff, Moderator Wednesday, May 30, 2012 5:40 AM move to more appropriate forum (From:Visual C# General)
All Replies
-
Wednesday, May 30, 2012 4:11 AMIf your input list is not changing, you might consider a sorted list which is fast to search.
"Programming is an art form that fights back"
-
Wednesday, May 30, 2012 6:26 AM
Hi
i have some thing not understand following.
1. what's mean the number of matches?
2. what's mean number of times matched?
3. is there repeated value in any sequence like "2,2,3,b" ?
DON'T TRY SO HARD,THE BEST THINGS COME WHEN YOU LEAST EXPECT THEM TO.
- Edited by Matthew LIN Wednesday, May 30, 2012 6:27 AM
-
Wednesday, May 30, 2012 7:50 AM
HEy everyone!
I believe I found a solution which SIGNIFICANTLY reduces the amount of time!
For anyone who doesn't believe how much time allocation/deallocation can take, I wrote this little test program to see what I could do.
The first test does all the splitting and creating of the hashsets since by the time we get to this looping we already know what they are.
THe second test is based on our code, where every time through the loop the hashset and temp array are created to compare and sum up the instances.
On my little old laptop, when I set the const MAX to 30,000, it made a significant different in the time the two tests took to complete. Check it out! I'll have to try adjusting our code now to accommodate the pattern. If anyone has any other ideas that would be great!
Matthew Lin: to answer your question: each sequence will not have a repeated key.
const int MAX = 30000; string[] strTest = new string[MAX]; string[] intTest = new string[MAX]; HashSet<string>[] h1 = new HashSet<string>[MAX]; string[][] h2 = new string[MAX][]; int[][] id = new int[MAX][]; //global counter for each input to be summed up at end for (int x = 0; x < MAX; x++) { strTest[x] = "01,02,03,04,05,06,07"; //test master sequences intTest[x] = "01,02,03,08,05,06,07"; //test input sequences //pre create hashsets to all the master sequences h1[x] = new HashSet<string>(strTest[x].Split(',')); h2[x] = intTest[x].Split(','); //pre-split all the input sequences id[x] = new int[MAX]; } start = DateTime.Now; Console.WriteLine("Starting compare timing test: " + start); int a = 0; Parallel.For(0, h1.Length, x => { Parallel.For(0, intTest.Length, y => { int count = 0; for (int z = 0; z < h2[y].Length; z++) { count += h1[x].Contains(h2[y][z]) ? 1 : 0; } id[x][y] = count; }); }); //sum up everything for (int x=0; x<MAX; x++) for (int y=0; y<MAX; y++) a += id[x][y]; end = DateTime.Now; Console.WriteLine("a: " + a); Console.WriteLine("Finished at: " + end); Console.WriteLine("Time: " + (end - start)); Console.WriteLine(); //reset values to start again for (int x = 0; x < MAX; x++) { Array.Clear(id[x], 0, id[x].Length); h1[x].Clear(); Array.Clear(h2[x], 0, h2[x].Length); } start = DateTime.Now; Console.WriteLine("Starting compare timing test2: " + start); a = 0; Parallel.For(0, strTest.Length, x => { //splitting, allocation, and deallocation kept within loops HashSet<string> hs = new HashSet<string>(strTest[x].Split(',')); Parallel.For(0, intTest.Length, y => { int count = 0; string[] s = intTest[y].Split(','); //another allocation within the loop for (int z = 0; z < s.Length; z++) { count += hs.Contains(s[z]) ? 1 : 0; } Array.Clear(s, 0, s.Length); id[x][y] = count; }); hs.Clear(); //deallocation within loop }); //sum up everything again for (int x=0; x<MAX; x++) for (int y=0; y<MAX; y++) a += id[x][y]; end = DateTime.Now; Console.WriteLine("a: " + a); Console.WriteLine("Finished at: " + end); Console.WriteLine("Time: " + (end - start)); Console.WriteLine();
- Proposed As Answer by Stephen Toub - MSFTMicrosoft Employee, Owner Friday, August 24, 2012 6:07 AM
-
Wednesday, May 30, 2012 8:02 AM
Mike Zhang:
Why did you move this thread to the parallel forum as "more appropriate"?
It was posted in the C# General because I'm seeking the best solution and parallel was mentioned as one way of doing it. AND it's not restricted to just parallel computing options.
Now because you have moved it here only people who do parallel computer will look at the issue and respond. :P
If someone had suggested LINQ and thrown in linq code would you then move it to a LINQ forum??
Or if someone suggested inserting into a db and doing all sorts of joins on results would you then move it to an SQL forum??
Would love to know the reasoning behind why you moved it.
Thank you!
-
Wednesday, May 30, 2012 1:08 PMModerator
My first thinking after read your requirement is that you want the "fast" way to do this kind work, and the very first option in my mind is the Parallel Computing will help you to speed up it.
I will try to help you get the more appropriate solution for question(including find the more appropriate forum to let that tech aspect exports to help you). Though the programming language is C#, but the technical to solve the problem maybe a more special tech area, so I will try to use all the ways I can do to help you get the answer quickly.
If there's any concern, please feel free to let me know.
Best wishes,
Mike Zhang[MSFT]
MSDN Community Support | Feedback to us
-
Friday, June 01, 2012 12:13 AM
Your line h1[x].Contains(h2[y][z]) makes a call to string.GetHashCode() MAX times for each h2[y][z]. If you profile your sample code, you'll see it spends a significant portion of its CPU time generating those hash codes. If you could cache the hash codes themselves in, say, HashSet<int> objects for both your input and test data, you'd probably see a significant improvement, because the hash code of an int is the int itself.- Marked As Answer by FireMyst Friday, August 24, 2012 2:19 PM
-
Friday, August 24, 2012 2:20 PM
Your line h1[x].Contains(h2[y][z]) makes a call to string.GetHashCode() MAX times for each h2[y][z]. If you profile your sample code, you'll see it spends a significant portion of its CPU time generating those hash codes. If you could cache the hash codes themselves in, say, HashSet<int> objects for both your input and test data, you'd probably see a significant improvement, because the hash code of an int is the int itself.
Phil
That's exactly what I did for a very significant short term improvement. :-)

