locked
Fastest way to do a search for a string in a list of strings.both single and mult-threaded. RRS feed

  • Question

  • I am looking for the best way (performance wise) to search for a string in a list.  (I could be looking for hundred of thousands of string or perhaps even millions.)

    I would like to find the best single thread way and the best multi-thread way to do this search.

    I was testing three different ways in a single thread by:

    1) By a predicate

    2) Foreach

    3) For i

    The code is:

     

    	public class StringList : List<string>
    	{
    		public StringList()
    			: base()
    		{
    		}
    
    
    
    		public string FindStringPredicate(string s)
    		{
    			//string s2 = List.Find(o => o == s);
    
    			return (this.Find( o => o == s));
    		}
    
    
    
    		public string FindStringFori(string s)
    		{
    			int count = this.Count;
    			for(int i = 0; i < count; i++)
    			{
    				if(this[i] == s)
    					return (this[i]);
    			}
    			return (null);
    		}
    
    		public string FindStringForEach(string s)
    		{
    
    			foreach(string s2 in this)
    			{
    				if(s2 == s)
    					return (s2);
    			}
    			return (null);
    		}
    	}
    

    I created one million words and did 100,000 searches and my results came back as:

    Predicate: 2 minutes 16 seconds

    Foreach:   1 minute 48 seconds

    For i:        1 minute 30 seconds.

     

    Is there any other ways that could possible be faster than the for i in a single thread?

    what about a multi-threaded solution?  Any idea what would be the best way for that?

    Thursday, October 14, 2010 4:52 PM

Answers

  • Hi guys, does that mean that Toms sequential search was really really fast?

    I mean did it report the right time for a sequential search for the last element? or do you think it just returned the last item without a search?

     


    …we each have more potential than we might ever presume to guess. (Blog: http://dsmyth.blogspot.com/)

    No.  Look closely - his timing is just the console output, not the search...

     

    I put together a program to test all of these.  I'll post it afterwards, but first, the results:

    Starting run 1
    LINQ-Where: Found 6efb61b5f7724ddca93d9aa32e114cea in 17 ms
    LINQ-First: Found 6efb61b5f7724ddca93d9aa32e114cea in 21 ms
    PLINQ   : Found 6efb61b5f7724ddca93d9aa32e114cea in 19 ms
    foreach  : Found 6efb61b5f7724ddca93d9aa32e114cea in 13 ms
    for    : Found 6efb61b5f7724ddca93d9aa32e114cea in 11 ms
    List.Find : Found 6efb61b5f7724ddca93d9aa32e114cea in 11 ms
    Binary Src: Found 6efb61b5f7724ddca93d9aa32e114cea in 2724 ms
    
    Starting run 2
    LINQ-Where: Found 6efb61b5f7724ddca93d9aa32e114cea in 17 ms
    LINQ-First: Found 6efb61b5f7724ddca93d9aa32e114cea in 21 ms
    PLINQ   : Found 6efb61b5f7724ddca93d9aa32e114cea in 6 ms
    foreach  : Found 6efb61b5f7724ddca93d9aa32e114cea in 13 ms
    for    : Found 6efb61b5f7724ddca93d9aa32e114cea in 11 ms
    List.Find : Found 6efb61b5f7724ddca93d9aa32e114cea in 13 ms
    Binary Src: Found 6efb61b5f7724ddca93d9aa32e114cea in 2770 ms
    
    Starting run 3
    LINQ-Where: Found 6efb61b5f7724ddca93d9aa32e114cea in 17 ms
    LINQ-First: Found 6efb61b5f7724ddca93d9aa32e114cea in 21 ms
    PLINQ   : Found 6efb61b5f7724ddca93d9aa32e114cea in 6 ms
    foreach  : Found 6efb61b5f7724ddca93d9aa32e114cea in 13 ms
    for    : Found 6efb61b5f7724ddca93d9aa32e114cea in 11 ms
    List.Find : Found 6efb61b5f7724ddca93d9aa32e114cea in 11 ms
    Binary Src: Found 6efb61b5f7724ddca93d9aa32e114cea in 2757 ms

    As you can see, PLINQ is by far the fastest option (on my machine).  I had to up the search to 1,000,000 elements in order to get a reasonable timing.  The first run, of course, the timings are kind of skewed because of JIT overhead, but after that, it really pairs down quickly.

    Of course, this is also somewhat unfair in that I'm doing the sort each iteration.  If you pull the sort out, the binary search is very fast - but it does require a sorted list in order to work.  The sort far overshadows the speed up unless you can pre-sort this and not re-add elements.  Also, as a comparison, I tested with a HashSet - with 1000000 elements, it was running in under 0.5ms - I had to go to 10mil and measure in ticks to see it.  There, a HashSet search (Contains) is performing at about 100000x the speed of any of the other options ;) 

    For anybody who wants to test, here's the full code - just make sure to run it in an optimized (release) build outside of VS.

     

    namespace ConsoleApplication1
    {
      using System;
      using System.Collections.Generic;
      using System.Diagnostics;
      using System.Linq;
    
      class Program
      {
        public static string last;
    
        static void Main(string[] args)
        {
          List<string> stringsOriginal = InitList();
    
          for (int iteration = 0; iteration < 3; ++iteration)
          {
            Console.WriteLine("Starting run {0}", iteration + 1);
    
            List<string> strings = new List<string>(stringsOriginal);
            Stopwatch stopwatch = new Stopwatch();
    
            stopwatch.Start();
            var value = (from s in strings where s == last select s).First();
            stopwatch.Stop();
    
            Console.WriteLine("LINQ-Where: Found {0} in {1} ms", value, stopwatch.ElapsedMilliseconds);
    
            stopwatch.Reset();
    
            stopwatch.Start();
            value = strings.First(s => s == last);
            stopwatch.Stop();
    
            Console.WriteLine("LINQ-First: Found {0} in {1} ms", value, stopwatch.ElapsedMilliseconds);
    
            stopwatch.Reset();
    
            strings = new List<string>(stringsOriginal);
            stopwatch.Start();
            value = strings.AsParallel().First(s => s == last);
            stopwatch.Stop();
    
            Console.WriteLine("PLINQ   : Found {0} in {1} ms", value, stopwatch.ElapsedMilliseconds);
    
            stopwatch.Reset();
    
            strings = new List<string>(stringsOriginal);
            stopwatch.Start();
            value = null;
            foreach (var s in strings)
            {
              if (s == last)
              {
                value = s;
                break;
              }
            }
            stopwatch.Stop();
    
            Console.WriteLine("foreach  : Found {0} in {1} ms", value, stopwatch.ElapsedMilliseconds);
    
            strings = new List<string>(stringsOriginal);
            stopwatch.Reset();
    
            stopwatch.Start();
            value = null;
            for (int i = 0; i < strings.Count; ++i)
            {
              if (strings[i] == last)
              {
                value = strings[i];
                break;
              }
            }
            stopwatch.Stop();
    
            Console.WriteLine("for    : Found {0} in {1} ms", value, stopwatch.ElapsedMilliseconds);
    
    
            stopwatch.Reset();
    
            strings = new List<string>(stringsOriginal);
            stopwatch.Start();
            value = strings.Find(s => s == last);
            stopwatch.Stop();
    
            Console.WriteLine("List.Find : Found {0} in {1} ms", value, stopwatch.ElapsedMilliseconds);
    
            stopwatch.Reset();
    
            strings = new List<string>(stringsOriginal);
            stopwatch.Start();
            strings.Sort();
            value = strings[strings.BinarySearch(last)];
            stopwatch.Stop();
    
            Console.WriteLine("Binary Src: Found {0} in {1} ms", value, stopwatch.ElapsedMilliseconds);
          }
        }
    
        static List<string> InitList()
        {
          List<string> strings = new List<string>(1000000);
    
          string value = string.Empty;
          for (int i = 0; i < 1000000; i++)
          {
            value = Guid.NewGuid().ToString("N");
            strings.Add(value);
          }
    
          last = value;
          return strings;
        }
      }
    }
    


    Reed Copsey, Jr. - http://reedcopsey.com
    • Marked as answer by Figo Fei Monday, October 25, 2010 2:43 AM
    Thursday, October 14, 2010 11:00 PM
    Moderator
  • Actually, the timing was not console output - it was the actual execution of the query.  The query does not execute until you iterate it - defered execution....  You will see the same setup in the msdn performance tests.  I just setup my query wrong, which was unfortunate.

    Also purposely used the where to simulate not knowing where the in the collection the string was.  I wanted to set it up so I would be timing the worst case - which in a sequential search, would be the last item in the list.  I suppose last would do that, but I wanted it to be more like a real search....

    In fact, in this case, this was not a particularly good example.  On my system, a simple for loop is faster then then using the query - I suspect that there is not enough work going on to make up for the added over head of the parallelism:

    stopwatch.Start ();
    for ( int i = 0; i < strings.Count; i++ )
    {
       if ( strings[i] == last )
       {
         Console.WriteLine ( strings[i] );
         break;
       }
    }
    stopwatch.Stop ();
    Console.WriteLine ( stopwatch.ElapsedMilliseconds );
    

    takes about 3 - 4 ms.  The corrected query takes about 17 or 18 ms.  The funny thing is my original version - ends up about 5 - 6 ms.  Probably because it's essentially running sequentially - and the timing is in the same range as a simple foreach:

    stopwatch.Start ();
    foreach ( string s in strings )
    {
      if ( s == last )
      {
        Console.WriteLine ( s );
        break;
      }
    }
    stopwatch.Stop ();
    Console.WriteLine ( stopwatch.ElapsedMilliseconds );
    
    Doing a list .find, is in the 3-4ms range like a for loop.

    I'm not convinced now that this is a good candidate for a multi-threaded operation, at least not at the 100,000 item range.

     


    Tom Shelton
    • Marked as answer by Figo Fei Monday, October 25, 2010 2:44 AM
    Friday, October 15, 2010 8:03 AM

  • Hey folks,

    First of all thanks for sharing your code and results Reed. It's interesting seeing the comparision results of the different methods.

     

    I prefer to use multi-threading to do two things at the same time rather than using threading to do the same thing faster. The reason is multi threading never really gives that much of a performance boost due to all the overhead involved, and it usually just produces complex code. The complexity of the code doesn't justify the slight performance gain for me. Not saying it's a golden rule of mine, a silver bullet, but for the most part one CPU should be enough to do a single task.

    Take this problem for example there are a couple of approaches. Half the list and search each half seperately with two threads, ok but the time spent halfing the list and then rejoining the results when finished adds more processing that basically cancels out the performance gains. No point, complicates the code for no real benefit. Ok how about two threads searching the same list, that should be fine, each thread only reads from the list so no locks needed, but what happens to the results, the matches. If they go in another list then access to that result list needs to be synchronised, one thread at a time, theres the bottleneck that cancels out any benefit, complex code for no real benefit.

     

    They say the software industry needs to change to use multi core CPUs and they are right, developers needs to develop software differently. But for me the change isn't doing a thing faster but doing more things at the same time.

     

    There isn't enough information about the actual task at hand to really recommend an approach. PLINQ does look the fastest but apart from the search what eles needs to happen in the application. Maybe the thing that follows the PLINQ query rules out the PLINQ approach. Who knows. For me I'd stick with sequential searching until it proves to be a bottleneck. I more interested in how the strings get into the program, reading or generating millions of string; that for me is a more likely candidate for a bottleneck.

    :)

    …we each have more potential than we might ever presume to guess. (Blog: http://dsmyth.blogspot.com/)
    • Proposed as answer by Figo Fei Thursday, October 21, 2010 9:26 AM
    • Marked as answer by Figo Fei Monday, October 25, 2010 2:44 AM
    Friday, October 15, 2010 10:08 AM
  • Hi, no problem

    How much memory do you have on that bad boy?

    Post the times you get with the 160,000 searches, which seems a lot.... what's your software doing?

     

            private static long TimeGUIDSearch_SequentialContains(string[] guids)
            {
                int count = 0;
                string lastEntry = guids.Last();

                Stopwatch watch = new Stopwatch();
                watch.Restart();

                if (guids.Contains(lastEntry))
                    count = 1;

                watch.Stop();
                Console.WriteLine("Sequential : Contains : GUIDs found {0} : {1} ms", count, watch.ElapsedMilliseconds);

                return watch.ElapsedMilliseconds;
            }

     

    Virtually the same for parallel

     

            private static long TimeGUIDSearch_ParallelContains(string[] guids)
            {
                int count = 0;
                string lastEntry = guids.Last();

                Stopwatch watch = new Stopwatch();
                watch.Restart();

                if (guids.AsParallel().Contains(lastEntry))
                    count = 1;

                watch.Stop();
                Console.WriteLine("Parallel : Contains : GUIDs found {0} : {1} ms", count, watch.ElapsedMilliseconds);

                return watch.ElapsedMilliseconds;
            }

     


    …we each have more potential than we might ever presume to guess. (Blog: http://dsmyth.blogspot.com/)
    • Marked as answer by Figo Fei Monday, October 25, 2010 2:43 AM
    Wednesday, October 20, 2010 5:20 PM

All replies

  • .NET 4 using PLink:

     

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Diagnostics;
    
    namespace ConsoleApplication2
    {
      class Program
      {
        public static string last;
    
        static void Main ( string[] args )
        {
    
          List<string> strings = InitList ();
    
          Stopwatch stopwatch = new Stopwatch ();
          var value = ( from s in strings where s == last select s ).AsParallel ();
    
          stopwatch.Start ();
          foreach ( var v in value )
            Console.WriteLine ( v );
          stopwatch.Stop ();
          Console.WriteLine ( stopwatch.Elapsed );
        }
    
        static List<string> InitList ()
        {
          List<string> strings = new List<string> ( 100000 );
    
          string value = string.Empty;
          for ( int i = 0; i < 100000; i++ )
          {
            value = Guid.NewGuid ().ToString ( "N" );
            strings.Add ( value );
          }
    
          last = value;
          return strings;
        }
      }
    }
    
    

     

    100000 unique strings - doing a worst case search (basically, searching for the last string in the list):

    Time: 00:00:00.0058429

    that's a 3 year old dual core processor.


    Tom Shelton
    Thursday, October 14, 2010 5:39 PM
  • Tom,

    You've got your PLINQ query setup inappropriately.  This:

     

    var value = ( from s in strings where s == last select s ).AsParallel ();
    

    Should really be:

     

    var value = from s in strings.AsParallel() 
         where s == last
         select s;
    
    // Or:
    
    // var value = strings.AsParallel().Where(s => s == last);
    

     

    The way you wrote it, you're doing your entire query in serial.  It's effectively doing:

     

     

    var value = strings.Where(s => s == last).AsParallel();
    
    

     

    This causes strings to be enumerated sequentially using the Enumerable.Where method, then the results are converted to a ParallelQuery, which is effectively just being enumerated serially again.

     

     


    Reed Copsey, Jr. - http://reedcopsey.com
    Thursday, October 14, 2010 8:25 PM
    Moderator
  • Predicate: 2 minutes 16 seconds

    Foreach:   1 minute 48 seconds

    For i:        1 minute 30 seconds.

     

    Is there any other ways that could possible be faster than the for i in a single thread?

    what about a multi-threaded solution?  Any idea what would be the best way for that?

    I suspect that your timings are suspect here.  I would recommend making sure that you run this in release mode, outside of the Visual Studio test host.  You'll probably find your timings change - they should be much faster.

     

    That being said, I would use PLINQ personally to multithread this.  My version would be:

     

    public string FindStringPLinq(string s)
    {
      return this.AsParallel().FirstOrDefault(str => s==str);
    }
    


    Reed Copsey, Jr. - http://reedcopsey.com
    Thursday, October 14, 2010 8:44 PM
    Moderator
  • LOL...your right duh!  sorry :)

    Tom Shelton
    Thursday, October 14, 2010 9:36 PM
  • Hey man, how are the string listed? or how are they loaded into the program?

    Like say they are in a database then a SQL query would be faster than reading them into memory. If they are in a file then line by line sequentially will reduce the need for a lot of memory or perhaps a huge big read into a stringbuilder and then running a compiled regular expression might be fast... maybe the text changes a lot but the search pattern doesn't or vice versa or maybe both change all the time. If you have a large list of strings and reading them in rather than searching will take the longest time. What about the search is it on a per line basis, looks like it, so you know ... lots of things could improve performance. Perhaps using map reduce functional programming techniques could be applied, I read F# was capable of large data processing in short times, only because functional programming techniques were applied... LINQ in C# basically.

    Is searching the biggest performance bottleneck?

     


    …we each have more potential than we might ever presume to guess. (Blog: http://dsmyth.blogspot.com/)
    Thursday, October 14, 2010 9:54 PM
  • Is there any other ways that could possible be faster than the for i in a single thread?

    list.Sort();

    list.BinarySearch(s)

    6 seconds to sort amillion random guids. . .

    0.5 seconds to search 100000 guids

     


    gimme some slamming techno!!!!
    Thursday, October 14, 2010 9:58 PM
  • Hi guys, does that mean that Toms sequential search was really really fast?

    I mean did it report the right time for a sequential search for the last element? or do you think it just returned the last item without a search?

     


    …we each have more potential than we might ever presume to guess. (Blog: http://dsmyth.blogspot.com/)
    Thursday, October 14, 2010 9:58 PM
  • using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    
    namespace TestListSearch
    {
     public class Program 
     {
      private static List<string> CreateStringList()
      {
       var result = new List<string>();
       for (int i = 0; i < 1000000; i++)
       {
        result.Add(Guid.NewGuid().ToString());
       }
       return result;
      }
    
      static void Main(string[] args)
      {
       Console.WriteLine("Building list");
       var list = CreateStringList();
       var findThese = new List<string>();
       Random rand = new Random();
       Console.WriteLine("Building Search Criteria");
       for (int i = 0; i < 99000; i++)
       {
        findThese.Add(list[rand.Next(0, list.Count)]);
       }
    
       // Add 1% not found
       for(int i = 0; i < 1000; i++)
       {
        findThese.Add(Guid.NewGuid().ToString());
       }
       DateTime time = DateTime.Now;
       Console.WriteLine("Preparing for Binary Search (sorting the list)");
       list.Sort();
       TimeSpan timeToSort = DateTime.Now.Subtract(time);
       Console.WriteLine("Binary Search prepared: {0}", timeToSort.ToString(@"mm\mss\.ff\s"));
       Console.WriteLine("Running Binary Search ");
       findThese.ForEach(delegate(string s)
       {
        list.BinarySearch(s);
       });
       Console.WriteLine("BinarySearch (including time to Sort) :{0}", DateTime.Now.Subtract(time).ToString(@"mm\mss\.ff\s"));
       Console.WriteLine();
       Console.WriteLine("[Press any key to exit]");
       Console.ReadKey();
      }
     }
    
    }
    
    

    gimme some slamming techno!!!!
    Thursday, October 14, 2010 10:33 PM
  • Hi guys, does that mean that Toms sequential search was really really fast?

    I mean did it report the right time for a sequential search for the last element? or do you think it just returned the last item without a search?

     


    …we each have more potential than we might ever presume to guess. (Blog: http://dsmyth.blogspot.com/)

    No.  Look closely - his timing is just the console output, not the search...

     

    I put together a program to test all of these.  I'll post it afterwards, but first, the results:

    Starting run 1
    LINQ-Where: Found 6efb61b5f7724ddca93d9aa32e114cea in 17 ms
    LINQ-First: Found 6efb61b5f7724ddca93d9aa32e114cea in 21 ms
    PLINQ   : Found 6efb61b5f7724ddca93d9aa32e114cea in 19 ms
    foreach  : Found 6efb61b5f7724ddca93d9aa32e114cea in 13 ms
    for    : Found 6efb61b5f7724ddca93d9aa32e114cea in 11 ms
    List.Find : Found 6efb61b5f7724ddca93d9aa32e114cea in 11 ms
    Binary Src: Found 6efb61b5f7724ddca93d9aa32e114cea in 2724 ms
    
    Starting run 2
    LINQ-Where: Found 6efb61b5f7724ddca93d9aa32e114cea in 17 ms
    LINQ-First: Found 6efb61b5f7724ddca93d9aa32e114cea in 21 ms
    PLINQ   : Found 6efb61b5f7724ddca93d9aa32e114cea in 6 ms
    foreach  : Found 6efb61b5f7724ddca93d9aa32e114cea in 13 ms
    for    : Found 6efb61b5f7724ddca93d9aa32e114cea in 11 ms
    List.Find : Found 6efb61b5f7724ddca93d9aa32e114cea in 13 ms
    Binary Src: Found 6efb61b5f7724ddca93d9aa32e114cea in 2770 ms
    
    Starting run 3
    LINQ-Where: Found 6efb61b5f7724ddca93d9aa32e114cea in 17 ms
    LINQ-First: Found 6efb61b5f7724ddca93d9aa32e114cea in 21 ms
    PLINQ   : Found 6efb61b5f7724ddca93d9aa32e114cea in 6 ms
    foreach  : Found 6efb61b5f7724ddca93d9aa32e114cea in 13 ms
    for    : Found 6efb61b5f7724ddca93d9aa32e114cea in 11 ms
    List.Find : Found 6efb61b5f7724ddca93d9aa32e114cea in 11 ms
    Binary Src: Found 6efb61b5f7724ddca93d9aa32e114cea in 2757 ms

    As you can see, PLINQ is by far the fastest option (on my machine).  I had to up the search to 1,000,000 elements in order to get a reasonable timing.  The first run, of course, the timings are kind of skewed because of JIT overhead, but after that, it really pairs down quickly.

    Of course, this is also somewhat unfair in that I'm doing the sort each iteration.  If you pull the sort out, the binary search is very fast - but it does require a sorted list in order to work.  The sort far overshadows the speed up unless you can pre-sort this and not re-add elements.  Also, as a comparison, I tested with a HashSet - with 1000000 elements, it was running in under 0.5ms - I had to go to 10mil and measure in ticks to see it.  There, a HashSet search (Contains) is performing at about 100000x the speed of any of the other options ;) 

    For anybody who wants to test, here's the full code - just make sure to run it in an optimized (release) build outside of VS.

     

    namespace ConsoleApplication1
    {
      using System;
      using System.Collections.Generic;
      using System.Diagnostics;
      using System.Linq;
    
      class Program
      {
        public static string last;
    
        static void Main(string[] args)
        {
          List<string> stringsOriginal = InitList();
    
          for (int iteration = 0; iteration < 3; ++iteration)
          {
            Console.WriteLine("Starting run {0}", iteration + 1);
    
            List<string> strings = new List<string>(stringsOriginal);
            Stopwatch stopwatch = new Stopwatch();
    
            stopwatch.Start();
            var value = (from s in strings where s == last select s).First();
            stopwatch.Stop();
    
            Console.WriteLine("LINQ-Where: Found {0} in {1} ms", value, stopwatch.ElapsedMilliseconds);
    
            stopwatch.Reset();
    
            stopwatch.Start();
            value = strings.First(s => s == last);
            stopwatch.Stop();
    
            Console.WriteLine("LINQ-First: Found {0} in {1} ms", value, stopwatch.ElapsedMilliseconds);
    
            stopwatch.Reset();
    
            strings = new List<string>(stringsOriginal);
            stopwatch.Start();
            value = strings.AsParallel().First(s => s == last);
            stopwatch.Stop();
    
            Console.WriteLine("PLINQ   : Found {0} in {1} ms", value, stopwatch.ElapsedMilliseconds);
    
            stopwatch.Reset();
    
            strings = new List<string>(stringsOriginal);
            stopwatch.Start();
            value = null;
            foreach (var s in strings)
            {
              if (s == last)
              {
                value = s;
                break;
              }
            }
            stopwatch.Stop();
    
            Console.WriteLine("foreach  : Found {0} in {1} ms", value, stopwatch.ElapsedMilliseconds);
    
            strings = new List<string>(stringsOriginal);
            stopwatch.Reset();
    
            stopwatch.Start();
            value = null;
            for (int i = 0; i < strings.Count; ++i)
            {
              if (strings[i] == last)
              {
                value = strings[i];
                break;
              }
            }
            stopwatch.Stop();
    
            Console.WriteLine("for    : Found {0} in {1} ms", value, stopwatch.ElapsedMilliseconds);
    
    
            stopwatch.Reset();
    
            strings = new List<string>(stringsOriginal);
            stopwatch.Start();
            value = strings.Find(s => s == last);
            stopwatch.Stop();
    
            Console.WriteLine("List.Find : Found {0} in {1} ms", value, stopwatch.ElapsedMilliseconds);
    
            stopwatch.Reset();
    
            strings = new List<string>(stringsOriginal);
            stopwatch.Start();
            strings.Sort();
            value = strings[strings.BinarySearch(last)];
            stopwatch.Stop();
    
            Console.WriteLine("Binary Src: Found {0} in {1} ms", value, stopwatch.ElapsedMilliseconds);
          }
        }
    
        static List<string> InitList()
        {
          List<string> strings = new List<string>(1000000);
    
          string value = string.Empty;
          for (int i = 0; i < 1000000; i++)
          {
            value = Guid.NewGuid().ToString("N");
            strings.Add(value);
          }
    
          last = value;
          return strings;
        }
      }
    }
    


    Reed Copsey, Jr. - http://reedcopsey.com
    • Marked as answer by Figo Fei Monday, October 25, 2010 2:43 AM
    Thursday, October 14, 2010 11:00 PM
    Moderator
  • Actually, the timing was not console output - it was the actual execution of the query.  The query does not execute until you iterate it - defered execution....  You will see the same setup in the msdn performance tests.  I just setup my query wrong, which was unfortunate.

    Also purposely used the where to simulate not knowing where the in the collection the string was.  I wanted to set it up so I would be timing the worst case - which in a sequential search, would be the last item in the list.  I suppose last would do that, but I wanted it to be more like a real search....

    In fact, in this case, this was not a particularly good example.  On my system, a simple for loop is faster then then using the query - I suspect that there is not enough work going on to make up for the added over head of the parallelism:

    stopwatch.Start ();
    for ( int i = 0; i < strings.Count; i++ )
    {
       if ( strings[i] == last )
       {
         Console.WriteLine ( strings[i] );
         break;
       }
    }
    stopwatch.Stop ();
    Console.WriteLine ( stopwatch.ElapsedMilliseconds );
    

    takes about 3 - 4 ms.  The corrected query takes about 17 or 18 ms.  The funny thing is my original version - ends up about 5 - 6 ms.  Probably because it's essentially running sequentially - and the timing is in the same range as a simple foreach:

    stopwatch.Start ();
    foreach ( string s in strings )
    {
      if ( s == last )
      {
        Console.WriteLine ( s );
        break;
      }
    }
    stopwatch.Stop ();
    Console.WriteLine ( stopwatch.ElapsedMilliseconds );
    
    Doing a list .find, is in the 3-4ms range like a for loop.

    I'm not convinced now that this is a good candidate for a multi-threaded operation, at least not at the 100,000 item range.

     


    Tom Shelton
    • Marked as answer by Figo Fei Monday, October 25, 2010 2:44 AM
    Friday, October 15, 2010 8:03 AM

  • Hey folks,

    First of all thanks for sharing your code and results Reed. It's interesting seeing the comparision results of the different methods.

     

    I prefer to use multi-threading to do two things at the same time rather than using threading to do the same thing faster. The reason is multi threading never really gives that much of a performance boost due to all the overhead involved, and it usually just produces complex code. The complexity of the code doesn't justify the slight performance gain for me. Not saying it's a golden rule of mine, a silver bullet, but for the most part one CPU should be enough to do a single task.

    Take this problem for example there are a couple of approaches. Half the list and search each half seperately with two threads, ok but the time spent halfing the list and then rejoining the results when finished adds more processing that basically cancels out the performance gains. No point, complicates the code for no real benefit. Ok how about two threads searching the same list, that should be fine, each thread only reads from the list so no locks needed, but what happens to the results, the matches. If they go in another list then access to that result list needs to be synchronised, one thread at a time, theres the bottleneck that cancels out any benefit, complex code for no real benefit.

     

    They say the software industry needs to change to use multi core CPUs and they are right, developers needs to develop software differently. But for me the change isn't doing a thing faster but doing more things at the same time.

     

    There isn't enough information about the actual task at hand to really recommend an approach. PLINQ does look the fastest but apart from the search what eles needs to happen in the application. Maybe the thing that follows the PLINQ query rules out the PLINQ approach. Who knows. For me I'd stick with sequential searching until it proves to be a bottleneck. I more interested in how the strings get into the program, reading or generating millions of string; that for me is a more likely candidate for a bottleneck.

    :)

    …we each have more potential than we might ever presume to guess. (Blog: http://dsmyth.blogspot.com/)
    • Proposed as answer by Figo Fei Thursday, October 21, 2010 9:26 AM
    • Marked as answer by Figo Fei Monday, October 25, 2010 2:44 AM
    Friday, October 15, 2010 10:08 AM
  • As you can see, PLINQ is by far the fastest option (on my machine) . . . The sort far overshadows the speed up unless you can pre-sort this and not re-add elements. 

    well... lets restate this.

    if the list is new every time, ie:

    list = Get a million Items
    item = find in list
    

    then parallel is faster

    if the list is fairly static

    list = Get a million Items
    findThese = Get a thousand Items
    found = list where contains findThese
    

    then binary search is a huge magnitude faster


    gimme some slamming techno!!!!
    Friday, October 15, 2010 10:36 AM
  • if the list is fairly static

    list = Get a million Items
    findThese = Get a thousand Items
    found = list where contains findThese
    

    then binary search is a huge magnitude faster


    gimme some slamming techno!!!!

    Ahh, but if it's just checking for the list, a HashSet<T> is faster than a binary search...  Also, HashSet would let you change the contents of the collection without a resort.  (With List<T> and BinarySearch, any addition to or removal from the list, or a change in any element requires a resort.)

     


    Reed Copsey, Jr. - http://reedcopsey.com
    Friday, October 15, 2010 3:26 PM
    Moderator
  • Hi thanks for all the interesting responses.  I was in fact running it in debug.  when I switched to release I got new results of:

     

                                         DEBUG in VS                     RELEASE in VS                 RELEASE not in VS  

    ------------------------------------------------------------------------------------------------------------

     

    Predicate: 2 minutes 16 seconds          1 minute 14 seconds          58 seconds

    Foreach:   1 minute 48 seconds           1 minute 37 seconds           1 minute 16 seconds

    For i:        1 minute 30 seconds.          1 minute 13 seconds          48 seconds

    Plinq                                                 7 minutes 6 seconds           6 minutes 26 seconds

    I guess I under estimated how well release could optimize code.  I am surprised the Predicate improved that much in release under VS and is running at the speed of a for i.  Then I am surprised again the for i beats the Predicate in release mode not under VS.

    In my test app I create all the words before searching.

    In my real app I will be generating words on the fly and I want to make sure there are no duplicates. So when I make a new word up I want to search the list of words and if the word is found in the list do not add it to the list.  If the word wasn't found then add it to the list.

     

    Why is the PLINQ statement running so slow?

    I am using a statement that is very similar to:

     

    value = strings.AsParallel().First(s => s == last);

     

     

    • Edited by Lavagin Wednesday, October 20, 2010 4:46 PM aligned text better
    Tuesday, October 19, 2010 6:36 PM
  • Hi Lavagin,

    Been doing some more benchmarking as well. Your times seem very large, are you running these on a Spectrum? :) How many test cases are you using?

    I've found that the Contains() method of an unsorted array of GUIDs is showing very fast times compared to other options.

    Here are my results, the machine is a pile of poo with less memory than a wristwatch and a CPU build by Fred Flintstone.

     

    The For and Count methods look and count all GUIDs that begin with the letter a, the Contains methods looks for the last guid in the array.

     

    Parallel : 1000000 GUIDs created in 1363 ms

    Parallel : For : GUIDs found 62086 : 128 ms
    Parallel : Count : GUIDs found 62086 : 137 ms
    Parallel : Contains : GUIDs found 1 : 39 ms

    Sequential : For : GUIDs found 62086 : 207 ms
    Sequential : Count : GUIDs found 62086 : 220 ms
    Sequential : Contains : GUIDs found 1 : 25 ms



    Parallel : 2000000 GUIDs created in 2916 ms

    Parallel : For : GUIDs found 124207 : 229 ms
    Parallel : Count : GUIDs found 124207 : 234 ms
    Parallel : Contains : GUIDs found 1 : 47 ms

    Sequential : For : GUIDs found 124207 : 386 ms
    Sequential : Count : GUIDs found 124207 : 427 ms
    Sequential : Contains : GUIDs found 1 : 49 ms



    Parallel : 3000000 GUIDs created in 4085 ms

    Parallel : For : GUIDs found 187538 : 348 ms
    Parallel : Count : GUIDs found 187538 : 350 ms
    Parallel : Contains : GUIDs found 1 : 70 ms

    Sequential : For : GUIDs found 187538 : 575 ms
    Sequential : Count : GUIDs found 187538 : 643 ms
    Sequential : Contains : GUIDs found 1 : 74 ms

     

    It's quite a good learning exercise this; haven't looked into the now .NET 4 parallel processing stuff. Going to see what else is an offer.

    How are you getting times of minutes?
    …we each have more potential than we might ever presume to guess. (Blog: http://dsmyth.blogspot.com/)
    Wednesday, October 20, 2010 9:43 AM
  • In my test sample I have a million strings and I do 160,000 searches.  The million strings and 160,000 are exactly the same each time I run my test.

    My CPU that I have been testing is a Core i7 CPU 860 2.8 GHZ which has 4 cores and with hyper threading it can do 8 threads at once.  

     

    Derek could I see your code for your sequential and Parallel Contains section?

    Wednesday, October 20, 2010 5:04 PM
  • Hi, no problem

    How much memory do you have on that bad boy?

    Post the times you get with the 160,000 searches, which seems a lot.... what's your software doing?

     

            private static long TimeGUIDSearch_SequentialContains(string[] guids)
            {
                int count = 0;
                string lastEntry = guids.Last();

                Stopwatch watch = new Stopwatch();
                watch.Restart();

                if (guids.Contains(lastEntry))
                    count = 1;

                watch.Stop();
                Console.WriteLine("Sequential : Contains : GUIDs found {0} : {1} ms", count, watch.ElapsedMilliseconds);

                return watch.ElapsedMilliseconds;
            }

     

    Virtually the same for parallel

     

            private static long TimeGUIDSearch_ParallelContains(string[] guids)
            {
                int count = 0;
                string lastEntry = guids.Last();

                Stopwatch watch = new Stopwatch();
                watch.Restart();

                if (guids.AsParallel().Contains(lastEntry))
                    count = 1;

                watch.Stop();
                Console.WriteLine("Parallel : Contains : GUIDs found {0} : {1} ms", count, watch.ElapsedMilliseconds);

                return watch.ElapsedMilliseconds;
            }

     


    …we each have more potential than we might ever presume to guess. (Blog: http://dsmyth.blogspot.com/)
    • Marked as answer by Figo Fei Monday, October 25, 2010 2:43 AM
    Wednesday, October 20, 2010 5:20 PM