locked
Count occurences substring

    Question

  • Hi All,

    I've searched for this subject all over the net, but couldn't find the exact answer I am looking for.
    I need to count occurences of a substring within a string.
    This is because I have a textfile with about 100.000 addresses, and these addresses all have a code assigned, multiple addresses per code.

    I need a counts by code functionality.

    Let's say I have 500 lines with code 1234 and 250 with 9876, I want to write these values to a textfile. This textfile needs to be something like:

    Code:               Occurences:
    1234                 500
    9876                 250

    Outputting to a textfile won't be the problem, but at this point I have no idea how to count the occurences of this substring through the file.

    Can someone point me in the correct direction? I don't need exact code-snippets, as I am still learning C# :)

    Thanks in advance,

    Steef
    Monday, July 28, 2008 9:10 AM

Answers

  • I would prefer this regular expression approach as it is less expensive than substring.

                string pattern=@"Sethi"// @"\bSethi\b" for word pattern 
                string input="PradeepSethiPradeepSethi Sethi Pradeep sethi"
                Regex r = new Regex(pattern, RegexOptions.IgnoreCase); 
                MatchCollection matches=r.Matches(input); 
                Console.WriteLine("Word :{0} - Count : {1}", pattern, matches.Count.ToString());      
     


    Pradeep Sethi
    • Proposed as answer by Mr. Javaman Tuesday, July 29, 2008 10:11 PM
    • Marked as answer by Figo Fei Wednesday, July 30, 2008 3:36 AM
    Monday, July 28, 2008 10:07 AM
  • This is a fundamentally wrong approach.  Download and install the free SQL Server Express edition.  Create a database to store your addresses.  Write some code to parse the text file and stuff the data into the database.  Throw away your text file.  Now run simple SQL queries with the COUNT and GROUP BY keywords to get what you want.  You'll get your results in milliseconds rather than minutes, changing your queries will take seconds rather than hours.
    Hans Passant.
    • Proposed as answer by Mr. Javaman Tuesday, July 29, 2008 10:12 PM
    • Marked as answer by Figo Fei Wednesday, July 30, 2008 3:36 AM
    Monday, July 28, 2008 12:11 PM

All replies

  • There's a lot of answers to your question. 

    Even if you attempt to use the methods of the string object (because in C# string is not a simple datatype, it's actually a little class), there will most likely be a loop involved.

    The fastest one will be: read the file as one big string, then use unsafe code (see: pointers and the unsafe keyword) to go over it.

    Another fast method would be to read your sourcefile into a char-array, then matching it again.

    The easiest one (but absolute worst in case of performance, you'll need to tweak this a lot) will be to go over the file as one big string, getting a substring and comparing it to your searchstring.  It will look a bit like:

            string
    Input = "This is abc a testabcstring, we'll count ABC the nuabcmber of times we find abc in thisabd text";
            string search = "abc";
            int length = search.Length;
            int howmanytimes = Input.Length - length;
            int result = 0;
            for (int index = 0; index < howmanytimes; index++)
            {
                string theSubString = Input.Substring(index, length);
                if (theSubString.ToLower() == search.ToLower())
                    {
                        result++;
                    }
            }
            Console.WriteLine(string.Format(
                    "We searched all over the text: {0} {1} for the text: {2} {1} and found it {3} times"
                    , Input, Environment.NewLine, search, result));
            Console.ReadKey();

    The improbable we do, the impossible just takes a little longer. - Steven Parker
    Monday, July 28, 2008 9:32 AM
  • jannemanrobinson said:

    There's a lot of answers to your question. 

    Even if you attempt to use the methods of the string object (because in C# string is not a simple datatype, it's actually a little class), there will most likely be a loop involved.

    The fastest one will be: read the file as one big string, then use unsafe code (see: pointers and the unsafe keyword) to go over it.

    Another fast method would be to read your sourcefile into a char-array, then matching it again.

    The easiest one (but absolute worst in case of performance, you'll need to tweak this a lot) will be to go over the file as one big string, getting a substring and comparing it to your searchstring.  It will look a bit like:

            string
    Input = "This is abc a testabcstring, we'll count ABC the nuabcmber of times we find abc in thisabd text";
            string search = "abc";
            int length = search.Length;
            int howmanytimes = Input.Length - length;
            int result = 0;
            for (int index = 0; index < howmanytimes; index++)
            {
                string theSubString = Input.Substring(index, length);
                if (theSubString.ToLower() == search.ToLower())
                    {
                        result++;
                    }
            }
            Console.WriteLine(string.Format(
                    "We searched all over the text: {0} {1} for the text: {2} {1} and found it {3} times"
                    , Input, Environment.NewLine, search, result));
            Console.ReadKey();


    The improbable we do, the impossible just takes a little longer. - Steven Parker


    In this case you take "abc" as string to search, will this also work when my string to search is different troughout the file?
    I will try this option later today, as I am not in the office right now.

    Monday, July 28, 2008 9:49 AM
  • Check this thread to count the frequency of words. 
    Pradeep Sethi
    Monday, July 28, 2008 9:50 AM
  • I would prefer this regular expression approach as it is less expensive than substring.

                string pattern=@"Sethi"// @"\bSethi\b" for word pattern 
                string input="PradeepSethiPradeepSethi Sethi Pradeep sethi"
                Regex r = new Regex(pattern, RegexOptions.IgnoreCase); 
                MatchCollection matches=r.Matches(input); 
                Console.WriteLine("Word :{0} - Count : {1}", pattern, matches.Count.ToString());      
     


    Pradeep Sethi
    • Proposed as answer by Mr. Javaman Tuesday, July 29, 2008 10:11 PM
    • Marked as answer by Figo Fei Wednesday, July 30, 2008 3:36 AM
    Monday, July 28, 2008 10:07 AM
  • You didn't specify the format of the strings, so I can't comment on how you split them up to find the substrings.

    However, on the subject of counting up the substrings once you have found them:

    I would use a Dictionary<string, int> to count them up:

    using System; 
    using System.Diagnostics; 
    using System.Collections.Generic; 
     
    namespace Demo 
        public class Program 
        { 
            public static void Main() 
            { 
                Dictionary<string, int> counter = new Dictionary<string, int>(); 
     
                Count("One", counter); 
                Count("Two", counter); 
                Count("Three", counter); 
                Count("One", counter); 
                Count("Two", counter); 
                Count("One", counter); 
     
                Print(counter); 
            } 
     
            private static void Count(string key, Dictionary<string, int> counter) 
            { 
                int count; 
     
                if (counter.TryGetValue(key, out count)) 
                { 
                    counter[key] = count + 1; 
                } 
                else 
                { 
                    counter[key] = 1; 
                } 
            } 
     
            private static void Print(Dictionary<string, int> counter) 
            { 
                foreach (KeyValuePair<string, int> item in counter) 
                { 
                    Console.WriteLine(item.Key + " occured " + item.Value + " times."); 
                } 
            } 
        } 
     

    Monday, July 28, 2008 10:07 AM
  • This is a fundamentally wrong approach.  Download and install the free SQL Server Express edition.  Create a database to store your addresses.  Write some code to parse the text file and stuff the data into the database.  Throw away your text file.  Now run simple SQL queries with the COUNT and GROUP BY keywords to get what you want.  You'll get your results in milliseconds rather than minutes, changing your queries will take seconds rather than hours.
    Hans Passant.
    • Proposed as answer by Mr. Javaman Tuesday, July 29, 2008 10:12 PM
    • Marked as answer by Figo Fei Wednesday, July 30, 2008 3:36 AM
    Monday, July 28, 2008 12:11 PM
  • nobugz said:

    This is a fundamentally wrong approach.  Download and install the free SQL Server Express edition.  Create a database to store your addresses.  Write some code to parse the text file and stuff the data into the database.  Throw away your text file.  Now run simple SQL queries with the COUNT and GROUP BY keywords to get what you want.  You'll get your results in milliseconds rather than minutes, changing your queries will take seconds rather than hours.


    Hans Passant.



    Hey Hans,
    you're correct, if this is a project to create/manage contacts, then please use an intelligent server to store data, and don't use textfiles to do it.

    However, on the other hand, I must also say that we don't know if he's using the textfile as the actual storage.  It could be produced and send to him by 3rd party application.  Also, maybe the data isn't user-own maybe it there's no need to store the data after calculation (ie processing daily results, ..)  Maybe he's just writing a 15Kb utility tool. 


    The improbable we do, the impossible just takes a little longer. - Steven Parker
    Monday, July 28, 2008 12:45 PM
  • nobugz said:

    This is a fundamentally wrong approach.  Download and install the free SQL Server Express edition.  Create a database to store your addresses.  Write some code to parse the text file and stuff the data into the database.  Throw away your text file.  Now run simple SQL queries with the COUNT and GROUP BY keywords to get what you want.  You'll get your results in milliseconds rather than minutes, changing your queries will take seconds rather than hours.


    Hans Passant.


    Yes, if he can do that he should.

    However, without any other information, it's far from clear that he CAN do that.

    Monday, July 28, 2008 1:04 PM
  • Thanks for the posts already. It's not about creating/managing contacts, but textfiles with this information in it. I am a data processor and process dozens of textfiles with address details every day. Part of this is creating counts by code reports for clients.

    So, I don't think it's the best way to put the data in a database and run queries on it.

    At this point I have the following:

    string line, source; 
     
                frmSources SrcFrm = new frmSources(); 
                OpenFileDialog ofd = new OpenFileDialog(); 
                ofd.Filter = "Text Files (*.txt, *.out, *.dat)|*.txt;*.out;*.dat|All Files (*.*)|*.*"
                ofd.InitialDirectory = "C:\\Tmp\\"
                StreamReader sr = null
     
                if(SrcFrm.ShowDialog().Equals(DialogResult.OK) && 
                    ofd.ShowDialog().Equals(DialogResult.OK)) 
                { 
                    try 
                    { 
                        dsMaster.Clear(); // Clear out dsMaster. 
                        lbSources.Items.Clear(); // Clear out lbSources. 
     
                        start = SrcFrm.start; // Start position of source field 
                        size = SrcFrm.size; // Size of source field 
                        file = ofd.FileName; // Get the filename in a string 
                        sr = new StreamReader(file, System.Text.Encoding.Default, true); 
                        list = new SortedList(); 
     
                        while((line=sr.ReadLine())!=null) // Loop through the file 
                        { 
                            source = line.Substring(start-1, size); // Get the source 
                             
                            if(list.IndexOfKey(source) < 0
                                list.Add(source, 1); // Source not found, add to SortedList. 
                            else 
                            { 
                                /* Cast the value of the existing source into an integer 
                                 * and increment its value with 1. */ 
                                int val = (int)list[source]; 
                                val++; 
                                list[source] = val; 
                            } 
                        } 
     
                        this.SuspendLayout(); 
                        for(int i=0;i<list.Count;i++) 
                        { 
                            lbSources.Items.Add(list.GetKey(i)+", "+ 
                                list[list.GetKey(i)]); 
                        } 
                        this.ResumeLayout(); 
                        btnNewSplit.Enabled = true
                        lblStatus.Text = String.Empty; 
                    } 
                    catch(Exception ex) 
                    { 
                        MessageBox.Show("Error! "+ex.Message); 
                    } 
                    finally 
                    { 
                        if(sr!=null) 
                            sr.Close(); 
                    } 
                } 

    This goes partially OK, but the only thing is if I test this with a sample file of 10 lines, with 2 different codes (60/40) it adds the key multiple times with multiple occurences (in the listbox for now)

    Like:

    code_a,1
    code_a,2
    code_a,3
    code_a,4
    code_b,1
    code_b,2
    code_b,3
    code_b,4
    code_b,5
    code_b,6

    I just want the highest value to remain.
    • Edited by dikkeeend Tuesday, July 29, 2008 9:47 AM typo
    Tuesday, July 29, 2008 9:16 AM
  • I have tested your piece of code with my sample test data and it is working fine. I don't see anything wrong in the code. 
    Pradeep Sethi
    Tuesday, July 29, 2008 7:28 PM
  • Pradeep Sethi said:

    I have tested your piece of code with my sample test data and it is working fine. I don't see anything wrong in the code. 


    Pradeep Sethi

    So, you just get a few items in your listbox, and not like I describe above?

    This is making me go crazy!

    Wednesday, July 30, 2008 8:16 AM
  • yeah I just get unique code in the listbox with count. 
    Pradeep Sethi
    Wednesday, July 30, 2008 8:54 AM
  • I solved this problem  in AWK (GAWK) very simply thru its assoc array :

     {

    line = $0  # input line from text file, like ReadLine

     

    code = substr(line, startpos, length)

     

    codefreq[code]++   # counts code frequencies

     }   # the whole file is read

    END { for (c in codefreq) print c, codefreq[c] }

     

    This reports only the final count of each code.

     

    awk is an easyest, simplest tool for working with text files.

     

    gawk works on command line, and fast, but I now need a gui solution, and I hoped for c# for a similar solution, but... perhaps I am missing something.

     

    Another problem with text files/tables is that they can contain control characters that can disturb the ReadLine routine on e.g. a stray "lf" character.

     

    We can read text into a byte buffer, separate records at cr/lf and work on broken records at the end/beginning of buffer, but is´s a cludge !

     

    In GAWK we can define a record separator and read file orderly, record by record ...

     

    c# designer guys, do something !

     

    dikkeeend, if you are ok with a proven command line solution, tell me.

    Friday, November 26, 2010 4:49 PM
  • dikkeeend hasn't posted anything in over two years... I think he's not here any more.
    Friday, November 26, 2010 5:03 PM
  • string pattern=@"Sethi"// @"\bSethi\b" for word pattern 
                string input="PradeepSethiPradeepSethi Sethi Pradeep sethi"
                Regex r = new Regex(pattern, RegexOptions.IgnoreCase); 
                MatchCollection matches=r.Matches(input); 
                Console.WriteLine("Word :{0} - Count : {1}", pattern, matches.Count.ToString());    

    what if the string input is "PradeepSethiPradeepSethi Sethi Pradeep sethi1"?

    and I want the count to be 3 only( sethi1 will not be counted )


    beginner at everything
    Monday, January 17, 2011 8:18 AM
  • what if the string input is "PradeepSethiPradeepSethi Sethi Pradeep sethi1"?

    and I want the count to be 3 only( sethi1 will not be counted )


    That's what the RegexOptions.IgnoreCase is there for.
    Monday, January 17, 2011 11:06 AM