locked
How to find 10 most common value (words) from a column table? RRS feed

  • Question

  • User-352524747 posted

    How to find 10 most common value (words) from a column table? What sql query to use?

    Thursday, June 29, 2017 2:44 PM

Answers

  • User-352524747 posted

    I get values from a MatchCollection and this doesn't work.

    I solved it using this query, to get all rows as one single row:

    db.QuerySingle(@"SELECT STUFF((SELECT TOP 100 ',' + Cities FROM EuMap FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 1,'') AS One")

    And this function:

    public static List<string> CommonTags(string mct)
        {
            var hashtags = new List<string>();
            MatchCollection matchtgs = Regex.Matches(mct, @"\B#\w\w+", RegexOptions.IgnoreCase | RegexOptions.Compiled);
            matchtgs.Cast<Match>().Select(m => m.Value.ToLowerInvariant()).GroupBy(s => s).OrderByDescending(g => g.Count()).Select(g => g.Key).Distinct().ToList().ForEach(delegate (string s)
            {
                hashtags.Add(s);
            }
            );
            return hashtags;
        }

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Monday, July 3, 2017 1:40 PM

All replies

  • User281315223 posted

    Are their individual words in the specific columns (i.e. single words and not paragraphs)? If so, you could simply group on the content of the specific column and order the size of the groupings:

      SELECT TOP 10 
             YourColumn,
             COUNT(YourColumn) AS Appearances
        FROM YourTable
    GROUP BY YourColumn
    ORDER BY Appearances DESC


    Thursday, June 29, 2017 4:30 PM
  • User-352524747 posted

    Are their individual words in the specific columns

    No, but they start with #. e.g. #eggs and are part of text paragraphs.

    Thursday, June 29, 2017 5:07 PM
  • User281315223 posted

    No, but they start with #. e.g. #eggs and are part of text paragraphs.

    This will complicate things a bit as you'll have to manually split the string up using a given delimiter (likely a space). SQL doesn't necessarily support this type of behavior out of the box (as easy as you might in C#), but the link that jkjhse provided will define a function that should handle that for you.

    You might consider if this needs to be done entirely in SQL, as it may be much easier to do this within your C# code (via a simple Split() call and a little bit of LINQ). 

    Friday, June 30, 2017 1:43 PM
  • User-352524747 posted

    I may consider something like this:

    Regex.Split("Hello World This is a great world, This World is simply great".ToLower(), @"\W+")
        .Where(s => s.Length > 3)
        .GroupBy(s => s)
        .OrderByDescending(g => g.Count())

    it may be much easier to do this within your C# code

    I think so, but what if i consider 100 rows or more? Any e.g.?

    Friday, June 30, 2017 4:10 PM
  • User281315223 posted

    I think so, but what if i consider 100 rows or more? Any e.g.?

    It shouldn't be any issue.

    Now if you start getting into thousands and thousands of rows, then you might start to run into some issues, but it really just depends on your scenario. Additionally, you could use a Take(10) on your current query to only grab the top 10 as well.

    Friday, June 30, 2017 9:32 PM
  • User-352524747 posted

    Just one last thing for the following code i'm trying to achieve it with Linq but there's an error i get

    System.Linq.Enumerable+WhereSelectEnumerableIterator`2[System.Linq.IGrouping`2[System.Char,System.Char],System.Char]
    var query = db.Query("SELECT TOP 100 Cities FROM EuMap");
        foreach (var row in query)
        {
            MatchCollection links = Regex.Matches(row.Cities, @"(\B#)(\w\w+)", RegexOptions.IgnoreCase | RegexOptions.Compiled);
            foreach (Match link in links)
            {
                <p>@link.Value.GroupBy(s => s).OrderByDescending(g => g.Count()).Select(g => g.Key)</p>
            }
        }


    If i don't use 

    .GroupBy(s => s).OrderByDescending(g => g.Count()).Select(g => g.Key)

    i get this results

    #Berlin
    #Bremen
    #Paris
    #Madrid
    #Munich
    #Berlin
    #Lyon
    #Berlin
    #Munich
    ...
    Sunday, July 2, 2017 9:22 AM
  • User347430248 posted

    Hi dow7,

    you can also try to refer example below may help you.

    Dictionary<string, int> GetTopKWords(string input, int k)
    {
        string[] words = Regex.Split(input, @"\W");
        var occurrences = new Dictionary<string, int>();
         
        foreach (var word in words)
        {
            string lowerWord = word.ToLowerInvariant();
            if (!occurrences.ContainsKey(lowerWord))
                occurrences.Add(lowerWord, 1);
            else
                occurrences[lowerWord]++;
        }
        return (from wp in occurrences.OrderByDescending(kvp => kvp.Value) select wp).Take(k).ToDictionary (kw => kw.Key, kw => kw.Value);
    }
    
    var input = "the quick brown fox is brown and jumps over the brown log over the long fire and quickly jumps to a brown fire fox";
    GetTopKWords(input, 10);
    

    Output:

    Reference:

    Using LINQ in C# to easily get a list of most often used words

    Regards

    Deepak

    Monday, July 3, 2017 7:27 AM
  • User-352524747 posted

    I get values from a MatchCollection and this doesn't work.

    I solved it using this query, to get all rows as one single row:

    db.QuerySingle(@"SELECT STUFF((SELECT TOP 100 ',' + Cities FROM EuMap FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 1,'') AS One")

    And this function:

    public static List<string> CommonTags(string mct)
        {
            var hashtags = new List<string>();
            MatchCollection matchtgs = Regex.Matches(mct, @"\B#\w\w+", RegexOptions.IgnoreCase | RegexOptions.Compiled);
            matchtgs.Cast<Match>().Select(m => m.Value.ToLowerInvariant()).GroupBy(s => s).OrderByDescending(g => g.Count()).Select(g => g.Key).Distinct().ToList().ForEach(delegate (string s)
            {
                hashtags.Add(s);
            }
            );
            return hashtags;
        }

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Monday, July 3, 2017 1:40 PM