none
Using LINQ to Mark Duplicate Values in a Data Column RRS feed

  • Question

  • Hi, I am trying to mark duplicate values that exist in one specific column in a data table using LINQ. I have been using the standard For/Next Loop nested within another For/Next Loop to check for duplicates and it works, but when the records count gets to 500 it takes about 2 maybe 3 seconds to check the entire column. Of course this is not ideal. I've scoured the internet looking for a faster way to do this, with LINQ, but I keep coming across the LINQ method of finding distinct values instead.  

    I am not trying to delete the duplicate values either, just inform the user (via DataGridView) where the duplicate entries exist. I want it to be all-inclusive as well, meaning all the duplicate values get marked, not skipping the original. For example:

    A data column with values of:

    Jeff
    Jeff
    Bryan
    Debra
    Zack
    Jeff

    I want to get a datarow enumerated list via LINQ that I can iterate through so that I can mark ALL of the 'Jeff' entries as duplicates (all three). Or possibly there is another way but running through the column with nested for/each loops is too slow.

    If you're able to help, thanks a bunch.

    Friday, August 4, 2017 2:32 AM

Answers

  • A datatable in this case is slow no matter if you are using a manual loop yourself,  or you ask Linq to do it with Linq using its loop mechanism. In the future, you should look into using List(of T) with custom object for speed as opposed to a datatable for performance.

    https://chrisbitting.com/2016/05/04/datatable-row-loop-csharp-performance-testing-linq-vs-select-vs-parallel-vs-for/

    https://dzone.com/articles/reasons-move-datatables

    http://lauteikkehn.blogspot.com/2012/03/datatable-vs-list.html

    The link is using C# Linq with lambada to find all duplicates. The below I converted it to VB lambada. 

    https://stackoverflow.com/questions/14189537/select-entire-duplicate-row-from-datatable-with-linq-c-sharp

    dim allDuplicates = dt.AsEnumerable()
        .GroupBy(Function(dr) dr.Field<string>("VMDList"))
        .Where(Function(g) g.Count() > 1)
        .SelectMany(Function(g) g)
        .ToList();

    You can play in VB with Linqpad to run your VB Linq until it's right.

    https://www.linqpad.net/

    • Marked as answer by StyVB Saturday, August 5, 2017 12:19 AM
    Friday, August 4, 2017 3:20 AM
  • Check the next approach too:

    Dim s As New HashSet(Of String)
    Dim j = dt.Columns.IndexOf("Name")
    Dim k = dt.Columns.IndexOf("IsDuplicate")
    
    For i = 0 To dt.Rows.Count - 1
       s.Add(CStr(dt.Rows(i)(j)))
    Next
    
    For i = 0 To dt.Rows.Count - 1
       dt.Rows(i)(k) = If(s.Contains(CStr(dt.Rows(i)(j))), "*", "")
    Next

    where dt is the datatable, “Name” and “IsDuplicate” — the columns of type String. It puts “*” if the name is a duplicate.

    Compare with your original loops.




    • Edited by Viorel_MVP Friday, August 4, 2017 5:39 AM
    • Marked as answer by StyVB Saturday, August 5, 2017 12:19 AM
    Friday, August 4, 2017 5:35 AM
  • @styVB:

    Besides the given LINQ methods, VB.NET's LINQ can also help you like this:

    Imports System.Linq
    Module Module1
     
        Sub Main()
     
            Dim dt As New DataTable
            dt.Columns.Add("PersonName")
     
            dt.Rows.Add("John")
            dt.Rows.Add("John")
            dt.Rows.Add("John")
            dt.Rows.Add("Mary")
            dt.Rows.Add("Mary")
            dt.Rows.Add("Peter")
     
     
            Dim result = From row In dt.Rows.Cast(Of DataRow)
                         Group row By pname = row("PersonName").ToString() Into Group
                         Where Group.Count() > 1
                         Select pname
     
            For Each name As String In result
                Console.WriteLine(name)
            Next
        End Sub
     
    End Module

    @Cor Ligthert:

    "Expression" doesn't belong to DataTable object but for DataColumn Expression (You CAN NEITHER get the intellisense from DataTable, NOR fetch the property from DataTable object from MSDN, your link is something to do with DataColumn by carefully) . And I don't know what it has anything to do with the selection of duplicated rows? If you wanna use this, maybe you have to have another new column to do something like counting……to count the duplicated column values. However what if the user's table doesn't have such a column for Expression?

    Can you show your full codes to prove what you meant?


    ASP.NET Forum
    StackOverFlow
    FreeRice Donate
    Issues to report




    • Edited by ThankfulHeart Friday, August 4, 2017 3:07 PM
    • Marked as answer by StyVB Saturday, August 5, 2017 12:19 AM
    Friday, August 4, 2017 3:06 PM
  •      

    I tried this but obviously does not work...

    Dim allDuplicates = dtSamples.AsEnumerable().GroupBy(Function(dr) Not dr.RowState = DataRowState.Deleted AndAlso dr.Field(Of String)("sample_key").ToUpper).Where(Function(g) g.Count() > 1).SelectMany(Function(g) g).ToList()

    I then tried to declare allDuplicates as EnumerableRowCollection but i still get an error..can anyone help with this at all? Thanks.

    • Marked as answer by StyVB Saturday, August 5, 2017 2:34 PM
    Saturday, August 5, 2017 12:41 PM

All replies

  • A datatable in this case is slow no matter if you are using a manual loop yourself,  or you ask Linq to do it with Linq using its loop mechanism. In the future, you should look into using List(of T) with custom object for speed as opposed to a datatable for performance.

    https://chrisbitting.com/2016/05/04/datatable-row-loop-csharp-performance-testing-linq-vs-select-vs-parallel-vs-for/

    https://dzone.com/articles/reasons-move-datatables

    http://lauteikkehn.blogspot.com/2012/03/datatable-vs-list.html

    The link is using C# Linq with lambada to find all duplicates. The below I converted it to VB lambada. 

    https://stackoverflow.com/questions/14189537/select-entire-duplicate-row-from-datatable-with-linq-c-sharp

    dim allDuplicates = dt.AsEnumerable()
        .GroupBy(Function(dr) dr.Field<string>("VMDList"))
        .Where(Function(g) g.Count() > 1)
        .SelectMany(Function(g) g)
        .ToList();

    You can play in VB with Linqpad to run your VB Linq until it's right.

    https://www.linqpad.net/

    • Marked as answer by StyVB Saturday, August 5, 2017 12:19 AM
    Friday, August 4, 2017 3:20 AM
  • Check the next approach too:

    Dim s As New HashSet(Of String)
    Dim j = dt.Columns.IndexOf("Name")
    Dim k = dt.Columns.IndexOf("IsDuplicate")
    
    For i = 0 To dt.Rows.Count - 1
       s.Add(CStr(dt.Rows(i)(j)))
    Next
    
    For i = 0 To dt.Rows.Count - 1
       dt.Rows(i)(k) = If(s.Contains(CStr(dt.Rows(i)(j))), "*", "")
    Next

    where dt is the datatable, “Name” and “IsDuplicate” — the columns of type String. It puts “*” if the name is a duplicate.

    Compare with your original loops.




    • Edited by Viorel_MVP Friday, August 4, 2017 5:39 AM
    • Marked as answer by StyVB Saturday, August 5, 2017 12:19 AM
    Friday, August 4, 2017 5:35 AM
  • @styVB:

    Besides the given LINQ methods, VB.NET's LINQ can also help you like this:

    Imports System.Linq
    Module Module1
     
        Sub Main()
     
            Dim dt As New DataTable
            dt.Columns.Add("PersonName")
     
            dt.Rows.Add("John")
            dt.Rows.Add("John")
            dt.Rows.Add("John")
            dt.Rows.Add("Mary")
            dt.Rows.Add("Mary")
            dt.Rows.Add("Peter")
     
     
            Dim result = From row In dt.Rows.Cast(Of DataRow)
                         Group row By pname = row("PersonName").ToString() Into Group
                         Where Group.Count() > 1
                         Select pname
     
            For Each name As String In result
                Console.WriteLine(name)
            Next
        End Sub
     
    End Module

    @Cor Ligthert:

    "Expression" doesn't belong to DataTable object but for DataColumn Expression (You CAN NEITHER get the intellisense from DataTable, NOR fetch the property from DataTable object from MSDN, your link is something to do with DataColumn by carefully) . And I don't know what it has anything to do with the selection of duplicated rows? If you wanna use this, maybe you have to have another new column to do something like counting……to count the duplicated column values. However what if the user's table doesn't have such a column for Expression?

    Can you show your full codes to prove what you meant?


    ASP.NET Forum
    StackOverFlow
    FreeRice Donate
    Issues to report




    • Edited by ThankfulHeart Friday, August 4, 2017 3:07 PM
    • Marked as answer by StyVB Saturday, August 5, 2017 12:19 AM
    Friday, August 4, 2017 3:06 PM
  • @thankfull haert, I've complete misunderstood the question. I've deleted my reply

    Success
    Cor

    Friday, August 4, 2017 5:17 PM
  • Thank you guys for your help. I will also look into LinqPad as well.
    Saturday, August 5, 2017 12:20 AM
  • Thank you guys for your help. I will also look into LinqPad as well.

    Hi

    A bit late to the table,but here is yet another approach.

    This uses a DataTable Filter, and in the example,filters 9999 records almost instantaneous.

    If you want to try it out, the filter works on the current selected column and the filter is case insensitive. An empty TextBox will clear the filter and show all records.

    It may not suit your needs as it just lists all the matches.

    ' This example needs a Form1 with
    ' a blank DataGridView1, 
    ' TextBox1 - filter text
    ' Button1 - Filter button
    Option Strict On
    Option Explicit On
    Public Class Form1
        Dim myTable As New DataTable("Freddy")
        Dim view As New DataView(myTable)
        Dim rand As New Random
        Private Sub Form1_Load(sender As Object, e As EventArgs) Handles MyBase.Load
            With myTable
                .Columns.Add("ID", GetType(Integer))
                .Columns.Add("Name", GetType(String))
                .Columns.Add("Roll Number", GetType(String))
                .Columns.Add("Subject1", GetType(String))
                .Columns.Add("Subject2", GetType(String))
                .Columns.Add("Score", GetType(String))
    
                ' just some test data
                Dim r1() As String = {"N1", "N2", "N3", "N4"}
                Dim r2() As String = {"Math", "Social", "English", "Scottish"}
                For i As Integer = 0 To 9999
                    .Rows.Add(i, r1(rand.Next(4)), rand.Next(111), r2(rand.Next(4)), r2(rand.Next(4)), rand.NextDouble() * 999)
                Next
            End With
    
            ' just some formatting of DGV
            With DataGridView1
                .DataSource = myTable
                .MultiSelect = False
                .RowHeadersWidth = 24
                .ColumnHeadersDefaultCellStyle.WrapMode = DataGridViewTriState.True
                .ColumnHeadersDefaultCellStyle.Alignment = DataGridViewContentAlignment.MiddleCenter
                .AutoSizeColumnsMode = DataGridViewAutoSizeColumnsMode.Fill
                .SelectionMode = DataGridViewSelectionMode.CellSelect
                For i As Integer = 0 To 4
                    .Columns(i).DefaultCellStyle.Alignment = DataGridViewContentAlignment.MiddleCenter
                Next
            End With
            DataGridView1.DataSource = view
        End Sub
        Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
            ' filter selected column with TextBox1 text
            ' if TextBox1 is empty then clear filter
            If Trim(TextBox1.Text).Length < 1 Or DataGridView1.SelectedCells.Count < 1 Then
                view.RowFilter = Nothing
                Exit Sub
            End If
            Dim c As Integer = DataGridView1.SelectedCells(0).ColumnIndex
            Dim h As String = DataGridView1.Columns(c).Name
            Dim v As String = DataGridView1.SelectedCells(0).Value.ToString
            view.RowFilter = h & "='" & TextBox1.Text & "'"
        End Sub
    End Class
    


    Regards Les, Livingston, Scotland

    Saturday, August 5, 2017 1:08 AM
  • Dim allDuplicates = dtSamples.AsEnumerable().GroupBy(Function(dr) dr.Field(Of String)("sample_key").ToUpper).Where(Function(g) g.Count() > 1).SelectMany(Function(g) g).ToList()

    This works great...until you have some rows in your datatable with a row state = deleted then there's an exception. Would you know how to avoid this exception? I'm not sure where I would put a line similar to 'where not .rowstate = datarowstate.deleted.

    Saturday, August 5, 2017 3:21 AM
  •      

    I tried this but obviously does not work...

    Dim allDuplicates = dtSamples.AsEnumerable().GroupBy(Function(dr) Not dr.RowState = DataRowState.Deleted AndAlso dr.Field(Of String)("sample_key").ToUpper).Where(Function(g) g.Count() > 1).SelectMany(Function(g) g).ToList()

    I then tried to declare allDuplicates as EnumerableRowCollection but i still get an error..can anyone help with this at all? Thanks.

    • Marked as answer by StyVB Saturday, August 5, 2017 2:34 PM
    Saturday, August 5, 2017 12:41 PM
  • More like:

       Dim allDuplicates = dtSamples.AsEnumerable().Where(Function(dr) dr.RowState <> DataRowState.Deleted).GroupBy(Function(dr) dr.Field(Of String)("sample_key").ToUpper).Where(Function(g) g.Count() > 1).SelectMany(Function(g) g).ToList()

    Dim allDuplicates = dtSamples.AsEnumerable().
                                 Where(Function(dr) dr.RowState <> DataRowState.Deleted).
                                 GroupBy(Function(dr) dr.Field(Of String)("sample_key").ToUpper).
                                 Where(Function(g) g.Count() > 1).SelectMany(Function(g) g).
                                 ToList()
    

        

    Saturday, August 5, 2017 1:18 PM
  • That did it, thank you VERY much! I'm learning a lot about Linq and just used what I gathered from your response to write edit one of my other Linq statements and it worked. Thanks a bunch!
    Saturday, August 5, 2017 2:35 PM