none
VB duplicate file finder RRS feed

Answers

  • Where to find:  on my laptop files - thousands of jpg's, png's, etc.  A duplicate image is one image that is identical to another.

    RONATMOODYLAKE

    Yea that sound easy in the way you tell it, and many would like to pay for such a program. However the problem is the same as the wheat and the chessboard problem. Many forget how much processing time it would need. Before one duplicate file is found you are probably already all a long time deceased. 

    https://en.wikipedia.org/wiki/Wheat_and_chessboard_problem


    Success
    Cor

    • Marked as answer by RONATMOODYLAKE Wednesday, February 21, 2018 1:38 PM
    Wednesday, February 21, 2018 1:35 PM

All replies

  • Hi RONATMOODYLAKE,

    Welcome to the MSDN forum.

    This forum is discussing Visual Studio WPF/SL Designer, Visual Studio Guidance Automation Toolkit, Developer Documentation and Help System, and Visual Studio Editor.

    According to your description, your issue is related to VB development, I will move this thread to VB forum for a professional answer.

    Thanks for your understanding.

    Regards,

    Judyzh

    • Edited by Judy ZhuY Wednesday, February 21, 2018 3:17 AM
    Wednesday, February 21, 2018 3:15 AM
  • I am looking for VB code to find duplicate image files.

    RONATMOODYLAKE

    Where to find and what is a duplicate image?

    Success
    Cor

    Wednesday, February 21, 2018 3:48 AM
  • Where to find:  on my laptop files - thousands of jpg's, png's, etc.  A duplicate image is one image that is identical to another.

    RONATMOODYLAKE

    Wednesday, February 21, 2018 1:15 PM
  • Well what does duplicate mean? Same file name or exact same image under different filename? And what is the premise for the search? You enter a filename and want all file paths for the same name?

    La vida loca

    Wednesday, February 21, 2018 1:17 PM
  • Where to find:  on my laptop files - thousands of jpg's, png's, etc.  A duplicate image is one image that is identical to another.

    RONATMOODYLAKE

    Here are two ways to compare images pixel by pixels in this Find Waldo thread.

    And here:

    https://social.msdn.microsoft.com/Forums/vstudio/en-US/d3624dd0-0903-4310-ad97-4b5fd0b116f4/match-image-at-location-on-screen?forum=vbgeneral

    Wednesday, February 21, 2018 1:24 PM
  • A duplicate file is an image that is identical to another in terms of pixel density, size, etc.  Duplicate files may have different names.

    RONATMOODYLAKE

    Wednesday, February 21, 2018 1:32 PM
  • Where to find:  on my laptop files - thousands of jpg's, png's, etc.  A duplicate image is one image that is identical to another.

    RONATMOODYLAKE

    Yea that sound easy in the way you tell it, and many would like to pay for such a program. However the problem is the same as the wheat and the chessboard problem. Many forget how much processing time it would need. Before one duplicate file is found you are probably already all a long time deceased. 

    https://en.wikipedia.org/wiki/Wheat_and_chessboard_problem


    Success
    Cor

    • Marked as answer by RONATMOODYLAKE Wednesday, February 21, 2018 1:38 PM
    Wednesday, February 21, 2018 1:35 PM
  • Where to find:  on my laptop files - thousands of jpg's, png's, etc.  A duplicate image is one image that is identical to another.


    RONATMOODYLAKE

    Yea that sound easy in the way you tell it, and many would like to pay for such a program. However the problem is the same as the wheat and the chessboard problem. Many forget how much processing time it would need. Before one duplicate file is found you are probably already all a long time deceased. 

    https://en.wikipedia.org/wiki/Wheat_and_chessboard_problem


    Success
    Cor

    Good Point.

    I have not done it and dont claim to understand the wheat problem but it seems there is one loop of each file ie compare file 1 with all other files until a match is found, then mark file one as duplicate. So if there were 1000 files that is one loop of 1000 (or less). Now there are 999 files. Now loop file 2 through 999 (or less)... so you have 1000* 1000 / something factorial loops.

    However most images are a non match with the first pixel check. Check just the diagonal and thats eliminates like 90 percent.

    So, I am thinking it can be done in less than a lifetime. More like a minute or two with 1000 files.

    May have to try it to get convinced. The exact criteria will come to play... also what do you intend to do with the info? Make a report, delete the duplicates or what?

    Still thinking...


    Wednesday, February 21, 2018 2:02 PM
  • This seems to work. The speed can be increased greatly using lockbits.

    The result lists the duplicates.

    Here is the result:

    Public Class Form7
        Dim bmpPaths As New List(Of String)
        Dim sw As New Stopwatch
    
        Private Sub Form7_Load(sender As Object, e As EventArgs) Handles MyBase.Load
    
            Dim di As New IO.DirectoryInfo("c:\test5")
            Dim aryFi As IO.FileInfo() = di.GetFiles("*.png")
            Dim fi As IO.FileInfo
    
            For Each fi In aryFi
                bmpPaths.Add(fi.FullName)
            Next
    
        End Sub
    
        Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
            Dim dups As String
            Dim bmpTest As Bitmap
            Dim bmpTarget As Bitmap
            Dim clrTarget As Color
            Dim clrUL, clrUR, clrLL, clrLR, clrTest As Color
            Dim found As Boolean
    
            sw.Reset()
            sw.Start()
    
            For j As Integer = 0 To bmpPaths.Count - 1
    
                bmpTest = New Bitmap(bmpPaths(j))
    
                For i As Integer = 0 To bmpPaths.Count - 1
    
    
                    If i <> j And bmpPaths(i) <> "" Then
    
                        bmpTarget = New Bitmap(bmpPaths(i))
    
                        clrUL = bmpTest.GetPixel(0, 0)
                        clrUR = bmpTest.GetPixel(bmpTarget.Width - 1, 0)
                        clrLL = bmpTest.GetPixel(0, bmpTarget.Height - 1)
                        clrLR = bmpTest.GetPixel(bmpTarget.Width - 1, bmpTarget.Height - 1)
    
                        For x As Integer = 0 To bmpTest.Width - bmpTarget.Width
                            For y As Integer = 0 To bmpTest.Height - bmpTarget.Height
    
                                clrTarget = bmpTarget.GetPixel(x, y)
    
                                'first check the corners
                                If clrTarget = clrUL Then
                                    'found the upperleft pixel check upper right
                                    clrTarget = bmpTarget.GetPixel(x + bmpTarget.Width - 1, y)
    
                                    If clrTarget = clrUR Then
                                        clrTarget = bmpTarget.GetPixel(x, y + bmpTarget.Height - 1)
                                        If clrTarget = clrLL Then
                                            clrTarget = bmpTarget.GetPixel(x + bmpTarget.Width - 1, y + bmpTarget.Height - 1)
                                            If clrTarget = clrLR Then
                                                'found all four courners
                                                'check the diagonal
                                                Dim w1 As Integer
                                                found = True
                                                If bmpTarget.Width > bmpTarget.Height Then w1 = bmpTarget.Height Else w1 = bmpTarget.Width
                                                For x1 As Integer = 1 To w1 - 2
                                                    clrTarget = bmpTarget.GetPixel(x1, x1)
                                                    clrTest = bmpTest.GetPixel(x1, x1)
    
                                                    If clrTarget <> clrTest Then
                                                        found = False
                                                        Exit For
                                                    End If
                                                Next
    
    
                                                If found Then
                                                    'diagonal is a match now check every pixel
                                                    For w As Integer = 0 To bmpTarget.Width - 1
                                                        For h As Integer = 0 To bmpTarget.Height - 1
                                                            clrTarget = bmpTarget.GetPixel(w, h)
                                                            clrTest = bmpTest.GetPixel(w, h)
    
                                                            If clrTarget <> clrTest Then
                                                                'not a match
                                                                found = False
                                                                Exit For
                                                            End If
                                                        Next
                                                    Next
    
                                                    If found Then
                                                        dups &= bmpPaths(i) & " - " & bmpPaths(j) & vbLf
                                                        'remove this dup
                                                        bmpPaths(j) = ""
                                                    End If
    
                                                End If
                                            End If
                                        End If
                                    End If
                                End If
                            Next
                        Next
                    End If
                Next
            Next
    
            sw.Stop()
            Label1.Text = bmpPaths.Count.ToString & " files in " & sw.ElapsedMilliseconds & " ms" & vbLf & vbLf & dups
    
        End Sub
    End Class

    Here are the files:

    PS dumb editor wont let me add a line and moved the pic to the bottom was at top.


    Wednesday, February 21, 2018 3:27 PM
  • I tried this.  I am getting an error on the line: clrUR = bmpTest.GetPixel(bmpTarget.Width - 1, 0).

    The werror is: Additional information: Parameter must be positive and < Width.

    my load event is:

            Dim di As New IO.DirectoryInfo("C:\Contacts 2014\Contact_Images\Movie_Images")
            Dim aryFi As IO.FileInfo() = di.GetFiles("*.jpg")
            Dim fi As IO.FileInfo
            For Each fi In aryFi
                bmpPaths.Add(fi.FullName)
            Next

    Maybe this will not work with .jpg's?


    RONATMOODYLAKE

    Wednesday, February 21, 2018 7:58 PM
  • This seems to work. The speed can be increased greatly using lockbits.

    You can speed that up somewhat by comparing the filesize before starting,  If the filesizes are not the same the images cannot be identical. Or, if a BMP and PNG can be considered identical, by comparing imagesize.

    Wednesday, February 21, 2018 8:17 PM
  • I tried this.  I am getting an error on the line: clrUR = bmpTest.GetPixel(bmpTarget.Width - 1, 0).

    The werror is: Additional information: Parameter must be positive and < Width.

    my load event is:

            Dim di As New IO.DirectoryInfo("C:\Contacts 2014\Contact_Images\Movie_Images")
            Dim aryFi As IO.FileInfo() = di.GetFiles("*.jpg")
            Dim fi As IO.FileInfo
            For Each fi In aryFi
                bmpPaths.Add(fi.FullName)
            Next

    Maybe this will not work with .jpg's?


    RONATMOODYLAKE

    Ok then try this.

    The problem was I had forgot to try it with different size files and Doh... etc.

    Png or jpg does not matter I think in fact you can do both at the same time.

    It is just a quicky to prove the concept. It has to be debugged. It will only run once right now then you have to restart etc.

    I changed it to use Acamar's suggestion and etc.

    I am sure other issues will come up with more testing. Try it with 1000 files and see how long it takes. If less than a lifetime then maybe worth improving the speed...

    PS I think there will be more required if there are more than one duplicate but I cant think that far ahead right now...

    Edit: Now uses lockbits to compare the images which is much faster.

    'check for duplicate images v6 - lock bits
    Imports System.Drawing.Imaging
    Imports System.Runtime.InteropServices
    
    Public Class Form8
        Private DisplayText As New TextBox With {.Parent = Me, .Dock = DockStyle.Bottom, .Top = 50,
            .ScrollBars = ScrollBars.Both, .Font = New Font("tahoma", 10), .Multiline = True, .WordWrap = False}
        Private WithEvents GoButton As New Button With {.Parent = Me, .Top = 10, .Left = 100, .Text = "Go"}
    
        Private bmpPaths As New List(Of String)
        Private bmpPathsBackup As New List(Of String)
        Private bmpSizes As New List(Of Long)
        Private sw As New Stopwatch
    
        Private Sub Form7_Load(sender As Object, e As EventArgs) Handles MyBase.Load
    
            Dim di As New IO.DirectoryInfo("c:\test3")
            Dim aryFi As IO.FileInfo() = di.GetFiles("*.png")
            Dim fi As IO.FileInfo
    
            For Each fi In aryFi
                bmpPathsBackup.Add(fi.FullName)
                bmpSizes.Add(fi.Length)
            Next
    
            Form7_Resize(0, Nothing)
    
        End Sub
    
        Private Sub Form7_Resize(sender As Object, e As EventArgs) Handles Me.Resize
            DisplayText.Height = ClientSize.Height - (GoButton.Bottom + 20)
        End Sub
    
        Private Sub Button1_Click(sender As Object, e As EventArgs) Handles GoButton.Click
            Dim dups As String = ""
            Dim bmpA, bmpB As Bitmap
            Dim duplicateBmp As Boolean
            Dim dupCount As Integer = 0
    
            'make unreferenced copy for multiple tests
            bmpPaths.Clear()
            For Each pth As String In bmpPathsBackup
                Dim pth2 As String = CType(pth.Clone, String)
                bmpPaths.Add(pth2)
            Next
    
            sw.Reset()
            sw.Start()
    
            For bmpAcount As Integer = 0 To bmpPaths.Count - 1
    
                DisplayText.Text = "Processing: " & bmpAcount.ToString &
                    vbCrLf & bmpPaths(bmpAcount) &
                    vbCrLf & vbCrLf & dups
                DisplayText.Refresh()
    
                bmpDispose(bmpA)
    
                'convert 24 to 32 bit
                Dim img As Image = Image.FromFile(bmpPaths(bmpAcount))
                bmpA = New Bitmap(img)
                img.Dispose()
    
                For bmpBcount As Integer = bmpAcount + 1 To bmpPaths.Count - 1
    
                    If bmpPaths(bmpBcount) <> "" AndAlso
                        bmpBcount <> bmpAcount AndAlso
                        bmpSizes(bmpBcount) = bmpSizes(bmpAcount) Then
    
                        bmpDispose(bmpB)
    
                        'convert 24 to 32 bit
                        Dim img2 As Image = Image.FromFile(bmpPaths(bmpBcount))
                        bmpB = New Bitmap(img2)
                        img2.Dispose()
    
                        If bmpB.Width = bmpA.Width AndAlso
                            bmpB.Height = bmpA.Height Then
    
                            duplicateBmp = CompareBmps(bmpA, bmpB)
    
                            If duplicateBmp Then
                                'remove this duplicate
                                dups = bmpPaths(bmpBcount) & vbCrLf & " - " & bmpPaths(bmpAcount) & vbCrLf & dups
                                bmpPaths(bmpAcount) = ""
                                dupCount += 1
                                Exit For
                            End If
                        End If
                    End If
                Next
            Next
    
            sw.Stop()
            DisplayText.Text = bmpPaths.Count.ToString & " files in " & sw.ElapsedMilliseconds & " ms" &
                vbCrLf & "    Duplicates: " & dupCount.ToString &
                vbCrLf & vbCrLf & dups
    
        End Sub
    
        Private Sub bmpDispose(ByRef thisBmp As Bitmap)
            'disposes thisbmp image reference
            If thisBmp IsNot Nothing Then
                Dim tempbmp As Bitmap = thisBmp
                thisBmp = Nothing
                tempbmp.Dispose()
            End If
        End Sub
    
        Public Function CompareBmps(ByVal bmpA As Bitmap, ByVal bmpB As Bitmap) As Boolean
            'returns true if A and B are the same bitmaps.  must be 32 bit bmps    
    
            'convert bitmaps to integer arrays
            Dim bmpDataA As BitmapData
            Dim PixelDataA As Integer()
            Dim pFSize As Integer = Bitmap.GetPixelFormatSize(bmpA.PixelFormat)
            Dim bmpRect As New Rectangle(0, 0, bmpA.Width, bmpA.Height)
            bmpDataA = bmpA.LockBits(bmpRect, ImageLockMode.ReadWrite, bmpA.PixelFormat)
            ReDim PixelDataA(bmpA.Width * bmpA.Height - 1)
            Marshal.Copy(bmpDataA.Scan0, PixelDataA, 0, PixelDataA.Length)
    
            Dim bmpDataB As BitmapData
            Dim PixelDataB As Integer()
            pFSize = Bitmap.GetPixelFormatSize(bmpB.PixelFormat)
            bmpRect = New Rectangle(0, 0, bmpB.Width, bmpB.Height)
            bmpDataB = bmpB.LockBits(bmpRect, ImageLockMode.ReadWrite, bmpB.PixelFormat)
            ReDim PixelDataB(bmpB.Width * bmpB.Height - 1)
            Marshal.Copy(bmpDataB.Scan0, PixelDataB, 0, PixelDataB.Length)
    
            'compare the two bitmap arrays
            Dim sameBmps As Boolean = True
    
            For i As Integer = 0 To PixelDataA.Length - 1
                If PixelDataA(i) <> PixelDataB(i) Then
                    'not a match
                    sameBmps = False
                    Exit For
                End If
            Next
    
            'dispose
            If PixelDataA IsNot Nothing Then Marshal.Copy(PixelDataA, 0, bmpDataA.Scan0, PixelDataA.Length)
            bmpA.UnlockBits(bmpDataA)
            PixelDataA = Nothing
            bmpDataA = Nothing
    
            If PixelDataB IsNot Nothing Then Marshal.Copy(PixelDataB, 0, bmpDataB.Scan0, PixelDataB.Length)
            bmpB.UnlockBits(bmpDataB)
            PixelDataB = Nothing
            bmpDataB = Nothing
    
            Return sameBmps
        End Function
    
    End Class



    • Edited by tommytwotrain Thursday, February 22, 2018 7:39 PM v6 add img dispose
    Wednesday, February 21, 2018 11:00 PM
  • I am sure other issues will come up with more testing. Try it with 1000 files and see how long it takes. If less than a lifetime then maybe worth improving the speed...

    I haven't run this to check, but I would suspect a useful change might be to adjust the looping to:

            For j As Integer = 0 To bmpPaths.Count - 2
                Label1.Text = "Processing: " & j.ToString
                Label1.Refresh()
    
                bmpTest = New Bitmap(bmpPaths(j))
    
                For i As Integer = j + 1 To bmpPaths.Count - 1
                    ...
    

    Thursday, February 22, 2018 2:11 AM
  • I am sure other issues will come up with more testing. Try it with 1000 files and see how long it takes. If less than a lifetime then maybe worth improving the speed...

    I haven't run this to check, but I would suspect a useful change might be to adjust the looping to:

            For j As Integer = 0 To bmpPaths.Count - 2
                Label1.Text = "Processing: " & j.ToString
                Label1.Refresh()
    
                bmpTest = New Bitmap(bmpPaths(j))
    
                For i As Integer = j + 1 To bmpPaths.Count - 1
                    ...

    Acamar,

    Oh yeah good one! It dropped these 227 images from 2.8 secs to 0.9 secs.

    As I mentioned its just a concept. Depends on exactly what one has in mind for how many files, how large the images etc. These are not a lot of large images.

    PS I updated my example above.

    PS Razerz and I have played with this and as I recall using lockbits is like 10 times faster or more especially with large files like when 1000x1000 and over. Most of the images I used above were less than 500. He knows the lockbits I am just beginning to get it.

    I have seen faster ways of doing a getpixel if one wants to get into that.

    Thursday, February 22, 2018 3:37 AM
  • Tommy,

    Doing it for 270 files is not the thing, it is like for me about many thousands of files. 

    I know that I can do many recursive searches on a disk to get all files with their size, type of image per type of image. 

    If I sort then that collection on size, I can go through it and find which of those who have an equal value are different or equal. 

    Of course I should make the computer only bound to that process, If there is one image added during that time it is wrong. 

    My experience with that recursive way on a current Microsoft OS has lead to many disappointments because of the many strange ways they made folders read only. 

    For me it has since windows 7 been an impossible loop to get all those files.

    Although that image software is not my thing, can I do it easily for one folder even if that one has subfolders. 

    The first problem is "How to get all the files and its sizes".


    Success
    Cor


    • Edited by Cor Ligthert Thursday, February 22, 2018 4:30 AM
    Thursday, February 22, 2018 4:28 AM

  • The first problem is "How to get all the files and its sizes".


    Success
    Cor


    Cor,

    I was just going to say that!

    But getting the files to test together is technically another problem?

    Just for kicks I tested v3 with 600 images, several hundred over 2000 x 1000 pixels it takes 13 minutes. So optimizing can cut that x 10 and more I think.

    But, I agree, if you are talking lots of large files then...?


    PS I just realized the time is dependant on the number of duplicate because each must check every pixel for a dup. My test I had copied the large files in twice so there were several hundred that had one duplicate.
    Thursday, February 22, 2018 4:50 AM
  • PS Razerz and I have played with this and as I recall using lockbits is like 10 times faster or more especially with large files like when 1000x1000 and over. Most of the images I used above were less than 500. He knows the lockbits I am just beginning to get it.

    I have never bothered implementing a lockbits version, although I am sure it is much faster, mainly because if I need to compare several thousand images I would usually let it run in the background while I get on with something more important.   But if I was to do a lockbits I think t would be interesting to implement the checking as an unamanged memory block compare in assembler.  That would include forcing it to run in multiple threads with each thread running in a single core to optimise the caching and branch prediction!

    You might get a a small additional improvement by fully implementing the early exit - currently it only applies to the inner loop.   It is a balance of an extra test for each iteration of the the outer loop (which is additional processing when the images match) against unneeded (albeit, probably brief) iterations of the outer loop when the images don't match. 

    Thursday, February 22, 2018 5:02 AM
  • PS Razerz and I have played with this and as I recall using lockbits is like 10 times faster or more especially with large files like when 1000x1000 and over. Most of the images I used above were less than 500. He knows the lockbits I am just beginning to get it.

    I have never bothered implementing a lockbits version, although I am sure it is much faster, mainly because if I need to compare several thousand images I would usually let it run in the background while I get on with something more important.   But if I was to do a lockbits I think t would be interesting to implement the checking as an unamanged memory block compare in assembler.  That would include forcing it to run in multiple threads with each thread running in a single core to optimise the caching and branch prediction!

    You might get a a small additional improvement by fully implementing the early exit - currently it only applies to the inner loop.   It is a balance of an extra test for each iteration of the the outer loop (which is additional processing when the images match) against unneeded (albeit, probably brief) iterations of the outer loop when the images don't match. 

    I came up with a lockbits version and the 600 files that were 13 minutes now takes 1 minute.

    I updated the example above to v6.


    Thursday, February 22, 2018 7:26 PM
  • I used an mft scan to get all .bmp, .gif, .jpg, .png and .tif filepaths then copied all to a folder on my desktop and there are 13,580. Did the same with .ico and there are 708. I think it would take awhile especially having to do multilayered .gifs and .ico's.

    La vida loca

    Friday, February 23, 2018 4:58 AM