locked
Querying duplicate files RRS feed

  • Question

  • I'm learning LINQ and am trying to find "duplicate" files within a directory.  I have a directory (let's call it "root") that may have any number of subdirectories.  An application I'm using may be creating files with exactly the same name but in a different directory, or a file in the same directory and if the file already exists it just appends a _2 or _3 before the file extension to the end of the file name.  Let's assume that the file extension is 3 characters and that there are never more than _9 files.  In pseudo-code in SQL I would maybe write it something like this I suppose:

    WHERE FileName LIKE '%^_[2-9].___' ESCAPE '^'

    OR

    HAVING COUNT(FileName) > 1

    Can someone help me write a similar query in LINQ?  I have started with the code below that I found on MSDN, but it only accounts for exact duplicates.

    DirectoryInfo dir = new DirectoryInfo(strPath);
                
                IEnumerable<FileInfo> fileList = dir.GetFiles("*.*", SearchOption.AllDirectories);
                
                int nCharsToSkip = strPath.Length;
    
                var queryDupNames =
                    from file in fileList
                    group file.FullName.Substring(nCharsToSkip) by file.Name into fileGroup
                    where fileGroup.Count() > 1
                    select fileGroup;

    Thanks so much!

    Thursday, August 8, 2013 2:01 AM

Answers

  • Hello mateoc15,

    Thank you for posting in MSDN Forums.

    From your description, I notice the issue you are experiencing is that you trying to find "duplicate" files within a directory.

    If I have misunderstood anything, please feel free to let me know.

    With the code provided by you,I made a smaple.

    Following is my sample code:

    using System;
    using System.Collections;
    using System.Collections.Generic;
    using System.IO;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    namespace ADONET
    {
        class Queryingduplicatefiles
        {
            internal void Show()
            {
                ArrayList collectList = new ArrayList();
                // Change the root drive or folder if necessary 
                string startFolder = @"E:\BMX\WorkPlace";
                // Take a snapshot of the file system.
                System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo(startFolder);
                // This method assumes that the application has discovery permissions 
                // for all folders under the specified path.
                IEnumerable<System.IO.FileInfo> fileList = dir.GetFiles("*.*", System.IO.SearchOption.AllDirectories);
                // used in WriteLine to keep the lines shorter 
                int charsToSkip = startFolder.Length;
                string fileName = string.Empty;
                foreach (FileInfo file in fileList)
                {
                    if (collectList.Contains(file.Name))
                    {
                        continue;
                    }
                    fileName = file.Name.Substring(0, file.Name.LastIndexOf('.'));
                    var fileNames = from fileNew in fileList
                                    where fileNew.Name.Contains(fileName)
                                    select fileNew.Name;
                    if (fileNames.Count() > 1)
                    {
                        foreach (string name in fileNames)
                        {
                            collectList.Add(name.ToString());
                        }
                    }
                }
                foreach (string name in collectList)
                {
                    Console.WriteLine(name);
                }
            }
        }
    }

    It can collect file which is true duplicate file names and files that have a _2 or _3, etc.

    The file 2013-09-01 and 2013-09-02 are also in the new folder.So they will be shown,too.

    I look forward to hearing from you.

    Best Regards.



    Fred Bao
    MSDN Community Support | Feedback to us
    Develop and promote your apps in Windows Store
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

    Friday, August 9, 2013 5:05 AM

All replies

  • If you want to compare use Equals 

    if you want to find some similar string use Contains


    Thanks & Regards
    Syed Amjad Sr. Silverlight/WPF Developer,
    yahoo : syedamjad6736@yahoo.com, skype : syedamjad.0786.
    Please use Marked as Answer if my post solved your problem and use Vote As Helpful if a post was useful.

    • Proposed as answer by Fred Bao Tuesday, August 13, 2013 9:49 AM
    Thursday, August 8, 2013 8:54 AM
  • I figured, but I'm struggling on how to combine the results of true duplicate file names and files that have a _2 or _3, etc.
    Thursday, August 8, 2013 12:29 PM
  • Hello mateoc15,

    Thank you for posting in MSDN Forums.

    From your description, I notice the issue you are experiencing is that you trying to find "duplicate" files within a directory.

    If I have misunderstood anything, please feel free to let me know.

    With the code provided by you,I made a smaple.

    Following is my sample code:

    using System;
    using System.Collections;
    using System.Collections.Generic;
    using System.IO;
    using System.Linq;
    using System.Text;
    using System.Threading.Tasks;
    namespace ADONET
    {
        class Queryingduplicatefiles
        {
            internal void Show()
            {
                ArrayList collectList = new ArrayList();
                // Change the root drive or folder if necessary 
                string startFolder = @"E:\BMX\WorkPlace";
                // Take a snapshot of the file system.
                System.IO.DirectoryInfo dir = new System.IO.DirectoryInfo(startFolder);
                // This method assumes that the application has discovery permissions 
                // for all folders under the specified path.
                IEnumerable<System.IO.FileInfo> fileList = dir.GetFiles("*.*", System.IO.SearchOption.AllDirectories);
                // used in WriteLine to keep the lines shorter 
                int charsToSkip = startFolder.Length;
                string fileName = string.Empty;
                foreach (FileInfo file in fileList)
                {
                    if (collectList.Contains(file.Name))
                    {
                        continue;
                    }
                    fileName = file.Name.Substring(0, file.Name.LastIndexOf('.'));
                    var fileNames = from fileNew in fileList
                                    where fileNew.Name.Contains(fileName)
                                    select fileNew.Name;
                    if (fileNames.Count() > 1)
                    {
                        foreach (string name in fileNames)
                        {
                            collectList.Add(name.ToString());
                        }
                    }
                }
                foreach (string name in collectList)
                {
                    Console.WriteLine(name);
                }
            }
        }
    }

    It can collect file which is true duplicate file names and files that have a _2 or _3, etc.

    The file 2013-09-01 and 2013-09-02 are also in the new folder.So they will be shown,too.

    I look forward to hearing from you.

    Best Regards.



    Fred Bao
    MSDN Community Support | Feedback to us
    Develop and promote your apps in Windows Store
    Please remember to mark the replies as answers if they help and unmark them if they provide no help.

    Friday, August 9, 2013 5:05 AM