locked
compare two near duplicate documents(pdf files) in c# RRS feed

  • Question

  • I have develop application for compare two near duplicate documents(pdf files) in c#.
     
    Actually compare the content of two files. In this how much content is matching with 
     
    file1 to file2 that means finally howmuch percentage is matching with file1 to file2.
     
    whenever compare the two pdf files howmuch percentage u got.
     
    my code like this

    private bool FileCompare(string file1, string file2)
    {
    int file1byte;
    int file2byte;
    FileStream fs1;
    FileStream fs2;
     
    // Determine if the same file was referenced two times.
    if (file1 == file2)
    {
    // Return true to indicate that the files are the same.
    return true;
    }
     
    // Open the two files.
    fs1 = new FileStream(file1, FileMode.Open);
    fs2 = new FileStream(file2, FileMode.Open);
     
    // Check the file sizes. If they are not the same, the files 
    // are not the same.
    if (fs1.Length != fs2.Length)
    {
    // Close the file
    fs1.Close();
    fs2.Close();
     
    // Return false to indicate files are different
    return false;
    }
     
    // Read and compare a byte from each file until either a
    // non-matching set of bytes is found or until the end of
    // file1 is reached.
    do
    {
    // Read one byte from each file.
    file1byte = fs1.ReadByte();
    file2byte = fs2.ReadByte();
    }
    while ((file1byte == file2byte) && (file1byte != -1));
     
    // Close the files.
    fs1.Close();
    fs2.Close();
     
    // Return the success of the comparison. "file1byte" is 
    // equal to "file2byte" at this point only if the files are 
    // the same.
    return ((file1byte - file2byte) == 0);
    }
     
    private void PdfCompare_Load(object sender, EventArgs e)
    {
     
    }
     
    private void button1_Click(object sender, EventArgs e)
    {
    if (FileCompare(this.textBox1.Text, this.textBox2.Text))
    {
    MessageBox.Show("Files are equal.");
    }
    else
    {
    MessageBox.Show("Files are not equal.");
    } 
    }
     
    }

    please help me for percentage matching whenever compare two nearduplicate pdf files

    Friday, March 13, 2015 3:26 AM

All replies

  • Count all bytes in one pdf file, count all miss-matching bytes while comparing, then compute the percentage. Is this possible?

    -----------------------------------------

    Free .NET Barcode Generator & Scanner supporting over 40 kinds of 1D & 2D symbologies.

    Friday, March 13, 2015 6:52 AM
  • my code like as

    private void button1_Click(object sender, EventArgs e)
            {
                OpenFileDialog openFileDialog = new OpenFileDialog();
                openFileDialog.CheckFileExists = true;
                openFileDialog.AddExtension = true;
                openFileDialog.Filter = "PDF files (*.pdf)|*.pdf";
                DialogResult result = openFileDialog.ShowDialog();
                if (result == DialogResult.OK)
                {
                    filename = Path.GetFileName(openFileDialog.FileName);
                    path = Path.GetDirectoryName(openFileDialog.FileName);
                    textBox1.Text = path + "\\" + filename;
                }
            }
     
            private void button2_Click(object sender, EventArgs e)
            {
     
                OpenFileDialog openFileDialog = new OpenFileDialog();
                openFileDialog.CheckFileExists = true;
                openFileDialog.AddExtension = true;
                openFileDialog.Filter = "PDF files (*.pdf)|*.pdf";
                DialogResult result = openFileDialog.ShowDialog();
                if (result == DialogResult.OK)
                {
                    filename = Path.GetFileName(openFileDialog.FileName);
                    path = Path.GetDirectoryName(openFileDialog.FileName);
                    textBox2.Text = path + "\\" + filename;
                }
            }

    compare two pdf files and find the difference in terms of percentage

    please help me.thank u


    Wednesday, March 25, 2015 11:13 AM
  • Hi,

    Actually your case related to PDF, here is a thread Comparing two PDF file in C#

    You should read the content use PDF build-in method. I think you could get some hints from the thread above.

    About the different strings, please try this code.

      var Diff = str1.Where(x => !str2.Contains(x)).ToArray();  

    We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time. Thanks for helping make community forums a great place.
    Click HERE to participate the survey.


    • Edited by Kristin Xie Wednesday, March 25, 2015 11:47 AM
    Wednesday, March 25, 2015 11:46 AM
  • you may need a third party dll or so, to read a PDF file in case it is what you wish to do, e.g.

    http://www.squarepdf.net/pdfbox-in-net

    http://www.tallcomponents.com/

    If you just only want to compare file size, then you can user "FileInfo.Length" for this.

    https://msdn.microsoft.com/en-us/library/system.io.fileinfo.length%28v=vs.110%29.aspx

    regards

    joon

    Wednesday, March 25, 2015 12:22 PM
  •   i want compare content of pdf files that is string format not byte format 

     forexample first pdf contain one word like as "ramesh" .second pdf contain "ramesh" this is 100% matching and another one is there first pdf contain word "ramesh" second word is "rammi" i think may be mach with 65%.

     i want this type of code.

    please help me

    Wednesday, March 25, 2015 12:50 PM
  • i don't want to differenece. i want similarty persentage how much content will match with another pdf content  in terms of percentage.
    Wednesday, March 25, 2015 1:35 PM
  • i don't want to differenece. i want similarty persentage how much content will match with another pdf content  in terms of percentage.

    Hi,

    I think the difference place you know, the left is how to calculate the percentage.

    Since there are many words in pdf files, you can matches with string one by one.

    Using Count() to  calculate the different. Then do what you want to do.

    var Diff = str1.Where(x => !str2.Contains(x)).Count();

    By the way, there is a good tool called i-net PDF content comparer

    You can use it to verify.

    http://stackoverflow.com/questions/145657/tool-to-compare-large-numbers-of-pdf-files

    Best regards,

    Kristin

     


    We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time. Thanks for helping make community forums a great place.
    Click HERE to participate the survey.


    • Edited by Kristin Xie Thursday, March 26, 2015 2:03 AM
    Thursday, March 26, 2015 2:01 AM