none
HTML and Text File comparision RRS feed

  • Question

  • I have to build a utility which will compare the files from a particular directory. The files can be simple text file or HTML files. HTML files contains the Image paths also for displaying the images when HTML file is loaded in the browser. These images are available inside the folder.

     

    I want to design a solution which can compare two HTML file for the text contained as well as images referenced in the <img> tag. There might be chances that although text is exact same and even image name is same but actual image is different. For example one image in one file might be showing different bar grarph and image with same name in another file might be showing different bar graph. I need to build a roubust solution which can deal with this situation.

     

    If someone can suggest good ways to do this then it will help me in creating good solution.

     

    Thanks,

    PK

     

    Wednesday, March 26, 2008 7:28 AM

Answers

  • Hi

    Here is my suggession about how to do this

     

    Assumption of the solution

     

    1) all HTML files will be XHTML 1.1

     

    How to compare

     

    1) We load HTML1 and HTML2 as DOM objects

    2) Do a Deapth first tree in the HTML nd for each known node try to find the same in HTML2

    3) If the location of the node in HTML2 passes the business rule of equivalence then continue else mark it as unmatched

     

    Hope this helps

    G

    Wednesday, March 26, 2008 9:03 AM
  • Hi PraveenK,
    about image comparison, what your software should do is proceeding with the following decision tree:

    • If both image URLs (both compared sources) are absolute (as opposite of relative) and both are the same (that is, calling Equals on any of them passing the other as argument gives true), images are the same one so you don't need to go further
      • An absolute URL is http://msdn.microsoft.com/architecture
      • A relative URL is ./images/logo.gif
    • ELSE (if at least one URL is relative it does not matter whether the other is equal or not: the resulting ABSOLUTE URL is not the same), you must proceed accessing the files and performing a binary comparison. You may have more guidance on implementing such functionality here
    For the HTML comparison, given the sequential nature of the algorithm, I suggest you something like XmlTextReader to avoid getting deep in details of XML tags, marks, etc. However this approach has as drawback that both fIles must be XHTML compliant, are they? If they aren't, you'll have to perform a fine grained comparison

    Something else to consider is the canonical form of an HTML file: this two HTML constructs look different but are canonical similar
    • <IMG SRC="logo.gif" WIDTH=280 HEIGHT=320>
    • <IMG WIDTH=280 HEIGHT=320 SRC="logo.gif">
    The solution gverma offered you helps, as you load both trees in memory and later ask for equality (Equals() method, which takes care of the details). Still had as drawback that
    • XHTML compliant, if your files are, so that's not an issue
    • For huge source file, you may need certain infrastructure. That just depend on your expected sizes, if they are relatively small considering your available memory, go ahead and discard this comment
    Hope these helps
    Friday, April 18, 2008 1:17 AM

All replies

  • Hi

    Here is my suggession about how to do this

     

    Assumption of the solution

     

    1) all HTML files will be XHTML 1.1

     

    How to compare

     

    1) We load HTML1 and HTML2 as DOM objects

    2) Do a Deapth first tree in the HTML nd for each known node try to find the same in HTML2

    3) If the location of the node in HTML2 passes the business rule of equivalence then continue else mark it as unmatched

     

    Hope this helps

    G

    Wednesday, March 26, 2008 9:03 AM
  • Hi,

    Your suggestion didn't talk about how can I compare the images which are being used in the HTML files. I need to compare the images also. Can you give me some suggestion for comparing the images also.

    Thanks,
    PK
    Wednesday, March 26, 2008 10:09 AM
  • Hi PraveenK,
    about image comparison, what your software should do is proceeding with the following decision tree:

    • If both image URLs (both compared sources) are absolute (as opposite of relative) and both are the same (that is, calling Equals on any of them passing the other as argument gives true), images are the same one so you don't need to go further
      • An absolute URL is http://msdn.microsoft.com/architecture
      • A relative URL is ./images/logo.gif
    • ELSE (if at least one URL is relative it does not matter whether the other is equal or not: the resulting ABSOLUTE URL is not the same), you must proceed accessing the files and performing a binary comparison. You may have more guidance on implementing such functionality here
    For the HTML comparison, given the sequential nature of the algorithm, I suggest you something like XmlTextReader to avoid getting deep in details of XML tags, marks, etc. However this approach has as drawback that both fIles must be XHTML compliant, are they? If they aren't, you'll have to perform a fine grained comparison

    Something else to consider is the canonical form of an HTML file: this two HTML constructs look different but are canonical similar
    • <IMG SRC="logo.gif" WIDTH=280 HEIGHT=320>
    • <IMG WIDTH=280 HEIGHT=320 SRC="logo.gif">
    The solution gverma offered you helps, as you load both trees in memory and later ask for equality (Equals() method, which takes care of the details). Still had as drawback that
    • XHTML compliant, if your files are, so that's not an issue
    • For huge source file, you may need certain infrastructure. That just depend on your expected sizes, if they are relatively small considering your available memory, go ahead and discard this comment
    Hope these helps
    Friday, April 18, 2008 1:17 AM