locked
The checksum of the document is different after uploaded to document library RRS feed

  • Question

  • I have a custom page that upload document to a document library. The page will check if there is a same content (same checksum) exist, error message will show and document will not be uploaded.

    From my testing:
    1. Uploaded Test.doc (assuming that there is no document in the library)
    2. Upload the same Test.doc (error message is not shown and the document is upload)

    I would like to know if a document is uploaded to doc library, will the checksum of the document change? If yes, is there a way to check if the document content already exist?

    Thanks!


    jingzo (^_^)
    Saturday, November 20, 2010 2:55 PM

Answers

  •  

    I love the questions in these forums!  They lead me down new paths in SharePoint that I would have never tried.

    Such a simple question: "if a document is uploaded to doc library, will the checksum of the document change"

    Well, I know of a few cases where it can, but I'll come back to that. Here's the first very weird thing I discovered... I uploaded the same file (a Word document) three different ways, and got three different file sizes, and when downloaded they were all different the the originally uploaded file.

    (File size from right-clicking the file in Windows Explorer and selecting Properties)

    • File on disk (C:):
         139,264 bytes
    • File uploaded by clicking "Upload" in the library and then checking it's size from Open With Windows Explorer: 
         140,288 Bytes
    • File uploaded with "Upload Multiple":
         139,776
    • File uploaded by dragging from C: (Windows Explorer) to Open Windows With Explorer:
         139,264

    Now I downloaded the file using drag and drop from Open With Windows Explorer:

    • File uploaded by clicking "Upload"
        140,288    (unchanged)
    • File uploaded by clicking "Upload Multiple"
        139,776   (unchanged)
    • File uploaded with dragging from C: (Windows Explorer) to Open Windows With Explorer:
        139,776   (CHANGED!)

    Next I wrote some .Net code to access the documents via the API and the size reported from SPFileItem.File.Length is the same as the downloaded numbers (140,288, 139,776, 139,776).

     

    Remember... the original file on C: was 139,264 bytes.

     

    Now I opened each of the "uploaded and then downloaded" files in a HEX viewer:

    • The file uploaded by clicking "Upload Multiple" and and the file uploaded with dragging from C: (Windows Explorer) to Open Windows With Explorer:
        All bytes identical until the end of the file where there is what looks like random bytes (different in both files) and an incomplete fragment of an XML structure (same in both files).  (junk in the upload buffer???)  
    • File uploaded by clicking "Upload"
        First byte changed from 00 to D0
        bytes/text added at the end of the file with metadata from the library columns!

    So... I don't think you can rely on the check sum of the file!

     

    Note: the above was for a Word document. A simple text (.TXT) was unchanged by SharePoint and always reported the same file length no matter how I uploaded or downloaded it.

     

    Now back to how (I used to know) files can be different...

    • If you are using Information Rights Management - the IRM wrapper is removed on upload (so files can be indexed for search) and reapplied on download.
    • If you have fields bound between an Office document (Word, etc) and the columns in the library, a user who edits the library columns will also be changing the content of the file.

    Mike Smith TechTrainingNotes.blogspot.com
    • Proposed as answer by Dennis Gaida Monday, November 22, 2010 6:06 PM
    • Marked as answer by Porter Wang Thursday, November 25, 2010 8:33 AM
    Saturday, November 20, 2010 8:25 PM

All replies

  •  

    I love the questions in these forums!  They lead me down new paths in SharePoint that I would have never tried.

    Such a simple question: "if a document is uploaded to doc library, will the checksum of the document change"

    Well, I know of a few cases where it can, but I'll come back to that. Here's the first very weird thing I discovered... I uploaded the same file (a Word document) three different ways, and got three different file sizes, and when downloaded they were all different the the originally uploaded file.

    (File size from right-clicking the file in Windows Explorer and selecting Properties)

    • File on disk (C:):
         139,264 bytes
    • File uploaded by clicking "Upload" in the library and then checking it's size from Open With Windows Explorer: 
         140,288 Bytes
    • File uploaded with "Upload Multiple":
         139,776
    • File uploaded by dragging from C: (Windows Explorer) to Open Windows With Explorer:
         139,264

    Now I downloaded the file using drag and drop from Open With Windows Explorer:

    • File uploaded by clicking "Upload"
        140,288    (unchanged)
    • File uploaded by clicking "Upload Multiple"
        139,776   (unchanged)
    • File uploaded with dragging from C: (Windows Explorer) to Open Windows With Explorer:
        139,776   (CHANGED!)

    Next I wrote some .Net code to access the documents via the API and the size reported from SPFileItem.File.Length is the same as the downloaded numbers (140,288, 139,776, 139,776).

     

    Remember... the original file on C: was 139,264 bytes.

     

    Now I opened each of the "uploaded and then downloaded" files in a HEX viewer:

    • The file uploaded by clicking "Upload Multiple" and and the file uploaded with dragging from C: (Windows Explorer) to Open Windows With Explorer:
        All bytes identical until the end of the file where there is what looks like random bytes (different in both files) and an incomplete fragment of an XML structure (same in both files).  (junk in the upload buffer???)  
    • File uploaded by clicking "Upload"
        First byte changed from 00 to D0
        bytes/text added at the end of the file with metadata from the library columns!

    So... I don't think you can rely on the check sum of the file!

     

    Note: the above was for a Word document. A simple text (.TXT) was unchanged by SharePoint and always reported the same file length no matter how I uploaded or downloaded it.

     

    Now back to how (I used to know) files can be different...

    • If you are using Information Rights Management - the IRM wrapper is removed on upload (so files can be indexed for search) and reapplied on download.
    • If you have fields bound between an Office document (Word, etc) and the columns in the library, a user who edits the library columns will also be changing the content of the file.

    Mike Smith TechTrainingNotes.blogspot.com
    • Proposed as answer by Dennis Gaida Monday, November 22, 2010 6:06 PM
    • Marked as answer by Porter Wang Thursday, November 25, 2010 8:33 AM
    Saturday, November 20, 2010 8:25 PM
  • Very informative reply. Thanks Mike.
    Regards
    NLV
    Visit SharePoint User group - India
    Monday, November 22, 2010 9:03 AM
  • Hi Mike,

    Thanks for your answer. We have also tested this behavior that the checksum of the office documents is changed after they are uploaded to SharePoint.

    Is there any way how to check if the document is changed from the original document?

    We need to upload documents to SharePoint and check if they were changed from the original documents. Normally we are using MD5 hash for this but this cannot be used for office documents which were uploaded to SharePoint (its MD5 has has changed).

    Thank you.

    Regards,

    Michal.

    Thursday, September 8, 2016 9:14 AM
  • Michal,

    I have a little more info here: http://techtrainingnotes.blogspot.com/2010/11/sharepoint-changes-files-as-they-are.html

    What I have learned is that SharePoint updates Office documents to store the SharePoint metadata that was added to the item in the library.

    As far as checking the documents for change, if they are Office documents then they WILL be changed. Not the content, but the metadata. As an example, upload a document to a library where you have added a Content Type. Edit the library item's Content Type properties and then open the Office document. In the Info, Advanced Properties panel you will see those properties. Properties not found here can be found in the Custom tab. In the example below, I used the Dublin Core Columns content type. The properties for Author and Comments were never manually added to Excel, only to SharePoint. But, the Excel file was updated while in SharePoint.


    Mike Smith TechTrainingNotes.blogspot.com
    Books: SharePoint 2007 2010 Customization for the Site Owner, SharePoint 2010 Security for the Site Owner

    Friday, September 9, 2016 12:41 AM
  • Hi, 
    is there any way to upload files without changing <g class="gr_ gr_44 gr-alert gr_spell gr_inline_cards gr_run_anim ContextualSpelling" data-gr-id="44" id="44">it's</g> checksums? 
    I realized that some office files uploaded to Record Center do not change their checksums but not all of them (.htm and.html files are still being modified by Sharepoint.
    Regards
    Tuesday, May 30, 2017 9:31 AM