locked
Consolidate content md5 of various parts RRS feed

  • Question

  • Hi,

    We use java storage client library to transfer files to Azure. For a large file, we split the file in to multiple parts of equal sizes and compute content md5 for each part. Once all the parts are transferred, final content md5 needs to computed from the individual parts and validated against one provided by Azure.

    For example, a 500 MB file is split in to 4 parts of 128 MB size, so I have now 4 different content md5 values and need to compute the final content md5. Also the computed sum should match with that of Azure.

    So what is the algorithm used to compute the final content md5 from the sum of each individual content md5 of multiple parts?



    • Edited by ArunagiriR Tuesday, April 24, 2018 6:35 PM
    Tuesday, April 24, 2018 8:30 AM

Answers

  • I just got an answer from the internal Devs. there’s not really a way of combining component MD5s to get an overall MD5 at this point(Maybe in the future). They also recommended checking this post:  https://stackoverflow.com/a/2214304/9270120  , This is provides a solution to your situation, recording MD5 hashes while still feeding data to computer the end result MD5. 
    Wednesday, May 2, 2018 6:08 PM

All replies

  • Hi,

    there's no way to use the 4 separate MD5 hashes to calculate the hash for the complete file. How I would do it: combine all bytes of the 4 parts and calculate the hash over that. So in the end it will be 4 byte arrays, add those to one big byte array and calculate the hash.

    I'm not a Java guy, so I don't know how to do it in Java. But you can use the System.Security.Cryptography.MD5 class' ComputeHash method to calculate the hash for a byte array. If you convert the output of that method to a Base64 string, you can compare it to the MD5 hash stored on Azure side.

    In PowerShell, it would be something like this:

    $crypto = [System.Security.Cryptography.MD5]::Create()
    $content = Get-Content -Path "C:\yourfile.txt" -Encoding byte
    $hash = [System.Convert]::ToBase64String($crypto.ComputeHash($content))

    The $hash variable contains the value that you can use for comparison on Azure side. I bet you can convert something like this to Java :)

    Hope this helps you out.


    Floris van der Ploeg - www.florisvanderploeg.com

    If my post was helpfull, remember to click the "Propose as answer" button.

    Tuesday, April 24, 2018 11:41 AM
  • I created a single byte array and copied the contents of 4 byte arrays in to it. The computed md5 hash is of length 88 and does not match up with Azure. If I push the file as a single part, the computed md5 hash is of size 24 which exactly matches with Azure. The problem arises when the file is split in to multiple parts and trying to consolidate md5 hashes of all parts.


    • Edited by ArunagiriR Tuesday, April 24, 2018 6:46 PM
    Tuesday, April 24, 2018 4:00 PM
  • Let me run this by the internal team , and I'll get back to you.

    Adam
    Tuesday, April 24, 2018 6:12 PM
  • And splitting those files, how is that done? Is it possible that the split process adds some header/footer to the file (which will modify the MD5 hash)?

    Floris van der Ploeg - www.florisvanderploeg.com

    If my post was helpfull, remember to click the "Propose as answer" button.

    Wednesday, April 25, 2018 8:07 AM
  • I am not appending any additional information to the file. For example, a 500 MB file will be split in to 128 MB of 4 parts(3 parts of 128 MB and the 4th one will be of 116 MB to be precise). Calculate md5 hash after reading 128 MB for each part and store it in a 2D byte array, part number and its corresponding hash. Consolidated md5 sum needs to be computed from the elements of the byte array.
    Wednesday, April 25, 2018 10:51 AM
  • Hi Adam,

    Any updates or suggestion on how to consolidate multiple content md5's of each part?

    Wednesday, May 2, 2018 8:38 AM
  • I just got an answer from the internal Devs. there’s not really a way of combining component MD5s to get an overall MD5 at this point(Maybe in the future). They also recommended checking this post:  https://stackoverflow.com/a/2214304/9270120  , This is provides a solution to your situation, recording MD5 hashes while still feeding data to computer the end result MD5. 
    Wednesday, May 2, 2018 6:08 PM
  • I followed the steps mentioned the stackoverflow post and it worked. Thanks Adam!

    Monday, May 14, 2018 11:53 AM