locked
Microsoft Academic - gzip + md5sum to save time+egress RRS feed

  • Question

  • Hi!

    I got the Microsoft Academic Graph data provisioned on one of my storage accounts.

    Since I'm indexing the data in another back end, I have to download it.

    This generates a large amount of egress cost (~50€) for <500GiB and it seems to be rate-limited. Is there a way to compress them with gzip (or comparable) before I download them?

    Also, the provisioned files were somehow without an md5sum, so I had to download the important ones again to check if they were the same. Can I generate them somehow or is that not possible when the files are provisioned by Microsoft Academic?

    Friday, March 29, 2019 2:37 PM

Answers

  • Apologies for the delay response! Adding information to Adam’s response


    Azure Storage does not support on-the-wire compression today.  You can pre-compress data before you upload it, and then it will download compressed.

    For md5, if you know the expected md5 for a blob you can call SetBlobProperties and update the Content-MD5 for the blob.  The service will not validate that the MD5 is correct, but the service will return the MD5 when a client downloads the whole blob, and the client libraries will validate the MD5 during the download.

    You can also use transactional md5 to reliably download a blob that does not have a Content-MD5 set.  In this case, you would download the blob in less than or equal to 4 MB chunks, and the service will calculate the MD5 for the requested range, and the client library can validate the transnational MD5 for that range.

    • Marked as answer by einsweniger Friday, April 5, 2019 8:26 AM
    Friday, April 5, 2019 4:58 AM

All replies

  • Can you please elaborate bit more on this query?

    Microsoft Academic Graph is currently in free preview. Consumers incur costs only on their own Azure resource usage associated with graph (i.e. storing, downloading, processing, analytics, etc.). See the pricing page for Azure cost estimator links that pre-populate storage costs associated with storing the approximate size of the graph.

    It's important to note that old versions of MAG are not removed or modified in any way by the provisioning process, so if you have signed up for automatic provisioning you are responsible for removing older releases.

    We are updating the data and send out a new copy every 2 weeks. However, the customer can decide to delete the old copies once they receive the new one.

    * Price estimates are based on only the most recent version of the MAG being retained in storage. Use the Azure estimator links above to model different use scenarios, e.g. retaining older versions of MAG.   For more details refer to this article

    ** MAG "core" version refers to the complete graph as detailed in the data schema

    Additional information: You may try xz format due to a better compression ratio vs bzip2 or gzip. The downside is the tools are not always installed by default. 

    You can apply a quota to the API to allow fair access and a high response time to our services.

    For more details you may refer to this link.

    We provide the Microsoft Academic Graph in RDF in several parts, enabling to use also only some specific parts.

    kindly just post the question on stack overflow to receive a focused and immediate assistance from the right set of experts.

    Saturday, March 30, 2019 5:40 AM
  • I have to say, I'm quite confused by your answer. The first part (before 'additional information') does not relate to any single of my question. I know about various compression formats, but how do I apply this to the provisioned files? (I'd rather use zstd because it's faster, but let's stay on topic, please.)

    Also I cannot see what the CORE API or RDF have to do with anything related to my questions. And why should I post it on SO, tagged as "microsoft academic" when I have questions regarding the Azure Storage?

    I'll rephrase my questions:

    How can I compress artifacts in Azure Storage on premise?
    How do I generate md5sum in Azure Storage for files not uploaded by myself?

    Please ignore the MAG part, it is just the motivation to frame my question and provide possibly necessary context .

    • Edited by einsweniger Saturday, March 30, 2019 10:35 AM
    Saturday, March 30, 2019 10:35 AM
  • @einswengier , Apologies for the misunderstanding in the previous answer.

    Gzip encoding is possible in Blob storage, There's an external blog written by StefanGordon     here  includes a tool as well. 
    More info on what's supported(types of compression) can be found here
    MD5: There is a property allowing you to set the content-MD5 value for blobs here : Deflate, Gzip, Bzip2, Lzo, Snappy.

    x-ms-blob-content-md5

    There are also external tools that could be used for this as well: a tool in Github can used to get the hash of several blobs at the same time

    Let me know if any of these solutions work for you, I'll also escalate your requests to the Storage team to have more content on the two topics mentioned: MD5 and compression.

    Thanks,

    Adam


    Wednesday, April 3, 2019 5:15 PM
  • Hi Adam, thank you for the response!

    I've looked at all the tools and from what I gather, generating a gzip with the linked tool will download the content, encode it and then upload it again; so that's not quite what I wanted. I presume there is no way to preserve some egress here.

    Regarding check sums, the tools do the same: download all data feed it through the hashing algorithm and then set the property 'x-ms-blob-content-md5'. Also not quite what I wanted, but I guess that's the limitations of the of the blob storage.

    I'm new to Azure Storage, maybe the storage needs to be that simple? Let me know if the Storage team has a solution that will not require that much egress.

    If it's not possible, could you drop the Microsoft Academic Graph team a note, that it would be great if they would set the md5 on the blobs after provisioning?

    Thank you!

    regards,
    David

    Thursday, April 4, 2019 7:53 AM
  • Apologies for the delay response! Adding information to Adam’s response


    Azure Storage does not support on-the-wire compression today.  You can pre-compress data before you upload it, and then it will download compressed.

    For md5, if you know the expected md5 for a blob you can call SetBlobProperties and update the Content-MD5 for the blob.  The service will not validate that the MD5 is correct, but the service will return the MD5 when a client downloads the whole blob, and the client libraries will validate the MD5 during the download.

    You can also use transactional md5 to reliably download a blob that does not have a Content-MD5 set.  In this case, you would download the blob in less than or equal to 4 MB chunks, and the service will calculate the MD5 for the requested range, and the client library can validate the transnational MD5 for that range.

    • Marked as answer by einsweniger Friday, April 5, 2019 8:26 AM
    Friday, April 5, 2019 4:58 AM