none
Appropriate Method for Storing PDF Repository in Azure

    Question

  • I am interested in leveraging Azure storage to host a large repository of PDFs.  Each PDF contains multiple reports in it, and upon request (via web app) we extract page ranges from these PDFs and serve them up.  My attempts to use block blob storage for this failed miserably because extracting page ranges from PDFs requires a great deal of seeking to different byte offsets and reading several small amounts of data.  The overhead incurred in all the range requests across the REST interface for block blob storage makes this impossible to do at any reasonable speed.  I thought perhaps page blobs would be better suited for this, but I haven't seen anybody use page blobs for anything other that VHDs or specialized data structures (circular logs, etc), and when I attempted to copy a PDF to my storage account using the Set-AzureStorageBlobContent Powershell command, I received an error that the file size is invalid for a page blob (because of the 512 byte boundary).  This lead me to feel like I'm trying to use this service incorrectly.

    TL;DR - If I need fast random access to thousands of large files in a Azure Storage so that I can extract page ranges from PDFs, what would be the best way to go about that?

    Friday, September 4, 2015 3:11 PM

Answers

All replies

  • Hi,

    We'd need more time to research on this, we'll keep you updated with our findings.
    We regret the inconvenience caused and appreciate the patience.

    Regards,
    Malar.


    Saturday, September 5, 2015 10:36 AM
  • My attempts to use block blob storage for this failed miserably because extracting page ranges from PDFs requires a great deal of seeking to different byte offsets and reading several small amounts of data.  The overhead incurred in all the range requests across the REST interface for block blob storage makes this impossible to do at any reasonable speed.  

    Hi BobMcLare,

    Since I don't know anything about your requirements, I am not sure if this suggestion is even apropriated for your case:

    Instead of seeking throughout the content of those PDFs, directly in the files. Wouldn't it be more efficient if you index all PDFs content using Azure Search (or Elasticsearch) and search the produced index instead? Take a look at this example: http://wp.sjkp.dk/azure-search-pdf-indexing/

    Hope this helps!


    Best Regards,
    Carlos Sardo

    Sunday, September 6, 2015 6:49 PM
  • Thanks so much for your suggestion Carlos.  I have actually read that article and think it's great.  I am seriously considering using the Azure Search functionality for indexing my PDFs.  However, this does not address my root problem, which is efficiently extracting page ranges from a PDF on Azure storage.  The search service may be able to do a great job of telling me where to find the pages, but extracting them from the PDF is where my challenge lies.
    Monday, September 7, 2015 2:33 PM
  • I am not quite with you on what you are going to use your PDFs for, but have looked into Azure Files?

    It is a bit more vanilla solution though probably not so robust as Blob storage.

    Tuesday, September 8, 2015 10:01 AM
  • Thanks Alex,

    I have looked at the Azure Files service, and indeed it looks like it would suit my needs well, but based on the latest article I read on the subject, it appears to still be in preview as of 8/4/2015.  This is for a client-facing, billable product with SLAs, so I want to make sure whatever service I am using is proven, tested, and guaranteed to stick around.

    Tuesday, September 8, 2015 1:41 PM
  • Hi Bob,

    File Storage is now Generally Available.
    You could refer the following link for details: 
    https://github.com/Azure/azure-content/blob/master/articles/storage/storage-dotnet-how-to-use-files.md

    Regards,
    Malar.

    • Marked as answer by BobMcLaren Wednesday, February 21, 2018 9:08 PM
    Friday, October 30, 2015 9:47 AM