locked
Indexing a PDF with metadata per hit (coordinates per word) RRS feed

  • Question

  • We are evaluating Azure Search for indexing a large number of PDF documents.

    1) The first challenge is extracting text from a PDF in order to store it into a "Text" field of Azure Search. We are experienced in this field an know how to handle it.

    2) The second challenge comes when retrieving the results. Ideally, we should show a hit directly in the PDF. This requires that we save the PDF coordinates of every single word along with the text. We have thought about two approaches:

    a) Save something like "Hello <meta c=123;456> World <meta c=887;978>" in the text field and have a custom analyzer ignore all <meta> tags. I think this is not (yet) possible with Azure Search, right?

    b) Have a second field with metadata tags for every word in the "Text" field. This requires that we get the index of the retrieved match, so that we can find the corresponding data in the metadata field. However, the index of a match is not returned, is that correct?

    Is there another approach that we have overseen? Do you have recommendations how to store "per match" metadata in Azure Search?

    Thank you, Thomas

    Tuesday, June 30, 2015 11:35 AM

Answers

  • Hi Thomas,

    Regarding storing metadata with the text: Azure Search doesn't currently support customer analyzer chains, but it's on our roadmap. There is no exact date yet, but right now it looks at least several months away.

    You are correct that Azure Search doesn't return the index of a match in the response. Providing this information would be a pretty big undertaking for us, so I would expect it to take more time than custom analyzer chains.

    One approach you could take would be to search within the PDF for the highlight snippets returned by Azure Search. This could work if your PDFs aren't too large.

    Hope this helps,

    -Bruce

    Thursday, July 2, 2015 8:36 PM

All replies

  • Hi Thomas,

    Regarding storing metadata with the text: Azure Search doesn't currently support customer analyzer chains, but it's on our roadmap. There is no exact date yet, but right now it looks at least several months away.

    You are correct that Azure Search doesn't return the index of a match in the response. Providing this information would be a pretty big undertaking for us, so I would expect it to take more time than custom analyzer chains.

    One approach you could take would be to search within the PDF for the highlight snippets returned by Azure Search. This could work if your PDFs aren't too large.

    Hope this helps,

    -Bruce

    Thursday, July 2, 2015 8:36 PM
  • One approach you could take would be to search within the PDF for the highlight snippets returned by Azure Search. This could work if your PDFs aren't too large.

    Thank you, Bruce

    The problem with the mentioned approach is that a snippet returned by Azure Search could occur multiple times in the PDF (think of footers, for example). Additionally, it's hard to perform a reliable search because of the differences in the snippet and the original text regarding punctuation, line breaks, "folding" (oe instead of ö) and all that stuff.

    For these reasons, locating an Azure Search hit automatically does not seem to be promising. We may have to let the end user search, as it is on web pages.

    Friday, July 3, 2015 7:45 AM