locked
Azure Search substring query RRS feed

  • Question

  • Hi all

    We're starting our first Azure pilot project for a web based application that allows users to search OCRd PDFs stored in Azure Blob storage.  We plan on using Azure search to index the PDFs and allow users to query for text matches.  Users  should be able to filter results by year and month.  The PDFs are named using a time stamp format.  For instance, 19680909.pdf would be 1968, Sept, 09.  The file name is stored in a field called metadata_storage_name. Is there a way to run a search against this field using a substring query to extract year and month from the data and return just those matching results?  If not, is it possible to create another field in my index with the year and month extracted from the file name?

    Thank you


    Steven

    Wednesday, January 10, 2018 2:46 PM

Answers

  • That makes sense now.  We were initially going to use the Azure Storage Explorer to upload, but see we'd need to create an in house app or script to extract that and then populate these details to the storage container.  

    Thanks Bruce


    Steven

    • Marked as answer by steven_455 Monday, January 15, 2018 3:07 PM
    Monday, January 15, 2018 3:07 PM

All replies

  • Hi Steven,

    One approach would be to use a Lucene Regex query. For example, if you're looking for September, 1968 your search text would look like this:

    metadata_storage_name:/196809[0-9][0-9]/

    You need to set queryType to "full" in order to use the regex syntax.

    Another approach would be to preprocess the Blobs before indexing to populate separate filterable year and month fields, then use a filter. This will probably have better performance than the regex query, but it will increase the size of your index.

    Wednesday, January 10, 2018 11:31 PM
    Moderator
  • Thank you, Bruce.

    I'm not sure how I would go about 'preprocessing the blobs'.  If this were a more traditional application, I'd use something like an ETL package to split out the year and month from file name and throw into 2 table columns.  This is my first venture into Azure web apps, so I'm uncertain of what the design would look like.  Can you elaborate a little?

    Tks again


    Steven

    Thursday, January 11, 2018 7:21 PM
  • My suggestion to preprocess the blobs assumes that you have control over the code that uploads the files to Azure Blob Storage. Assuming that is true, you can set additional blob metadata properties when uploading. I'm imagining two new properties for year and month that you would also have in your index definition and indexer field mappings.

    You can use this method to set user-defined metadata on a blob.


    Thursday, January 11, 2018 8:05 PM
    Moderator
  • That makes sense now.  We were initially going to use the Azure Storage Explorer to upload, but see we'd need to create an in house app or script to extract that and then populate these details to the storage container.  

    Thanks Bruce


    Steven

    • Marked as answer by steven_455 Monday, January 15, 2018 3:07 PM
    Monday, January 15, 2018 3:07 PM