locked
Azure Search Indexing - How and when? RRS feed

  • Question

  • Hello,

    I'm trying to understand how / when information gets indexed in Azure Search. Unfortunately, I've been unable find any documentation on this. I started down this path because I was evaluating "uploads" and "merges". Based on what I've been able to glean from the Azure Search docs on MSDN, it seems like indexing can happen:

    • At the index level
    • At the row/document level
    • At the field level

    From my understanding, indexing is responsible for evaluating data so that when a query happens, the most accurate results are returned. This would imply that some relevancy score is assigned to a document within an index during indexing. I'm assuming the relevancy score comes from the scoring profile. If this is the case it seems like indexing would always happen at the row/document level. If my opinion is correct, I do not see how it would ever happen at the index / field level.

    Is my understanding correct? Is there documentation anywhere about how/when indexing works in Azure Search? Thanks

    Tuesday, March 8, 2016 12:24 PM

Answers

  • An "upload" is like an HTTP PUT in that is fully replaces and existing document.

    A "merge" assumes that the document already exists and only updates the fields sent

    (You can see a description of the actions here)

    For example if you "upload" a document 

    {
      "@search.action": "upload",
      "hotelId": "1",
      "hotelName": "Fancy Stay"
    }

    and you "merge" a new field

    {
      "@search.action": "merge",
      "hotelId": "1",
      "description": "Best hotel in town"
    }

    The document will then look like

    {
      "hotelId": "1",
      "hotelName": "Fancy Stay",
      "description": "Best hotel in town"
    }

    However, if you were then to "upload" a new version of the document without 'hotelName' or 'description' it would overwrite the whole document

    {
      "@search.action": "upload",
      "hotelId": "1",
      "description_fr": "Meilleur hôtel en ville"
    }

    and then the document would look like

    {
      "hotelId": "1",
      "description_fr": "Meilleur hôtel en ville"
    }


    Performance wise, work (indexing) is only done on the fields that are sent. (every piece of work done costs performance)

    So in the first example work is done for the 'hotelName', and in the second example (the "merge") work is only done for the field 'description'

    I'd highly recommend reading our new articles on these topics

    https://azure.microsoft.com/en-us/documentation/articles/search-create-index-rest-api/

    https://azure.microsoft.com/en-us/documentation/articles/search-import-data-rest-api/

    • Edited by Sean Saleh Tuesday, March 8, 2016 8:04 PM adding links to further documentation
    • Marked as answer by bonzo82 Wednesday, March 9, 2016 5:45 PM
    Tuesday, March 8, 2016 7:57 PM

All replies

  • Hi Bonzo,

    As for the When, indexing can occur manually, if you implement your own calls to the Azure Search API using REST calls or the official SDK.

    Another alternative, is to use Indexers, which would automatically index and keep updated your information. Please refer to this article on details on Indexers and API references.

    As to How, an indexed document means that all the indexable fields are indexed too, this depends on your index structure (which fields were marked for indexing/searching). By default, the results are ordered by score based on the TF-IDF algorithm but you can alter it by using Scoring Profiles or using Lucene query syntax in your query. This is all referenced on the article I linked previously.

    • Marked as answer by bonzo82 Tuesday, March 8, 2016 6:04 PM
    • Unmarked as answer by bonzo82 Tuesday, March 8, 2016 6:04 PM
    Tuesday, March 8, 2016 2:15 PM
  • Just to add to Ealsur's answer. When you index a document, an analyzer (which you can configure: blog, docs) extracts searchable terms from it. We organize them and compute statistics like: frequency of each term in the document, frequency across documents in the index (scoped to a given field) and other. When you issue your query, we compare the query terms with terms in the index and we retrieve documents that matched. Based on the statistics I mentioned, at query time we compute a score for each document that reflects how relevant given document is to your query relative to all other documents retrieved. You can use scoring profiles and/or term boosting in Lucene query language to influence how the score is computed. 

    Let me know if this answers your question,

    Janusz

    Tuesday, March 8, 2016 2:58 PM
  • Thank you for your response. I keep reading about indexers and analyzers. However, one thing is still unclear to me. When updating documents in a search index, why should I use a merge instead of an upload? A "upload" lets me update documents and a "merge" lets me update documents. It seems like there has to be an advantage to using "merge" over "upload". That's why I'm interested in the how and when behind indexing.

    I keep diving into indexing to find something that says a merge is faster than an upload. But, I don't know if that statement is true. Or just my assumption. 

    Are merges faster than uploads? I know this is beginning to go outside of the scope of this question. However, I'm really wrestling with the purpose of "merge" over "upload". Performance is the only reason I could think of. However, I can't find anything to back my assumption.

    Thank you for sharing your insights.

    Tuesday, March 8, 2016 6:50 PM
  • Taking it from the API Documentation:

    upload: An upload action is similar to an "upsert" where the document will be inserted if it is new and updated/replaced if it exists. Note that all fields are replaced in the update case.

    • merge: Merge updates an existing document with the specified fields. If the document doesn't exist, the merge will fail. Any field you specify in a merge will replace the existing field in the document. This includes fields of type Collection(Edm.String). For example, if the document contains a field "tags" with value ["budget"] and you execute a merge with value ["economy", "pool"] for "tags", the final value of the "tags" field will be ["economy", "pool"]. It will not be ["budget", "economy", "pool"].

    • mergeOrUpload: This action behaves like merge if a document with the given key already exists in the index. If the document does not exist, it behaves like upload with a new document.

    From my understanding (possibly Janusz can confirm / deny), in a merge you can specify a subset of attributes and only those attributes get updated on the document, the rest remain untouched, but this only works if the document already exists. Upload is an "upsert" that needs the whole document to be posted.

    Going to a real-world example, we have an index with around 50.000 items and during the day we perform delete/upsert operations as our users publish/delete their content. At night, we use a process to calculate an internal document score based on a complex logic for each document, and issue a merge with just that attribute for the affected documents. This way we use the amount of data/bandwidth we really need and it's much faster.

    Tuesday, March 8, 2016 7:51 PM
  • An "upload" is like an HTTP PUT in that is fully replaces and existing document.

    A "merge" assumes that the document already exists and only updates the fields sent

    (You can see a description of the actions here)

    For example if you "upload" a document 

    {
      "@search.action": "upload",
      "hotelId": "1",
      "hotelName": "Fancy Stay"
    }

    and you "merge" a new field

    {
      "@search.action": "merge",
      "hotelId": "1",
      "description": "Best hotel in town"
    }

    The document will then look like

    {
      "hotelId": "1",
      "hotelName": "Fancy Stay",
      "description": "Best hotel in town"
    }

    However, if you were then to "upload" a new version of the document without 'hotelName' or 'description' it would overwrite the whole document

    {
      "@search.action": "upload",
      "hotelId": "1",
      "description_fr": "Meilleur hôtel en ville"
    }

    and then the document would look like

    {
      "hotelId": "1",
      "description_fr": "Meilleur hôtel en ville"
    }


    Performance wise, work (indexing) is only done on the fields that are sent. (every piece of work done costs performance)

    So in the first example work is done for the 'hotelName', and in the second example (the "merge") work is only done for the field 'description'

    I'd highly recommend reading our new articles on these topics

    https://azure.microsoft.com/en-us/documentation/articles/search-create-index-rest-api/

    https://azure.microsoft.com/en-us/documentation/articles/search-import-data-rest-api/

    • Edited by Sean Saleh Tuesday, March 8, 2016 8:04 PM adding links to further documentation
    • Marked as answer by bonzo82 Wednesday, March 9, 2016 5:45 PM
    Tuesday, March 8, 2016 7:57 PM