none
Handling scoring and priority with search across multiple indexes RRS feed

  • Question

  • Our application is now querying against multiple indexes simultaneously and aggregating the results (two indexes at the moment but in the future it may be as many as ten).  In our view, documents from one index are not necessarily more important than another. Unfortunately, the score value is only assigned within the context of a given index. So, our results are always grouped by index rather than a holistic score considering all of the results.

    Has anyone else had to tackle this problem?  Any advice?  I tried sorting by the score given to each document just to see what would happen, but as expected, that basically groups the results by the indexes from which they came.

    Thanks!


    Wednesday, March 1, 2017 8:32 PM

Answers

  • Regarding schema for a combined index, it's much easier to take the union of all fields rather than try to multiplex fields across different types. In other words, let's say you have emails with To, From, Subject, and Body fields, as well as contacts with Name and Address fields. In this case, you'd have these fields in your index: Id, To, From, Subject, Body, Name, Address. For emails, you'd leave the contact fields null, and for contacts, you'd leave the email fields null. One minor catch is that you need to assign IDs uniquely across all entity types. Assuming each entity already has a unique ID, you could concatenate some kind of entity type ID onto it to make it unique across all entity types.

    The reason this kind of schema is not a problem for Search is that your data is stored in an inverted index, which handles sparsity very well. Each unique value of a field (or each individual analyzed term for searchable fields) is a key into a list of documents that contain that value (or term). When a field is "null", it is simply not present in the index.

    Regarding document count, that limit applies at the service level, not the index level. An index can be as big as you want, as long as it fits the document count limit imposed by the SKU and number of partitions that you choose.

    Thursday, March 2, 2017 7:10 PM
    Moderator

All replies

  • As you observed, scoring is done per-index and there isn't really a way to harmonize them across indexes. Can you describe your scenario in a bit more detail? In particular, why are you querying multiple indexes?
    Wednesday, March 1, 2017 10:04 PM
    Moderator
  • I posted my scenario a few weeks ago, we have many different tables we are indexing that each have multiple different columns we want searchable and some columns have limited access for certain roles, etc.  So, after explaining my situation here, it was recommended that we go with multiple indexes.
    Thursday, March 2, 2017 2:07 AM
  • ...I was just hoping someone had come up with a workaround of some kind.
    Thursday, March 2, 2017 2:07 AM
  • Ok, I remember now. :-)  Here is the thread for anyone else following along: https://social.msdn.microsoft.com/Forums/azure/en-US/e9b519aa-9827-487c-8e8d-58dba27936ba/when-to-use-multiple-indexes?forum=azuresearch

    I apologize for making my initial recommendation in haste. I've seen cases before that looked superficially similar to yours, but in fact they involved very different data models.

    From how you describe your query pattern, it sounds like a single index would be a better choice for you. The question is whether combining this data in a single index will cause problems that you can't work around. Some things to consider:

    1. How many distinct fields would you get across all of these different entity types combined? Azure Search has a limit on the number of fields in an index (100 for Basic SKU, 1000 for all other SKUs), so you'll have to fit within that.
    2. What are the cardinality relationships between the various entities that you're trying to model? Will updating one entity cause many cascading updates of other entities? If not, a single index should work.

    Besides these, did you find any other reasons why having all the entity types in a single index would cause problems for implementing your scenario?

    Thursday, March 2, 2017 2:48 AM
    Moderator
  • Our entities are very distinct (and we may want to use scoring profiles differently for different entities) and we are dealing with a lot of legacy tables which handled security for various roles in different ways in the past, some of them stored content in fields intended to only be seen by certain roles, others stored flags in columns indicating whether certain roles could see the data, etc.  Another problem we saw with pushing everything into a single index was that we would have to de-dup the results on the way back if we had to break one table into storing it's 5 columns of content as separate documents, when we don't want to necessarily show a result for the same entity more than once.

    One example is that we are indexing a comments table which has columns in it which link it to other entities, so we may want to have a calendar item's data be searchable but also comments related to that item (in another table) also be searchable and the link to the related entity is stored within that table. 

    I dunno... We'll give it some more thought but we may be forced to ship it as-is for now due to time constraints as some significant refactoring would be required at this point.

    If we were to go to a single index, wouldn't the idea basically be to store every searchable column's content from the database tables into a single field in the index and just have some sort of field to indicate which table/column it represents?  We wouldn't have a named field for every distinct column, from every table, in the same index would we? That is an interesting thought.... Sounds like a crazy idea in the SQL world, but maybe not in the search world??....

     
    Thursday, March 2, 2017 3:34 AM
  • Another possible concern would be the number of documents in the index. We are starting with around 7 million, but after we add more entities and at our growth rate, we may have 50 or 60 million with a year or two. Is that going to be scalable in terms of performance? How many is too many for an index? 100 million? 500 million? more?

    Thanks :)

    Thursday, March 2, 2017 1:42 PM
  • Regarding schema for a combined index, it's much easier to take the union of all fields rather than try to multiplex fields across different types. In other words, let's say you have emails with To, From, Subject, and Body fields, as well as contacts with Name and Address fields. In this case, you'd have these fields in your index: Id, To, From, Subject, Body, Name, Address. For emails, you'd leave the contact fields null, and for contacts, you'd leave the email fields null. One minor catch is that you need to assign IDs uniquely across all entity types. Assuming each entity already has a unique ID, you could concatenate some kind of entity type ID onto it to make it unique across all entity types.

    The reason this kind of schema is not a problem for Search is that your data is stored in an inverted index, which handles sparsity very well. Each unique value of a field (or each individual analyzed term for searchable fields) is a key into a list of documents that contain that value (or term). When a field is "null", it is simply not present in the index.

    Regarding document count, that limit applies at the service level, not the index level. An index can be as big as you want, as long as it fits the document count limit imposed by the SKU and number of partitions that you choose.

    Thursday, March 2, 2017 7:10 PM
    Moderator
  • Thanks Bruce. Just to be sure I'm understanding correctly. Let's say I have 3 tables I want to index with a few searchable fields each. One of them represents emails as in your example. Another is a table of products and the last is a table of calendar items. So, are you saying that I would want to have a single index representing the three tables (like below) and each document I insert would only populate the fields representing that table (or document type as I have below) and leave all other fields null right?

    Index Name: "global-index-for-all-table-content"

    Fields:
    - documentTypeId (enuSearchableDocuments.Email, enuSearchableDocuments.Products, enuSearchableDocuments.CalendarItems, etc.)

    - emailsTo
    - emailsSubject
    - emailsBody
    - emailsSentDateTime
    - productsName
    - productsDescription
    - productsInventoryCount
    - productsPrice
    - calendarItemId
    - calendarItemDate
    - calendarItemName
    - calendarItemDescription
    - calendarItemOwnerId


    Correct?

    Thanks for your help!


    Friday, March 3, 2017 4:01 PM
  • That's correct, although you forgot to mention the id field that uniquely identifies each entity.
    Friday, March 3, 2017 8:10 PM
    Moderator