none
Faceting on Azure Search RRS feed

  • Question

  • Hi

    I just started doing analysis on AzureSearch as a potential search engine for replacing Solr in our organization. I was doing the POC for it and I found a weird scenario. I was getting facets for a column city in my query and for default count of facet values i.e. (10), I was getting results like Greenville(99). But when I try to filter it using "city eq 'Greenville'" for same default query with facet for column city, it started giving me facet results Greenville(172).

    Also when I queried for default query with facet values count of 1000, I started getting results Greenville(172). I think some thing is really not right with faceting here. Has anybody else also face this type of thing?

    Monday, November 17, 2014 7:15 AM

Answers

  • Hi Harsh,

    SOLR evaluates facets by making multiple round trips to the shards, which could have a performance hit for large data sets. It sounds like this was not the case for your data. (I'd be interested in any performance and data size metrics you have if you'd be willing to share.)

    In general there are limits to SOLR's scalability, which is one reason we elected not to use it as the underlying search engine for Azure Search (see this article for one example: http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/).

    How many documents and what incoming query rate are you expecting for your indexes? Depending on these numbers, it may be the case that you can get 100% accuracy for facets with acceptable performance (once we make the improvements I mentioned above).

    -Bruce

    Friday, November 21, 2014 2:25 AM
    Moderator

All replies

  • Hi Harsh,

    I am not sure of the answer to this, but I am discussing it with one of the engineers for this feature to see if I can provide a good answer for you.

    Liam


    Sr. Program Manager, SQL Azure Strategy - Blog

    Monday, November 17, 2014 5:06 PM
    Moderator
  • Hi Harsh,

    Can you share the full URLs of the queries you were executing? Also, if you can capture the request-id HTTP response header values and share those as well, that would be great. You can send the details to me directly at bruce DOT johnston AT microsoft.

    Thanks,

    -Bruce

    Monday, November 17, 2014 5:16 PM
    Moderator
  • Hi Bruce

    I have sent you the details. Please check.

    Tuesday, November 18, 2014 5:03 AM
  • Hi Harsh,

    Thanks for the info.

    The reason why you see different counts depending on the query is very subtle. It is related to the fact that all search queries are distributed among the various shards that comprise your index, even if you have only one search unit. Facet queries in particular are executed by retrieving the top "count" terms from each shard, and then selecting the top "count" terms from the combined list from all shards. One consequence of this is that if "count" is less than the number of unique terms (as it probably is for cities when "count" defaults to 10), it is possible that some values are missing from the final result because they were not in the top "count" results for one or more shards.

    For example, imagine you have two shards, A and B. Shard A contains 99 documents with city='Greenville', while shard B has 73 such documents. Further, imagine that most other documents in shard A are for a variety of less-popular cities, while in shard B, at least 10 cities other than 'Greenville' have more than 73 documents present. When executing facet=city on shard A, the 99 'Greenville' hits will appear in the top 10, but for shard B, the remaining 73 will not make the top 10 (because the other cities have more hits). In this case, the final count for 'Greenville' will be 99, which is inaccurate. However, increasing the facet count to 1000 solves the problem because each shard can contribute more terms, so the 'Greenville' documents in shard B will be counted in that case. Adding a filter expression also works by reducing the population of candidate documents from each shard that are included in the facet result.

    Note that your index actually has more than 2 shards; I just used 2 to simplify the example, but hopefully you get the idea. Also note that increasing the 'count' parameter has a performance cost, so you should use it judiciously.

    I hope this helps. We'll improve our API documentation to clarify this behavior.

    -Bruce

    Tuesday, November 18, 2014 10:35 PM
    Moderator
  • Hi Bruce

    This has put me in a problem. Actually it means that if I am making a default query for facets it will be giving me wrong results. Like if I am building an ecom site where faceted values of cities will be shown on the left side.Eg.

    Greenville(99)

    New York(100)... so on

    The user will want me to show the list of Greenville items if some one clicks on the link saying Greenville and he would expect me to show only 99 results but in this as I am querying for Greenville in field city, this will give me 172 results (data from whole index). This way count won't match up and application won't be reliable.

    Also if you would say that I should have facet count (say 1000 facets) every time in a query, as you had said it will affect the performance.

    I am using Solr for many years for handling this problem and it always give facets count correct irrespective of I am using shards or not. I think the logic for getting facets is incorrect here.

    Please correct if I am wrong here or there is a way I can get this data out of it without affecting the performance.

    Wednesday, November 19, 2014 4:45 AM
  • Hi Harsh,

    I agree the user experience would suffer with inaccurate counts. Fortunately there is a change we can make that will improve accuracy. Rather than ask each shard for "count" terms, we can ask each shard for more than "count" terms. This will increase the chances of all top terms making it into the final query. We can expose this as an additional parameter to the facet query option (let's call it "shard_count" for now), or pick a reasonable default value (although it is most effective if the value is based on the characteristics of the underlying data). I will file a bug to make this change.

    In the meantime, the only workaround to get better accuracy is to provide a larger value for "count". The impact this has on performance will depend on the value you choose, on your data size, and on the number of unique terms in the data. We recommend experimenting with different values for "count" and testing with data that will be typical of your production data set if possible.

    Once we introduce "shard_count", you will be able to make it very large to dramatically increase accuracy without affecting the shape of your query results. You could even get 100% accuracy if you pick a value that is guaranteed to be greater than the number of unique terms. However, it is still worthwhile to experiment to see what kind of performance you get. If your data size will be small, a very large "count" or "shard_count" may be perfectly fine.

    It's important to understand that Azure Search is built on distributed search technology because we can't make any assumptions about data size. We could use a more deterministic approach to faceting like SOLR, but it would limit scalability and performance in the general case. Hence the tradeoff between performance and accuracy.

    I hope "shard_count" will help. Please let me know if you have more questions.

    -Bruce

    Thursday, November 20, 2014 2:39 AM
    Moderator
  • Hi Bruce

    It still feels like a hack that I am doing in order to get somewhat accurate results and which I am not sure accurate as it depends on data that I have the shards that I have.

    BTW SOLR is also distributed technology but it provides faceting irrespective of shards or facet values limit. This can be real set back for Azure search because we really really don't want user experience or performance to suffer.

    Please suggest if something more can be done in this direction.

    Thursday, November 20, 2014 4:27 AM
  • Hi Harsh,

    SOLR evaluates facets by making multiple round trips to the shards, which could have a performance hit for large data sets. It sounds like this was not the case for your data. (I'd be interested in any performance and data size metrics you have if you'd be willing to share.)

    In general there are limits to SOLR's scalability, which is one reason we elected not to use it as the underlying search engine for Azure Search (see this article for one example: http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/).

    How many documents and what incoming query rate are you expecting for your indexes? Depending on these numbers, it may be the case that you can get 100% accuracy for facets with acceptable performance (once we make the improvements I mentioned above).

    -Bruce

    Friday, November 21, 2014 2:25 AM
    Moderator
  • Hi Bruce

    Thanks for the information. I will read more about this also.We are trying different options for our application and I will keep your insights in mind while doing that.

    Thanks for your help. May be I will discuss some of my understandings later also.

    Harsh

    Friday, November 21, 2014 4:17 AM
  • I'm facing the same problem. shard_count would be very valuable. In my case, I have a wine database with about 5,000 brands of wines among several tens of thousands wines. I want to facet by brand, however I cannot get accurate brand facet results. It all makes sense as explained above.

    When is shard_count going to be live? As it stands, I would have to take some drastic measures and calculate the number of unique values across all my facet fields before uploading this to azure search. Then use this data to formulate my queries with the proper count parameter.

    What really worries me however is prices. I facet by price interval as do many customer facing retail sites. People can then shop within a budget. However, prices are very diverse, and count does not work for range facets. I'm guessing shard_count won't work? If so, how can I even use range intervals reliably? Or are they not going to suffer from this problem? The good case is when a range interval shows the wrong count. At least you can let the end user filter further and get the right counts as another click action. The worse case is when it doesn't show an interval because the price wasn't within the top count results...yet product(s) within the range DO exist. This may mean a lost sale and I doubt I can convince my product team to accept that.

    So while shard_count would at least help, ultimately I hope range faceting doesn't suffer from this, although I suspect it does. I have yet to verify it.

     

     

     

    Wednesday, December 3, 2014 2:12 AM
  • Hi Leonardo

    I am still playing with the different count values and trying to find out a optimal solution for this but as the count of facet values and also the number of facet fields increases, the performance starts getting hit.

    But I am hoping we can reach a formula based on the number of items in the index for putting the count of the facet values after keep playing with different numbers.

    Friday, December 5, 2014 9:31 AM
  • Hi Leonardo,

    I'm not aware of any accuracy limitations for range or interval facets. There shouldn't be a problem for those facet types because the results must account for all query hits already. The reason this problem exists for term facets is that not all terms are included in the result (unless "count" is set high enough), but the range and interval facets necessarily get computed across all terms.

    As for "shard_count", we are still discussing when to do this work, specifically whether or not it will make our General Availability milestone. I'll update this thread when I have news.

    -Bruce

    Monday, December 8, 2014 1:49 AM
    Moderator
  • Thanks Bruce,

    The fact numeric range facets are accurate is great news. Basically it was the one thing that would have kept me from pushing this to our roadmap for 2015.

    I'm really hoping the shard_count will make your GA. My workaround relies on keeping data statistics up to date on my side and then making sure my count is set to a good value. Off course, receiving facets that are 5000+ items long is not ideal, in particular when I will hide all but the top 10. We ended up creating a wrapper service to your API which removes the excess as our mobile clients have limited bandwidth and the facets objects in json were simply huge. But at least now we have accuracy and my QA folks are going to be happy.

    Monday, December 8, 2014 10:44 PM
  • Thanks to everyone for the comments here.  One thing that would be really helpful is if each of you could post your vote for this feature on our UserVoice page.  We use this page heavily as a major source for prioritization of features.

    I am not able to commit to having this for GA, but this feedback is certainly helpful in allowing us to best prioritize this feature.

    Liam


    Sr. Program Manager, SQL Azure Strategy - Blog

    Tuesday, December 9, 2014 12:23 AM
    Moderator
  • Tuesday, December 9, 2014 1:52 PM