none
Highlights == null in search results?? RRS feed

  • Question

  • I'm having an issue with Azure search where Highlights is null in some cases and I can't figure out why. The Highlight fields and pre/post tags are being specified and I'm getting a valid array of Highlights on the index with most results, but even though the content contains the text and the field containing the search term is requested via the Highlighted fields property on the search parameters object, some of the results have Highlights == null.  Can anyone shed any light on this?

    Thanks!

    Thursday, February 23, 2017 4:29 PM

Answers

  • Yes, I think the approach will be a good fit. Below is a slightly modified version of the approach. Basically in the index schema, I have two fields, one analyzed with the standard analyzer and the other containing prefixes up to 5 chars. The same data will be injected to the two fields. Below is the schema.

    PUT : http://[service_name].search.windows.net/indexes/custom?api-version=2016-09-01

    {
       "name":"custom",
       "fields":[
          {
             "name":"id",
             "type":"Edm.String",
             "searchable":false,
             "filterable":true,
             "retrievable":true,
             "sortable":false,
             "facetable":true,
             "key":true,
             "indexAnalyzer":null,
             "searchAnalyzer":null,
             "analyzer":null
          },
          {
             "name":"string",
             "type":"Edm.String",
             "searchable":true,
             "filterable":true,
             "retrievable":true,
             "sortable":false,
             "facetable":true,
             "key":false,
             "analyzer":"standard"
          },
          {
             "name":"prefix",
             "type":"Edm.String",
             "searchable":true,
             "filterable":true,
             "retrievable":true,
             "sortable":false,
             "facetable":true,
             "key":false,
             "indexAnalyzer":null,
             "searchAnalyzer":null,
             "analyzer":"custom"
          }
       ],
       "analyzers":[
          {
             "name":"custom",
             "@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
             "tokenizer":"standard",
             "tokenFilters":[
              "lowercase",
                "edge5"
             ]
          }
       ],
       "tokenFilters":[
          {
             "name":"edge5",
             "@odata.type":"#Microsoft.Azure.Search.EdgeNGramTokenFilter",
             "minGram":1,
             "maxGram":5
          }
       ]
    }

    Say you are indexing the word "communications" to the fields. The word will be analyzed to a single token <communications> by the standard analyzer in the string field. The same word be analyzed to <c>, <co>, <com>, <comm>, <commu> by the custom analyzer in the prefix field.

    At query time, you decide which field to search on by looking at the number of chars in the user input. If longer than 5, you can do a prefix search (with *) because the prefix is long enough so it's a relatively narrower search. If the input length is < 5, the search is issued directly to the prefix search against fields with edge N grams.

    If input > 5 chars :

    GET : http://[service_name].search.windows.net/indexes/custom/docs?search=string:communica*&api-version=2016-09-01&queryType=full&highlight=string

    If input <= 5 chars :

    GET: http://[service_name].search.windows.net/indexes/custom/docs?search=prefix:comm?api-version=2016-09-01&queryType=full&highlight=prefix

    Highlighting returns matches in the context of the data you injected and not in the context of internal analysis so it should return hits correctly in either case.

    Let me know if you have any additional questions.

    Nate

    Friday, February 24, 2017 10:28 PM

All replies

  • Are you using the Azure Search .NET SDK? If so, please open an issue in the Azure .NET SDK GitHub repo with enough information for us to repro the problem. Make sure to prefix the title of the issue with "Search SDK: ".

    Thursday, February 23, 2017 4:39 PM
    Moderator
  • I am using the .NET SDK and I will do that. Thanks
    Thursday, February 23, 2017 5:38 PM
  • Please ensure that the fields you are searching on (in REST Api, the parameter is searchFields=) match with the fields you are highlighting (in REST Api. the parameter is highlight=). If say, you are searching on two fields and only highlighting one, there may be documents with no highlight if the documents only have matched in the other non-highlighting field.

    Another possibility is that you may be issuing a very broad wildcard search query (for example, search=a* or search=/.*/). In highlighting, wildcard search term first expands to a limited number of terms and then highlights. If, for example, returned document has a match to the wildcard search query but the matched term is not in the expanded set of terms for highlight, highlight will be missing for the document.

    Hope this helps. Please let me know if you have any further questions.

    Nate

    Thursday, February 23, 2017 6:11 PM
  • Thanks Nate. In this case, there is only one field which contains the content and that field is being specified in the HighlightFields property of the SearchParameters object (.NET SDK). If it weren't I wouldn't have some of the results for the same search term being returned with HighLights and some not. So, I think we're good on that.

    On the second part of your message, I don't understand why the Highlighting code wouldn't use the same expanded search terms as the standard search. What is the point in that?  In our search, we are currently replacing spaces ("ORS") with "+" ("ANDS"), but that doesn't come into play here because I'm only searching on a few letters, no spaces.  We are also appending "*" to the search term, but again, I'm not sure why the highlighting would be treated differently than the search itself.  I could even more understanding of this behavior, if what I'm typing didn't literally appear in the string exactly as I have typed it. I suppose I can search the full content of the returned document and find the first instance and do my own highlighting, but that really reduces the value of the highlighting feature to almost nothing for me if it isn't going to work 100% of the time and I have to write code to assist when it doesn't...
    Thursday, February 23, 2017 6:40 PM
  • Adding the link to the GitHub issue here for posterity: https://github.com/Azure/azure-sdk-for-net/issues/2853
    Thursday, February 23, 2017 8:10 PM
    Moderator
  • Thanks Travor. If you are issuing prefix search queries with only a few letters most of the time, I recommend using a custom analyzer with EdgeNGramTokenFilter to create a search index more tailored for the experience. Wildcard search queries can be very expansive. All wildcard search queries (prefix, regex, fuzzy) are rewritten internally with matching terms in the search index and if the matching criteria is very broad (one or two letters with *), the set of expanded search terms can be grow up to many (> 1000) terms. We apply heuristics for such expensive queries and rewrite with predicates that's more performant in broader searches. Highlighting is an independent post process and currently works by expanding to a term set with the limit internally and can cause discrepancy, for broader searches. We are aware of this issue and actively working to address it. I will update the thread once we have a fix.

    Now, going back to the recommendation with custom analyzer with EdgeNGramTokenFilter. When applied, given a term "hello", EdgeNGramTokenFilter analyzes and produces the following tokens that are prefixes to the input.

    <h> <he> <hel> <hell> <hello>

    As the prefix tokens themselves are now stored in the search index, your searches no longer needs to go through the expensive rewriting process. You can directly issue a term query "h" instead of the prefix search query "h*" and the query will find the document. The approach does take more storage but I presume it wouldn't be an issue as you are searching on one field. The search will be much more performant and won't have the issue with highlight.

    Please take a look at the documentation below for more information.

    https://docs.microsoft.com/en-us/rest/api/searchservice/custom-analyzers-in-azure-search

    Hope this helps.

    Nate

    Thursday, February 23, 2017 8:57 PM
  • Thanks Nate.  At the moment we have two indexes, one of them has a single searchable field (but that field can contain a fair amount of data) and the other has two searchable fields. When our users perform a search, we are searching both indexes simultaneously (using parallel foreach). However, this is just the beginning of our search plans. We will be adding another 6 - 8 indexes in the forseeable future which may have anywhere from 1 - 5 searchable fields.  Knowing that, does this approach still sound like a good fit?  Will we need to create a separate field in every index for each searchable field in that index to hold the tokens? ...AND if the search term is found in the tokens fields, isn't the highlight match going to return the token it found in the added token field and not the context of the word in the original field with the highlighting?  

    I read the article you mentioned above, but it wasn't clear to me how to implement it and I haven't actually found clear information anywhere on exactly how to set this up. The EdgeNGram analyzer is not available via the UI when creating an index in Azure, so it sounds like this will require programmatically creating the index again?

    Thanks
    Friday, February 24, 2017 1:54 PM
  • Yes, I think the approach will be a good fit. Below is a slightly modified version of the approach. Basically in the index schema, I have two fields, one analyzed with the standard analyzer and the other containing prefixes up to 5 chars. The same data will be injected to the two fields. Below is the schema.

    PUT : http://[service_name].search.windows.net/indexes/custom?api-version=2016-09-01

    {
       "name":"custom",
       "fields":[
          {
             "name":"id",
             "type":"Edm.String",
             "searchable":false,
             "filterable":true,
             "retrievable":true,
             "sortable":false,
             "facetable":true,
             "key":true,
             "indexAnalyzer":null,
             "searchAnalyzer":null,
             "analyzer":null
          },
          {
             "name":"string",
             "type":"Edm.String",
             "searchable":true,
             "filterable":true,
             "retrievable":true,
             "sortable":false,
             "facetable":true,
             "key":false,
             "analyzer":"standard"
          },
          {
             "name":"prefix",
             "type":"Edm.String",
             "searchable":true,
             "filterable":true,
             "retrievable":true,
             "sortable":false,
             "facetable":true,
             "key":false,
             "indexAnalyzer":null,
             "searchAnalyzer":null,
             "analyzer":"custom"
          }
       ],
       "analyzers":[
          {
             "name":"custom",
             "@odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
             "tokenizer":"standard",
             "tokenFilters":[
              "lowercase",
                "edge5"
             ]
          }
       ],
       "tokenFilters":[
          {
             "name":"edge5",
             "@odata.type":"#Microsoft.Azure.Search.EdgeNGramTokenFilter",
             "minGram":1,
             "maxGram":5
          }
       ]
    }

    Say you are indexing the word "communications" to the fields. The word will be analyzed to a single token <communications> by the standard analyzer in the string field. The same word be analyzed to <c>, <co>, <com>, <comm>, <commu> by the custom analyzer in the prefix field.

    At query time, you decide which field to search on by looking at the number of chars in the user input. If longer than 5, you can do a prefix search (with *) because the prefix is long enough so it's a relatively narrower search. If the input length is < 5, the search is issued directly to the prefix search against fields with edge N grams.

    If input > 5 chars :

    GET : http://[service_name].search.windows.net/indexes/custom/docs?search=string:communica*&api-version=2016-09-01&queryType=full&highlight=string

    If input <= 5 chars :

    GET: http://[service_name].search.windows.net/indexes/custom/docs?search=prefix:comm?api-version=2016-09-01&queryType=full&highlight=prefix

    Highlighting returns matches in the context of the data you injected and not in the context of internal analysis so it should return hits correctly in either case.

    Let me know if you have any additional questions.

    Nate

    Friday, February 24, 2017 10:28 PM
  • Note that all the REST API requests in Nate's post above can also be done using the .NET SDK. We don't have samples of how to configure custom analyzers yet, but there are unit tests on GitHub that might help: https://github.com/Azure/azure-sdk-for-net/blob/AutoRest/src/Search/Search.Tests/Tests/CustomAnalyzerTests.cs
    Friday, February 24, 2017 10:38 PM
    Moderator
  • Thanks Nate / Bruce. I appreciate the help. We're going to weigh our options and make a decision soon. In order to help us make that decision, is improving the match highlighting on the radar as something that will be dealt with in the near future?
    Tuesday, February 28, 2017 12:38 PM
  • I think this will be my last question on this and I appreciate your help:

    In the search result, I need to be able to identify which field the search term was found in and also show the highlighted text in the context of the main searchable content. Initially, I was going to use the Highlights for this, but that is no longer possible or at least not "out of the box".

    If I implement the EdgeNGram with the additional prefix field, won't the highlights just start returning matches on the prefix field instead of the "real" preview content I want to show my users? Also, won't the field associated with the returned HighLight be the "prefix" field rather than the actual field containing the preview text that I want to show my user?

    Thanks!

    Trevor

    Tuesday, February 28, 2017 4:00 PM
  • Presuming you uploaded the same content to both fields, highlights from the prefix field will return the same snippet as you would get from the original field. The only difference would be that the field name.

    For example, docs?search=prefix:hel&highlight=prefix&queryType=full

    will return

       "@search.highlights": {
        "prefix": ["<em>hello</em> world"]
       }

    The approach I recommended involves using two fields, prefix and the original. A simpler alternative would be to use a single field with sufficiently long prefix (maxGram) length. The single field use the same custom analysis configuration with the following tokenFilter.

       "tokenFilters":[
          {
             "name":"edge5",
             "@odata.type":"#Microsoft.Azure.Search.EdgeNGramTokenFilter",
             "minGram":1,
             "maxGram":200
          }
       ]

    I initially did not recommend this approach because I was afraid that it may bloat up the index size because for long word like "communication", the analyzer configuration would produce all prefixes for the given word as <c>, <co>, <com>, <comm>, <commu>, <commun>.. <communication>, instead of limiting the length to 5 as in the initial approach. But thinking again, this may be a good alternative approach to consider because it removes the complexity in querying and post processing the response.

    Questions are always welcome. Hope this helps.

    Thanks,

    Nate

    Wednesday, March 1, 2017 10:26 PM