none
en.microsoft analyzer settings to enable creation of a tweaked custom analyser? RRS feed

  • Question

  • Hi

    We are currently using the en.microsoft analyser, and its going great at the moment.

    Is it possible to get the settings for this analyser (maybe post the JSON to recreate it exactly), so I can recreate it with different stopwords and term lengths as a custom analyzer? I have checked online and cant find it, but if anyone knows where it is that would be great.

    The reason for this is

    When I user postman on my index

    https://*************.windows.net/indexes/***********/analyze?api-version=2017-11-11

    and POST in

    {
      "text": "Space X running research & development, or R & D",
      "analyzer": "en.microsoft"
    }

    I get

    {
        "@odata.context": "https://focus-search.search.windows.net/$metadata#Microsoft.Azure.Search.V2017_11_11.AnalyzeResult",
        "tokens": [
            {
                "token": "space",
                "startOffset": 0,
                "endOffset": 5,
                "position": 0
            },
            {
                "token": "run",
                "startOffset": 8,
                "endOffset": 15,
                "position": 2
            },
            {
                "token": "running",
                "startOffset": 8,
                "endOffset": 15,
                "position": 2
            },
            {
                "token": "research",
                "startOffset": 16,
                "endOffset": 24,
                "position": 3
            },
            {
                "token": "development",
                "startOffset": 27,
                "endOffset": 38,
                "position": 4
            }
        ]
    }

    I would like to just change the min term length and have an option of changed stopwords, but keep all other settings of the analyzer the same, to enable us to search for "Space X" and "R & D" (I know it should be R&D, but OCR is not perfect).

    Thanks for your assistance with this.

    Regards

    Dave

    Monday, September 24, 2018 1:41 PM

All replies

  • I am unable to locate the answer you are looking for off the top of my head. I have reached out internally to see if we can maybe get the settings or JSON file for you. I hope to have an update either way shortly.
    Thursday, September 27, 2018 2:13 AM
    Moderator
  • Hi Dave,

    Sorry for the late response. You can use the following custom analyzer definition to specify your own custom list of stopwords and marking specific terms as keywords -

    {
        "@odata.context": "https://testservice.search.windows.net/$metadata#indexes/$entity",
        "@odata.etag": "\"......\"",
        "name": "testindex",
        "fields": [
            {
                "name": "id",
                "type": "Edm.String",
                "searchable": false,
                "filterable": false,
                "retrievable": true,
                "sortable": false,
                "facetable": false,
                "key": true,
                "indexAnalyzer": null,
                "searchAnalyzer": null,
                "analyzer": null,
                "synonymMaps": []
            }
        ],
        "scoringProfiles": [],
        "defaultScoringProfile": "",
        "corsOptions": null,
        "suggesters": [],
        "analyzers": [
            {
                "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
                "name": "my_analyzer",
                "tokenizer": "my_tokenizer",
                "tokenFilters": [
                    "my_stopwords_filter",
    "is_keywords_filter"
                ],
                "charFilters": []
            }
        ],
        "tokenizers": [
            {
                "@odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer",
                "name": "my_tokenizer",
                "maxTokenLength": 20,
                "isSearchTokenizer": false,
                "language": "english"
            }
        ],
        "tokenFilters": [
            {
                "@odata.type": "#Microsoft.Azure.Search.StopwordsTokenFilter",
                "name": "my_stopwords_filter",
                "stopwords": [
                    "stopword1",
                    "stopword2"
                ],
                "stopwordsList": null,
                "ignoreCase": false,
                "removeTrailing": true
            },
    {
                "@odata.type": "#Microsoft.Azure.Search.KeywordMarkerTokenFilter",
                "name": "is_keywords_filter",
                "keywords": [
                    "running",
                    "word2"
                ],            
                "ignoreCase": false
            }
        ],
        "charFilters": []
    }

    The MicrosoftLanguageStemmingTokenizer has an option to set maxTokenLength which specifies the maximum length of a term during tokenization. If you want to remove tokens that are shorter or greater than a specified length range, you can use the LengthTokenFilter instead. You can check the other type of token filters available here- https://docs.microsoft.com/en-us/rest/api/searchservice/custom-analyzers-in-azure-search#TokenFilters

    If I am misinterpreting your ask about the min term length, let us know. 

    Saturday, October 13, 2018 11:33 PM