Custom Word breaker deliverables


  • Hello everyone,

    I assume we may need to develop a custom word breaker for english as part of our project work.

    Can somebody explain what are the inputs and  deliverables of a custom word breaker in detail?

    Monday, February 27, 2012 10:46 AM


All replies

  • Here is where you should start.

    looking for a book on SQL Server 2008 Administration? looking for a book on SQL Server 2008 Full-Text Search?

    • Marked as answer by KJian_ Monday, March 05, 2012 6:38 AM
    Monday, February 27, 2012 2:13 PM
  • Hi Hilary,

    I understood that people go for a custom word breaker during the situations

    1.When there is no word breaker for the specific language.

    2.When the existing word breaker of specific language is not addressing project needs(special treatment of special characters).

    Unforunately we are in a situation where the messages stored in our tables do not follow the rules of natural languages and the data is stored in multiple languages in the same column.

    Logically i would like to call this as "Industrial language" similar to English,German,Spanish and Japanese languages.

    Below are my questions.

    1.It's possible to develop a custom word breaker only for specific language which has LCID .Is it so?

    2. Assume if I want to develop a custom word breaker/stemmer for the above mentioned "Industrial language" is it possible?

    3.If yes, how the LCID for new languages are created/registered?

    I hope i  conveyed the problem properly.

    Thanks & Regards



    Thursday, March 08, 2012 7:59 AM
  • Hello Hilary,

    Can you please provide your views on this concern?

    Thanks & Regards


    Wednesday, March 14, 2012 12:06 PM
  • 1) yes

    2) the problem is how are you going to do language detection to apply language specific word breaker rules when using "industrial" or multilanguage/blended language content? So you are free to tag a language with an LCID and have a word breaker written to apply a word breaker for that language, but your word breaker is going to be very complex. Most people will break the content into different columns and apply different word breakers to these columns, ie one column for German, one for English, one for Japanese, etc.

    3) you need to open up a support incident with Microsoft for guidance on how to do this.

    looking for a book on SQL Server 2008 Administration? looking for a book on SQL Server 2008 Full-Text Search?

    Wednesday, March 14, 2012 12:48 PM
  • Hi Hilary,

    Thak you very much for your response.

    Below is the sample format of language blended xml data stored in a single column.

    Based on user specified LCID , search query the appropriate language content has to be retrieved.

    What could be the best approach for implementing FTS with minimal efforts?

    Regarding Answer2, You mean to say simply register the new custom breaker(for industrial language) with any one of the existing languages and select that language word breaker during FTI creation? Is my understanding correct?

    Thursday, March 15, 2012 5:27 AM
  • Hi Hilary,

    Could you please advise on the above issue?



    Monday, March 19, 2012 6:04 AM
  • That will not work. Here is an example of how to set it. You need to use the xml:lang element.

    <docENUTitle xml:lang="en-us">
    Yukon full-text search
    <docDEUTitle xml:lang="de">
    Yukon full-text search (german equivalent)

    looking for a book on SQL Server 2008 Administration? looking for a book on SQL Server 2008 Full-Text Search?

    Tuesday, March 20, 2012 2:18 PM
  • Hello Hilary,

    Thanks for your response.

    Assume i have the xml doc with xml:lang attribute for the respective languages.

    1.Which word breaker i need to choose at the time of full text index creation?

    2.How to write FTS query considering the search term can be from any language?

    I really appreciate your time for answering these questions.



    Wednesday, March 21, 2012 4:00 AM