Full Text Search - Retrieving the synonyms used by a FORMSOF Thesaurus query
-
Monday, July 30, 2012 11:48 PM
I have a C#.NET application that allows users to query a collection of quotes by using keyword searches. One of the requirements for the system is to highlight in each quote the keyword that was searched for. I have accomplished this using a CONTAINS query to get the quote collection and a regular expression to highlight the keywords.
I have a similar requirement to allow users to expand their search by choosing to include synonyms in the search results. I have implemented this using a FORMS OF Thesaurus query and a Thesaurus file that I created using WordNET. The issue I am having though is how to highlight the words in the quote collection that match not only the keyword that was searched for but the synonyms of the keyword that were used by the Full Text Search engine.
For example, say that the user searched for the word 'false' and the following expansion set exists in the Thesaurus file:
<expansion>
<sub>false</sub>
<sub>wrong</sub>
</expansion>I would like to return the following quotes and highlight the word 'false' and its synonyms ('wrong') such as the following:
"All generalizations are false, including this one."
"Truth is mighty and will prevail. There is nothing wrong with this, except that it ain't so."
Thus, my question is how can I figure out based on what is returned from a FORMSOF Thesaurus query, what the Full Text Search engine used as synonyms for the keyword? Or, how can I query the Thesaurus file to get a particular expansion set? Any suggestions or possible workarounds would be greatly appreciated.
Thanks!
Leo Wantz
All Replies
-
Thursday, August 02, 2012 9:19 AMModerator
Hi Leo,
Thank you for your question.
I am trying to involve someone more familiar with this topic for a further look at this issue. Sometime delay might be expected from the job transferring. Your patience is greatly appreciated.
Thank you for your understanding and support.Thanks,
MaggiePlease remember to mark the replies as answers if they help and unmark them if they provide no help. This can be beneficial to other community members reading the thread.
-
Thursday, August 09, 2012 9:05 PMAnswerer
Hello,
Thank you for your post. This is a common question that current versions of full-text are not able to implement. You may have seen this other post where other customers are also asking for this functionality? - http://social.msdn.microsoft.com/Forums/en-US/sqldatabaseengine/thread/35717c10-dd62-487e-a89a-db8b5c6e6ebf/
Regrettably, there is no current built-in way to return the word found or its location (both of which would be helpful for client applications which want to highlight the found word.) Having this functionality built-in could possibly lead to getting everything you need from a single pass. Without this, it will likely take your application multiple passes of the data and have built-in logic in the application.
You can somewhat get halfway using the Distributed Management Views (DMVs) for full-text that are included in SQL Server versions 2008, 2008 R2, and 2012. The DMV sys.dm_fts_index_keywords_by_document can be used to return all the words found when the document was indexed and it also tells you how many occurrences of the word are in the document. It just doesn't tell their location or their variations. You can also use sys.dm_fts_parser to return the words from a phrase to see how it is parsed.
The thesaurus file is just an XML file in the SQL Server folder that is per language. You can import it (or them if using multiple languages) and then query it to find the synonyms for any given word which you have added your own custimizations.
The challenge you will have is that not all words that are expanded are in the Thesaurus. For example, FormsOf is going to return all tenses of a word. Take the following example:
CREATE TABLE [dbo].[FTTest]( [id] [int] IDENTITY(1,1) NOT NULL, [Text] [text] NULL, PRIMARY KEY CLUSTERED ( [id] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY] GO --NOTE: You may have to change the primary key reference CREATE FULLTEXT INDEX ON [dbo].[FTTest]( [Text] LANGUAGE [English]) KEY INDEX [PK__FTTest__3213E83F239E4DCF]ON ([FTTest], FILEGROUP [PRIMARY]) WITH (CHANGE_TRACKING = AUTO, STOPLIST = SYSTEM) GO Insert into FTTest(Text) values ('I like to ride a bike.') Insert into FTTest(Text) values ('I rode the bus today.') Insert into FTTest(Text) values ('Did you see who was riding that motorcycle?') go select * from FTTest where contains(TEXT, ' FORMSOF (INFLECTIONAL, ride) '); RESULTS: ======== All 3 rows are returned and yet none of these expansions are in the actual thesuarus. You're application would have to know these are all the possible inflections of the word "ride" since it is not part of the Thesaurus. id Text 7 I like to ride a bike. 8 I rode the bus today. 9 Did you see who was riding that motorcycle?I have put in a request to the developers for full-text to be enhanced to return the word found and its location. It is possible that multiple variations of the word are found in the same row of data so it could be possible that the returned result may be multiple rows of information that would need to be handled by the application when highlighting the word(s) found in a row of data.
Wish I had better news.
Sincerely,
Rob Beene, MSFT
- Edited by Robert Be - MSFTMicrosoft Employee, Editor Thursday, August 09, 2012 9:09 PM
-
Friday, August 10, 2012 2:29 AM"Robert Be - MSFT" wrote in message news:6a15f751-1df3-4c17-be46-92b6984a4f36...
Hello,
Thank you for your post. This is a common question that current versions of full-text are not able to implement. You may have seen this other post where other customers are also asking for this functionality? - http://social.msdn.microsoft..com/Forums/en-US/sqldatabaseengine/thread/35717c10-dd62-487e-a89a-db8b5c6e6ebf/
Regrettably, there is no current built-in way to return the word found or its location (both of which would be helpful for client applications which want to highlight the found word.) Having this functionality built-in could possibly lead to getting everything you need from a single pass. Without this, it will likely take your application multiple passes of the data and have built-in logic in the application.
You can somewhat get halfway using the Distributed Management Views (DMVs) for full-text that are included in SQL Server versions 2008, 2008 R2, and 2012. The DMV sys.dm_fts_index_keywords_by_document can be used to return all the words found when the document was indexed and it also tells you how many occurrences of the word are in the document. It just doesn't tell their location or their variations. You can also use sys.dm_fts_parser to return the words from a phrase to see how it is parsed.
The thesaurus file is just an XML file in the SQL Server folder that is per language. You can import it (or them if using multiple languages) and then query it to find the synonyms for any given word which you have added your own custimizations.
The challenge you will have is that not all words that are expanded are in the Thesaurus. For example, FormsOf is going to return all tenses of a word. Take the following example:
CREATE TABLE [dbo].[FTTest]( [id] [int] IDENTITY(1,1) NOT NULL, [Text] [text] NULL, PRIMARY KEY CLUSTERED ( [id] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY] GO --NOTE: You may have to change the primary key reference CREATE FULLTEXT INDEX ON [dbo].[FTTest]( [Text] LANGUAGE [English]) KEY INDEX [PK__FTTest__3213E83F239E4DCF]ON ([FTTest], FILEGROUP [PRIMARY]) WITH (CHANGE_TRACKING = AUTO, STOPLIST = SYSTEM) GO Insert into FTTest(Text) values ('I like to ride a bike.') Insert into FTTest(Text) values ('I rode the bus today.') Insert into FTTest(Text) values ('Did you see who was riding that motorcycle?') go select * from FTTest where contains(TEXT, ' FORMSOF (INFLECTIONAL, ride) '); RESULTS: ======== All 3 rows are returned and yet none of these expansions are in the actual thesuarus. You're application would have to know these are all the possible inflections of the word "ride" since it is not part of the Thesaurus. id Text 7 I like to ride a bike. 8 I rode the bus today. 9 Did you see who was riding that motorcycle?I have put in a request to the developers for full-text to be enhanced to return the word found and its location. It is possible that multiple variations of the word are found in the same row of data so it could be possible that the returned result may be multiple rows of information that would need to be handled by the application when highlighting the word(s) found in a row of data.
Wish I had better news.
Sincerely,
Rob Beene, MSFT
-
Friday, August 10, 2012 2:31 AMIn future you might get a better response (although the previous response is fine) by posting full-text questions directly to the SQL Server full-text-specific forum here: http://social.technet.microsoft.com/Forums/en-US/sqlsearch/threadsCheers, Bob
-
Monday, August 13, 2012 11:10 PM
Bob
Thank you for your reply. As you said, as a workaround I could query the Thesaurus file to get an expansion set for a given keyword and use these words in my highlighting routine.
I am confused about how SQL Server is getting each tense of a word though. Is it possible to get to that information using SQL? For example, a query that passes in "ride" and returns "rode" and "riding."
Thanks,
Leo
-
Thursday, August 30, 2012 5:53 PMAnswerer
Hey Leo,
My apologies for not following up sooner. You can use sys.dm_fts_parser with FORMSOF({FREETEXT, THESAURUS}, {TERM}) which can then be parsed and used
in the highlighting routine. This would allow you to get all tenses of the word that full-text would search for. You would still have to determine which of these forms was found in a particular result set and there could potentially be multiple tenses found in the same row.Hope this helps.
Sincerely,
Rob Beene, MSFT
- Proposed As Answer by JamesYiModerator Thursday, September 06, 2012 2:06 AM

