Word analysis with Cosmos DB RRS feed

  • Question

  • I need to extract specific types of phrases from contracts.  A specific type of phrase can be written a variety of different ways, however, there will be a common set of rules to identify the phrase.  Currently we have some data scientists creating regular expressions in python which does a pretty good job but its slow and clunky and not perfect.  I attempted replacing that process using SQL Server Full Text Search which did a pretty good job in identifying the paragraphs that contained the phrases but FTS was no help in extracted the phrase text itself.  So I'm thinking of loading the contract numbers (contract IDs) as nodes.  then the paragraph IDs of each paragraph as nodes under the contract nodes, and lastly each word of the paragraphs as a node under the paragraph node.  This would allow me to query a string of words similar to FTS:

    WHERE CONTAINS(DocPr_Text, 'NEAR((assign,whole), 7)')

    -give me the words starting from "assign" to "whole" with a word count no greater than 7 in-between.

    Of course this is simple logic but it will get much more complex.  for example, add to that logic:

    -and include words after "whole" to the nearest number if the number is no greater that 3 words away.

    -and the number can be written as "25" or "twenty five".

    SQL server doesnt know that "twenty five" is a number but in Cosmos DB we could assign a property to nodes like this such as "IsNumber = True".

    I can envision adding a lot of properties to words to embellish them for richer searches including synonyms.

    QUESTION:  Am I on the right track here and does this make sense?  are other people doing this?  I've been Googling this and haven't found anything - but maybe In not searching correctly.

    Thank you for any advice you can offer.

    Thursday, September 19, 2019 7:17 AM

All replies

  • If I understood it correctly, you have text that needs to be parsed and extracted for words based on a look up criteria?

    I had trouble understanding this part - "So I'm thinking of loading the contract numbers (contract IDs) as nodes.  then the paragraph IDs of each paragraph as nodes under the contract nodes, and lastly each word of the paragraphs as a node under the paragraph node"

    I remember solving a similar issue a using Data Quality Services. We had to look for anomalies in huge text files that were loaded into SQL dataware house tables. These were put through DQS where we could specify business rules to identify keywords and the train the model accordingly.

    Could you please elaborate your use case.

    Friday, September 20, 2019 12:38 PM
  • Hi

    The last reply has been proposed as an answer. Please mark it as an answer if this has helped you or get back to us with more questions.

    Navtej S

    Tuesday, September 24, 2019 6:32 PM
  • Thanks Kaylan.

    We are doing machine leaning on contracts to identify specific phrases.  these phrases can be a little complex and on one project there were close to 1000 phrase types that were categorized and about 600k total phrases - but ML was not used on that project.  it was all done manually.

    Currently I'm developing prototype where we have data scientists writing regular expressions in python to pre-annotate the text in prep for machine learning.  but this is a painfully slow process because of the many complex regex expressions requested to the data scientists.  then after they run those and analyst needs to review the contracts to see if things were annotated correctly of see if anything was missed (so much for automation...).

    My concept is to design an app that would provide a lot of query options to users who are not developers so they could plug in various parameters and choose some logic options.  the result would be a list of phrases extracted from the text.  This way you can see the results of your search instantly and refine the search as needed, an only need minimal support from the data scientist as needed for the pre-annotation process.

    I tried this with SQL FTS and had some success, but ultimately it falls short.

    My hopes are that a graph database can do most of what we need to identity these phrases by special queries and traversing the graph.

    also, sometimes things are written like "twenty five" or "25".  My thought is even though initially "twenty five" is 2 words (and therefore 2 vertices), we could run a process that would combine these into 1 vertex and assign 2 properties: IsNumeric-true, NumericValue-25.  this is also something you cant do nicely in sql server.  

    One BIG problem is 6000 contracts have about 800,000 paragraphs and maybe 200,000,000+ words.  thats a lot of vertices and edges and this will probably be very costly.

    Does this help you understand our case?

    Tuesday, September 24, 2019 7:57 PM
  • Thank you for elaborating. This is indeed a complex scenario.

    I would recommend posting this on Stackoverflow too to reach a wider audience.

    Thursday, September 26, 2019 6:28 AM