I'm trying out DQS and cannot for the life of me figure out how to overcome my following problem and could really use some advice on this because surely this is the exact type of thing DQS was built for.
I get raw data that I need to import into my database that I have no control over the format of. One of the columns in the raw data is Name. However, within that name field, there is both Business and Human Names. I was wanting to use DQS to help me cleanse this data and have the following set up: (this is just for clarity to what my issues are)
I have a database with a list of known FirstNames from the Social Security Office, a list of known LastNames from Census data and a large list of words in the english language i know are not human names, i.e. Acupuncture, Concrete...
So I load my raw data to a staging table then from there I have a view that splits the Name Field out into either a BusinessName field or a PersonName field based on an algorithm placed in a UDF that checks the words in the name against my above mentioned lists. I would say it is roughly 95% accurate on determining if the name is a business or a person and this is where I would want DQS to come in and help me.
So I created a knowledge base with the following domains: CompanyName, FirstName, MiddleName, LastName, PersonName(CD). My PersonName Composite Domain is made up of FirstName, MiddleName, LastName and under the advanced options I have it set up to Parse using the space delimiter but also with Use Knowledge Based Parsing Checked. On top of that, I imported my table of FirstNames into the FirstName Domain Values and I did the same with my table of LastNames for the LastName Domain.
So theoretically at this point, with over 10,000 FirstNames and 50,000 LastNames I would think that my knowledge based name parsing would be rather successful but I am running into the following two problems that I can't seem to overcome.
1) The 5% or so inaccurate Company Names that wind up getting parsed into PersonName fields. Since my algorithm to determine these fields is outside of DQS it is a painstaking process to mark the values in the Person Name fields as invalid and ultimately does me no good because if I dont add the Business Names to my Business Word table then the next time i run data through DQS I am going to have the same problem since my algorithm occurs before the data hits DQS. Am I missing a better way to architect all of this within DQS itself? Reference Data Services are out of our budget.
2) The knowledge based domain parsing is TERRIBLE. I get a lot of Names that have a First and a Middle Name but no Last Name even though the second Name was a valid Last Name in my Domain Values. For Example, I have in my First Name Domain Values the name Donald and in my Last Name Domain Values the name Alexander. However, DQS parses this as FirstName: Donald, MiddleName:Alexander, LastName:<Blank>. How do I fix that specific example? I dont want to mark Alexander as invalid because it is entirely possible that someone would have the middle name Alexander but I need that moved to the LastName field in all cases that there are only 2 names in the person name and I dont see a business rule that would allow me to do that.
I really want DQS to work for my company. It seems like it has so much potential and I will be the first to admit that I am a complete beginner on this and there very well could be advanced ways to achieve what I am trying to do. If there is I would love any advice I could get. Just please don't tell me that the solution is to purchase a Reference Data Service, that would be rather disheartening to hear!
Thanks in advance.
Hope this might help:
1) In regards to issue number 1 where in some cases the company name is in the data and is added to the person name - If you want to avoid the company name being added to the person name you can add an additional spare domain to your CD as the last domain. Thus when the DQS knowledge base parsing will find the correct First Name, Last Name and Middle Name it will put all the left-overs from the parsing in this domain and will not add them to the other domains. The knowledge base parsing by default takes all the left over values and adds them to the domain on the "left" of the value so we do not loose data when parsing.
2) Regarding issue number 2 can you please verify that in the CD under the CD proprieties advanced part the "use knowledge based parsing" option is checked.
I tried your scenario with the following steps -
1. Created a PersonName (CD) which is comprised of first name, middle name and last name and set in the advanced the parsing method to space with use knowledge base parsing.
2. In the first name domain I added Donald as a value and in the last name domain I added Alexander as the value.
3. I tried cleansing the following data -
Donald Alexander Donald M Alexander Donald Alexander M Donald Alexander Microsoft Microsoft Donald M Alexander
Donald M Microsoft Alexander
For all records the parsing parsed correctly - the First name was Donald and the Last name Alexander.
Please let me know if this helped.
Thanks for the reply. I havent had time to go back and check the second issue I raised from above regarding the knowledge based parsing but it's probable that it was human error on my part trying to set it all up for the first time. I will let you know if I am still finding issues with it when I get back to it. As for issue #1 above, how would your solution work for something along the lines of a company named "John Smith Construction Inc"? John and Smith would be valid name parts but the name as a whole should only be in the company name field. Would DQS be smart enough with its knowledge based parsing to figure that out or would I likely get FirstName: John, LastName: Smith, CompanyName: Construction Inc?
Thanks for the response.
In regards to "John Smith Construction Inc" if you enhance your CD and include a company name domain within the CD with a list of values which includes "John Smith Construction Inc" in the company name domain than the Knowledge based parsing will be "smart" and identify "John Smith Construction Inc" as a company name and will not split it. The knowledge based parsing first tries to find the longest string in the domains and then reduces the strings until it finds matches, once a match is found it identifies that value as the correct value and places it in the domain.
Hope this helps,
Just a reminder, that Melissa Data offers a commercial Name Service for DQS that handles this exact problem built off of 190,000 First and Last names and Companies. Mapping your Full Name domain to the Melissa Data Name Check Service in DQS can handle most of these issues and more.
Here is the description.
Name Check splits and genderizes full, dual, inverse and mixed format names to enable personalized communications, determine overall gender makeup of a database or list, and create targeted, gender-based campaigns for greater response. Name Parser will parse names into five components (Prefix, First, Middle or Initial, Last, and Suffix), add a gender code to the first name, and recognizes more than 190,000 first and last names to correct misspelled names. Name Check will also flag names containing possible vulgar words, nuisance names, and associated company words to help you screen out possible hoaxes and pranks; recognizes company names (using a special table to understand what are words and what are acronyms in company names) and returns them in upper or lower case; and can create the desired custom salutations based on the salutation format you want to use, for instance Formal, Informal, default Slug, etc.
Here is the link to the documentation: http://www.melissadata.com/manuals/dqs-azure-namecheck-reference.pdf
And the Datamarket listing:
- 已建议为答案 Welly S LeeMicrosoft Employee, Moderator 2012年8月3日 2:33