Correcting multiple columns in one pass/one check - composite domains?

Unanswered Correcting multiple columns in one pass/one check - composite domains?

  • Tuesday, September 18, 2012 11:47 PM
     
     

    I am working with some address data, that is split into columns for streetnumber, street name, street extension, suburb, postcode, state, municipality. 

    I have created a domain for each of the text fields, with known values in my domains. While not the core question here, while performing knowledge discovery on my domain, I pointed the client at a lookup table populated with all known values for (eg.) street. Once knowledge discovery was complete, I noticed DQS had added a bunch of error records with autocorrect values - I didn't want this, as my lookup table only contained valid values. Is this because I had "Enable speller" checked on the domain properties tab?

    My incoming data is manually entered from many different sources, and can contain errors - both spelling and relational, as well as missing values - no street extension is a common one.

    I am currently able to use the domains I have defined to pick up invalid values (i.e. garbage), or misspelled values, and the suggestion functionality of DQS cleanse is quite good for correcting these values. 

    However I also know from my source/lookup data that certain combinations of street name and street extension are valid, or for example certain streets only exist in particular suburbs, or certain suburbs are only valid in certain postcodes etc.

    I can see that this sort of logic can be implemented through the use of domain rules, however I don't fancy entering thousands of domain rules. 

    Is there any way I can make use of composite domains? I could create new columns of concatenated data in my data to be cleansed - i.e. streetname + street extension, and then a composite domain to validate against, however this would only enable me to find invalid values, not correct them - which is helpful, but not as helpful as I was hoping. 

    Also, what is the standard approach to using the output of a cleanse project to update the source data? i.e. it's not possible to export the corrections to overwrite the input data, only to export to a new table, either including or excluding cleanse information. If I export the output of the cleanse project, would I then use SSIS/custom SPs/Sql to selectively update corrected values? i.e. update source where exported <col>_confidence > x (eg. 0.5)?

All Replies

  • Monday, September 24, 2012 12:30 PM
     
     

    Yes, your scenario is an apt use case for using composite domains. From looking at your requirement of validating as well as enriching the data (filling in missing values for street extension, cross-validation between domains i.e. certain streets only exist in particular suburbs) without having to define domain rules, you might consider using the Reference Data Service feature in DQS. You can look at this topic on how one can use the RDS feature in DQS to cleanse addresses: Cleanse Your Data Using Refence Data Knowledge.

    Also, in addition to validating values in domains within a CD using the CD rules, you can also correct values in a domain based on values in other domains within the CD using the definitive Then cross-domain rule: Value is equal to. For more information, see Data Correction using Definitive Cross-Domain Rules.

    Exporting the Processed Data: You can export the processed data to a new table in SQL Server database or to an Excel or CSV file. It does not overwrite your source data. Its upto you how you use the processed data to update/overwrite your source data.

    Thanks,
    Vivek
    SQL Server Documentation