locked
Normalizing personal contact information RRS feed

  • Question

  • I have a large data set with 10s of millions of rows of contact information.  The data is in CSV format and contains 48 columns of information (First name, MI, last name, 4 part address, 10+ demographic points, etc.) and I'm struggling with how I should design the database and normalize this data, or if I should normalize this data.

    My 2 thoughts for design were either:

    1. Break the columns into logical categorical tables (i.e. BasicContactInfo, Demographics, Financials, Interests, etc.)
    2. Keep the entire row in one table, and pull out the "Objects" into another table (i.e. ContactInformation, States, ZIPCodes, EmployementStatus, EthnicityCodes, etc.)

    The data will be immutable for the most part, and when I get new data, I'll just create a new database and replace the old one.

    The reason I like option 1 is because it makes importing easier, since I can just insert the appropriate columns from each row into the appropriate tables.  Option number 2 feels like it would be faster to get metrics on the data, like how many contacts live in which states, or what is the total number of unique occupations in the data set.  Plus I'll be able to make relationships between the tables, like which state is tied to which zipcode, which city is tied with which county, etc.  Importing that data might be more tricky, since I don't think SQL Bulk Copy will allow for inserting into normalized tables like that.

    The primary use for this data is to allow our sales force to create custom lists of contact information based on a faceted search page.  The sales person would create the filter, and then I will provide them with the resulting data so they can start making business contacts.  Search performance needs to be good.  Insert, update, and deletes won't happen once the data has been imported.

    What should I look for in designing this database?  Are there any good articles on designing tables around wide data sets like my contact information?  Any good advice?

    Friday, May 22, 2015 7:06 PM

Answers

  • I've really never worked with this large of a data set before, so I'm assuming if I want to get this data imported in this lifetime, I'll probably want to stage the raw data and then normalize it afterwards?

    If so, what is the recommended way to go about doing this?  I'm a developer first, so I'm not sure what all the cool tricks are when migrating data in SQL server (2014 if that matters).

    Is there any way to normalize data like this w/o having to iterate over each row?  Here is a subset of the columns I'm working with:

    [FirstName] [MiddleInitial] [LastName] [DateOfBirth]
    [Address] [SuiteApt] [City] [State] [Zip5]  [Zip4]
    [TelephoneNumber] [DoNotCallFlag] [TimeZone]
    [GenderCode] [HomeOwnerCode] [HouseholdIncomeCode]

    Assuming I understand what you were suggesting, would you break out these columns as such?

    ContactsTable:
    [Id] [FirstName] [MiddleInitial] [LastName] [DateOfBirth] [AddressId] [TelephoneId] [TimeZone] [GenderId] [HomeOwnerId] [HouseholdIncomeId]

    AddressTable:
    [Id] [StreetAddress] [SuiteApt] [CityId] [StateId] [ZipId]

    CityTable:
    [Id] [CityName] [StateId] [ZipId]

    StateTable:
    [Id] [Name]

    ZipTable:
    [Id] [Zip5] [Zip4] [StateId]

    PhoneTable:
    [Id] [TelephoneNumber] [DoNotCallFlag]

    GenderTable, HomeOwnerTable, HouseholdIncomeTable

    I have all of the Gender Codes, HomeOwnerCodes and HouseholdIncomeCodes, but I haven't populated the list of cities, zips or states (the last one I could probably do w/o querying the database... if I could only remember that song...)

    What would a stored proc look like that would, for example, check if a city exists in that table, and create it if it doesn't exist?  I'm imagining that this will quickly turn into a 3 week long query if I have to iterate over each row!


     one Advice - Microsoft has a sample database called - AdventureWorks .  you use this database as reference and understand how you can normalize the tables.

    AdventureWorks has almost the same tables like yours and i believe, you acts as very good reference if you look through the tables and  how constraints are defined between the tables to enforce the rules.

    you can get the sample ones here http://msftdbprodsamples.codeplex.com/

     I do not know what version you are using, so,  you can refer the link above and download the correct sample database version. Do not download the dataware house version . download the OLTP adventure works database.

    make use of the database diagram and drag and drop the tables you need and you can easilt understand the links between them


    Hope it Helps!!

    Friday, May 22, 2015 9:52 PM

All replies

  • I always trust 3rd normal form. 

    I would have a contact table and address table, Address Type Table if a contact can have multiple address. (Same with phone)

    If there are multiple occurrences of interests and financials I would separate them too.


    Andy Tauber
    Data Architect
    The Vancouver Clinic
    Website | LinkedIn

    This posting is provided "AS IS" with no warranties, and confers no rights. Please remember to click "Mark as Answer" and "Vote as Helpful" on posts that help you. This can be beneficial to other community members reading the thread.

    • Proposed as answer by Uri DimantMVP Saturday, May 23, 2015 6:16 AM
    Friday, May 22, 2015 7:39 PM
  • I've really never worked with this large of a data set before, so I'm assuming if I want to get this data imported in this lifetime, I'll probably want to stage the raw data and then normalize it afterwards?

    If so, what is the recommended way to go about doing this?  I'm a developer first, so I'm not sure what all the cool tricks are when migrating data in SQL server (2014 if that matters).

    Is there any way to normalize data like this w/o having to iterate over each row?  Here is a subset of the columns I'm working with:

    [FirstName] [MiddleInitial] [LastName] [DateOfBirth]
    [Address] [SuiteApt] [City] [State] [Zip5]  [Zip4]
    [TelephoneNumber] [DoNotCallFlag] [TimeZone]
    [GenderCode] [HomeOwnerCode] [HouseholdIncomeCode]

    Assuming I understand what you were suggesting, would you break out these columns as such?

    ContactsTable:
    [Id] [FirstName] [MiddleInitial] [LastName] [DateOfBirth] [AddressId] [TelephoneId] [TimeZone] [GenderId] [HomeOwnerId] [HouseholdIncomeId]

    AddressTable:
    [Id] [StreetAddress] [SuiteApt] [CityId] [StateId] [ZipId]

    CityTable:
    [Id] [CityName] [StateId] [ZipId]

    StateTable:
    [Id] [Name]

    ZipTable:
    [Id] [Zip5] [Zip4] [StateId]

    PhoneTable:
    [Id] [TelephoneNumber] [DoNotCallFlag]

    GenderTable, HomeOwnerTable, HouseholdIncomeTable

    I have all of the Gender Codes, HomeOwnerCodes and HouseholdIncomeCodes, but I haven't populated the list of cities, zips or states (the last one I could probably do w/o querying the database... if I could only remember that song...)

    What would a stored proc look like that would, for example, check if a city exists in that table, and create it if it doesn't exist?  I'm imagining that this will quickly turn into a 3 week long query if I have to iterate over each row!


    Friday, May 22, 2015 8:35 PM
  • I don't think I would split out City state and zip unless you have a specific reason to do so.

    Also is address 1 to 1? If not you will need a many to many table.

    I would stage it using SSIS to a single table and normalize it from there.

    You might want to consider creating the IDs for the associated table in the staging database so you will have them to use for the foreign keys.


    Andy Tauber
    Data Architect
    The Vancouver Clinic
    Website | LinkedIn

    This posting is provided "AS IS" with no warranties, and confers no rights. Please remember to click "Mark as Answer" and "Vote as Helpful" on posts that help you. This can be beneficial to other community members reading the thread.

    Friday, May 22, 2015 9:43 PM
  • "What would a stored proc look like that would, for example, check if a city exists in that table, and create it if it doesn't exist?  "

    Take a look at the t-sql MERGE statement.


    Andy Tauber
    Data Architect
    The Vancouver Clinic
    Website | LinkedIn

    This posting is provided "AS IS" with no warranties, and confers no rights. Please remember to click "Mark as Answer" and "Vote as Helpful" on posts that help you. This can be beneficial to other community members reading the thread.

    Friday, May 22, 2015 9:44 PM
  • I've really never worked with this large of a data set before, so I'm assuming if I want to get this data imported in this lifetime, I'll probably want to stage the raw data and then normalize it afterwards?

    If so, what is the recommended way to go about doing this?  I'm a developer first, so I'm not sure what all the cool tricks are when migrating data in SQL server (2014 if that matters).

    Is there any way to normalize data like this w/o having to iterate over each row?  Here is a subset of the columns I'm working with:

    [FirstName] [MiddleInitial] [LastName] [DateOfBirth]
    [Address] [SuiteApt] [City] [State] [Zip5]  [Zip4]
    [TelephoneNumber] [DoNotCallFlag] [TimeZone]
    [GenderCode] [HomeOwnerCode] [HouseholdIncomeCode]

    Assuming I understand what you were suggesting, would you break out these columns as such?

    ContactsTable:
    [Id] [FirstName] [MiddleInitial] [LastName] [DateOfBirth] [AddressId] [TelephoneId] [TimeZone] [GenderId] [HomeOwnerId] [HouseholdIncomeId]

    AddressTable:
    [Id] [StreetAddress] [SuiteApt] [CityId] [StateId] [ZipId]

    CityTable:
    [Id] [CityName] [StateId] [ZipId]

    StateTable:
    [Id] [Name]

    ZipTable:
    [Id] [Zip5] [Zip4] [StateId]

    PhoneTable:
    [Id] [TelephoneNumber] [DoNotCallFlag]

    GenderTable, HomeOwnerTable, HouseholdIncomeTable

    I have all of the Gender Codes, HomeOwnerCodes and HouseholdIncomeCodes, but I haven't populated the list of cities, zips or states (the last one I could probably do w/o querying the database... if I could only remember that song...)

    What would a stored proc look like that would, for example, check if a city exists in that table, and create it if it doesn't exist?  I'm imagining that this will quickly turn into a 3 week long query if I have to iterate over each row!


     one Advice - Microsoft has a sample database called - AdventureWorks .  you use this database as reference and understand how you can normalize the tables.

    AdventureWorks has almost the same tables like yours and i believe, you acts as very good reference if you look through the tables and  how constraints are defined between the tables to enforce the rules.

    you can get the sample ones here http://msftdbprodsamples.codeplex.com/

     I do not know what version you are using, so,  you can refer the link above and download the correct sample database version. Do not download the dataware house version . download the OLTP adventure works database.

    make use of the database diagram and drag and drop the tables you need and you can easilt understand the links between them


    Hope it Helps!!

    Friday, May 22, 2015 9:52 PM
  • I had broken out the city, state, and zip because you had suggested I normalize to third normal form. I'll take a look at the Adventureworks database and fire up my SSIS! I'll post back with the results
    Friday, May 22, 2015 10:22 PM