none
Anyone please help me RRS feed

  • Question

  • lease prepare

      • 1.       High-level design (Table structures, constraints)
      • 2.       Write pseudo-code / logic on how you’ll solve this

     You can assume whatever you want to solve this problem. But, remember that you aren’t working with 5 records. There are 10,000,000 records. Thousands of location names in those

     

    Assume we have 10,000,000 company names, like the below. Some names have city/state/country location names. Some names have parent company names. Some have other details like Private, etc within ()

     

    Sample Input:

     

    ID

    CompName

    1

    Test eKnowledge (Chennai) Pvt. Ltd

    2

    Test eKnowledge (Private) Ltd

    3

    Test eKnowledge Ltd (an tell Company)

    4

    Fukushima Medical Corp (Japan), a subsidiary of Yashika

    5

    Fukushima Medical Corporation

     

    Business Rules: (Pls see the Sample Output

    1. REMOVE Location names – like city, state, country, etc, should be removed along with (). Note that there are a lot of cities, states and countries in the input data.
    2. EXPAND/STANDARDIZE: Pvt must be replaced with Private, Corp with Corporation, Ltd with Limited, and many more
    3. SPLIT: If the name has “a subsidiary of”, or “(an …)”, or (“a …)” then, the 2<sup>nd</sup> company name must be split to another column

     

    Sample Output: (Sample Input à Business Rules applied à Sample Output)

    ID

    CompName

    ParentName

    1

    Test eKnowledge Private Limited

     

    2

    Test eKnowledge (Private) Limited

     

    3

    Test eKnowledge Limited

    tell Company

    4

    Fukushima Medical Corporation

    Yashika

    5

    Fukushima Medical Corporation

     

    Wednesday, January 22, 2020 10:11 AM

All replies

  • There is not a good way to automatically cleanup data as you describe.  You may be able to do 60-70% using string search/replace.  However, there are way too many variables.

    The best solution to your problem is to use other data to lookup and standardize the data using another data source, like the post office or business list or DNBS or something to get a standardized naming.


    • Edited by Tom Phillips Wednesday, January 22, 2020 12:52 PM
    Wednesday, January 22, 2020 12:52 PM