none
Name control parsing RRS feed

  • Question

  • Good morning,

    I'm struggling to find a way to parse name control (4 characters from last name) from the taxpayer's name. What I need for query to do is to loop through the name and after every space get 4 characters, or pad with space if less then 4. I need this loop to create multiple records, so one record for every match and then move to another name. Its parsing based on https://www.irs.gov/pub/irs-pdf/p4164.pdf page 199, exibit 5. Reason for multiple records is because its very hard to figure out someone's last name, especially foreign names, so submitting it multiple different times will probably get us a match. Taxpayer's name is just one field, no choice on that since we don't create that data. Middle initial should be skipped, there is no period after it.

    Thanks

    On thing to add, when we receive data its only one person's full name.. one part of the example shows joint filing.
    • Edited by Amko4217 Monday, April 22, 2019 4:43 PM
    Monday, April 22, 2019 4:37 PM

Answers

  • I agree that SQL is not the best language for this task.  But here is an attempt.  If I did it right, it handles all the samples in the IRS instructions except "Dang Van Le Nhat Thi Pham" (for which the IRS says the correct response is "LE PHAM") and "Yen-Yin Chiu Jin-Zhang Qui (for which the IRS says the correct response is "CHIU QUI")".  But real data probably has more special cases (like a suffix of ", III" and not just suffixes like ", Sr.".  So you probably need more special cases and won't ever be able to handle everything, but at least you can come close.

    The following code inputs the 71 cases from the IRS sample, finds 233 guesses for the match and succeeds on 69 of the 71 cases.

    Declare @ValidNameControls Table(Name varchar(100), NameControl char(40));
    Insert @ValidNameControls(Name, NameControl) Values
    ('John Brown',				'BROW'),
    ('Mary Smith & John Jones',	'SMIT'),
    ('Ralph Teak',				'TEAK'),
    ('Dorothy Willow',			'WILL'),
    ('Joe McCedar',				'MCCE'),
    ('Joe McCarty',				'MCCA'),
    ('Torn MacDouglas',			'MACD'),
    ('Joseph MacTitus',			'MACT'),
    ('John Hardy, Minor',		'HARD'),
    ('April May Jordan',		'JORD'),
    ('John Lea-Smith',			'LEA-'),
    ('Thomas A. El-Oak',		'EL-O'),
    ('Rana Al-Smadi',			'AL-S'),
    ('John O''Neil',				'ONEI'),
    ('Ann O''Spruce',			'OSPR'),
    ('Mark D''Magnolia',			'DMAG'),
    ('John O''Willow',			'OWIL'),
    ('Danette B',				'B'),
    ('James P. Ai',				'AI'),
    ('John A. Fir',				'FIR'),
    ('John Ao, Sr.',			'AO'),
    ('John En, Sr.',			'EN'),
    ('Daniel P. Di Almond',		'DIAL'),
    ('Mary J. Van Elm',			'VANE'),
    ('Susan L. Von Birch',		'VONB'),
    ('Aya Abu Sham',			'ABUS'),
    ('Donald Vander Oak',		'VAND'),
    ('Otto Von Hickory',		'VONH'),
    ('Nabil Al Feyez',			'ALFE'),
    ('Amr El Bayoumi',			'ELBA'),
    ('Janet C. Redbud Laurel',	'LAUR'),
    ('Dee (Plum) Birch',		'BIRC'),
    ('Mary Johnson Garcia',		'GARC'),
    ('Joan Hickory-Hawthorn',	'HICK'),
    ('Dale Redwood-Cedar',		'REDW'),
    ('John Lea-Wren',			'LEA-'),
    ('Dell Ash & Linda Birch',	'ASH'),
    ('Trey & Joan Eucalyptus',	'EUCA'),
    ('Linda Birch & Dell Ash',	'BIRC'),
    ('Mary Smith & Mike Best',	'SMIT'),
    ('Abdullah Allar-Sid',		'ALLA'),
    ('Jose Alvarado Nogales',	'ALVA'),
    ('Juan de la Rosa Y Obregon',	'DELA'),
    ('Pedro Paz-Ayala',			'PAZ-'),
    ('Donald Vander Neut',		'VAND'),
    ('Otto Von Wodtke',			'VONW'),
    ('John Smith Gonzalez',		'GONZ'),
    ('Maria Acevedo Smith',		'SMIT'),
    ('John Garcia Garza Hernandez',	'GARZ'),
    ('Elena del Valle',			'DELV'),
    ('Eduardo de la Rosa',		'DELA'),
    ('Pablo De Martinez',		'DEMA'),
    ('Miguel de Torres',		'DETO'),
    ('Juanita de la Fuente',	'DELA'),
    ('B.A. De Rodriguez',		'DERO'),
    ('M.D. de Garcia',			'DEGA'),
    ('Juan Garza Morales',		'GARZ'),
    ('Maria Lopez y Moreno',	'LOPE'),
    ('Sylvia Juarez de Garcia',	'JUAR'),
    ('Binh To La',				'LA'),
    ('Nam Quoc Tran & Thuy Thanh Vo',	'TRAN'),
    ('Dang Van Le Nhat Thi Pham',	'LE PHAM'),
    ('Kim Van Nguyen & Thi Tran',	'NGUY'),
    ('Kwan Kim Van & Yue Le',		'KWAN'),
    ('Yen-Yin Chiu Jin-Zhang Qui',	'CHIU QUI'),
    ('Earline Old Person',		'OLDP'),
    ('Spike Big Horn',			'BIGH'),
    ('Spike Bighorn',			'BIGH'),
    ('Mary Between Lodges',		'BETW'),
    ('Gail George Moonface',	'GEOR'),
    ('Night & James Lou Mary Her Many Horses',	'HERM');
    
    -- Get substring of each Name after any space
    Declare @ProposedNameControls Table(Name varchar(100), NameControl char(40), ProposedNameControl varchar(100));
    
    ;With Numbers As
    (Select 1 As Number
    Union All
    Select Number + 1
    From Numbers
    Where Number <= 100)
    Insert @ProposedNameControls(Name, NameControl, ProposedNameControl)
    Select v.Name, v.NameControl, Upper(Substring(v.Name, Number, Len(v.Name))) As ProposedNameControl
    From @ValidNameControls v
    Cross Join Numbers n
    Where Substring(v.Name, Number, 1) = '' And SUBSTRING(v.Name, Number, Len(v.Name)) <> ''
    -- Also add the whole name
    Union All
    Select v.Name, v.NameControl, v.Name
    From @ValidNameControls v;
    
    -- Remove ", Sr." and ", Jr."
    -- You probably need to add others like , III and , IV
    Update @ProposedNameControls
    Set ProposedNameControl =
      Replace(Replace(ProposedNameControl, ', Sr.', ''), 'Jr. ', '');
    
    
    -- Remove spaces, ', ., and commas - then truncate to 4 characters
    Update @ProposedNameControls
    Set ProposedNameControl =
      Substring(Replace(Replace(Replace(Replace(ProposedNameControl, ' ', ''), '''', ''), '.', ''), ',', '')  , 1, 4);
    
    -- Replace & with 5 spaces to always get last name of joint names separated by &
    Update @ProposedNameControls
    Set ProposedNameControl =
      Replace(ProposedNameControl, '&', '     ');
    
    -- Return any rows in the IRS sample for which we did not find the correct NameControl
    Select Name, NameControl
    From @ValidNameControls
    Except
    Select Name, ProposedNameControl From @ProposedNameControls Order By Name;
    
    Tom
    • Marked as answer by Amko4217 Thursday, May 9, 2019 7:11 PM
    Tuesday, April 23, 2019 4:01 AM

All replies

  • Good morning,

    Good evening! :-)

    I'm struggling to find a way to parse name control (4 characters from last name) from the taxpayer's name. What I need for query to do is to loop through the name and after every space get 4 characters, or pad with space if less then 4. I need this loop to create multiple records, so one record for every match and then move to another name. Its parsing based on https://www.irs.gov/pub/irs-pdf/p4164.pdf page 199, exibit 5. Reason for multiple records is because its very hard to figure out someone's last name, especially foreign names, so submitting it multiple different times will probably get us a match. Taxpayer's name is just one field, no choice on that since we don't create that data. Middle initial should be skipped, there is no period after it.

    Ouch! I am glad that I don't have to work with that! So Mary J. Van Elm becomes VANE, but I don't think Van Morrisson should become VANM. (Van is his first name.) And these middle initials... Danielle B results in B only, but what if you have John F Smith, that is the period is missing from the middle initial?

    Given the complexity of these rules, I would not like to do this in T-SQL, but rather use the RegEx classes in the CLR have have a CLR function to return the possible name codes. If you are SQL 2017, you could also invoke a Pyhton script for the same task.


    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Monday, April 22, 2019 6:53 PM
  • It is complex and unpredictable but that's why I would like to send multiple matches per name, so instead of figuring out one name control which is almost impossible I would send 2, 3 or more.. I figure one of those would get good match. We are allowed to send multiple records for one account.

    For example for Donald Vander Neut I would create 2 records, one with name control Vand and another as Neut.. of them would be match. Things get more complex with Van, Von, Del, El and so on would be obviously more complex as they are considered part of last name... so they should be concatenated with the next word.

    Monday, April 22, 2019 7:20 PM
  • I agree that SQL is not the best language for this task.  But here is an attempt.  If I did it right, it handles all the samples in the IRS instructions except "Dang Van Le Nhat Thi Pham" (for which the IRS says the correct response is "LE PHAM") and "Yen-Yin Chiu Jin-Zhang Qui (for which the IRS says the correct response is "CHIU QUI")".  But real data probably has more special cases (like a suffix of ", III" and not just suffixes like ", Sr.".  So you probably need more special cases and won't ever be able to handle everything, but at least you can come close.

    The following code inputs the 71 cases from the IRS sample, finds 233 guesses for the match and succeeds on 69 of the 71 cases.

    Declare @ValidNameControls Table(Name varchar(100), NameControl char(40));
    Insert @ValidNameControls(Name, NameControl) Values
    ('John Brown',				'BROW'),
    ('Mary Smith & John Jones',	'SMIT'),
    ('Ralph Teak',				'TEAK'),
    ('Dorothy Willow',			'WILL'),
    ('Joe McCedar',				'MCCE'),
    ('Joe McCarty',				'MCCA'),
    ('Torn MacDouglas',			'MACD'),
    ('Joseph MacTitus',			'MACT'),
    ('John Hardy, Minor',		'HARD'),
    ('April May Jordan',		'JORD'),
    ('John Lea-Smith',			'LEA-'),
    ('Thomas A. El-Oak',		'EL-O'),
    ('Rana Al-Smadi',			'AL-S'),
    ('John O''Neil',				'ONEI'),
    ('Ann O''Spruce',			'OSPR'),
    ('Mark D''Magnolia',			'DMAG'),
    ('John O''Willow',			'OWIL'),
    ('Danette B',				'B'),
    ('James P. Ai',				'AI'),
    ('John A. Fir',				'FIR'),
    ('John Ao, Sr.',			'AO'),
    ('John En, Sr.',			'EN'),
    ('Daniel P. Di Almond',		'DIAL'),
    ('Mary J. Van Elm',			'VANE'),
    ('Susan L. Von Birch',		'VONB'),
    ('Aya Abu Sham',			'ABUS'),
    ('Donald Vander Oak',		'VAND'),
    ('Otto Von Hickory',		'VONH'),
    ('Nabil Al Feyez',			'ALFE'),
    ('Amr El Bayoumi',			'ELBA'),
    ('Janet C. Redbud Laurel',	'LAUR'),
    ('Dee (Plum) Birch',		'BIRC'),
    ('Mary Johnson Garcia',		'GARC'),
    ('Joan Hickory-Hawthorn',	'HICK'),
    ('Dale Redwood-Cedar',		'REDW'),
    ('John Lea-Wren',			'LEA-'),
    ('Dell Ash & Linda Birch',	'ASH'),
    ('Trey & Joan Eucalyptus',	'EUCA'),
    ('Linda Birch & Dell Ash',	'BIRC'),
    ('Mary Smith & Mike Best',	'SMIT'),
    ('Abdullah Allar-Sid',		'ALLA'),
    ('Jose Alvarado Nogales',	'ALVA'),
    ('Juan de la Rosa Y Obregon',	'DELA'),
    ('Pedro Paz-Ayala',			'PAZ-'),
    ('Donald Vander Neut',		'VAND'),
    ('Otto Von Wodtke',			'VONW'),
    ('John Smith Gonzalez',		'GONZ'),
    ('Maria Acevedo Smith',		'SMIT'),
    ('John Garcia Garza Hernandez',	'GARZ'),
    ('Elena del Valle',			'DELV'),
    ('Eduardo de la Rosa',		'DELA'),
    ('Pablo De Martinez',		'DEMA'),
    ('Miguel de Torres',		'DETO'),
    ('Juanita de la Fuente',	'DELA'),
    ('B.A. De Rodriguez',		'DERO'),
    ('M.D. de Garcia',			'DEGA'),
    ('Juan Garza Morales',		'GARZ'),
    ('Maria Lopez y Moreno',	'LOPE'),
    ('Sylvia Juarez de Garcia',	'JUAR'),
    ('Binh To La',				'LA'),
    ('Nam Quoc Tran & Thuy Thanh Vo',	'TRAN'),
    ('Dang Van Le Nhat Thi Pham',	'LE PHAM'),
    ('Kim Van Nguyen & Thi Tran',	'NGUY'),
    ('Kwan Kim Van & Yue Le',		'KWAN'),
    ('Yen-Yin Chiu Jin-Zhang Qui',	'CHIU QUI'),
    ('Earline Old Person',		'OLDP'),
    ('Spike Big Horn',			'BIGH'),
    ('Spike Bighorn',			'BIGH'),
    ('Mary Between Lodges',		'BETW'),
    ('Gail George Moonface',	'GEOR'),
    ('Night & James Lou Mary Her Many Horses',	'HERM');
    
    -- Get substring of each Name after any space
    Declare @ProposedNameControls Table(Name varchar(100), NameControl char(40), ProposedNameControl varchar(100));
    
    ;With Numbers As
    (Select 1 As Number
    Union All
    Select Number + 1
    From Numbers
    Where Number <= 100)
    Insert @ProposedNameControls(Name, NameControl, ProposedNameControl)
    Select v.Name, v.NameControl, Upper(Substring(v.Name, Number, Len(v.Name))) As ProposedNameControl
    From @ValidNameControls v
    Cross Join Numbers n
    Where Substring(v.Name, Number, 1) = '' And SUBSTRING(v.Name, Number, Len(v.Name)) <> ''
    -- Also add the whole name
    Union All
    Select v.Name, v.NameControl, v.Name
    From @ValidNameControls v;
    
    -- Remove ", Sr." and ", Jr."
    -- You probably need to add others like , III and , IV
    Update @ProposedNameControls
    Set ProposedNameControl =
      Replace(Replace(ProposedNameControl, ', Sr.', ''), 'Jr. ', '');
    
    
    -- Remove spaces, ', ., and commas - then truncate to 4 characters
    Update @ProposedNameControls
    Set ProposedNameControl =
      Substring(Replace(Replace(Replace(Replace(ProposedNameControl, ' ', ''), '''', ''), '.', ''), ',', '')  , 1, 4);
    
    -- Replace & with 5 spaces to always get last name of joint names separated by &
    Update @ProposedNameControls
    Set ProposedNameControl =
      Replace(ProposedNameControl, '&', '     ');
    
    -- Return any rows in the IRS sample for which we did not find the correct NameControl
    Select Name, NameControl
    From @ValidNameControls
    Except
    Select Name, ProposedNameControl From @ProposedNameControls Order By Name;
    
    Tom
    • Marked as answer by Amko4217 Thursday, May 9, 2019 7:11 PM
    Tuesday, April 23, 2019 4:01 AM
  • Amazing work, Tom!

    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Tuesday, April 23, 2019 1:07 PM
  • Hi Tom, 

    We are working on making it stored procedure so that once we populate our address exchange table SP will execute and create name controls. I'll update this thread soon with results.

    Your code looks amazing, thanks Tom!



    Tuesday, April 23, 2019 4:01 PM
  • So I have imported 936 Accounts into tblNameControl table and for those accounts we do know what the name control should be. With that said after we ran the SP we found that biggest issue is names we get don't have period after middle initial so they are included into name control. In the IRS guideline they have period as well as parenthesis which we do not receive. So for example Mary J. Van Elm would come over as Mary J Van Elm and when SP is ran name control comes out to JVAN (again just an example since I cannot share data). We also receive MC CARTHY instead of MCCARTHY and so on.. but they are correct in name control.
    • Edited by Amko4217 Tuesday, April 23, 2019 4:55 PM
    Tuesday, April 23, 2019 4:55 PM
  • Upon closer look I do see that since it created multiple entries, one of the entries is almost always right so that should really help in us getting accurate hits. We might clean it up a bit, as I said middle initial is currently used as well as title (I, II, III, IV, JR, SR) are all creating new name control record. For example Johnny Cash Jr has 3 records; John , Cash, and Jr 

    JR and SR and other titles do not have period after it so I removed it in the SP but it still created records for it.. not that big of deal at the moment but just trying to fine tune it bit.

    Maybe best option would to add (A, B, C, etc and JR, SR, I, II, III,IV) as not process keywords and we can add more as needed to that list.

    • Edited by Amko4217 Tuesday, April 23, 2019 6:05 PM
    Tuesday, April 23, 2019 6:01 PM
  • I just wanted to post what we ended up with. Process works really good and we are almost ready for our first address exchange request.

    --truncate TAR table
    truncate table tblTAROutputCore
    --insert addresses with bad address marker into tblTAROutputCore table... sql omitted.
    
    --start the process
    
    DROP TABLE tblNameControl
    
    CREATE TABLE tblNameControl(TaxpayerName varchar(100), NameControl varchar(100), AgencyControlNumber int)
    ;
    With Numbers As
    (Select 1 As Number
    Union All
    Select Number + 1
    From Numbers
    Where Number <= 100)
    
    Insert tblNameControl(TaxpayerName, NameControl, AgencyControlNumber)
    Select v.TaxpayerName, Upper(Substring(v.TaxpayerName, Number, Len(v.TaxpayerName))) As NameControl, AgencyControlNumber
    From tblTAROutputCore v
    Cross Join Numbers n
    Where Substring(TaxpayerName, Number, 1) = '' And SUBSTRING(TaxpayerName, Number, Len(TaxpayerName)) <> ''
    
    -- Remove spaces, ', ., and commas - then truncate to 4 characters
    Update tblNameControl
    Set NameControl =
      Substring(Replace(Replace(Replace(Replace(NameControl, ' ', ''), '''', ''), '.', ''), ',', '')  , 1, 4);
    
    -- Remove suffix
    Delete from tblNameControl
    Where NameControl like '%jr' and taxpayername like '% jr' or 
    NameControl like '%SR' and taxpayername like '% Sr' or
    NameControl like '%II' and taxpayername like '% II' or 
    NameControl like '%III' and taxpayername like '% III' or 
    NameControl like '%IV' and taxpayername like '% IV'
      --Replace(Replace(NameControl, ' III', ''), ' Jr', '');
    
    -- Replace & with 5 spaces to always get last name of joint names separated by &
    Update tblNameControl
    Set NameControl =
      Replace(NameControl, '&', '     ');

    Thursday, April 25, 2019 2:08 PM
  • Hi Amko,

    What is the progress of your work? Any update for the current thread?

    Best Regards,

    Will


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Monday, April 29, 2019 9:26 AM
    Moderator
  • Hi Will, I don't have any additional progress. We are going to use what i posted on April 25th but at the moment we are waiting on IRS to respond to previously sent file (using old name control process), once they respond and it should be this week, we are going to send another address tickler to them using the stored procedure outlined on 25th. IRS should respond within 7 working days if file format is correct and earlier if not. So for the moment we are waiting. I'm really liking how the SP works and how name control is parsed out, I don't expect any problems.

    I'll post on here what happens when the file comes back, we should be able to calculate rough percentages and see how this new process is working.

    Quick edit; I was able to send file over to Fed and for 2,569 records system created 4,507 name control matches. Everything worked as intended and now we wait for about 7 working days until we get results back.



    • Edited by Amko4217 Monday, April 29, 2019 7:02 PM
    Monday, April 29, 2019 2:24 PM
  • Final update, I have received data back from Fed and name control SP worked really well. I think that about 85% of requested accounts got matches back. Hard to say why 15% didn't.. probably something to do with their record keeping or hyphenated names that didn't come over with "-".

    Thank you 


     
    Thursday, May 9, 2019 7:17 PM