none
SSIS Package to change column format & remove hidden characters / white spaces RRS feed

  • Question

  • Hi There

    I have an issue with loading data from an excel spreadsheet to SQL database using SSIS package. It seems to be data columns have different formats or hidden characters which are not accepting by the database so SSIS package is erroring out as a result.

    As its not practical to check every single spreadsheet manually, is there a way we can add another SSIS package to check data first and change the format if its needed or remove if there are any hidden characters and then save them in a staging table before run the data loading package?

    I would greatly appreciate if any of you can help with this. Happy to provide more information if needed.

    Many thanks & have a wonderful weekend!

    Sapphire


    Friday, September 13, 2019 3:57 AM

All replies

  • Hi there,

    it is best not to work with Excel files at all.


    Arthur

    MyBlog


    Twitter

    Friday, September 13, 2019 2:45 PM
    Moderator
  • Hi Sapphire,

    As ArthurZ already pointed out, Excel is like a paper which tolerates any scribblings. It is much better to switch your data feeds into XML format at the system of origin.

    In the meantime, you can treat Excel files as virtual database tables on the file system.

    SELECT from them, JOIN, GROUP, SORT, etc,. and analyze for data quality, and "...change column format & remove hidden characters / white spaces…"

    Check it out in SSMS:

    -- 0. Check what 64-bit OLEDB providers are installed
    -- Microsoft.ACE.OLEDB.12.0 and
    -- Microsoft.ACE.OLEDB.15.0 or 16.0
    EXEC master.sys.sp_MSset_oledb_prop;
    
    -- 1. as rectangular
    SELECT *
     FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
       'Excel 12.0 Xml; HDR=NO;
        Database=c:\Users\...\dada.xlsx',
        [Sheet1$]);
    
    -- 2. as rectangular range
    SELECT *
     FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
       'Excel 12.0 Xml;
        Database=c:\Users\...\dada.xlsx',
        'SELECT * FROM [Sheet1$A1:D2]');
    
    -- 3. as XML
    SELECT *
     FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
       'Excel 12.0 Xml; HDR=NO;
        Database=c:\Users\...\dada.xlsx',
        [data$]) AS wow
       FOR XML PATH, ROOT('root');
    
    -- 4. INSERT if needed
    INSERT INTO ...
    SELECT *
    FROM OPENROWSET('Microsoft.ACE.OLEDB.12.0',
      'Excel 12.0 Xml; HDR=NO;
      Database=c:\Users\...\dada.xlsx',
      [Sheet1$]);
    Friday, September 13, 2019 4:37 PM
  • Dear Yitzhak Khabinsky

    Thank you so much for the great information you provided. I feel like this will help with my question but unfortunately I don’t know how to apply this. I am still in a learning level so if you can explain little bit more how to apply this would be very much appreciated.

    Thank you so much

    Saphire


    Monday, September 16, 2019 2:29 AM
  • Thank you Arthur but this is an existing system and I am going to update it…


    • Edited by Saphire77 Monday, September 16, 2019 2:31 AM
    Monday, September 16, 2019 2:30 AM
  • Hi Sapphire,

    Why not use more sources and destinations in data flow task?

    This is very useful to keep the integrity of your data.

    Best Regards,

    Mona


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com

    Monday, September 16, 2019 9:55 AM
  • Hi Sapphire,

    You can treat Excel files as virtual database tables on the file system.

    (1) and (2) allow you to query via T-SQL your Excel files in SQL Server Management Studio (SSMS) for analysis.

    (4) allows to load Excel files data into a database table via SSIS Execute SQL Task.

    Additionally, you can use a CTE construct for additional data manipulation/cleansing before INSERT ...

    Here is a useful udf_tokenize() user-defined function for you:

    USE yourDBname;
    GO
    
    /*
    1. All invisible TAB, Carriage Return, and Line Feed characters will be replaced with spaces.
    2. Then leading and trailing spaces are removed from the value. 
    3. Further, contiguous occurrences of more than one space will be replaced with a single space.
    */
    CREATE FUNCTION dbo.udf_tokenize(@input VARCHAR(MAX))
    	RETURNS VARCHAR(MAX)
    AS
    BEGIN 
    	RETURN (SELECT CAST('<r><![CDATA[' + @input + ']]></r>' AS XML).value('(/r/text())[1] cast as xs:token?','VARCHAR(MAX)'));
    END

    Use of the udf_tokenize() function:

    DECLARE @input VARCHAR(MAX) = '   dog	is             barking  ;  ]'
    
    SELECT dbo.udf_tokenize(@input);

    Output, cleansed data:
    dog is barking ; ]
    Monday, September 16, 2019 1:28 PM
  • Hi Sapphire,

    What's the latest?

    Did the proposed solution help you to resolve the issue?

    Wednesday, September 18, 2019 8:18 PM
  • Dear Yitzhak

    You're such a helpful… Sorry for getting back to you too late. I spent lots of time to understand why my SSIS package failed to loading data to a table. I've checked every single bit and found it's not an issue with formatting but with some hidden characters/white spaces. After I've manually delete all blank cells before execute the SSIS package I could load data with no issues. So that tells me there are some hidden characters/white spaces in blank cells.

    Now I want to know how to change my SSIS package or add a new SSIS package to delete/remove all hidden characters/white spaces in blank cells before load the data to a table.

    I am not an experienced SSIS user so I would greatly appreciate if you can help with this.

    Many thanks

    Saphire


    Thursday, September 19, 2019 1:11 AM
  • Hi Saphire77,

    Please remember to click "Mark as Answer" the responses that resolved your issue.

    This can be beneficial to other community members reading this thread. 

    Thank you.

    Regards,

    Mona


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com

    Friday, September 20, 2019 10:27 AM
  • Dear Yitzhak

    You're such a helpful… Sorry for getting back to you too late. I spent lots of time to understand why my SSIS package failed to loading data to a table. I've checked every single bit and found it's not an issue with formatting but with some hidden characters/white spaces. After I've manually delete all blank cells before execute the SSIS package I could load data with no issues. So that tells me there are some hidden characters/white spaces in blank cells.

    Now I want to know how to change my SSIS package or add a new SSIS package to delete/remove all hidden characters/white spaces in blank cells before load the data to a table.

    I am not an experienced SSIS user so I would greatly appreciate if you can help with this.

    Many thanks

    Saphire


    Hi Saphire,

    Here is what you need to do, step-by-step:

    1. Load Excel file as-is into a staging table where all columns are VARCHAR() data type to tolerate any data quality issues.
    2. Apply the udf_tokenize() user-defined function to all columns. It will cleanse the invisible not needed characters.
    3. Load cleansed staging table to a final destination table.
    Friday, September 20, 2019 3:35 PM
  • Thank you so much for the step-by-step instructions Yitzhak, I greatly appreciate your help.

     

    Can you also please advise me how to apply the udf_tokenize()user-defined function to all columns. There are 11 columns I need to check for hidden characters.  I am not very clear on this step so your kind help on this would be very much appreciated.

     

    Thank you very much

     

    Saphire

     




    Sunday, September 22, 2019 10:22 PM
  • Hi Saphire,

    Here is how to do it. Obviously, you need to use your real table instead of @mockTbl.

    USE yourDatabaseName;
    GO
    
    /*
    1. All invisible TAB, Carriage Return, and Line Feed characters will be replaced with spaces.
    2. Then leading and trailing spaces are removed from the value. 
    3. Further, contiguous occurrences of more than one space will be replaced with a single space.
    */
    CREATE FUNCTION dbo.udf_tokenize(@input VARCHAR(MAX))
       RETURNS VARCHAR(MAX)
    AS
    BEGIN 
       RETURN (SELECT CAST('<r><![CDATA[' + @input + ']]></r>' AS XML).value('(/r/text())[1] cast as xs:token?','VARCHAR(MAX)'));
    END
    
    -- DDL and sample data population, start
    DECLARE @mockTbl TABLE (ID INT IDENTITY(1,1), col_1 VARCHAR(100), col_2 VARCHAR(100));
    INSERT INTO @mockTbl (col_1, col_2)
    VALUES ('  FL   ', '  Miami')
    	, ('  FL   ', '  Fort       Lauderdale   ')
    	, ('  NY   ', '  New           York   ');
    -- DDL and sample data population, end
    
    -- before
    SELECT * FROM @mockTbl;
    
    -- remove invisible chars
    UPDATE @mockTbl
    SET col_1 = dbo.udf_tokenize(col_1)
    , col_2 = dbo.udf_tokenize(col_2);
    
    -- after
    SELECT * FROM @mockTbl; 


    Sunday, September 22, 2019 10:25 PM