locked
CSV importing unwanted columns RRS feed

  • Question

  • Hi all,

    I wonder if anyone can help with a strange issue that's occuring when I try to import some csv files via SSIS.
    My flat file connection manager details are as follows:

    Format: Delemited
    Text Qualifier: <none>
    Header row delimiter: Comma {,}
    Header rows to skip: 0
    Column names ARE NOT in the first data row.

    The problem is that if I have specified 3 columns, as below:

    Column1 ColumnDelimiter: Comma{,}
    Column2 ColumnDelimiter: Comma{,}
    Column3 ColumnDelimiter: {CR}{LF}

    ...However, if in the csv file there is data populated in the first 4 columns (i.e. my defined columns[1-3] in addition to column4 which I have not specified) then when I look into the Preview tab in SSIS, I get the value followed by a comma (',') for the for Column3 field. What's worse is that if I have lots more rows then it populate all those other entries in Column3 with a comma (',') also.

    Has anyone experience this before, and more importantly is there a fix to it? I only want SSIS to read those column values that I have specified in the file connection manager. I don't want to read/import anything after my last defined column.

    Any help would be greatfully appreciated.

    Wednesday, April 13, 2011 10:43 AM

Answers

  • I got around this by adding a script task component before my dataflow task was called. Inside the script task component i had some like the below.

        Public Sub Main()

            Dim sourceFilename As String = Dts.Variables("RunTime_FileName").Value.ToString
            Dim sColumnSeparator As String = ","
            Dim sColumnCount As Integer = 13
            Dim fileReader As StreamReader = New StreamReader(sourceFilename)
            Dim line As String
            Dim DataRow() As String

            'Start with success status. Set to failure if required.
            Dts.TaskResult = Dts.Results.Success

            While (Not fileReader.EndOfStream)
                line = fileReader.ReadLine()
                DataRow = line.Split(sColumnSeparator.ToCharArray())
                If DataRow.Length > sColumnCount Then
                    Dts.TaskResult = Dts.Results.Failure
                End If
            End While
            fileReader.Close()
        End Sub

    Seems to work how I intend, i.e. if there are more columns than I have already defined then fail, in which I will handle the failure with a suitable message to the user. The variable sColumnCount will be changed to be read via a config file.

    Hope it helps, definately sorted me out!

     

    • Proposed as answer by Jerry Nee Monday, April 18, 2011 10:50 AM
    • Marked as answer by Jerry Nee Wednesday, April 27, 2011 10:26 AM
    Thursday, April 14, 2011 11:47 AM

All replies

  • in the connection manager you need to define the fourth column anyway because there is , between 3 and 4.

    however, in the data source you can untick the cloumn 4 to just import 3 columns.

    • Proposed as answer by Koen VerbeeckMVP Wednesday, April 13, 2011 11:32 AM
    • Unproposed as answer by wattyjnr Wednesday, April 13, 2011 12:09 PM
    Wednesday, April 13, 2011 11:09 AM
  • I think when you define a connection to the flat file, it has to read the whole file - you can't (in the connection) select which columns to read. However, in the columns tab of the connectino manager you can choose which columns to pass these on to the next task.
    every day is a school day
    My Blog
    Wednesday, April 13, 2011 11:10 AM
  • Hi, thanks for the reply.

    I am not sure I understand your answer, maybe I didn't explain the question properly.

    But for clarity in the SSIS if I have 3 column names defined that I want to import; EmployeeID, FirstName & Surname....then when a user fills out the csv to be imported I am expecting;

    AA1  Joe    Bloggs
    AA2  Sally  Anne
    AA3  Tim    Smith

    If for whatever reason they decide to add more data to the cell next to surname, then I want to ignore that, i.e.

    AA1  Joe    Bloggs  Blaah (I want to ignore the last column cell and just import the first 3)
    AA2  Sally  Anne    Blaah (I want to ignore the last column cell and just import the first 3)
    AA3  Tim    Smith
    AA4  Fae   

    In SSIS in the columns tab for the above data it is showing the following;

    EmployeeID    FirstName   Surname
    AA1           Joe         Bloggs,
    AA2           Sally       Anne,
    AA3           Tim         Smith,
    AA4           Fae         ,

    Does that help explain my issue any clearer? The only other information i failed to mention is that the outputcolumnwidth for each field is 255.

    Wednesday, April 13, 2011 11:43 AM
  • Hi, thanks for your reply. I was hoping that I could use maybe a script task component to dynamically check the number of columns and bring them in that way.
    Wednesday, April 13, 2011 11:45 AM
  • I think I got understand a little more what your situation is. your problem is you are not in control for the txt data source file.

    If you can guarantee that your first 3 column will be always there then there is a simple solution for this

    1. create your flat file connection and set the format to rigged right, which will import your file as one column only

    2. use the derived column transformation to get first column, second column and third one.

    3. this can ensure you only import 3 fields.

    an example of derived column can be:

    substring([Column 0], 1,(FINDSTRING([Column 0],",",1)-1) as col1

    substring([Column 0], (FINDSTRING([Column 0],",",1)+1, (FINDSTRING([Column 0],",",2)-1) as col2

    etc

    Wednesday, April 13, 2011 12:24 PM
  • Take a look at this:

    http://microsoft-ssis.blogspot.com/2011/02/script-component-as-source.html

    It's about a CSV file where not all column are filled (or extra column are added)


    Please mark the post as answered if it answers your question | My SSIS Blog: http://microsoft-ssis.blogspot.com
    Wednesday, April 13, 2011 12:34 PM
  • Hi Steven. Thanks for your reply. I've come across that solution and it seems good, however I can't gurantee that the columns that I have defined will always contain a value. If that is the case then I have some business logic later on in SQL Server which flags those records as validation errors. I simply want to erradicate SSIS giving me extra characters in my output columns if a user inadverdently adds extra data to a column that has not been defined (this could be quite easy to do in my case epecially seeing as the csv files do not contain header rows).
    Wednesday, April 13, 2011 3:35 PM
  • Have you any more suggestions?
    Wednesday, April 13, 2011 4:10 PM
  • Hi SSISJoost,

    Thanks for the link. Is there not anyway that I can do it but still using the flat file connection manager?

    Thursday, April 14, 2011 8:34 AM
  • Hi SSISJoost,

    Thanks for the link. Is there not anyway that I can do it but still using the flat file connection manager?

    You propably can use the connectionstring of that connection manager, but this script isn't using other properties of the connection manager (such as columnnames).

    The columns are defined in the Script Component instead of the connection manager:

    But feel free to try different approaches...

     


    Please mark the post as answered if it answers your question | My SSIS Blog: http://microsoft-ssis.blogspot.com

    Thursday, April 14, 2011 8:44 AM
  • Hi SSISJoost,

    Thanks for the link. Is there not anyway that I can do it but still using the flat file connection manager?

    And if you don't like scripting C# or VB.net. You can always read all columns as one big string and split it in smaller pieces/columns using a derived column.

    You will need the FINDSTRING expression and the SUBSTRING expression to accomplish that.

     

    An other options is to ask for correct csv files from your source. These are the basic csv rules: http://en.wikipedia.org/wiki/Comma-separated_values#Basic_rules

     


    Please mark the post as answered if it answers your question | My SSIS Blog: http://microsoft-ssis.blogspot.com
    Thursday, April 14, 2011 8:51 AM
  • Good point SSISJoost. I don't mind the scripting, that's not a problem. Just worried about how I would handle errors/occassions in which the data itself contains commas (,)!!!! Any ideas on how I can handle that, or is that a seperate question in another post?! CSV's...aaaaaaaaaaaaaaaaahhhhhhh!!
    Thursday, April 14, 2011 9:00 AM
  • Good point SSISJoost. I don't mind the scripting, that's not a problem. Just worried about how I would handle errors/occassions in which the data itself contains commas (,)!!!! Any ideas on how I can handle that, or is that a seperate question in another post?! CSV's...aaaaaaaaaaaaaaaaahhhhhhh!!


    That could be quite complicated... but the regular flat file connection manager has the same problems. If you add extra commas to your data you will need to use qualifiers otherwise you will get errors... and then you will have to reckon with qualifiers within your text....

    So incorrect supplied CSV can be quite a pain in the *ss :-(

     


    Please mark the post as answered if it answers your question | My SSIS Blog: http://microsoft-ssis.blogspot.com
    Thursday, April 14, 2011 9:13 AM
  • I got around this by adding a script task component before my dataflow task was called. Inside the script task component i had some like the below.

        Public Sub Main()

            Dim sourceFilename As String = Dts.Variables("RunTime_FileName").Value.ToString
            Dim sColumnSeparator As String = ","
            Dim sColumnCount As Integer = 13
            Dim fileReader As StreamReader = New StreamReader(sourceFilename)
            Dim line As String
            Dim DataRow() As String

            'Start with success status. Set to failure if required.
            Dts.TaskResult = Dts.Results.Success

            While (Not fileReader.EndOfStream)
                line = fileReader.ReadLine()
                DataRow = line.Split(sColumnSeparator.ToCharArray())
                If DataRow.Length > sColumnCount Then
                    Dts.TaskResult = Dts.Results.Failure
                End If
            End While
            fileReader.Close()
        End Sub

    Seems to work how I intend, i.e. if there are more columns than I have already defined then fail, in which I will handle the failure with a suitable message to the user. The variable sColumnCount will be changed to be read via a config file.

    Hope it helps, definately sorted me out!

     

    • Proposed as answer by Jerry Nee Monday, April 18, 2011 10:50 AM
    • Marked as answer by Jerry Nee Wednesday, April 27, 2011 10:26 AM
    Thursday, April 14, 2011 11:47 AM