none
Double Quotes in CSV File

    Question

  • I have a lot of data coming in from CSV files.  I have many CSV files (~20), with varying amounts of data- but some are quite large (largest file is ~230mb)

    I'm trying to import it into a SQL database via SSIS, but the data is a little bit frustrating. 

    I have data that looks like this:

    "Text from vendor ""Vendor Name, Inc."" blah blah", "Next string", "", 1234

    Many things to notice here - as well you can imagine some of the difficulties here.

    After parsing, this data should have 4 columns-

    Column1                                                                          Column2              Column3               Column4

    Text From Vendor "Vendor Name, Inc." blah blah      Next string            <blank>                1234

    The biggest problems stem from the double quotes mixed in with the comma because it is a comma delimited file with quotes as the text qualifier.  The other problem is the double quotes with blank text ... which prevents me from doing a replace on the double quotes ... I think ...

    What would be your suggestions to help me parse this out?  (don't forget, some of these are large files with about 260,000 records- )

    Thanks in advance,

    Rob

    Tuesday, October 24, 2006 9:20 PM

All replies

  • In addition, if I were to do my own parsing, what would the algorithm be to do that? 

    It would be something like, Find 1st quote - after first quote, if exists double quote, replace with aribtrary character (#)  - then keep parsing through until next single quote is found?

    Then I can go around and replace the arbitrary character with the double quotes later on-

    Does that sound about right?

    Tuesday, October 24, 2006 9:37 PM
  • When you set up your Flat File Source, on the General Tab you will see a text box called "Text qualifier"

    Place a " in the box. It will parse correctly.

    Tuesday, October 24, 2006 10:01 PM
  • It seems like it is bad form on two levels and I would recommend that you not accept that format from a vendor unless tweaks are made.

    1. The quotes are used to seperate data in the CSV and allow the meta character, comma, to be allowed in data such as "$1,110.00". By including quotes within the quoted data that breaks form. Request that single quotes be used within double quotes if needed or require an escape of the quote within the data area such as "Ficus said, \"to be or not\". Was his reply", .
    2. The number in your example should be quoted to, so the form is not broken. Its like any computer grammar, one needs things to be consistent.

    Just my opinion...I could be wrong. <g>

    Tuesday, October 24, 2006 10:06 PM
  • trust me, I have that box checked but it's not parsing correctly
    Tuesday, October 24, 2006 10:10 PM
  • Sorry, I missed that extra comma in the double-quoted text.

    Is requesting a change to the file format out of the question? That's a really broken file. If they could give it to you Tab delimited you would be set. I don't like using printable characters for delimiters for this very reason.

    You could run a RegEx on the files, but with those large files I don't know if any of the software products available can handle them.

    Check out PowerGrep. http://www.powergrep.com/ I would email them first to see if it will choke on those file sizes. You can point it at a directory, and it will loop through all of the files.

    If you're good with .NET you could write your own utility using the FileStream class to stream the file in, and do the RegEx replace. File size wouldn't matter in this case.

    Wednesday, October 25, 2006 2:53 AM
  • We have ActiveX script like this to solve this problem:

    Function Main()
       
    Set fso = CreateObject("Scripting.FileSystemObject")
       
        Set source = fso.OpenTextFile("_data.txt")
        Set destination = fso.CreateTextFile("data_good.txt")
     
           While Not source.AtEndOfStream
               destination.writeline fixQuotes(source.Readline)
           Wend 
     
        Set source = nothing
        Set destination = nothing
    Set fso = nothing

    Main = DTSTaskExecResult_Success

    End Function


    Function fixQuotes(str)
        qiStr = chr(34) & "," & chr(34)
        str = Replace(str,qiStr,"|&|")
        str = Replace(str,chr(34),"")
        fixQuotes = chr(34) & Replace(str,"|&|",qiStr) & chr(34)
    End Function
    Wednesday, October 25, 2006 4:04 AM
  • I've just encoutered this issue same with yours,and have solved like that below

    Using Script component in DF

    and try TextFieldParser class which is new in .net framework 2.0

    Good Luck

     

    Wednesday, October 25, 2006 4:24 AM
  • Awesome - that sounds like what I am looking for - I will need to figure out how to make it work, but I found a reference so I'll just need to implement it.

    Thanks for the suggestion!  I'll post my solution here if I can work it out

    Wednesday, October 25, 2006 3:52 PM
  • VB 2005 Script version:

        Sub ProcessFile(ByVal filename As String)
            Dim sReader As New System.IO.StreamReader(DIR_Downloads & "\" & filename)
            Dim sWriter As New System.IO.StreamWriter(DIR_RawFiles & "\" & filename)
            Do
                sWriter.WriteLine(fixQuotes(sReader.ReadLine()))
            Loop Until sReader.Peek = -1
            sWriter.Close()
            sReader.Close()
        End Sub

        Function fixQuotes(ByVal str As String) As String
            Dim qiStr As String = Chr(34) & "," & Chr(34)   ' ","
            str = str.Substring(1, str.Length - 2)
            str = Replace(str, qiStr, "|&|")
            str = Replace(str, Chr(34), "&#34;")
            fixQuotes = Chr(34) & Replace(str, "|&|", qiStr) & Chr(34)
        End Function
    Friday, December 08, 2006 6:15 PM
  • hi all,

     

    I am also facing the same error with flat file, which consists of ""Investment Accounts""", where in the text qualifier is as ". But the data flow task is throwing an error as "the column delimiter could not found ". Can you please help to remove or replace the double quotes i.e. "".

    Please sugesst me to go step by step.

     

    With Regards,

    anand

    Monday, July 30, 2007 11:48 AM
  • The script described above isn't working for you?

     

    Wednesday, August 01, 2007 12:37 AM
  • I ran into this same issue and when I did a search ended up in this forum. The script advise from ViewMaster to parse out the double quotes in the file using ScriptTask works but it assumes that all the fields are qualified with a double quote. In my case I have fields that are not enclosed in double quotes and some fields that are enclosed in double quotes. Then I spoted the comment from Andew Knight above to use a TextFieldParser class and I tested that out and lo and behold the TextFieldParser class parses the file pefectly fine. Not sure why SSIS flat file connection manager cannot parse it correctly. So here is my solution to this issue. I created a ScriptTask and used the TaskFieldParser class to read the csv file and convert it to tab delmited file and then modified my subsequent tasks to read the tab file. Just for these reasons I prefer to work with Tab delimited file but in this case the input file is beyond my control, so I converted it to tab and went my merry way. Sucks to do extra processing but what else a poor programmer to do.

     

    To recap the issue

    if you have a record like this

    "AU ","MISSOULA ""LAMBROS"", MT","",100328

     

    you would expect it to be parsed as

     col1        col2                                         col3     col4

     AU          MISSOULA "LAMBROS", MT    Null     100328

     

    But SSIS parses it as

    col1         col2                                          col3    col4         col5

    AU           MISSOULA ""LAMBROS"          MT"    Null          100328

     

     

    Here is the code for the ScriptTask if anyone is interested. You will need a reference to Microsoft.VisualBasic add two File Connection Managers one for the csv file and for the tab file.

    Code Snippet

     

    Imports System.IO

    Imports System.Text

    Imports Microsoft.VisualBasic.FileIO

     

    Public Sub Main()

    '

    'This routine uses the VB TextFieldParser class to properly parse the csv file and convert to a

    'tab delimited file and subsequent tasks in the package can read from the tab file. Apparently the

    'TaskFieldParser class does not have this limitation and it works as we expected. Besides it also

    'provides an added feature where by you can trim extra spaces from text fields.

     

    Dim csvFileFullPath As String

    Dim tabFileFullPath As String

    csvFileFullPath = Dts.Connections("CSV File Connection").ConnectionString

    tabFileFullPath = Dts.Connections("TAB File Connection").ConnectionString

     

    Using tabStreamWriter As New StreamWriter(tabFileFullPath)

    Using csvFileReader As New TextFieldParser(csvFileFullPath)

     

    csvFileReader.TextFieldType = FieldType.Delimited

    csvFileReader.Delimiters = New String() {","}

    csvFileReader.HasFieldsEnclosedInQuotes = True

    csvFileReader.TrimWhiteSpace = True

     

    Dim currentRow As String()

    While Not csvFileReader.EndOfData

    Try

    Dim i As Int32 = 1

    Dim outputRow As New Text.StringBuilder()

    currentRow = csvFileReader.ReadFields()

    For Each currentField As String In currentRow

    'currentField = currentField.Replace(Chr(34), Chr(39)) 'replace double quote with single quote if needed

    outputRow.Append(currentField)

    If i < currentRow.Length Then

    outputRow.Append(Chr(9)) 'add a tab for each field except last one

    End If

    i = i + 1

    Next

    tabStreamWriter.WriteLine(outputRow.ToString())

    Catch ex As Microsoft.VisualBasic.FileIO.MalformedLineException

    Dts.TaskResult = Dts.Results.Failure

    End Try

    End While

    End Using

    End Using

    Dts.TaskResult = Dts.Results.Success

    End Sub

     

     

     

    Friday, August 10, 2007 11:44 PM
  • Hi Recruz,
    I had the similar problem and i used text qualifier box in the flatfile source to be " then i added a derived column where i used a replace function as  REPLACE(Column_Name,"& chr(34)","") and that solved my problem.
    Try if  it works for u
    Wednesday, February 06, 2008 10:52 AM
  • How lame this is?  Did Microsoft forgot what qouted identifier means?  This is a bug since begining of 2008 and last quarter of 2010, still its a bug.
    reyshaik
    Monday, October 18, 2010 5:41 PM