none
problem reading tab delimited flat file source with missing tabs

    Question

  • I have a tab delimited flat file with say 60 columns. All columns can have null values. The file contains a blank tab for nulls.

    Now, if a row has null value for last 10 columns, ideally, there should be 10 blank tabs padded to the end of that row. But, this is not the case with this file. There is line break after 50 columns of that row, in the above example.

    Extrapolating this behavior, most of the rows have null values for last 'x' columns where 'x' is variable for each row.

    When I try to customize a flat file connection for this file, SSIS is not padding up the missing tabs. In the example above, it should insert null in the last 10 columns for that row. Instead, and weirdly, it is continuing to read next row. So, column 1 of next row is inserted into 51 st columns of the above row.

    Even Excel 2000 can handle this same file easily with the same delimiters (row - LF and column - tab). 

    Any ideas?
    Wednesday, August 03, 2005 8:59 PM

Answers

  • The flat file parser parses data per-column using current column delimiters, and then advancing to the next one. The row delimiter is considered as nothing more then a column delimiter of the last column.

    We considered adding what you are asking for; always check if we get the row delimiter in addition to checking for the current column delimiter, and if the row delimiter comes before expected column delimiter pad the remaining columns of that row.

    We decided not to implement this by default for performance reasons. It is likely we will have it in the next version of the product, but only as an option that will need to be explicitly selected.

    HTH.

    Thanks,
    Wednesday, August 03, 2005 10:00 PM

All replies

  • The flat file parser parses data per-column using current column delimiters, and then advancing to the next one. The row delimiter is considered as nothing more then a column delimiter of the last column.

    We considered adding what you are asking for; always check if we get the row delimiter in addition to checking for the current column delimiter, and if the row delimiter comes before expected column delimiter pad the remaining columns of that row.

    We decided not to implement this by default for performance reasons. It is likely we will have it in the next version of the product, but only as an option that will need to be explicitly selected.

    HTH.

    Thanks,
    Wednesday, August 03, 2005 10:00 PM
  • Thanks Bob for clarifying that this feature was dropped.

    However, I am not disappointed with the performance of DTS 2000 for such a file with 350,000 rows. It would be nice to see this feature back... even if it needs explicit selection. I can bet it will be better than writing my own buggy ;-) parser routine.

    Now, will this solution work? :

    Step1: Create a DTS 2000 package that can read the file properly and output it in either a txt format with vertical bar delimiter OR in SQL Server 2005 database table.  

    Step2: Migrate this package to SQL Server 2005 using Migration Wizard.

    Step 3: In SSIS designer, use 'Execute Package' control flow task to execute the above package.


    Any inputs?


    Thursday, August 04, 2005 12:47 PM
  • For other people having same problem, here is the script you can use to fix the missing tabs using a transformation "Script Component":

    Please note that this approach will work only if first row has column names.
    Feel free to mail me if you need more insight on this.

    Thanks,
    nitesh.ambastha@csfb.com


    ----------------------------------------------------------

    Imports System

    Imports System.Text

    Imports System.Data

    Imports System.Math

    Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper

    Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

    Public Class ScriptMain

    Inherits UserComponent

    Private columnCount As Int32 = 0

    ; First row contains column names.

    Private
    dataRow As Boolean = False  

     

    Public Overrides Sub Input_ProcessInputRow(ByVal Row As InputBuffer)

    Dim columns As String() = Nothing

    Dim outputRow As String = Nothing

    Dim buffer As StringBuilder = New StringBuilder()

    buffer.Append(Row.Column0.Substring(0, Row.Column0.Length - 1))

    columns = Row.Column0.Split(New [Char]() {Chr(9)})

    If Not dataRow Then

    columnCount = columns.Length

    End If

    Dim N As Integer

    If columns.Length < columnCount Then

    For N = 0 To columnCount - columns.Length Step 1

    buffer.Append(Chr(9))

    Next N

    End If

    dataRow = True

    buffer.Append(Chr(13))

    Row.columnOut = buffer.ToString()

    End Sub

    End Class

    Friday, August 05, 2005 12:45 PM
  • This is great

    I'm having a similar problem with a file with pipe delimited columns.  I think I understand your script (I'm getting up to speed with programming).  to help me understand it better what part of the code would you chang in my case.   Also the column that is sometimes missing is the the last column. 

    I know there are other ways to get clean data i.e make sure the data extract is correct before I receive it but this looks like a good exercise for later..

    Thanks for your help
    Thursday, August 11, 2005 6:26 PM
  • has anybody modified this script to handle to opposite problem which is also not backwards compatible.. ie we have some rows with extra delimiters on the end which are getting included into the data for the last column..

     

    it would be something like this

     

    if columns.Length > columnCount Then


    End IF

    WHERE YOU have to pull off X number of charcters off of the buffer where x

    = columns.length - ColumnCount (ie pull of x number of commas off the end of the buffer before dumping out)

    just need to figure out how to do that part.

     

    thanks

    Friday, March 10, 2006 4:11 PM
  • Just set the last column x number length, and change error output to ignore truncation for that th end data column - so you are just truncating the extra stuff. make sense?

    knowthyself, Is your script a source or transform component? We too have discovered this flaw, which is going to cause painstaking hours tweaking just so we don't get extra row {crlf} in the focused row transform! Are you handling/filtering bad rows similarly like the ff_source allows? if so, how do you do it?

    Friday, March 10, 2006 7:34 PM
  • here is an alternate solution using a stored procedure to fix rows it handles both missing and extra columns and also text deliminited fields..

     

    set ANSI_NULLS ON
    set QUOTED_IDENTIFIER ON
    go

    ALTER PROCEDURE [dbo].[Fix_input_line](

          @columns int,
          @inline varchar(8000),
          @terminator char(1),
          @textdelimiter char(1),
          @debug int,
          @outputline varchar(8000) output)

    AS

    BEGIN


     declare @outline varchar(8000);
     declare @current varchar(8000);
     declare @done int;
     declare @len int;
     Declare @last int;
     Declare @currentloc int;
     Declare @patindex varchar(3);
     Declare @word varchar(1000);
     Declare @colcount int;
     Declare @end int;
     Declare @totallinelen int;

     set @patindex = '%' + @terminator + '%';

    set @current=@inline;
    set @done = 0;
    set @last = 0;
    set @colcount = 0;
    set @totallinelen = len(@inline);
    set @outline = '';

    set @currentloc = Patindex(@patindex,@current);
    if (@currentloc = 0)
      set @last = 1;


    while ((@currentloc != 0) or (@done != 1))
    begin

      set @len = len(@current);

      if (@last = 1)
        begin
          set @currentloc = @len;
          set @end = @len;
        end
      else
        begin
         set @end = @currentloc - 1;
        end

       set @word = Substring(@current,1,@end);
       if (@debug = 1) print 'got column = ' + @word

       if (@textdelimiter is not null and @textdelimiter != '')
           begin
              if (@debug = 1) print 'comparing ' + substring(@word,@end,@end) + ' to text deliminiter'

             if (left(@word,1) = @textdelimiter) and (substring(@word,@end,@end) != @textdelimiter)
               begin
                   declare @currentalt varchar(8000);
                   declare @char char(1);
                   declare @endnotfound int;
               -- found a text delimited column keep going
               if (@debug = 1) print ' found text deliminted string skipping...';
               set @currentalt = Substring(@current,@currentloc+1,@len-1)
               set @char = Substring(@current,@currentloc,1)
               set @endnotfound = 0;
              while (@char != @textdelimiter)
                begin
        
                   if (@debug = 1) print 'in while loop searching for ' + @textdelimiter + ' cc = ' + @char;
        
                  set @currentloc = @currentloc + 1;
                  if (@currentloc = @totallinelen + 1)
                    begin
                       set @endnotfound = 1;
                       break;
                    end
                  set @char = Substring(@current,@currentloc,1)
                end
     
               if (@debug = 1) print 'found text ending delimintor';
               if (@endnotfound = 1)
                 begin
                     print 'ERROR: in input file NO text terminator found';
                     set @outline = @inline;
                     return;
                 end

               set @currentloc = @currentloc + 1;
               set @char = Substring(@current,@currentloc,1)

                  if ((@char != @terminator) and (@currentloc != @len -1))
                   begin
                     print 'ERROR: in input file no terminator after text deliminitor';
                     set @outline = @inline;
                     return;
                    end

               -- now reset data to go on
               set @len = len(@current);
               if (@last = 1)
                 begin
                  set @currentloc = @len;
                  set @end = @len;
                  end
               else
               begin
                 set @end = @currentloc - 1;
               end

              set @word = Substring(@current,1,@end);
               end -- found text del. string
            end -- have text delimited operations


     set @colcount = @colcount + 1;
     if (@debug = 1) print 'doing column checks colcount = ' + cast(@colcount as varchar(15)) + ' columns = ' + cast(@columns as varchar(15));
     if (@colcount < @columns)
        set @outline = @outline + @word + @terminator;
     if (@colcount = @columns)
        set @outline = @outline + @word;

      else
        if (@last = 1)
           begin
              while (@colcount < @columns-1)
              begin
                set @outline = @outline + @terminator;
                set @colcount = @colcount + 1;
              end
            end

     

       if (@last != 1)
        begin
           set @current= Substring(@current,@currentloc+1,@len-1);
           set @currentloc = Patindex(@patindex,@current);
           if (@currentloc = 0)
             begin
              set @last = 1;
             end
        end
        else
         begin
            set @done = 1;
            set @currentloc = 0;
         end
     
     if (@debug = 1)
      begin
       print 'at end of loop outline so far = ' + @outline;
       print ''
      end

     end -- while loop

     select @outputline = @outline;
    end

     

    Monday, March 13, 2006 3:37 PM
  • Truly Ragged comma delimited formats are very common in the telecommunication world where gigabytes of data in a day are spit out in this format. 

    It's inefficient in the real time world to pad the rows with commas or whatnot if a line can contain only one column of data or 200. 

    the row delimiter used to be read before the column delimiter in the DTS world...  it seems logical.   I do understand that someone simply tried to generalize, but this shouldn't have gone out the door without an option to threat a row of data as a row.

    Having said that,  is there any way to extend the connection manager/ data source to make the row delimiter to be read first?  Any fix available, etc?  We're currently stuck using SSIS for this, and I do not want to be releasing hacks such as the ones listed below into production.

    But surly if this can not be achieved in a clean way, the managements might believe the techs that SQL2005 is still a beta product and won't be production value until a service pack or two.  And thus can be convinced to go with what the techs prefer in the first place Oracle. Although this might sound extreame for a simple import, it's a flamming sword in the fight against letting MS technologies into a unix dominated world.

     

    Friday, March 17, 2006 4:40 PM
  • Sadly I must agree with loki55, it is truly disheartening to see this not available in the main product. It is becoming quite a task now to scope out this new workaround, and our organization and new prospects are looking outside of SSIS for this particular feature in new projects forward. Is there anyone at MS (Kirk?) that can help shed some light on the reality of this matter, is it going to be incorporated into a service pack, or are there any customizations (outside of parsing, then re-parsing) that aren't going to be so tedious to implement?

    Wednesday, March 22, 2006 5:37 PM
  • Same problem...

    DID the MS FIX it yet????????????????
    Wednesday, November 08, 2006 7:25 PM
  • no ms didn't fix it here is our final solution that is better than the sp..

    since we normally call dts from a batch file any way I wrote a c program that fixes column issues

    and call it first before calling the dts

     

    here is the file

     

    C:\J:\dbd\fixi>fixinput
    FixInput (c) 2006 larry kahn - icentrix.com (kahn@lgk.com)

    syntax: inputfile outputfile columns terminator delimiter debug

    Note to pass a quote or command put it in quotes ie: fixinput infilename.csv out
    filename.csv 10 "," """ 1
         to pass a tab character as the terminator pass T ie: fixinput infilename.tx
    t outfilename.txt "T" "" 1

    here is an example

     

    C:\OPar>cat icxauth.bat
    fixinput ICXAUTH.txt FIXEDICXAUTH.txt 17 "|" """ 0
    dtexec /F F:\OPar\ICXAUTH.dtsx
    C:\OPar>

     

     

    it will pad extra columns as necessary or delete if needed.

     

    here is the file..

     

    <a href="http://stage1.icentrix.com/fixinput.exe">fixinput</a>

     

    • Proposed as answer by Nigel Ainscoe Wednesday, February 03, 2010 2:07 PM
    Wednesday, November 08, 2006 8:07 PM
  • Instead of transformation - I wrote a Script Task:

    Imports System
    Imports System.Data
    Imports System.Math
    Imports Microsoft.SqlServer.Dts.Runtime

    Public Class ScriptMain
        Public Sub Main()
            Dim rowCount As Int32 = 0, columnCount As Int32 = 0
            Dim columns As String(), currentRow As String
            Dim sourceFile As String = CStr(Dts.Variables("MySourceFile").Value)
            Dim destFile As String = Dts.Connections("MyDestinationFile").ConnectionString
            Dim reader As System.IO.StreamReader = New System.IO.StreamReader(sourceFile)
            Dim writer As System.IO.StreamWriter = New System.IO.StreamWriter(destFile, False)
            Do
                currentRow = reader.ReadLine()
                writer.Write(currentRow)
                columns = currentRow.Split(New [Char]() {Chr(9)})
                If rowCount = 0 Then
                    columnCount = columns.Length
                End If

                Dim N As Integer
                If columns.Length < columnCount Then
                    For N = 1 To columnCount - columns.Length Step 1
                        writer.Write(Chr(9))
                    Next N
                End If

                writer.Write(Chr(13))
                writer.Write(Chr(10))
                rowCount += 1
            Loop Until reader.Peek = -1
            reader.Close()
            writer.Flush()
            writer.Close()
            reader = Nothing
            writer = Nothing

            Dts.TaskResult = Dts.Results.Success
        End Sub
    End Class

    Wednesday, November 08, 2006 8:43 PM
  • when will the next version be available? It's very helpful to us. Thanks,

     

    Wednesday, January 17, 2007 7:33 PM
  • After reading all the posts I agree that this was a feature in DTS that unfortunately was not carried  forward into SSIS. Too bad. When using DTS we used csv instead of .xls because DTS didn't handle xls as well. Now things have seemed to reverse itself.
    Friday, July 06, 2007 6:10 PM
  • You may try DataDefractor. It handles issues like variable number of input columns pretty well.

    Regards,
    Ivan

    Sunday, August 19, 2007 6:13 PM
  •  

    Hi,

    I'm also having a similar issue.I have a tab delimited file without header.

    And I have some columns in between which could have null vlaues.But I get an error while trying to (using a Flat File Connection manager,Flat File source and SQL Destination)  extract the data from tab delimited flat file to SQL server 2005 table.

     

    Thanks

    Wednesday, September 19, 2007 2:57 PM
  • What is the error message you are receiving? Do you have missing tabs, or just null or empty values for certain columns?

     

    Friday, September 21, 2007 3:12 PM
  • i found the easiest way out for this import problem.

     

    if you have SQL Server 2000, create a DTS package to import the file into temporary table and embedd that DTS package into SSIS package and carry on with their work.

     

    I spent nearly whole day to achieve this talk with other ways but it is the easiest and above all working.

     

     

    Tuesday, October 09, 2007 6:44 AM
  • How about in our file we can get extra tabs in the middle of the row instead of end of row.  Also, some columns can be NULL.  We thought the only solution is doing some column pattern matching.  Anyone knows how to do it in SSIS?

    Friday, November 02, 2007 3:21 PM
  • See this post for an example of using a script to handle inconsistent delimiters.

     

    http://agilebi.com/cs/blogs/jwelch/archive/2007/05/07/handling-flat-files-with-varying-numbers-of-columns.aspx

     

    Tuesday, November 06, 2007 1:59 AM
  • Hi,

     

    I want to thank you for providing such a valuable tool fixinput.exe. It is proven to be powerful, faster and stable all the time when I use. Recently I have noticed few issues with it when repairing CSV files.

     

    Text before conversion

    % write,23.184,18.778

     

    Text after conversion

    rite,23.184,18.778

     

    The conversion procession is eliminating 2 characters (including % sign) after a "%" sign found in the text. Any help on this matter is highly appreciated.

    Saturday, March 08, 2008 1:30 PM
  • Bob -

    I like the straightforward way you've described a solution to this problem.  Implementing this technique would give the developer the choice of tradeoffs.  As others have stated, there's really no 'performance advantage' to be gained when additional custom code must be built to fill the void.

    Since its been three years now since your post of 03 Aug 2005 . . .


    What is the status of this design idea?  Is it now included in SQL SSIS 2008?  Is it available as an update to SQL 2005?  No, the workarounds I've seen are not really an option.  Currently, I'm doing my flat file ETL work using an ACCESS front end to SQL where it may be slow, but it gets the job done.  I'd rather move the entire project to SQL and SSIS, but can't do so because of this issue.  Automation, after all, is the whole purpose of software technology.  If I have to manually intervene in the preparation of a flat file before SSIS can understand it, then what's the point?

    Each month, I receive dozens of flat files (csv, xls, txt, etc.) from multiple sources, every one of which is perfectly formed.  Every row of every file is consistent in its number of columns and character delimiters.  So what's the problem?  Each month, an additional column is included, representing the most recent month' sales figures (for instance).  Your suggested solution would allow me to define the incoming file with many month's/year's worth of additional columns, which I could then normalize/clean up with an UnPivot/Union.  Simple. (No, I do not have control over the form of the original data source.)   But, enough about me.

    What is your current thinking on this problem and what information can point me to that brings me up-to-date?

    Thanks and best regards,

    SHL
    Friday, June 20, 2008 7:38 PM
  • Hi TheViewMaster,

    thanks for your code. I used your code for dealing with flat file column tab delimited source. I have one more problem some of rows contains the last column's data(the column contains comments) is extended to new line. so SSIS deals this as new row and it is send to error file. How to add that line with column's existing data. I have idea on vb.net but i am getting this logic.

    Thanks in advance

    Brahma,


    ------------------------ Brahma http://vbrchowdary.info
    Thursday, October 08, 2009 7:21 PM
  • I'd like to publicly thank you for a fine solution to this surprising (is the kindest word I can think of) ommission in the flat file connector's abilities to deal with files with variable numbers of delimited fields. I can't believe this has been passed off as a "feature" in 2005 when anyone who deals with data loading on a daily basis will need it.
    Nigel Ainscoe
    Wednesday, February 03, 2010 2:07 PM