none
Export SQL View to Flat File with UTF-8 Encoding

    Question

  • I've setup a package in SSIS to export a SQL view to a flat file and it's working fine.  I now need to make that flat file UTF-8 encoded.  The package executes but still shows the files as ANSI encoded.

    My package consists of a Source (SQL View) -> Derived Column (casts the fields to DT_WSTR) -> Destination Flat File (Set to output UTF-8 file).

    I don't get any errors to help me troubleshoot further.  I'm running SQL Server 2005 SP2.
    Friday, December 12, 2008 6:08 PM

Answers

  • You are right, guys. It looks like we do not add BOMs to UTF-8 encoded files. We treat it like any other ANSI file encoded using specified code page. The UTF-8 (65001) code page is kind of specific, and we should have probably added the appropriate BOM to such files. Feel free to open a bug for this on the connect site (connect.microsoft.com).

    In the meantime, I can think of a workaround where you prepare your files in advance (with a proper BOM in it) and then append data to it by unchecking the "Overwrite data in the file" option in the Flat File Destination editor.

    HTH,
    Bob
    This posting is provided "AS IS" with no warranties, and confers no rights.
    Monday, December 15, 2008 9:30 PM
    Owner

All replies

  • How did you configure your Flat File Destination to export into UTF-8 file? I am assuming you set the code page on the flat file connection manager to 65001.

    Thanks,
    Bob
    This posting is provided "AS IS" with no warranties, and confers no rights.
    Saturday, December 13, 2008 1:22 AM
    Owner
  • Bob,
    Yep, that's exactly what the settings are...  Paul e-mailed me offline and just today I had a chance to repro it on 3042.  Setting the flat file connection manager to 65001, still yields a file that is encoded in ANSI format.

    Phil

    EDIT - Scratch this post...  Most editors can't tell the difference between a UTF-8 and an ANSI encoded file if the data is only standard ascii and a byte-order-marker isn't included.  See my next post below.
    Sunday, December 14, 2008 12:48 AM
    Moderator
  • Unless there is a Byte-Order-Marker (BOM - hex file prefix: EF BB BF) at the beginning of the file, and unless your data contains non-ASCII characters, I'm unsure there is a technical difference in the files, Paul.

    That is, even if the file is "encoded" UTF-8, if your data is only ASCII values (decimal values 0-127, hex 00-7F), UTF-8 doesn't really serve a purpose over ANSI encoding.  Now if you're looking for UTF-8 with specifically the BOM included, and your data is all standard ASCII, the Flat File Connection Manager can't do that, it seems.

    What the flat file connection manager is doing correctly though, is encoding values that are over decimal 127/hex 7F in UTF-8 when the encoding of the connection manager is set to 65001 (UTF-8).

    Example:
    Input data built with a script component as a source (code at the bottom of this post) and with only one WSTR output column hooked to a flat file destination component:

    a string containing only decimal value 225 (german Eszett character - ß)

    Encoding set to ANSI 1252 looks like:
    E1 0D 0A (which is the ANSI encoding of the decimal character value 225 (E1) and a CR-LF (0D 0A)

    Encoding set to UTF-8 65001 looks like:
    C3 A1 0D 0A  (which is the UTF-8 encoding of the decimal character value 225 (C3 A1) and a CR-LF (0D 0A)

    Note that for values over decimal 127, UTF-8 takes at least two bytes and up to four for the remaining values available.

    So, I'm comfortable now, after sitting down and going through this, that the flat file connection manager is working correctly, unless you need a BOM.

    1 Imports System  
    2 Imports System.Data  
    3 Imports System.Math  
    4 Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper  
    5 Imports Microsoft.SqlServer.Dts.Runtime.Wrapper  
    6  
    7 Public Class ScriptMain  
    8     Inherits UserComponent  
    9  
    10     Public Overrides Sub CreateNewOutputRows()  
    11         Output0Buffer.AddRow()  
    12         Output0Buffer.col1 = ChrW(225)  
    13     End Sub 
    14  
    15 End Class 

    Phil
    Sunday, December 14, 2008 1:56 AM
    Moderator
  • When I was testing my project which consisted of five different exports one of them would export and show that it was UTF-8 without BOM and the others would only show ANSI even when I set them up identically.  I was using Notepad++ to view the files and encodings which would show me if it had the BOM or not and I could never get it to reflect UTF-8 except on one of the exports.

    I've tested using a program called Unifer to convert the files after the export and it seems to be doing the encoding and adding the BOM as I need.
    Monday, December 15, 2008 4:33 PM
  • But if that extra file had only ascii characters in it, the encoding looks exactly the same as UTF-8.  Unless there is a BOM, of course.
    Monday, December 15, 2008 4:41 PM
    Moderator
  • Phil Brammer said:

    Now if you're looking for UTF-8 with specifically the BOM included, and your data is all standard ASCII, the Flat File Connection Manager can't do that, it seems.


    Just re-read your above post.  Do you have any other suggestions for adding the BOM so it would be recognized as UTF-8?  Or any other solution for this?
    Monday, December 15, 2008 4:52 PM
  • I've asked Bob to take a moment of his busy time to step back in and comment.

    [Microsoft follow-up]

    Thanks,
    Phil
    Monday, December 15, 2008 8:59 PM
    Moderator
  • You are right, guys. It looks like we do not add BOMs to UTF-8 encoded files. We treat it like any other ANSI file encoded using specified code page. The UTF-8 (65001) code page is kind of specific, and we should have probably added the appropriate BOM to such files. Feel free to open a bug for this on the connect site (connect.microsoft.com).

    In the meantime, I can think of a workaround where you prepare your files in advance (with a proper BOM in it) and then append data to it by unchecking the "Overwrite data in the file" option in the Flat File Destination editor.

    HTH,
    Bob
    This posting is provided "AS IS" with no warranties, and confers no rights.
    Monday, December 15, 2008 9:30 PM
    Owner
  • Thank you both for your follow-ups to my questions.  I'll log a bug as you mentioned.  I don't think the suggestion will work in my situation though as I need to do a nightly dump of a SQL View and if I append the file will grow daily w/ duplicate data.  Again, I really appreciate your time and help!
    Monday, December 15, 2008 9:41 PM
  • panderson253 said:

    Thank you both for your follow-ups to my questions.  I'll log a bug as you mentioned.  I don't think the suggestion will work in my situation though as I need to do a nightly dump of a SQL View and if I append the file will grow daily w/ duplicate data.  Again, I really appreciate your time and help!



    Nah, that should be easy to do...  Have a template file with just the BOM sitting in a directory.  Then at the start of your package, use a filesystem task to copy that file to your output location...  Then in the later task where you append to that file, it's appending to a new file...
    Monday, December 15, 2008 10:21 PM
    Moderator