none
sqlcmd is not processing unicode (utf-8) files correctly RRS feed

  • Question

  • Our SQL script encoding is UTF-8, which inserts Arabic characters into some tables. When the SQL script was executed using our deployment script that uses sqlcmd, we see weird symbols in the tables rather than Arabic. We have tried the following:

    -Run the UTF-8 script via sqlcmd (issues).

    -Run the UTF-8 script via SSMS (no issues).

    -Save the script as UTF-8-BOM and run it via sqlcmd (no issues).

    -Run the UTF-8 script via sqlcmd with -f 65001 (no issues).

    We are using the last option (updated our deployment script to use -f 65001) as a workaround but we really need to know why sqlcmd is behaving this way. I'm surprised by this test results. Does sqlcmd not know the input encoding unless BOM is included? Is it a bug and a fix available? We're running SQL Server 2014.


    • Edited by kjs1028x Thursday, September 22, 2016 2:23 AM
    Thursday, September 22, 2016 2:22 AM

Answers

  • "Unicode" in Microsoft speak is UTF-16LE. UTF-8 is another "Multi-byte character set".

    Besides, trying to detect UTF-8 without a BOM is advanced guesswork and bound to fail at some point. Explicitly specifying the code page appears to be a lot better option.

    Thursday, September 22, 2016 9:52 PM

All replies

  • From the sqlcmd page in BOL:

    Input/Output Options

    -f codepage | i:codepage[,o:codepage] | o:codepage[,i:codepage]
    Specifies the input and output code pages. The codepage number is a numeric value that specifies an installed Windows code page.

    Code-page conversion rules:

    • If no code pages are specified, sqlcmd will use the current code page for both input and output files, unless the input file is a Unicode file, in which case no conversion is required.

    • sqlcmd automatically recognizes both big-endian and little-endian Unicode input files. If the -u option has been specified, the output will always be little-endian Unicode.

    • If no output file is specified, the output code page will be the console code page. This enables the output to be displayed correctly on the console.

    • Multiple input files are assumed to be of the same code page. Unicode and non-Unicode input files can be mixed.

    Enter chcp at the command prompt to verify the code page of Cmd.exe.

    Check what the CHCP command returns where you are running sqlcmd and see if that is highlighting the issues.


    Martin Cairney SQL Server MVP

    Thursday, September 22, 2016 2:37 AM
  • Thanks for the reply.

    PS C:\Users\Administrator> chcp
    Active code page: 437

    However I thought this should not matter because the input SQL script is UTF-8, i.e. Unicode, and according to their sqlcmd page:

    • If no code pages are specified, sqlcmd will use the current code page for both input and output files, unless the input file is a Unicode file, in which case no conversion is required.

    • sqlcmd automatically recognizes both big-endian and little-endian Unicode input files. If the -u option has been specified, the output will always be little-endian Unicode.

    Thursday, September 22, 2016 2:46 AM
  • "Unicode" in Microsoft speak is UTF-16LE. UTF-8 is another "Multi-byte character set".

    Besides, trying to detect UTF-8 without a BOM is advanced guesswork and bound to fail at some point. Explicitly specifying the code page appears to be a lot better option.

    Thursday, September 22, 2016 9:52 PM
  • Hi Could you please post the sqlcmd example here for the below. 

    UTF-8 script via sqlcmd with -f 65001 (no issues).

    I tried as below, but did not work.

    sqlcmd -S %1 -d %2 -U %3 -P %4 -f 65001 -b -i "filepath"

    Thursday, August 16, 2018 9:07 PM
  • Please start a new thread, explaining your problem from beginning to end and do not piggyback on an old thread.

    Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

    Thursday, August 16, 2018 9:45 PM