none
Polybase / External Table and Data Factory failing for Rows with character " é " RRS feed

  • Question

  • Using Polybase / External Table option to load Data from Text File to Azure SQL DWH.  For rows which contains this character " é " in a column, it is failing with UTF8 decode failed message while doing SELECT from from External Table.  Unable to load to Target Table in Azure SQL DWH.

    Tried with Data Factory way, it is copying the data excluding  this "é" character, for e.g. if the data is "Café" then only "Caf" is inserted to Target table.

    Any idea how to handle this if I want data as-is.


    Raghavendra Narayana



    Monday, August 27, 2018 9:42 AM

Answers

  • Hi Raghavendra,

    Are you expecting these characters to convert from "é" to "e"? Can you perform a small exercise wher eyou export your data using bcp to UTF-16 encoding (bcp uses UTF-16 by default) and then run the following Powershell command on the exported .rtf file: 

    Get-ChildItem -rec -fi *.rtf | %{(get-content $_ -enc string) | set-content -enc utf8 $_.fullname}

    This will convert the .rtf file from UTF-16 to UTF-8, which is compatible with Polybase. The idea here is to export a small sample of your data (that contains the undesired characters) using bcp, and then convert this to UTF-8 to see if the characters have been reformatted. 

    There is additional information here from a 3rd party blog: Loading Data into Azure SQL Data Warehouse

    This is one approach. I am researching others. ~Mike

    Tuesday, September 4, 2018 11:06 PM
    Moderator
  • I found the root cause, the export TEXT file created by Source Database is ANSI in case Data contains special characters, otherwise it is creating file with UTF8 format.  Need to manually change by opening the file in Notepad or using Powershell / .NET commands. 



    Raghavendra Narayana

    Wednesday, September 12, 2018 11:27 AM

All replies

  • Hi Raghavendra,

    Are you expecting these characters to convert from "é" to "e"? Can you perform a small exercise wher eyou export your data using bcp to UTF-16 encoding (bcp uses UTF-16 by default) and then run the following Powershell command on the exported .rtf file: 

    Get-ChildItem -rec -fi *.rtf | %{(get-content $_ -enc string) | set-content -enc utf8 $_.fullname}

    This will convert the .rtf file from UTF-16 to UTF-8, which is compatible with Polybase. The idea here is to export a small sample of your data (that contains the undesired characters) using bcp, and then convert this to UTF-8 to see if the characters have been reformatted. 

    There is additional information here from a 3rd party blog: Loading Data into Azure SQL Data Warehouse

    This is one approach. I am researching others. ~Mike

    Tuesday, September 4, 2018 11:06 PM
    Moderator
  • Thanks Mike, I will check this.

    But, I want to keep "é" as is in Target, this issue is happening for several other special characters, I need to keep them all as is.  (I have tried direct insert into table using INSERT sql for this character, Azure SQL DWH is taking and storing the data as-is, but it is  not happening through Polybase or Data Factory)


    Raghavendra Narayana

    Wednesday, September 5, 2018 10:22 AM
  • I found the root cause, the export TEXT file created by Source Database is ANSI in case Data contains special characters, otherwise it is creating file with UTF8 format.  Need to manually change by opening the file in Notepad or using Powershell / .NET commands. 



    Raghavendra Narayana

    Wednesday, September 12, 2018 11:27 AM
  • Thanks Mike.  That Powershell command is working fine.

    (I found the root cause, the export TEXT file created by Source Database is ANSI in case Data contains special characters, otherwise it is creating file with UTF8 format.  Need to manually change by opening the file in Notepad or using Powershell / .NET commands. )


    Raghavendra Narayana

    Wednesday, September 12, 2018 11:39 AM