Azure SQL DW - Polybase did not work with UTF-8 files RRS feed

  • Question

  • We are using Polybase to copy data from blob to Azure SQL Data Warehouse.

    Polybase did not work with UTF-8 files and we are receiving error message -"UTF-8 decode failed".When I converted them to UTF-8 BOM with power shell script then it worked. Other files in the same folder are of type UTF-8 and they work fine.

    Please help us on this?

    Thursday, June 2, 2016 5:36 AM


All replies

  • Hi Anusha,​

    You might be interested in taking a look at - it has an Azure SQL Data Warehouse importer that handles all the necessary conversions, including UTF-8. 

    Disclosure: I work for Redgate on Data Platform Studio. It's in early preview at the moment, but feel free to sign up and get in contact - I'd be happy to bump you up the queue.

    • Edited by Jonathan-R Thursday, June 2, 2016 10:04 AM formatting error
    Thursday, June 2, 2016 10:04 AM
  • Hi Anusha,

    Yes, to load files using Polybase, they must first be converted to UTF-8.  We have some sample code here to help you. 



    Thursday, June 2, 2016 3:01 PM
  • Hi Sonya,

    Thank you.

    Currently we are using the same PowerShell script.

    Please let us know if you have any future plans where we can specify the encodingname to "UTF-8" in file format and poly base will take care of encoding.



    Friday, June 3, 2016 4:25 AM
  • Hi Jonathan,

    Yes.I am interested and will explore it.

    Thank you..

    Friday, June 3, 2016 4:28 AM
  • Hi Anusha,

    Great idea!  We have heard this feedback, but I haven't seen anyone add this feature request to our feedback page yet.  I believe this is a high priority feature.  It would be great if you can add this request to our feedback page.  It would be great if you could include any details on how this feature would be implemented best to serve your needs.  By adding the idea to our feedback page, we will send you updates on the feature progress.  We also periodically give the followers of a feature request access to previews of the feature for early feedback on if the feature meets their needs.



    Friday, June 3, 2016 11:10 PM
  • Hi Sonya,

    We are facing the same challenge, we are copying Oracle data (which has nls_characterset in UTF-8) into parquet file on azure blob and as per your suggestion when i tried to change the same parquet file encoding to UTF-8 then i got the error below

    EXTERNAL TABLE access failed due to internal error: 'File /oracle/sod_utf: HdfsBridge::CreateRecordReader - Unexpected error encountered creating the record reader: RuntimeException: wasbs:/ is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [82, 49, 13, 10]'

    I used below powershell command to convert the encoding

    Get-Content sod | Set-Content -Encoding utf8 sod_utf

    And when i tried changing it with c# code then i got different error

    Msg 110802, Level 16, State 1, Line 6
    110802;An internal DMS error occurred that caused this operation to fail. Details: Exception: Microsoft.SqlServer.DataWarehouse.DataMovement.Common.ExternalAccess.HdfsAccessException, Message: Error occurred while accessing HDFS external file[/oracle/sod_utf8][0]: Java exception raised on call to HdfsBridge_CreateRecordReader_V2. Java exception message:
    HdfsBridge::CreateRecordReader - Unexpected error encountered creating the record reader: TProtocolException: Required field 'version' was not found in serialized data! Struct: FileMetaData(version:0, schema:null, num_rows:0, row_groups:null)

    Looking forward for your help


    • Edited by Amit-Tomar Friday, February 1, 2019 6:53 AM
    Friday, February 1, 2019 6:23 AM
  • Amit,

    You need to define the external file format (CREATE EXTERNAL FILE FORMAT (Transact-SQL)) before you can create external table (CREATE EXTERNAL TABLE (Transact-SQL)) and import file into external table using Polybase. 



    Friday, February 1, 2019 11:03 PM