none
U-SQL Error Extracting from TXT file

    Question

  • when running my extract, got this error:

    Found invalid character-encoding for UTF-8 encoding in input. The input file may contain corrupted data, or the specified input encoding in the extractor does not match the actual file encoding. See the DETAILS section for a hexadecimal dump of the file segment containing the invalid character-encoding.

    I am not able to read UTF-8 character data through below U-SQL script :

    @cgadmdomain =
    EXTRACT 
    row_id string,
    orgarea_name string,
    last_changed_time string,
    start_date string,
    stop_date string,
    domain_name string,
    gui_description string,
    media string,
    direction string,
    distribution string,
    threshold1 string,
    threshold2 string
    
    
    FROM @cgadmdomainInPath USING Extractors.Text(delimiter: ';');

    File has the data "Test Kö CB" for media column . If I remove this particular record then my script is running fine,please let me know if i need to add anything to the parameters

    Monday, April 9, 2018 8:57 AM

Answers

All replies

  • Check to make sure that the input file was saved with the UTF-8 encoding format, that could be a possible issue for you here.
    Monday, April 9, 2018 5:27 PM
  • See also my answer/request for more details on Stackoverflow to same question.

    Michael Rys

    Monday, April 9, 2018 6:18 PM
    Moderator
  • The sample data is being copied from blob storage to Azure datalake store, during the copy activity sample data gets automatically encoded with UTF-8 format. While performing U-sql activity, input automatically goes with UTF-8 encoding.
    Tuesday, April 10, 2018 7:10 AM
  • Either your conversion creates invalid UTF-8 characters or the conversion is not taking place. Can you open the file in a tool like notepad++ to see what encoding it finds? Or if you can (no privacy related data), send me the file (mrys at microsoft) to take a look at the encoding?

    Michael Rys

    Tuesday, April 10, 2018 4:06 PM
    Moderator