locked
Problem: Extra characters in JSON files from Azure DataFactory v1. RRS feed

  • Question

  • UPDATE: Problem hidden characters are the  Unicode BOM (Byte Order Mark) that Data factory has pre-pended tot he output file it generates during the copy.  Is there a way to have Data Factory generate a file without the BOM?    

    Problem: Extra characters in JSON files from Azure DataFactory v1. 

    I have two files which are straight copies from two Cosmos db collections.  
    I used Data Factory v1, selecting defaults, to copy the collections to a Blob storage container 
    then used Azure Storage Explorer to copy the JSON files to a Windows 10 
    desktop.

    A. Looking at the files using an editor vi/Notrepad/Wordpad/Ultra Edit/Visual 
    Studio Code they look OK.

    B. When I attempt to read the files into a simple Nodejs(v9) application the 
    I get JSON parse error:

    SyntaxError: C:\Users\ricko\Desktop\whippy\MorpheusDataProduction-01052018-20.json: Unexpected token � in JSON at positi
    on 0
        at JSON.parse (<anonymous>)
        at Object.Module._extensions..json (module.js:654:27)
        at Module.load (module.js:554:32)
        at tryModuleLoad (module.js:497:12)
        at Function.Module._load (module.js:489:3)
        at Module.require (module.js:579:17)
        at require (internal/module.js:11:18)
        at Object.<anonymous> (C:\Users\ricko\Desktop\whippy\appUpdateJSON.js:5:17)
        at Module._compile (module.js:635:30)
        at Object.Module._extensions..js (module.js:646:10)

    C. A single Line validates using JSON http://jsonlint.com. Multiple lines do not
    validate giving a Parse Error:

    Error: Parse error on line 15:
    ..."_ts": 1512601730} { "path": "Dropbox\
    ----------------------^
    Expecting 'EOF', '}', ',', ']', got '{'


    D. Also using node to read the file directly then writing a record
     to the console I get a wierd double spaced version. I also saw two �� 
    characters at the beginning of the line in front of the open brace in one attempt.

      { " S T E P _ N A M E " : " O p e n s   T e s t " , " N A M E " : " A r t h u r   J o b e r t " , " D A T E " : " 1 2
    1 9 / 2 0 1 6   3 : 5 7 : 4 7   P M " , " L O T " : " C G 1 5 " , " W A F E R " : " " , " P R O C E S S _ S T E P " : "
    7 . 0 _ 1 . 0 " , " M E A S U R E M E N T " : " R d c : C - C " , " B A T C H " : " " , " D I E _ R O W " : " 3 " , " D
    I E _ C O L U M N " : " 3 " , " S I T E _ N U M " : " C - C " , " D A T A " : " 2 2 8 . 9 9 5 0 0 0 " , " U N I T " : "
    o h m s " , " S T A T I O N _ I D " : " P r o b e r 1 " , " M I D " : " M a t t h e w s : : * : : 3 : : 3 : : 1 2 - 1 9
    - 2 0 1 6   3 : 5 7 : 4 7   P M : : 1 8 " , " C O M P A N Y " : " M a t t h e w s " , " F I L E " : " D r o p b o x \ \
    D a t a b a s e   s w e e p / 2 0 1 6 - 1 2 - 1 9 - a t - 1 5 5 7 _ 0 0   C G 1 5   d i e 1 = ( 5 , 1 )   N = 4 3   T e
    s t e r = P r o b e r 1 . c s v " , " R E C O R D " : 1 8 , " i d " : " d 2 5 9 d 1 0 b - b 6 5 a - 4 2 9 9 - 4 a a 2 -
    4 0 0 c 2 d 1 b 7 4 9 7 " , " _ r i d " : " v t Y Y A O V t L w G v A A A A A A A A A A = = " , " _ s e l f " : " d b s
    / v t Y Y A A = = / c o l l s / v t Y Y A O V t L w E = / d o c s / v t Y Y A O V t L w G v A A A A A A A A A A = = / "
    , " _ e t a g " : " \ " 0 3 0 1 5 2 6 e - 0 0 0 0 - 0 0 0 0 - 0 0 0 0 - 5 a 2 8 7 8 8 1 0 0 0 0 \ " " , " _ a t t a c h
      e n t s " : " a t t a c h m e n t s / " , " _ t s " : 1 5 1 2 6 0 1 7 2 9 }

     
    Sunday, January 14, 2018 7:08 PM

Answers

  • Hi,

    Unicode with BOM is by default used, sorry that currently it couldn't be configured from user input, but so far as I know, many existing application handle this extra character. However, I also check node.js and found many people had the same issue with BOM with node.js, and here are some proposed workarounds:

    https://stackoverflow.com/questions/24356713/node-js-readfile-error-with-utf8-encoded-file-on-windows

    Also I found it was fixed from a certain version of node.js, could you check if this help?

    https://github.com/lorenwest/node-config/pull/216/files

    https://github.com/lorenwest/node-config/issues/215

    Thanks,

    Eva 

    Tuesday, January 16, 2018 10:27 AM

All replies

  • Hi Ossinger,

    Per your description, I'm suspecting that could this be caused by new line character or encodings? FYI, ADF copy with JSON format by default uses windows new line "\r\n" as line breaking and UTF-8 is the default encoding. Please set your application with the expected new line character and encoding and try again.

    Thanks,
    Eva

    Monday, January 15, 2018 7:37 AM
  • Not newlines.  Error is due to the  Unicode BOM (Byte Order Mark) which are the hidden characters at the beginning of the file.

    Monday, January 15, 2018 5:06 PM
  • Hi,

    Unicode with BOM is by default used, sorry that currently it couldn't be configured from user input, but so far as I know, many existing application handle this extra character. However, I also check node.js and found many people had the same issue with BOM with node.js, and here are some proposed workarounds:

    https://stackoverflow.com/questions/24356713/node-js-readfile-error-with-utf8-encoded-file-on-windows

    Also I found it was fixed from a certain version of node.js, could you check if this help?

    https://github.com/lorenwest/node-config/pull/216/files

    https://github.com/lorenwest/node-config/issues/215

    Thanks,

    Eva 

    Tuesday, January 16, 2018 10:27 AM
  •  https://stackoverflow.com/questions/24356713/node-js-readfile-error-with-utf8-encoded-file-on-windows 

    provided the answer.  Specificall using data = data.replace(/^\uFEFF/, ''); to strip the BOM off.

    Thank you.

    Tuesday, January 16, 2018 7:44 PM