none
Extremely Puzzling - Raw text file reading RRS feed

  • Question

  • I'm working on processing CSV files, some of these are UTF-8 but others are simple "ASCII" no byte order mark prefixes or anything.

    If I read the data as a stream and read the raw bytes (ReadByte) then in the case of the UTF-8 files I "see" every CR LF as a distinct character.

    But if I read the plain ASCII file I do see the CR LF chars at the end of each line BUT if I insert a CR LF in the middle of a record (e.g. in Notepad++ just put cursor partway along a record and press ENTER) then these inserted CR LF chars are not seen.

    Somehow the reading mechanism (Stream.ReadByte) seems to know that the originally present CR LF are real and it returns them but the inserted CR LF chars are to be ignored.

    I can see clearly in Neo Hex Editor these CR LF bytes and as a sequence of chars it looks as expected but I cannot read these chars, the "real" CR LF chars are seen but the one's I inserted are not.

    So for example if I begin with this simple text

    AAAAAAAAAA\r\n - we can see and read the \r and \n chars.

    If I edit this to be:

    AA\r\n

    AAAAAAAA\r\n

    (and save the file)

    Then we see a sequence of 10 'A's followed by a \r and \n - we never see the preceding \r\n - but in the hex editor I see no reason at all for this.

    Can anyone explain this?

    Thursday, September 12, 2019 6:04 PM

All replies

  • You are overlooking something somewhere; you are making an assumption that you are not validating.

    So please provide exact instructions to re-create the problem. You say start with AAAAAAAAAA\r\n then change it. What exactly are you using to do that? Then you say you see something; exactly what do you do to see that?



    Sam Hobbs
    SimpleSamples.Info

    Thursday, September 12, 2019 6:13 PM
  • For example, I used Visual Studio Code to put AAAAAAAAAA into a text file. I then opened it in Visual Studio using the Binary Editor as in:

    And I saw:

    Then in Visual Studio Code I added a new line as you describe and then Visual Studio refreshed the file and then I saw:

    I could have used Visual Studio for all that but I did it the way I did it. You can use Notepad instead of Visual Studio Code. Do you get different results?



    Sam Hobbs
    SimpleSamples.Info

    Thursday, September 12, 2019 6:30 PM
  • For example, I used Visual Studio Code to put AAAAAAAAAA into a text file. I then opened it in Visual Studio using the Binary Editor as in:

    And I saw:

    Then in Visual Studio Code I added a new line as you describe and then Visual Studio refreshed the file and then I saw:

    I could have used Visual Studio for all that but I did it the way I did it. You can use Notepad instead of Visual Studio Code. Do you get different results?



    Sam Hobbs
    SimpleSamples.Info

    I'll try to answer your question and provide more details tomorrow but it's rather involved with various test projects and test files and FTP servers and so on.

    But please note: I saw the data in the file using Neo Hex editor - how I changed the data does not matter, it was as described - TEXT OD OA TEXT OD OA.

    The trailing OD OA was always present and those bytes were always read, always visible when reading the stream.

    The inserted OD OA (inserted by editing the file in Notepad++ and saving it) once inserted, are invisible when reading the stream, the code reading the stream sees (in effect) TEXTTEXT0D0A.

    I tried reading using Read and ReadByte and nothing I tried gave different results.

    The hex editor presumbaly shows me every single byte in the NTFS file stream how the bytes got there is unimportant surely?

    Thursday, September 12, 2019 9:46 PM
  • But please note: I saw the data in the file using Neo Hex editor

    I don't have that so use something we are more likely to have, such as Visual Studio. I do have Notepad++ but it is better to not assume everyone does.

    how I changed the data does not matter

    It does if you want help.



    Sam Hobbs
    SimpleSamples.Info


    Friday, September 13, 2019 12:34 AM
  • If you append "AA" at the end of file, do you read them?

    AAAAAAAAAA\r\nAA

    Somehow I think it is possible to be a caching issue. I've used Stream.Read() for 15+ years and never experienced such kind of problem on it before.

    Friday, September 13, 2019 1:13 AM
    Answerer
  • The KEY thing you are missing here is the difference between a character and its hex value.  I assume you can see that when you have AAAA in a file, those four characters have the hex value 41 41 41 41.  A and 0x41 are the same thing; it's all in how you view that byte of data.  But if you type 41 in notepad, you will not get an A.  You will get 34 31.

    Similarly, when you type 0D0A in notepad, you are not typing hex bytes.  You are typing characters.  The characters 0D0A will be represented in the file as 33 44 33 41.  That's NOT a newline.  Now, if you press the ENTER key in Notepad, you will get carriage return (0D) and linefeed (0A) in your file.

    In memory, the byte of memory might be 0x41.  If you print that out as an integer, you'll see 65.  If you print it out as a hex value, you'll see 41.  If you print it out as a character, you'll see A.  Three representations of the exact same byte value.


    Tim Roberts | Driver MVP Emeritus | Providenza & Boekelheide, Inc.

    Saturday, September 14, 2019 7:02 AM