none
StreamReader.Read: How Many Bytes Per Character??

    Question

  • Hi, I've been trying all sorts of things but just can't seem to work this one out after a couple of days. I have an app that reads in text based reports one page at a time using StreamReader.Read() until it gets to a page break at which time the page is passed to another routine for some processing. As each character is read in, the page is built up using a stringbuilder and a byte count is kept. With the Byte count it is 1 byte per character, but obviously not every character is one byte.

    I need to be able to cut out specific pages later from another application and this is currently done using that byte count and Stream.Seek() so it is very important that the byte count is accurate when reading in the characters.

    Due do some inconsistencies when reading characters on different machines, a decision was made to use the the Windows Latin-1 code page 1252.

    The part that seems to be causing me confusion is how do I tell how many bytes each character takes up as they are being read in by the streamreader?

    What I really need to know is how many bytes does each character consume in the actual file. Possibly knowing the encoding of the file would be beneficial for me. For most files which have only single byte characters and everything works when I count one byte for one character. Any pointers on this would be hugely appreciated.


    Below is a code example which has me baffled. When I read the character in using code page 1252 encoding it tells me the character is 8250, however code page 1252 should be a single byte character set if I am correct. Therefore I had assumed it should only go up to 255. 

    When I use GetByteCount below it does not tell me the byte count of the character in the actual file. It does tell me the byte count of the character if it is encoded by that encoding type.

    I am guessing I will need to read in Bytes using a BinaryReader instead and then do some analysis on each byte


    Dim sr As StreamReader = New StreamReader("C:\Test\Test.txt", Encoding.GetEncoding(1252))  
     
    Dim intChar As Integer 
       
    ''Test variables  
    Dim buf(0) As Char 
    Dim ByteArr() As Byte 
    Dim tstByteCount As Integer 
     
       
    ''   
    ''Test Case 1  
    ''  
    intChar = sr.Peek  
    ''IntChar = 8250  
    ''So the character for this example is 8250  
     
    ''In the real program each character is read using sr.Read  
    ''If intChar < 256 Then Chr(intChar) is appended to the StringBuilder  
    ''If intChar > 255 Then ChrW(intChar) will be appended to the StringBuilder  
       
     
    ''For this test read one character into the array buf(0)  
    sr.ReadBlock(buf, 0, 1)  
    ''buf(0) = "›"  
     
    ByteArr = Encoding.GetEncoding(1252).GetBytes(buf(0))  
    ''ByteArr(0) = 155  
    tstByteCnt = Encoding.GetEncoding(1252).GetByteCount(ChrW(intChar))  
    ''tstByteCount = 1   
     
    ByteArr = Encoding.UTF8.GetBytes(buf(0))  
    ''ByteArr(0) = 226  
    ''ByteArr(1) = 128  
    ''ByteArr(2) = 186  
    tstByteCnt = Encoding.UTF8.GetByteCount(ChrW(intChar))  
    ''tstByteCount = 3   
     
    ByteArr = Encoding.Unicode.GetBytes(buf(0))  
    ''ByteArr(0) = 58  
    ''ByteArr(1) = 32  
    tstByteCnt = Encoding.Unicode.GetByteCount(ChrW(intChar))  
    ''tstByteCount = 2  
     
    ''  
    ''test Case 2  
    ''  
    intChar = sr.Peek  
    ''IntChar = 8217  
    ''So the character for this example is 8217  
    ''For this test read one character into the array buf(0)  
     
    sr.ReadBlock(buf, 0, 1)  
    ''buf(0) = "’"  
     
    ByteArr = Encoding.GetEncoding(1252).GetBytes(buf(0))  
    ''ByteArr(0) = 146  
    tstByteCnt = Encoding.GetEncoding(1252).GetByteCount(ChrW(intChar))  
    ''tstByteCount = 1   
     
    ByteArr = Encoding.UTF8.GetBytes(buf(0))  
    ''ByteArr(0) = 226  
    ''ByteArr(1) = 128  
    ''ByteArr(2) = 153  
    tstByteCnt = Encoding.UTF8.GetByteCount(ChrW(intChar))  
    ''tstByteCount = 3   
       
    ByteArr = Encoding.Unicode.GetBytes(buf(0))  
    ''ByteArr(0) = 25  
    ''ByteArr(1) = 32  
    tstByteCnt = Encoding.Unicode.GetByteCount(ChrW(intChar))  
    ''tstByteCount = 2  
     
     
     

     

    This link explains how character 155 for code page 1252 is unicode character 8250 and character 146 for code page 1252 is unicode character 8217
    http://www.pemberley.com/janeinfo/latin1.html

    Thanks in advance,
    Dennis



    Monday, June 30, 2008 1:41 PM

Answers

  • Ugh, fugly problem.  There's no way you can use StreamReader, it has a small internal buffer that ensures it always has enough bytes to properly decode the Unicode character.  In other words, it reads ahead.  BinaryReader can't read strings, you'll have to use FileStream.  But that doesn't solve anything, you are still faced with the challenge of properly decoding the bytes.

    Change the problem: don't use Seek().  Buffering the strings should work.

    Hans Passant.
    • Marked as answer by Bruno Yu Thursday, July 03, 2008 2:54 AM
    Monday, June 30, 2008 2:16 PM
    Moderator
  • Hi Dennis,

    I'm afraid I don't have the skill to help with the question you asked so my instinct too would be to look at a different approach. 

    Could you use a fileStream to read a page in binary, then encode the whole page.

    The code below does a similar thing by reading a line at a time from the fileStream and sending the filePosition and encodedText to be processed (it also requires a line to be terminated by \r\n)

    Sorry not to be more help,

    John

    static void Main(string[] args) 
        FileStream fs = new FileStream(@"C:\someFile.txt", FileMode.Open); 
     
        long bytesToRead = fs.Length; 
     
        int cr = 13; 
        int lf = 10; 
     
        // TODO: Advance past BOM 
     
        long position = 0; 
     
        while (bytesToRead > 0) 
        { 
            List<byte> lineBytes = new List<byte>(); 
            bool gotLine = false
            int currentByte; 
            int previousByte = 0; 
            while (!gotLine && (bytesToRead > 0)) 
            { 
                currentByte = fs.ReadByte(); 
                lineBytes.Add((byte)currentByte); 
                bytesToRead--; 
                if (currentByte == lf && previousByte == cr) 
                    gotLine = true
                previousByte = currentByte; 
            } 
            string text = Encoding.GetEncoding(1252).GetString(lineBytes.ToArray()); 
            ProcessLine(position, text); 
            position += lineBytes.Count; 
        } 
     
    static void ProcessLine(long position, string text) 
        Console.Write(string.Concat(position.ToString(), ": ", text)); 
     

    • Marked as answer by Bruno Yu Thursday, July 03, 2008 2:54 AM
    Monday, June 30, 2008 3:35 PM
  • Thanks for the replies. I ended up writing a solution that works well for my applications. I needed to keep the solution compatible with the existing data since someone else developed this app initially. In addition to the below code I also implemented GetFileEncoding on line 3 from here which has helped a lot with using the correct encoding 

    1     Public Sub TestStringFs(ByVal FilePath As String)  
    2  
    3         Dim e As Encoding = GetFileEncoding(FilePath)  
    4         Dim fs As New FileStream(FilePath, FileMode.Open)  
    5         Dim intByte As Integer 
    6         Dim BufferSize As Integer = 8192  
    7  
    8         Dim ByteArr(BufferSize - 1) As Byte 
    9  
    10         Dim blnFirstByte As Boolean = True 
    11         Dim PageString As String = "" 
    12         Dim ByteCount As Integer 
    13         Dim PageCount As Integer 
    14         Dim i As Integer 
    15  
    16         While intByte > -1  
    17  
    18             Do Until intByte = 12  
    19  
    20                 intByte = fs.ReadByte  
    21                 '-1 returned at end of file     
    22                 If intByte = -1 Then 
    23                     Exit Do 
    24                 End If 
    25  
    26                 ByteArr(i) = intByte  
    27  
    28                 If blnFirstByte = True Then 
    29                     blnFirstByte = False 
    30                     If intByte = 12 Then 
    31                         'Some reports have a form feed for the first character     
    32                         intByte = fs.ReadByte  
    33  
    34                         '-1 returned at end of file     
    35                         If intByte = -1 Then 
    36                             Exit Do 
    37                         Else 
    38                             i += 1  
    39                             ByteCount += 1  
    40                             ByteArr(i) = intByte  
    41                         End If 
    42                     End If 
    43                 End If 
    44  
    45                 i += 1  
    46                 ByteCount += 1  
    47  
    48                 If i = BufferSize Then 
    49                     If PageString = "" Then 
    50                         PageString = e.GetChars(ByteArr, 0, i)  
    51                     Else 
    52                         PageString = PageString & e.GetChars(ByteArr, 0, i)  
    53                     End If 
    54                     Array.Clear(ByteArr, 0, i)  
    55                     i = 0  
    56                 End If 
    57             Loop 
    58  
    59             PageCount += 1  
    60  
    61             If PageString = "" Then 
    62                 PageString = e.GetChars(ByteArr, 0, i)  
    63             Else 
    64                 PageString = PageString & e.GetChars(ByteArr, 0, i)  
    65             End If 
    66  
    67             Array.Clear(ByteArr, 0, i)  
    68             i = 0  
    69  
    70             ''DO EXTRA PAGE PROCESSING HERE ON PageString               
    71             PageString = "" 
    72  
    73         End While 
    74     End Sub 
    • Marked as answer by Bruno Yu Thursday, July 03, 2008 2:55 AM
    Tuesday, July 01, 2008 10:11 AM

All replies

  • Ugh, fugly problem.  There's no way you can use StreamReader, it has a small internal buffer that ensures it always has enough bytes to properly decode the Unicode character.  In other words, it reads ahead.  BinaryReader can't read strings, you'll have to use FileStream.  But that doesn't solve anything, you are still faced with the challenge of properly decoding the bytes.

    Change the problem: don't use Seek().  Buffering the strings should work.

    Hans Passant.
    • Marked as answer by Bruno Yu Thursday, July 03, 2008 2:54 AM
    Monday, June 30, 2008 2:16 PM
    Moderator
  • Hi Dennis,

    I'm afraid I don't have the skill to help with the question you asked so my instinct too would be to look at a different approach. 

    Could you use a fileStream to read a page in binary, then encode the whole page.

    The code below does a similar thing by reading a line at a time from the fileStream and sending the filePosition and encodedText to be processed (it also requires a line to be terminated by \r\n)

    Sorry not to be more help,

    John

    static void Main(string[] args) 
        FileStream fs = new FileStream(@"C:\someFile.txt", FileMode.Open); 
     
        long bytesToRead = fs.Length; 
     
        int cr = 13; 
        int lf = 10; 
     
        // TODO: Advance past BOM 
     
        long position = 0; 
     
        while (bytesToRead > 0) 
        { 
            List<byte> lineBytes = new List<byte>(); 
            bool gotLine = false
            int currentByte; 
            int previousByte = 0; 
            while (!gotLine && (bytesToRead > 0)) 
            { 
                currentByte = fs.ReadByte(); 
                lineBytes.Add((byte)currentByte); 
                bytesToRead--; 
                if (currentByte == lf && previousByte == cr) 
                    gotLine = true
                previousByte = currentByte; 
            } 
            string text = Encoding.GetEncoding(1252).GetString(lineBytes.ToArray()); 
            ProcessLine(position, text); 
            position += lineBytes.Count; 
        } 
     
    static void ProcessLine(long position, string text) 
        Console.Write(string.Concat(position.ToString(), ": ", text)); 
     

    • Marked as answer by Bruno Yu Thursday, July 03, 2008 2:54 AM
    Monday, June 30, 2008 3:35 PM
  • Thanks for the replies. I ended up writing a solution that works well for my applications. I needed to keep the solution compatible with the existing data since someone else developed this app initially. In addition to the below code I also implemented GetFileEncoding on line 3 from here which has helped a lot with using the correct encoding 

    1     Public Sub TestStringFs(ByVal FilePath As String)  
    2  
    3         Dim e As Encoding = GetFileEncoding(FilePath)  
    4         Dim fs As New FileStream(FilePath, FileMode.Open)  
    5         Dim intByte As Integer 
    6         Dim BufferSize As Integer = 8192  
    7  
    8         Dim ByteArr(BufferSize - 1) As Byte 
    9  
    10         Dim blnFirstByte As Boolean = True 
    11         Dim PageString As String = "" 
    12         Dim ByteCount As Integer 
    13         Dim PageCount As Integer 
    14         Dim i As Integer 
    15  
    16         While intByte > -1  
    17  
    18             Do Until intByte = 12  
    19  
    20                 intByte = fs.ReadByte  
    21                 '-1 returned at end of file     
    22                 If intByte = -1 Then 
    23                     Exit Do 
    24                 End If 
    25  
    26                 ByteArr(i) = intByte  
    27  
    28                 If blnFirstByte = True Then 
    29                     blnFirstByte = False 
    30                     If intByte = 12 Then 
    31                         'Some reports have a form feed for the first character     
    32                         intByte = fs.ReadByte  
    33  
    34                         '-1 returned at end of file     
    35                         If intByte = -1 Then 
    36                             Exit Do 
    37                         Else 
    38                             i += 1  
    39                             ByteCount += 1  
    40                             ByteArr(i) = intByte  
    41                         End If 
    42                     End If 
    43                 End If 
    44  
    45                 i += 1  
    46                 ByteCount += 1  
    47  
    48                 If i = BufferSize Then 
    49                     If PageString = "" Then 
    50                         PageString = e.GetChars(ByteArr, 0, i)  
    51                     Else 
    52                         PageString = PageString & e.GetChars(ByteArr, 0, i)  
    53                     End If 
    54                     Array.Clear(ByteArr, 0, i)  
    55                     i = 0  
    56                 End If 
    57             Loop 
    58  
    59             PageCount += 1  
    60  
    61             If PageString = "" Then 
    62                 PageString = e.GetChars(ByteArr, 0, i)  
    63             Else 
    64                 PageString = PageString & e.GetChars(ByteArr, 0, i)  
    65             End If 
    66  
    67             Array.Clear(ByteArr, 0, i)  
    68             i = 0  
    69  
    70             ''DO EXTRA PAGE PROCESSING HERE ON PageString               
    71             PageString = "" 
    72  
    73         End While 
    74     End Sub 
    • Marked as answer by Bruno Yu Thursday, July 03, 2008 2:55 AM
    Tuesday, July 01, 2008 10:11 AM
  • Thanks for the update.

    Btw, if performance is an issue you might want to look at lines 52 & 64.  My understanding is that these will cause the hole pageString content to be copied from one memeory location to another.  So if performane is an issue you might want to look at using StringBuilder.

    Good Luck,

    John
    Tuesday, July 01, 2008 10:30 AM
  • Hi John you're right. I am doing that in the production code. In the test code I was just lazy :)
    Wednesday, July 02, 2008 12:23 AM