none
StreamReader and File Position

    Question

  • I'm processing large binary files. These are PCL files, and I'm looking for page boundaries. I want to store the position of each Form Feed, which in PCL is decimal 12, hex 0C. However, that byte can also exist as part of a raster or other binary structure.

    So I loop through the file, and when I find a "12", I read ahead 14 bytes to compare them to a known string. If I get a match, I know the 12 was a real Form Feed, and I store it's position in an ArrayList.

    This works fine using a FileStream object and its ReadByte() method and .Position property. The problem is it is very slow. I'd like to use a StreamReader to take advantage of buffering. However, when I use a StreamReader, the FileStream's Position property points to the amount that's been buffered, not the actual file position.

    So my question is, how can I have the speed of StreamReader, but still maintain an accurate file position?

    Sample code, the StreamReader Version. Hopefully, someone can suggest a change that would report the "virtual" file position of the "current byte", rather than the current file position reached through buffering.



    using System;
    using System.IO;
    using System.Text;
    using System.Collections;

    namespace pcl_proc
    {
        /// <summary>
        /// Summary description for Class1.
        /// </summary>
        class Class1
        {
            [STAThread]
            static void Main(string[] args)
            {
                ArrayList page_positions = new ArrayList();
                ArrayList page_type = new ArrayList();
                string asciiString;

                string bgn_of_page = " &l8c1E *p0x0Y";
                string header;

                long curr_pos;

                int pcl_char;

                char[] test;

                string filename = @"C:\Statements-05-03-05.pcl";
                FileStream infile = new FileStream(filename, FileMode.Open, FileAccess.Read);
                StreamReader input = new StreamReader(infile);

                // need to initialize header and position of first page.

                test = new char[1024];

                input.Read(test, 0 , test.Length);
                asciiString = new String(test);

                header = asciiString.Substring(0,asciiString.IndexOf("*b0M") + 4);
                page_positions.Add(header.Length);
                page_type.Add("B");

                while (input.Peek() >= 0 )
                {  
                    pcl_char = input.Read();
               
                    if (pcl_char == 12)
                    {  
                        test = new char[14];

                        // this next line doesn't record the accurate position
                        // of the "12" found by the input.Read().
                        // How can I get the actual position?
                        curr_pos = infile.Position;

                        input.Read(test, 0, test.Length);

                        asciiString = new string(test);

                        if (asciiString == bgn_of_page)
                        {
                            page_positions.Add(curr_pos);
                        } // if (new string(test) == bgn_of_page)
                    } // if (pcl_char == 12)
                } // while (sr.Peek >= 0)

                infile.Close();
            }
        }
    }

     



    Note: the "bgn_of_page" string is actually 14 bytes, the forum stripped out the two "escape" characters. I mention this in case anyone wonders why I'm reading 14 bytes and comparing it to a 12 byte string.

    Note: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemiostreamreaderclassbasestreamtopic.asp,  contains this enigmatic statement:

    "StreamReader might buffer input such that the position of the underlying stream will not match the StreamReader position." Yes, that's right. But they offer no method or example to deal with that situation. Also, they refer to "the StreamReader position". Well, what is the StreamReader position? How do I find it? What method or property returns it?

    Wednesday, May 04, 2005 2:53 PM

Answers

  • Hmm.... I'm not sure StreamReader is really what you want here. You are thinking of the file as a series of bytes, and looking for positions within that series of bytes where pages break. Any text reader must use an encoding, by default UTF-8, to convert bytes to characters, and in most encodings characters can span multiple bytes. So it generally doesn't make sense to try to associate a physical stream position with a text reader's logical position. You might get something close by setting the minimum buffer size for the StreamReader and depending on the FileStream to do all the real buffering, but I think you are better off just reading bytes straight from the FileStream and doing byte-wise comparisons.
    Wednesday, July 06, 2005 9:26 PM
  • I wanted to do the same thing myself some time ago.  I ended up deriving my own class from StreamReader and then overriding the ReadLine( ) method so that it would keep track of exactly how many characters it has read (including the end-of-line characters not returned by ReadLine( ) ).  I then exposed a property that gave the byte position at the end of the last line read.  Then before seeking to a previously read position, you must call DiscardBufferedData( ).

    I found source code for the StreamReader class at
    http://www.123aspx.com/rotor/rotorsrc.aspx?rot=42055


    Wednesday, November 09, 2005 4:33 PM

All replies

  • Take a look at StreamReader.DiscardBufferedData(), which, if I'm reading the docs right, should work for you.

    Note that you should chech the result of input.Read(...) to make sure you are getting all of the data you are expecting, as there isn't a requirement for it to return all of the available data.
    Wednesday, May 04, 2005 3:12 PM
  • If I use the .DiscardBufferedData() method, I would effectively remove all buffering. I would be discarding 1023 bytes of buffered data each and every read. That would take me back to the performance levels achieved by the FileStream.ReadByte() version of code I wrote.

    I did manage to get high performance by effectively writing my own "buffering" system, by using the FileStream.Read() method to read in 8k of data, and then loop through the resulting byte array.

    However, I still would like an answer to the basic question:

    When using StreamReader, since it "may" buffer, how do you get the "calculated" file position of the underlying stream? If I've read three 1k "buffers" automatically while using StreamReader.Read(), but the character I want is the 5 character in the 3rd buffer, I have a file position of 2048 + 5.

    I shouldn't have to make that calculation, or keep track of how many buffers have been read, etc. There should be a "Position" property and/or method for the StreamReader object that will return the position, in the underlying stream, of the current "read" byte, regardless of buffering. I want buffering to be transparent.


    Tuesday, May 10, 2005 4:25 PM
  • Can you not read a bulk of data into a local buffer and do a mem compare instead of reading byte by byte from a file? 
    Friday, May 13, 2005 7:41 AM
  • I guess that's exactly what I'm doing, though you'd need to clarify what a "mem compare" is.

    I'm reading in large chunks to a byte array (local buffer). I'm looping through the bytes, looking for my target byte.

    If I want to know the file position of the "current byte", I must multiply the size of the buffer/byte array, by the number of times I've looped, then subtract the index of the current byte from the size of the buffer... plus I have to handle situations where the target is too close to the start and/or end of the buffer.

    I shouldn't have to do all of this, is my point.
    Thursday, May 19, 2005 9:38 PM
  • Hmm.... I'm not sure StreamReader is really what you want here. You are thinking of the file as a series of bytes, and looking for positions within that series of bytes where pages break. Any text reader must use an encoding, by default UTF-8, to convert bytes to characters, and in most encodings characters can span multiple bytes. So it generally doesn't make sense to try to associate a physical stream position with a text reader's logical position. You might get something close by setting the minimum buffer size for the StreamReader and depending on the FileStream to do all the real buffering, but I think you are better off just reading bytes straight from the FileStream and doing byte-wise comparisons.
    Wednesday, July 06, 2005 9:26 PM
  • I know it's been awhile since I started this thread, but the issue continues to re-appear in various projects. Let me see if I can be clear:

    In the latest incarnation of this problem, I'm processing a large TEXT file. I want to use a StreamReader. However, as I process the file, I need to "note" certain records. Imagine the file to be a large "document" consisting of many "pages". I may want to extract a "chapter". I know how to recognize when a "chapter" begins, and when it ends.

    Once I encounter the record that ends a "chapter", I want to go back to the start of the chapter, and capture all intervening records to a second file.

    What I really need is a "Position" property, to know that a certain record BEGINS at a specific byte-position in the underlying file. However, the base stream's Position property refers to how many BUFFERS have been read, not the actual Position of the CURRENT RECORD.

    Is there an elegant way to process a TEXT file, using StreamReader.ReadLine(), and yet still have an accurate "Position" property?
    Wednesday, November 09, 2005 12:38 AM
  • I wanted to do the same thing myself some time ago.  I ended up deriving my own class from StreamReader and then overriding the ReadLine( ) method so that it would keep track of exactly how many characters it has read (including the end-of-line characters not returned by ReadLine( ) ).  I then exposed a property that gave the byte position at the end of the last line read.  Then before seeking to a previously read position, you must call DiscardBufferedData( ).

    I found source code for the StreamReader class at
    http://www.123aspx.com/rotor/rotorsrc.aspx?rot=42055


    Wednesday, November 09, 2005 4:33 PM
  • Very nice. Care to share, or shall I re-invent the wheel?

    Wednesday, November 09, 2005 8:07 PM
  • Here is the snippet of changes that I made to the ReadLine() method.
    For my original use, I provided a read-only property that gave the byte length of the line that was just read by ReadLine().  You can easily modify it to keep track of the file position.
    Lines that I changed from the original code are indicated by /*--mod--*/




    private int _lineLength;                     /*--mod--*/
    public int LineLength {                      /*--mod--*/
      get{return _lineLength;}                   /*--mod--*/
    }

    public override String ReadLine() {
      _lineLength = 0;                           /*--mod--*/
      if (stream == null)
        __Error.ReaderClosed();
      if (charPos == charLen) {
        if (ReadBuffer() == 0) return null;
      }
      StringBuilder sb = null;
      do {
        int i = charPos;
        do {
          char ch = charBuffer[ i ];
          int EolChars = 0;                      /*--mod--*/
          if (ch == '\r' || ch == '\n') {
            EolChars = 1;                        /*--mod--*/
            String s;
            if (sb != null) {
              sb.Append(charBuffer, charPos, i - charPos);
              s = sb.ToString();
            }
            else {
              s = new String(charBuffer, charPos, i - charPos);
            }
            charPos = i + 1;
            if (ch=='\r' &&  (charPos<charLen || ReadBuffer()>0)) {
              if (charBuffer[charPos] == '\n') {
                charPos++;
                EolChars = 2;                   /*--mod--*/
              }
            }
            _lineLength = s.Length + EolChars;  /*--mod--*/
            return s;
          }
          i++;
        } while (i < charLen);
        i = charLen - charPos;
        if (sb == null) sb = new StringBuilder(i + 80);
        sb.Append(charBuffer, charPos, i);
      } while (ReadBuffer() > 0);
      string ss = sb.ToString();
      _lineLength = ss.Length;                 /*--mod--*/
      return ss;
    }
           

     


    Wednesday, November 09, 2005 10:28 PM
  • Thanks. I need help now with implementing the code. How, in an overriden method, do you access private base members? If I create a new class, with the following code, I'll get errors that "stream", "charPos", and "ReadBuffer" are inaccesible due to their protection level. This is because, in the StreamReader class, they are private members.



    using System;
    using System.Text;
    using System.Runtime.InteropServices;
    using System.IO;

    namespace streamOR
    {
     public class StreamReader2 : System.IO.StreamReader
     {

      private int _lineLength;
      public int LineLength
      {
       get{return _lineLength;}
      }

      public StreamReader2(String path) : base(path)
      {
      }

      public override String ReadLine()
      {
       _lineLength = 0; /* added dac */
       if (stream == null)
        throw new NullReferenceException("Reader is closed");

       if (charPos == charLen)
       {
        if (ReadBuffer() == 0) return null;
       }
       StringBuilder sb = null;
       do
       {
        int i = charPos;
        do
        {
         char ch = charBuffer[i ];
         // Note the following common line feed chars:
         // n - UNIX rn - DOS r - Mac
         int EolChars = 0; /* added dac */
         if (ch == 'r' || ch == 'n')
         {
          EolChars = 1; /* added dac */
          String s;
          if (sb != null)
          {
           sb.Append(charBuffer, charPos, i - charPos);
           s = sb.ToString();
          }
          else
          {
           s = new String(charBuffer, charPos, i - charPos);
          }
          charPos = i + 1;
          if (ch == 'r' && (charPos < charLen || ReadBuffer() > 0))
          {
           if (charBuffer[charPos] == 'n')
           {
            charPos++;
            EolChars = 2; /* added dac */
           }
          }
          _lineLength = s.Length + EolChars; /* added dac */
          return s;
         }
         i++;
        } while (i < charLen);
        i = charLen - charPos;
        if (sb == null) sb = new StringBuilder(i + 80);
        sb.Append(charBuffer, charPos, i);
       } while (ReadBuffer() > 0);
       string ss = sb.ToString();
       _lineLength = ss.Length; /* added dac */
       return ss;
      }
     
     }
    }

     

    Wednesday, November 09, 2005 10:44 PM
  • Bummer!  I thought that since ReadLine() was a virtual method in the base class it would be best to override it in a derived class.  I guess MS didn't really design for that, though.  The way I implemented this modification was to take the original  FileStream code and paste it into a new class and then modify ReadLine().  It has been a while since I did that -- maybe I first tried to derive the new class and failed in the same way you did.
    Wednesday, November 09, 2005 11:42 PM
  • I have had a similar problem in a couple of instances as well. I used a two pass solution. In Pass one I mark all the interesting bytes. In pass two, I start from known positions and read however much I need to.

    Here are two classes that encapsulate that behavior. IndexedFile scans an input file for markers you specify (as regular expressions or just a string) and creates a list of locations in the file where they are located. You can iterate over the IndexedFile positions, or you can access them directly by index. (Say you want to go to the third paragraph).

    In my example main, I am calling DiscardBufferedData(); This is the only way to get the streamreader to sync back up to the underlying file stream. In this contrived example, you pay a lot for discarding the buffer before each read. However, in practice you won't be doing this very often. You will seek to an index in the file, then get lines from it for a while.

    If you never wanted to call DiscardBufferedData(), you could instread read from the current location, to the next location minus one. If your data has markers often enough, you won't consume too much memory and you will have just the data you wanted to work with. 

     

     class Program //example code for using the classes below.
        {
            static void Main(string[] args)
            {
                string sampleDataFileName = @"somefile";
                Regex SectionHeadingRe = new Regex(@"Chapter:(.*)");
                IndexedFile file = new IndexedFile(sampleDataFileName, SectionHeadingRe);
                FileStream fs = new FileStream(sampleDataFileName, FileMode.Open);
                StreamReader sr = new StreamReader(fs);
    
                foreach (long position in file)
                {
                    Console.WriteLine(position);
                    fs.Seek(position, SeekOrigin.Begin);
                    sr.DiscardBufferedData();
                    Console.WriteLine(sr.ReadLine());
                }
            }
        }
    
    
        public class IndexedFile : System.Collections.IEnumerable
        {
            #region private properties
            private FileStream fs;
            private StreamReader sr;
            private List bookmarks = new List();
            #endregion
    
            #region constructors
            private IndexedFile()
            {
                //no default constructor. This means filename and pattern are required.
            }
    
            ~IndexedFile()
            {
                Close();
            }
            
            /// 
            /// opens a file and parses it for the pattern string. Constructs a list of locations where the string is located.
            /// 
            /// 
            /// string to mark indexs for.
            public IndexedFile(string filename, string pattern)
            {
                init(filename);
                scanFile(pattern);
            }
            /// 
            /// Opens a file and parses it using the supplied regular expression. Constructs a list of locations where the string is located.
            /// 
            /// 
            /// 
            public IndexedFile(string filename, Regex patternRegex)
            {
                init(filename);
                scanFile(patternRegex);
            }
            #endregion
    
            #region public accessors
            public int Count
            {
                get { return bookmarks.Count; }         
            }
    
            public long this[int index]
            {
                get { return bookmarks[index]; }
            }
    
            #endregion
    
            #region public methods
            public void Close()
            {            
                sr.Close();
                fs.Close();
            }
    
            #endregion
    
            #region private methods
            private void init(string filename)
            {
                fs = new FileStream(filename, FileMode.Open);
                sr = new StreamReader(fs);
            }
    
            #endregion
    
    
            private void scanFile(string pattern)
            {
                string p = Regex.Escape(pattern);
                Regex patternAsRe = new Regex(p);
                scanFile(patternAsRe);
            }
    
            private void scanFile(Regex pattern)
            {            
                long seekPos = 0;
                string line = string.Empty;
                while (sr.Peek() != -1)
                {
                    line = sr.ReadLine();
                    MatchCollection matches = pattern.Matches(line);
                    foreach (Match m in matches)
                    {
                        if (m.Success)
                        {
                            bookmarks.Add(m.Index + seekPos);
                        }
                    }
                    seekPos = seekPos + line.Length + 2; // add two for the CR/LF readline strips.
                }
                Close();
            }
    
            #region IEnumerable Members
    
            public System.Collections.IEnumerator GetEnumerator()
            {
                return new IndexedFileEnumerator(this);
            }
    
            #endregion
        }
    
        public class IndexedFileEnumerator :System.Collections.IEnumerator
        {
            private IndexedFile iFile = null;
            private int index = -1;
    
            public IndexedFileEnumerator(IndexedFile indexedFile)
            {
                this.iFile = indexedFile;
            }
    
            #region IEnumerator Members
    
            public object Current
            {
                get
                {
                    try
                    {
                        return iFile[index];
                    }
                    catch (IndexOutOfRangeException)
                    {
                        throw new InvalidOperationException();
                    }
                }
            }
    
            public bool MoveNext()
            {
                index++;
                return (index < iFile.Count);
            }
    
            public void Reset()
            {
                index = -1;
            }
    
            #endregion
        }
    
    
    
    
    
    Wednesday, August 02, 2006 1:48 AM
  • The thing that really gets your ticker is if you do a watch on a StreamReader object you can see a private attribute called charPos and bytePos.  But *sigh* that dosn't help.

     

    I personally create a readInALine method (or whatever) and track the number of bytes into an attribute in the class.  Not good design but a quick fix.

    Thursday, February 21, 2008 11:57 AM
  • Old post I know, but I just had to solve this myself. Here's the solution I used.

      public sealed class StreamReaderBuffer {
        public StreamReaderBuffer(Stream stream) {
          this.reader = new StreamReader(stream);
        }
    
        public char ReadChar() {
          if(this.readAheadBuffer == null 
            || this.readAheadBufferPosition == this.readAheadBuffer.Length) {
    
            return (char)reader.Read();
    
          } else {
    
            return this.readAheadBuffer[this.readAheadBufferPosition++];
          }
        }
    
        public string ReadAhead(int length) {
    
          if(this.readAheadBuffer != null && 
            this.readAheadBufferPosition + length < this.readAheadBuffer.Length) {
    
            return new string(this.readAheadBuffer).Substring(this.readAheadBufferPosition, length);
          }
    
          char[] buffer = new char[length];
          int i;
          for(i = 0; i < length; i++) {
            buffer[i] = (char)reader.Read();
            if(reader.EndOfStream) {
              break;
            }
          }
    
          if(i < length) {
            Array.Resize<char>(ref buffer, i + 1);
          }
    
          if(this.readAheadBuffer == null
            || this.readAheadBufferPosition == this.readAheadBuffer.Length) {
    
            this.readAheadBuffer = buffer;
            this.readAheadBufferPosition = 0;
    
            return new string(this.readAheadBuffer);
    
          } else {
            char[] combinedBuffer = new char[buffer.Length + this.readAheadBuffer.Length];
            this.readAheadBuffer.CopyTo(combinedBuffer, 0);
            buffer.CopyTo(combinedBuffer, this.readAheadBuffer.Length);
            this.readAheadBuffer = combinedBuffer;
            return new string(this.readAheadBuffer).Substring(this.readAheadBufferPosition, length);
          }
        }
    
        public bool EndOfStream {
          get {
            return reader.EndOfStream;
          }
        }
    
        private char[] readAheadBuffer;
        private int readAheadBufferPosition;
        private StreamReader reader;
      }

    Monday, June 07, 2010 11:22 AM
  • You can use a BinaryReader instead of StreamReader. Here's the extension method I created to do so:

    public static class BinaryReaderExtensions
    {
            public static string ReadLine(this BinaryReader reader)
            {
                StringBuilder builder = new StringBuilder();
         
                char c;
    
                do
                {
                    try
                    {
                        c = reader.ReadChar();
                    }
                    catch (EndOfStreamException)
                    {
                        c = '\n';
                    }
    
                    if (c != '\r' && c != '\n')
                        builder.Append(c);
    
                } while (c != '\n');
    
                return builder.ToString();
            }
     }


    • Edited by vIndEx Sunday, July 08, 2012 7:35 PM
    Sunday, July 08, 2012 7:34 PM
  • I found this worked for me, accessing the hidden character position of the Reader through reflection... One of my projects indexes multi-GB files using StreamReader, so this code is very handy when I need to index the position of certain lines...

        ''' <summary>
        ''' Determine where in the file stream we are
        ''' </summary>
        ''' <param name="s"></param>
        ''' <returns></returns>
        ''' <remarks></remarks>
        Private Function GetCharpos(s As StreamReader) As Long
            Dim charpos As Int32 = DirectCast(s.[GetType]().InvokeMember("charPos", BindingFlags.DeclaredOnly Or BindingFlags.[Public] Or BindingFlags.NonPublic Or BindingFlags.Instance Or BindingFlags.GetField, Nothing, s, Nothing), Int32)
            Dim charlen As Int32 = DirectCast(s.[GetType]().InvokeMember("charLen", BindingFlags.DeclaredOnly Or BindingFlags.[Public] Or BindingFlags.NonPublic Or BindingFlags.Instance Or BindingFlags.GetField, Nothing, s, Nothing), Int32)
            Return s.BaseStream.Position - charlen + charpos
        End Function


    If you shake a kettle, does it boil faster?

    Friday, November 01, 2013 1:31 AM