none
How to restore a Decoder object's state? RRS feed

  • Question

  • Hi,

    I'm implementing a ReadTo(string text) function from a stream of bytes from the network, where the user can provide an Encoding object via a property. The GetDecoder() method is used to convert the bytes to a string before encoding occurs.

    One error case is when no data arrives, I want to raise a TimeoutException, but I want to restore the state of the Decoder object to the way it was before the ReadTo() method was called, to allow for some atomic operation.

    That way, the user can reuse ReadTo() again at a later time and get expected results. By not restoring the Decoder to the state it was beforehand might cause subtle failures, especially for MBCS streams.

    Thanks
    Jason.

    Tuesday, September 4, 2012 5:25 PM

Answers

  • "I've had little to do with serialization/deserialization? Are there any tips?"

    Nothing special except that I have some concerns about its efficiency in this case. Every time you call ReadTo you'd need to serialize the decoder to a MemoryStream and keep that stream around until ReadTo completes successfully (or fails, in which case you need to deserialize, but this probably happens rarely).

    I still don't understand how ReadTo is supposed to work.

    For one thing what stops Read(cha[],...) from reading through all 5 characters that ReadTo expects?

    And OK, ReadTo reads only 3 characters and times-out. But by restoring the state of the decoder you'll basically put back those 3 characters in the byte buffer. What for? If another ReadTo call comes after 100 seconds it should notice it already has 3 characters in the char buffer and wait for 2 more. Or perhaps are you mixing char[] reading with byte[] reading?

    • Marked as answer by Jason Curl Monday, September 10, 2012 6:59 PM
    Wednesday, September 5, 2012 4:52 AM
    Moderator

All replies

  • There's no specific way to restore the decoder state but you could serialize/deserialize it as needed.

    But I don't understand what do you need this for. If no data arrives then there's no data to call Convert and the decoder state doesn't change.

    Tuesday, September 4, 2012 7:17 PM
    Moderator
  • Hi Mike,

    Thanks for the tip. I've had little to do with serialization/deserialization? Are there any tips?

    The code implements a Stream, with Read(byte[], offset, count), Read(char[], offset, count), ReadByte(), ReadChar(), ReadLine(), ReadTo(). So there's a potential mix between bytes and chars.

    Let's take a case that we're waiting on a set of 5 chinese characters, each transmitted via UTF8. With a previous call to Read(char[], ..) there might be 2.5 characters in the incoming byte buffer. Two bytes are now in my char buffer after the read returns. The other 0.5 is now captured by the internal state of the Decoder object. Let's say that this 0.5 of a character is the first character of the string we're looking for with a ReadTo().

    So, the first call with ReadTo() with a timeout, needs to decode what's in the byte buffer. It reads three characters, but there's not enough for ReadTo() to be successful. It returns. 100ms later an event occurs to indicate more data has arrived and a ReadTo() is called again.

    Now ReadTo() should be an atomic operation - all or nothing. But the first ReadTo() actually isn't atomic, because it depends on the state of the Decoder and this actually changes between calls.

    I could even decide to count the bytes first and only check if the minimum number of bytes are there. But I can still construe another use case that doesn't behave as it should, e.g. A ReadTo() fails, but now the user uses ReadChar() and then ReadTo(), the ReadChar() is likely to return incorrect data as the internal state of the Decoder is not as it was before the first ReadTo().

    I'd like to avoid GetCharCount looking for individual characters as I "feel" this is extremely inefficient for the normal usecase. A cheap copy of the Decoder object, which is only restored if the ReadTo() fails would hopefully be simpler and efficient.

    Regards,
    Jason


    • Edited by Jason Curl Tuesday, September 4, 2012 8:19 PM grammar
    Tuesday, September 4, 2012 8:18 PM
  • "I've had little to do with serialization/deserialization? Are there any tips?"

    Nothing special except that I have some concerns about its efficiency in this case. Every time you call ReadTo you'd need to serialize the decoder to a MemoryStream and keep that stream around until ReadTo completes successfully (or fails, in which case you need to deserialize, but this probably happens rarely).

    I still don't understand how ReadTo is supposed to work.

    For one thing what stops Read(cha[],...) from reading through all 5 characters that ReadTo expects?

    And OK, ReadTo reads only 3 characters and times-out. But by restoring the state of the decoder you'll basically put back those 3 characters in the byte buffer. What for? If another ReadTo call comes after 100 seconds it should notice it already has 3 characters in the char buffer and wait for 2 more. Or perhaps are you mixing char[] reading with byte[] reading?

    • Marked as answer by Jason Curl Monday, September 10, 2012 6:59 PM
    Wednesday, September 5, 2012 4:52 AM
    Moderator
  • Have a look at the description for SerialPort.ReadTo(). This is the interface I'm trying to follow. Data is converted in the byte stream to chars and the string "text" is searched for. If it's found, all text up to, but not including "text" is returned. The string "text" is also consumed from the input byte buffer. If the string is not found within a specified period of time, a TimeoutException occurs while waiting for more data to arrive from the network.

    Talking about SerialPort.ReadTo() implementation (via e.g. Reflector), on an exception, it takes the converted chars already given back from the decoder and actually converts them back to bytes with the encoder and puts them back in the buffer! This definitely has quirks, especially if the byte stream contains sequences not understood by the decoder, definitely the case with UTF8 to Unicode/UCS2. So I don't want to do this.

    By restoring the decoder, I only want to put it back in the state before the ReadTo() was called. What happens in the byte buffer is completely under my control. I have the offset and length of my buffer, in case of an error, data simply won't be consumed and the buffer remains as is, effectively putting the bytes back in a very cheap way.

    What you suggest is the other solution that I began thinking about last night as well. This would imply maintaining a character buffer internally. It would solve the problem by not having to restore the state of the decoder. It does increase the complexity of my other functions as now I have extra buffers to manage. But looking around, I don't see another solution either.

    Thanks for your time and useful comments
    Jason.

    Wednesday, September 5, 2012 8:08 AM
  • "This would imply maintaining a character buffer internally."

    I see, for some reason my impression was that you already do that.

    Yes, it's a bit more complicated but I think it's better than serializing the decoder or doing what the SerialPort does. Anyway, the primary users of Decoder/Encoder classes are the TextReader/Writer classes. And StreamReader does maintain a byte buffer and a char buffer and uses them in ReadLine for example.

    Wednesday, September 5, 2012 8:33 AM
    Moderator
  • Thanks Mike. Yes, i was initially trying to avoid having a "byte" buffer and a "char" buffer. But that's the way I will go and cache the converted values in between. This appears to be the most reliable and fastest method available.
    Monday, September 10, 2012 6:59 PM