none
Scan a byte array for a specific character in a given encoding RRS feed

  • Question

  • Both System.Text.Encoding and its workhorse System.Text.Decoder provide straight-through means to decode raw bytes into a string when the string size is known in advance:

    System.Text.Decoder.GetChars(Byte[], Int32, Int32, Char[], Int32)
    System.Text.Encoding.GetChars(Byte[], Int32, Int32)
    System.Text.Encoding.GetString(Byte[], Int32, Int32)

    However, when that is not the case, and the string end must be determined by searching for a specific character (terminator; generally not a NUL), — what are the standard methods for this? Are there any functions like those below?

    IndexOf(Char terminator, Byte[] buffer, Int32 start, Int32 maxsize)
    EnumerateChars(Byte[] buffer, Int32 start, Int32 count, SomeDelegateType callback)

    Surprisingly, I was unable to find any, so I am wondering whether I am missing something obvious. If not, then how is this task usually accomplished — without making any assumptions on the encoding (take EBCDIC for example)?

    For reference, in the C/C++ world this functionality is provided by something like

    unsigned char *_mbschr_l(
       const unsigned char *str,
       unsigned int c,
       _locale_t locale
    ); // C only 

    or other routines of non-Microsoft flavors.

    • Edited by Anton Samsonov Tuesday, June 30, 2015 12:28 PM Added _mbschr_l as well-known example
    Tuesday, June 16, 2015 6:25 PM

Answers

  • "I am somewhere in the middle on that scale, thus very pessimistic about the “DIY” approach."

    A DIY approach doesn't have to involve doing the actual decoding, that can be left to existing .NET Framework functionality:

    using System;
    using System.Text;
    
    class Program {
        static void Main() {
            var s = "aËd";
    
            var encoding = Encoding.UTF8;
            var bytes = encoding.GetBytes(s);
    
            Console.WriteLine(IndexOf(bytes, 'Ë', encoding));
            Console.WriteLine(IndexOf(bytes, 'd', encoding));
        }
    
        private static int IndexOf(byte[] bytes, char ch, Encoding encoding) {
            var decoder = encoding.GetDecoder();
            var chars = new char[1];
    
            for (int byteIndex = 0; byteIndex < bytes.Length; ) {
                int bytesUsed;
                int charsUsed;
                bool completed;
    
                decoder.Convert(bytes, byteIndex, bytes.Length - byteIndex, chars, 0, 1, false, out bytesUsed, out charsUsed, out completed);
    
                if (charsUsed == 1 && chars[0] == ch)
                    return byteIndex;
    
                byteIndex += bytesUsed;
            }
    
            return -1;
        }
    }
    

    There are a few things that this code doesn't probably handle correctly but that should be fixable:

    • I don't know what happens if "bytes" contains an invalid codepoint, it bytesUsed ends up being 0 you'll be stuck in an infinite loop
    • It doesn't know what happens if "bytes" contains characters that decode to a surrogate pair
    • Using the pointer overload of Convert might be more efficient
    • Attempting to decode more than one character at once might be more efficient
    Tuesday, June 30, 2015 2:09 PM
    Moderator

All replies

  • Hello Anton,

    without the lenght AND the NUL-Terminator it is very hard to find some character.

    I would be nice to give us the Encoding of the Text. Because some encodings can short the bytes (depending on the character we have 1 up to 6 Bytes). But some always use the same Size per char.

    You could try to revert the encoding and encode each char and then check it. If it is not what you are looking for, skip it and take the next char.

    For example UTF8

    You see that you can find out how many bytes the char must have and how to convert the char. Than you can check it and skip if wrong.


    © 2015 Thomas Roskop
    Germany //  Deutschland

    Tuesday, June 16, 2015 6:49 PM
  • It is very hard to find some character without the length and the NUL-terminator.

    It should be very easy for an Encoding or Decoder object, because it surely knows how to interpret a byte stream as a sequence of characters.

    Give us the encoding. Some encodings use variable character size, while others use a constant one.

    The encoding may be totally arbitrary — at least all the encodings instantiated by .Net, but also any derived class based on System.Text.Encoding.

    You could try to revert the encoding and encode each char and then check it. If it is not what you are looking for, skip it and take the next char.

    I don't quite understand what is “to revert the encoding”. The text is already in some encoding, stored in a file with an identification of the encoding being used. A corresponding Encoding (or Decoder) object is selected for that. What do you suggest then? To take 1, then 2, and so on up to 6 bytes at a time and ask the Decoder to interpret that chunk as a string, until it stops throwing exceptions? And then to advance to the next chunk? That would be not only dumb, but incredibly slow as well.

    For example UTF-8. You see that you can find out how many bytes the char must have and how to convert the char. Than you can check it and skip if wrong.

    The outside class user should not make any assumptions about that class internals, i. e. the encoding being used — it should be compatible with any Encoding, not just ASCII or UTF-8 or any other concrete type. Or do you suggest re-implementation of all the encodings? You must be kidding me. Terminator-delimited strings are so common in the computing world, yet .Net provides no way to deal with them? I just can't believe this.

    Wednesday, June 17, 2015 6:25 PM
  • Hello Anton,

    >>Scan a byte array for a specific character in a given encoding

    As far as I know, the .NET does not provide a default implemented method. What I suggest for achieving to scan a byte array for a specific character in a given encoding, is that since you already know the encoding, you could firstly convert the character you want to find to the byte pattern in this encoding, and to search/match for a byte pattern in an byte[] array and here is a discussion about this topic:

    https://social.msdn.microsoft.com/Forums/vstudio/en-US/15514c1a-b6a1-44f5-a06c-9b029c4164d7/searching-a-byte-array-for-a-pattern-of-bytes?forum=csharpgeneral

    Regards.


    We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time. Thanks for helping make community forums a great place.
    Click HERE to participate the survey.

    Monday, June 22, 2015 7:24 AM
    Moderator
  • Since you already know the encoding, you could firstly convert the character you want to find to the byte pattern in this encoding, and to search/match for a byte pattern in an byte[] array.

    That is again an assumption, which is totally incorrect in the most general case. First, obviously, coding unit size must be respected at all times — one cannot simply find a byte sequence starting in the middle of a UTF-16 unit and expect it to be a valid solution. Second, much less obvious and the real source of complication, interpreting a single unit out of context — without deep understanding of what exactly does it mean in combination with preceding unit(s) — would be wrong; it is just like pointing to an arbitrary position in a compressed or encrypted stream and trying to decode it (although you may still get a correct answer sometimes with plenty of luck).

    Monday, June 22, 2015 12:31 PM
  • Every encoding will have a specific byte arrry which represents a character, for a recieved byte stream, split it to a array and make sure each item size meets the given encoding size. And search the array with your given specific character(of course you need to convert it to the byte array).
    Friday, June 26, 2015 2:37 AM
  • Every encoding will have a specific byte array which represents a character.

    But the reverse does not hold true in general — that is the problem. Well, there are indeed robust encodings like UTF-8 which make it always possible to tell a leading unit from a trailing one. However there are also legacy encodings like ISO 2022-CN / -JP / -KR or Big-5 and Shift JIS, which use overlapping regions for both kind of units. For example, if one sees a byte 0x41 in Shift JIS, it is impossible to know (except from heuristics) whether it constitutes a latin letter “A” or some “second byte of a double-byte JIS X 0208 character whose first half of the JIS sequence was odd”. Moreover, ISO 2022 encodings use escape sequences to switch character sets in the long run, which makes them even more stateful: one cannot simply start decoding a sequence of bytes from arbitrary position, even on a coding unit boundary. If you think that those encodings are rare, then you may be surprised to find out that all of them are in fact supported in .NET and specified in standard exchange formats like ISO 8211.

    Make sure each item size meets the given encoding size.
    There is no such thing as “item size” or “encoding size”. Encodings deal with coding units (codeunits) of 1, 2 or 4 bytes to represent a coding position (codepoint) in a character set, such as Unicode, by a combination of 1 or more coding units, perhaps prefixed by an escape sequence (shift-mark); shifts may return explicity or implicitly. To make things even more complicated, Unicode allows the same logical “character” to be composed from different codepoints, which may or may not be later unified according to a set of rules called normalization form.

    So, to summarize: anyone thinking that encodings are easy and that it is a good idea to implement them yourself, knows either everything or virtually nothing about them. I am somewhere in the middle on that scale, thus very pessimistic about the “DIY” approach.

    Tuesday, June 30, 2015 1:38 PM
  • "I am somewhere in the middle on that scale, thus very pessimistic about the “DIY” approach."

    A DIY approach doesn't have to involve doing the actual decoding, that can be left to existing .NET Framework functionality:

    using System;
    using System.Text;
    
    class Program {
        static void Main() {
            var s = "aËd";
    
            var encoding = Encoding.UTF8;
            var bytes = encoding.GetBytes(s);
    
            Console.WriteLine(IndexOf(bytes, 'Ë', encoding));
            Console.WriteLine(IndexOf(bytes, 'd', encoding));
        }
    
        private static int IndexOf(byte[] bytes, char ch, Encoding encoding) {
            var decoder = encoding.GetDecoder();
            var chars = new char[1];
    
            for (int byteIndex = 0; byteIndex < bytes.Length; ) {
                int bytesUsed;
                int charsUsed;
                bool completed;
    
                decoder.Convert(bytes, byteIndex, bytes.Length - byteIndex, chars, 0, 1, false, out bytesUsed, out charsUsed, out completed);
    
                if (charsUsed == 1 && chars[0] == ch)
                    return byteIndex;
    
                byteIndex += bytesUsed;
            }
    
            return -1;
        }
    }
    

    There are a few things that this code doesn't probably handle correctly but that should be fixable:

    • I don't know what happens if "bytes" contains an invalid codepoint, it bytesUsed ends up being 0 you'll be stuck in an infinite loop
    • It doesn't know what happens if "bytes" contains characters that decode to a surrogate pair
    • Using the pointer overload of Convert might be more efficient
    • Attempting to decode more than one character at once might be more efficient
    Tuesday, June 30, 2015 2:09 PM
    Moderator