locked
Problem with .Net socket -- random characters being inserted when reading data

    Question

  • I've written a very small application that uses a socket to get data from a URL.  The code is based on several snippets of MSDN example code that shows how to connect to an http server, send a url request, and then slurp the data back. 

     

    The code works fine for the majority of the data read; however, extra characters are randomly being inserted in the data returned by the socket that I don't see if I use a browser like IE7 to hit the URL.  For the purpose of demonstrating the problem, I've cut this down to minimal functionality--all it does is basically print what it's read in.  Here's sample of the output, note the a000 in red below--these characters are not present in the data returned from the url. The code below can be run using the following command line and output written to junk.txt:

     

    SimpleAstral.exe "http://astral.berkeley.edu/seq.cgi?get=scopdom-seqres-gd-sel-gs-bib;ver=1.71;item=seqs;cut=95" > junk.txt

     

    I find it really hard to believe that socket's Receive method is broken, so I must be missing some setup option or initialization code.  Because I can hit this url successfully from IE7 every time, I don't think there is any problem with connectivity or network hardware.

     

    Thanks!

     

    --Andrew

     

    SAMPLE OUTPUT--NOTE BOGUS CHARACTERS IN RED:

    Sample Output
    >d1hqz1_ d.109.1.2 (1:) Cofilin-like domain of actin-binding protein abp1p {Baker's yeast (Saccharomyces cerevisiae)}
    lepidytthsreidaeylkivrgsdpdttwliispnakkeyepestgssfhdflqlfdet
    kvqyglarvsppgsdvekiiiigwcpdsaplktrasfaanfaavannlfkgyhvqvtard
    eddldenellmkisnaaga
    >d1ak7__ d.109.1.2 (-) Destrin {Human and pig (Homo sapiens) and (Sus scrofa)}
    tmitpssgnsasgvqvadevcrifydmkvrkcstpeeikkrkkavifclsadkkciivee
    gkeilvgdvgvtitdpfkhfvgmlpekdcryalydasfetkesrkeelmfflwapelapl
    kskmiyasskdaikkkfqgikhecqangpedlnraciaeklggslivafegcpv
    >d1m4ja_ d.109.1.2 (A:) Adf-H domain of twinfilin isoform-1 {Mouse (Mus musculus)}
    iqasedvkeifararngkyrllkisieneqlvvgscsppsdsweqdydsfvlplledkqp
    cyvlfrldsqnaqgyewifiawspdhshvrqkmlyaatratlkkefggghikdevfgtvk
    edvslhgykkyll
    >d1udma_ d.109.1.2 (A:) Coactosin-like protein Cotl1 (Clp) {Mouse (Mus musculus)}
    gsegaatmatkidkeacraaynlvrddgsaviwvtfrydgativpgdqgadyqhfiqqct
    ddvrlfafvrfttgdamskrskfalitwigedvsglqraktgtdktlvkevvqnfakefv
    isdrkeleedfirselkkagganydaqse
    >d1t2la_ d.109.1.
    a000
    2 (A:) Coactosin-like protein Cotl1 (Clp) {Human (Homo sapiens)}
    atkidkeacraaynlvrddgsaviwvtfkydgstivpgeqgaeyqhfiqqctddvrlfaf
    vrfttgdamskrskfalitwigenvsglqraktgtdktlvkevvqnfakefvisdrkele
    edfikselkkaggany

     

     


     

    Code Snippet

    using System;

    using System.Collections.Generic;

    using System.Text;

    using System.Text.RegularExpressions;

    using System.IO;

    using System.Net;

    using System.Net.Sockets;

    namespace SimpleAstral

    {

    class Program

    {

    static int Main(string[] args)

    {

    string url = args[0];

    Uri u = new Uri(url); // Uri object to decode/parse passed in URL (and parameters)

    Socket s = null; // Socket for communication

    IPHostEntry hostEntry = null;

    string version = "";

    string cutoff = "";

    Match vcm;

    // Pick off version and cutoff values from passed in URL

    Regex versrx = new Regex(@"ver=(?[^;]+)");

    Regex cutrx = new Regex(@"cut=(?[^;]+)");

    vcm = versrx.Match(url);

    if (vcm.Success)

    {

    version = vcm.Groups["version"].Value;

    }

    vcm = cutrx.Match(url);

    if (vcm.Success)

    {

    cutoff = vcm.Groups["cutoff"].Value;

    }

     

    // Get host related information.

    hostEntry = Dns.GetHostEntry(u.Host);

    // Loop through the AddressList to obtain the supported AddressFamily. This is to avoid

    // an exception that occurs when the host IP Address is not compatible with the address family

    // (typical in the IPv6 case).

    foreach (IPAddress address in hostEntry.AddressList)

    {

    IPEndPoint ipe = new IPEndPoint(address, u.Port);

    Socket tempSocket =

    new Socket(ipe.AddressFamily, SocketType.Stream, ProtocolType.Tcp);

    tempSocket.NoDelay = true;

    tempSocket.Connect(ipe);

    if (tempSocket.Connected)

    {

    s = tempSocket;

    break;

    }

    else

    {

    continue;

    }

    }

    string request = "GET " + u.PathAndQuery + " HTTP/1.1\r\nHost: " + u.Host +

    "\r\nConnection: Close\r\n\r\n";

    //s.ReceiveBufferSize = 16384;

    Byte[] bytesSent = Encoding.ASCII.GetBytes(request);

    Byte[] bytesReceived = new Byte[s.ReceiveBufferSize];

    if (s == null)

    return (1);

    // Send request to the server.

    s.Send(bytesSent, bytesSent.Length, 0);

    int bytes = 0;

    string overflow = "";

    string buffer = "";

    int rc = 0; // return code

    Console.WriteLine("Retrieving data from URL: " + u.AbsoluteUri);

    // There will be several lines of http style headers returned initially,

    // these must be read (and discarded) before parsing the rest of the data

    try

    {

    int inheader = 1;

    Regex rx = new Regex(@"\AHTTP\/\d+\.\d+ 200 OK([\r][\n]|[\n\r])(.*:.*([\r][\n]|[\n\r]))+([\r][\n]|[\n\r])", RegexOptions.Compiled | RegexOptions.Multiline);

    Regex jnk = new Regex(@"\A[^\>]+", RegexOptions.Compiled | RegexOptions.Multiline);

    Match m;

    buffer = "";

    do

    {

    bytes = 0;

    bytes = s.Receive(bytesReceived, bytesReceived.Length, 0);

    buffer = buffer + Encoding.UTF8.GetString(bytesReceived, 0, bytes);

    m = rx.Match(buffer);

    if (m.Success)

    {

    overflow = rx.Replace(buffer, "");

    inheader = 0;

    }

    } while (inheader == 1 || bytes > 0);

    // always seems to be one line of junk after the http headers

    m = jnk.Match(overflow);

    if (m.Success)

    {

    overflow = jnk.Replace(overflow, "");

    }

    }

    catch (Exception e)

    {

    Console.WriteLine("Error reading headers: " + e.Message);

    rc = 1;

    return rc;

    }

    try

    {

    // Now parse actual data elements using multi-line regular expressions...

    Regex astralrecs = new Regex(@"^(?>[^>]+)", RegexOptions.Compiled | RegexOptions.Multiline);

    Regex astralflds = new Regex(@"\A>(?\S+)\s+(?\S+)\s+\((?[^)]+)\)\s+(?[^\r\n]*)\s+(?.*)", RegexOptions.Compiled | RegexOptions.Singleline);

    Regex ovfl = new Regex(@"(?\>.*)\Z", RegexOptions.Compiled | RegexOptions.Multiline);

    int recc = 0;

    Match om; // overflow match

    Match fm; // field match

    MatchCollection mc; // general match collection

    do

    {

    bytes = s.Receive(bytesReceived, bytesReceived.Length, SocketFlags.None);

    buffer = overflow + Encoding.UTF8.GetString(bytesReceived, 0, bytes);

    overflow = "";

    Console.WriteLine("*************************************************");

    Console.WriteLine("Buffer is: " + buffer);

    Console.WriteLine("*************************************************");

    if (bytes != 0)

    {

    om = ovfl.Match(buffer);

    if (om.Success)

    {

    // there may be a partial record at the end of the buffer, so always pull off the last

    // > xxxx and following lines

    overflow = om.Groups["lastline"].Value;

    Console.WriteLine("Overflow is: " + overflow);

    Console.WriteLine("******************************************************");

    buffer = ovfl.Replace(buffer, "");

    }

    }

    mc = astralrecs.Matches(buffer);

    foreach (Match m in mc)

    {

    recc++;

    //SqlContext.Pipe.Send(string.Format("{0:G}: {1}", recc, m.Groups["record"].Value));

    fm = astralflds.Match(m.Groups["record"].Value);

    if (! fm.Success)

    {

    Console.WriteLine("Parse Failed: " + m.Groups["record"].Value);

    }

    }

    buffer = "";

    }

    while (bytes > 0);

    // Process any remaining record

    if (overflow.Length > 0)

    {

    mc = astralrecs.Matches(overflow);

    foreach (Match m in mc)

    {

    recc++;

    fm = astralflds.Match(m.Groups["record"].Value);

    if (!fm.Success)

    {

    Console.WriteLine("Parse Failed: " + m.Groups["record"].Value);

    }

    }

    }

    Console.WriteLine("done.");

    }

    catch (Exception err)

    {

    rc = 1;

    Console.WriteLine("An error occured: " + err.Message);

    }

    finally

    {

    if (s.Connected)

    {

    s.Close();

    }

    }

    return rc;

    }

    }

    }

     

     

    Wednesday, August 29, 2007 7:02 PM

Answers

  • Is there one right at the beginning of the download?  It looks like HTTP/1.1's Chunked Transfer-coding is being used by the server, the response is sent in chunks each length delimited with an ASCII-encoded hexadecimal size value, e.g "a0000".  From RFC 2616 section 3.6.1:

           Chunked-Body   = *chunk
                            last-chunk
                            trailer
                            CRLF

           chunk          = chunk-size [ chunk-extension ] CRLF
                            chunk-data CRLF

    Look in the response headers for "Transfer-Encoding: chunked".  I couldn't get your program to work, as something bad seems to have happened to the regex patterns ArgumentException: parsing "ver=(?[^;]+)" - Unrecognized grouping construct.)

    Anyway, either switch to using HttpWebRequest, or change you request line to specify HTTP/1.0.
    Wednesday, August 29, 2007 9:37 PM
  • These are not random characters, but chunk-size indicators. A quick look at the HTTP headers sent by this web server reveals:

     

    HTTP/1.1 200 OK

    Transfer-Encoding: chunked

    Content-Type: text/plain

     

    Quite frankly, I don't see why one would want to write such a program using plain sockets. Both System.Net.WebClient and System.Net.HttpWebRequest offer a much better programming model for HTTP-based applications, and implement HTTP 1.0/1.1 correctly (a lack thereof being exactly the problem here).

     

    Leaving out the parsing, your code would boil down to this:

     

    Code Snippet

    WebClient client = new WebClient();

    client.Encoding = Encoding.UTF8;

    string response = client.DownloadString(

        "http://astral.berkeley.edu/seq.cgi?get=scopdom-seqres-gd-sel-gs-bib;ver=1.71;item=seqs;cut=95");

     

     

     

    Wednesday, August 29, 2007 9:56 PM

All replies

  • Is there one right at the beginning of the download?  It looks like HTTP/1.1's Chunked Transfer-coding is being used by the server, the response is sent in chunks each length delimited with an ASCII-encoded hexadecimal size value, e.g "a0000".  From RFC 2616 section 3.6.1:

           Chunked-Body   = *chunk
                            last-chunk
                            trailer
                            CRLF

           chunk          = chunk-size [ chunk-extension ] CRLF
                            chunk-data CRLF

    Look in the response headers for "Transfer-Encoding: chunked".  I couldn't get your program to work, as something bad seems to have happened to the regex patterns ArgumentException: parsing "ver=(?[^;]+)" - Unrecognized grouping construct.)

    Anyway, either switch to using HttpWebRequest, or change you request line to specify HTTP/1.0.
    Wednesday, August 29, 2007 9:37 PM
  • These are not random characters, but chunk-size indicators. A quick look at the HTTP headers sent by this web server reveals:

     

    HTTP/1.1 200 OK

    Transfer-Encoding: chunked

    Content-Type: text/plain

     

    Quite frankly, I don't see why one would want to write such a program using plain sockets. Both System.Net.WebClient and System.Net.HttpWebRequest offer a much better programming model for HTTP-based applications, and implement HTTP 1.0/1.1 correctly (a lack thereof being exactly the problem here).

     

    Leaving out the parsing, your code would boil down to this:

     

    Code Snippet

    WebClient client = new WebClient();

    client.Encoding = Encoding.UTF8;

    string response = client.DownloadString(

        "http://astral.berkeley.edu/seq.cgi?get=scopdom-seqres-gd-sel-gs-bib;ver=1.71;item=seqs;cut=95");

     

     

     

    Wednesday, August 29, 2007 9:56 PM
  •  

    Thanks -- using a 1.0 request fixes the issue, and using the much more compact HttpWebRequest also works great.  --Andrew
    Thursday, August 30, 2007 4:37 PM