locked
Scraping Unicode RRS feed

  • Question

  • User-1848572166 posted
    I have been trying to build a scraper to help with word translations into different languages.  The current code I have uses Google as the translation service( www.google.com/translate_t) and works well as long as you stay within the english charset.  I have been pulling my hair out for sometime trying to get it to work with Japanense, Korean and Chinese.  For some reason, I cannot make the connection between the data that is returned(unicode) to the glyphs for display.  What I would like to do is take the returned UTF8 formated unicode and generate a string like  ズ&#x30BB......   that I can move around the system as a string.  An example is worth a thousand words.   Create a webpage and paste the two routines provided below into the codebehind and run.  You should see a google translation page that contains 3 glyphs  チーズ         but instead you see chars as if it doesn't understand the page contains unicode chars.  Any help is appreciated.   TIA[:)]

    private string Fetch( string PostData,
    string url )
    {
    String result = "";
    String strPost = PostData;
    StreamWriter myWriter =
    null;
    HttpWebRequest objRequest = (HttpWebRequest)WebRequest.Create(url);
    objRequest.Method = "POST";
    objRequest.ContentLength = strPost.Length;
    objRequest.ContentType = "application/x-www-form-urlencoded";
    try
    {
    myWriter =
    new StreamWriter(objRequest.GetRequestStream());
    myWriter.Write(strPost);
    }
    catch (Exception e)
    {
    return e.Message;
    }
    finally
    {
    myWriter.Close();
    }
    HttpWebResponse objResponse = (HttpWebResponse)objRequest.GetResponse();
    using (StreamReader sr = new StreamReader(objResponse.GetResponseStream(),System.Text.Encoding.UTF8) )
    {
    result = sr.ReadToEnd();
    // Close and clean up the StreamReader
    sr.Close();
    }
    return result;
    }

    protected override void Render(HtmlTextWriter writer)
    {
    writer.WriteLine( Fetch( "langpair=en|ja&hl=en&safe=off&text=cheese","http://www.google.com/translate_t" ) );
    }

    Tuesday, July 26, 2005 4:12 PM

All replies

  • User-1848572166 posted

    I think there is more in play here that what it appears to be.  In looking at the data returned via the httprequest( or webclient, I tried both) the returned byte data is:
    0x83 0x60 0x81 0x5B 0x83 0x59

    but when you visit the website(www.google.com/translate_t) and do the same translation,  and look at the data on the wire that is returned it is:
    0xE3 0x83 0x81 0xE3 0x83 0xBC 0xE3 0x82 0xBA

    which really makes more sense, since Japanese is typically 3 bytes per glyph.  Plus if you look at the last 3 bytes,  I believe they do map to the code point 0x30BA, which is the ズ glyph.  

    At first I though Google had added something to prevent scraping,  but then I remembered it works for the english charset, so it probably isn't an intential attempt to thwart scraping.  Any ideas are appreciated.  I am going to have to give this problem a rest.  I think I am too close to the trees to the see the forest.

    TIA

    Wednesday, July 27, 2005 12:58 PM