none
如何获取网页内容中的文字??? RRS feed

  • 问题

  • 我想获得网页中的文字。这里我用两种方式:

    方法一:用HttpWebRequest获取网页的HTML后,再去掉HTML标签。

    方法二:用浏览器打开一个网页 -> 全选 -> 复制 -> 打开记事本 -> 粘贴。

    这里记事本里是按照一定格式排列的,而 用HttpWebRequest取的HTML,然后再去掉HTML标记后的格式就不一样。

    我想请问一下.NET如何能够获得方法二的内容?

    一个用例:

    打开这个网站:http://67.222.12.123/cgi-bin/fp.pl/showlines?lines=100&sortor=3&refresh=58

    用方法一获得的结果如下:

     

     Recently checked public proxy servers list with speed rating 
     
     Recently checked 
    
     
     
     login 
     lists 
     new 
     .txt 
     dump 
     cgi/web 
     geo 
     chk 
     anon 
     https 
     email 
     relay 
     dnsbl 
     head 
     whois 
     docs 
     howto 
     tored 
     links 
     bboard 
     
     
    
     Last successfuly checked open proxy list 
     
     
     
     
     
     cookies are possibly disabled No correct COOKIES found in your request. Please enable your COOKIES * or/and try again/try to reload/refresh this page or click THIS LINK . * to protect server load from crawlers  
    
     
     
     
     
    
     select form Process last logs lines,
    sort by 
     unsorted 
     host name 
     port 
     speed 
     check date 
     
     
    filter 
     HTTP 
     HTTPS 
     ALL 
     SOCKS 
    
     
    reverse sort order MAX. 20 slowest hosts + yesterday DEMO hosts for NON-MEMBERS

     

     

     

    而用方法二获得的结果如下:

     

    Recently checked
    
    login
    lists
    new
    .txt
    dump
    cgi/web
    geo
    chk
    anon
    https
    email
    relay
    dnsbl
    head
    whois
    docs
    howto
    tored
    links
    bboard
    Last successfuly checked open proxy list
    
    select form
    Process last logs lines, sort by 
    filter reverse sort order 
    MAX. 20 slowest hosts + yesterday DEMO hosts for NON-MEMBERS
    HOST:port	speed
    Kb/s	date	 
    connect://216.52.207.73:80	41	yesterday	stat anon-chk -ssl whois
    connect://ip42.208-100-40.static.steadfast.net:80	35	yesterday	stat anon-chk -ssl whois
    http://91.202.165.254:3128	33	yesterday	stat anon-chk -ssl whois
    http://95.172.68.149:80	32	yesterday	stat anon-chk -ssl whois
    http://rrcs-67-53-19-30.west.biz.rr.com:8080	32	yesterday	stat anon-chk -ssl whois
    http://174.129.247.49:3128	32	yesterday	stat anon-chk -ssl whois
    http://matrixpro.cust.sloane.cz:3128	31	yesterday	stat anon-chk -ssl whois
    http://213.33.189.2:80	31	yesterday	stat anon-chk -ssl whois
    http://95.168.163.14:3128	31	yesterday	stat anon-chk -ssl whois
    http://pn3.v.rusonyx.ru:80	30	yesterday	stat anon-chk -ssl whois
    http://173.236.51.190:8080	30	yesterday	stat anon-chk -ssl whois
    http://95-31-208-227.broadband.corbina.ru:3128	29	yesterday	stat anon-chk -ssl whois
    http://93.99.38.20:8080	29	yesterday	stat anon-chk -ssl whois
    http://188.75.135.130:80	29	yesterday	stat anon-chk -ssl whois
    connect://sd-19704.dedibox.fr:443	29	yesterday	stat anon-chk -ssl whois
    http://80.82.150.158:3128	29	yesterday	stat anon-chk -ssl whois
    connect://atlanta.serverchamber.com:3129	28	yesterday	stat anon-chk -ssl whois
    http://72.240.34.17:80	28	yesterday	stat anon-chk -ssl whois
    http://178.216.49.122:3128	27	yesterday	stat anon-chk -ssl whois
    http://primaria.pecica.arad.astral.ro:8080	27	yesterday	stat anon-chk -ssl whois
    http://atlanta.serverchamber.com:3129	27	yesterday	stat anon-chk -ssl whois
    connect://91.202.164.57:3128	27	yesterday	stat anon-chk -ssl whois
    http://mail.glstaffing.com:8080	27	yesterday	stat anon-chk -ssl whois
    connect://avproxy1.la.inty.net:3128	26	yesterday	stat anon-chk -ssl whois
    http://217.23.1.239:3128	26	yesterday	stat anon-chk -ssl whois
    connect://80.82.150.158:3128	26	yesterday	stat anon-chk -ssl whois
    connect://208.112.111.204:9000	26	yesterday	stat anon-chk -ssl whois
    connect://91.202.164.105:3128	26	yesterday	stat anon-chk -ssl whois
    http://u15318074.onlinehome-server.com:3128	26	yesterday	stat anon-chk -ssl whois
    http://72.240.34.13:80	26	yesterday	stat anon-chk -ssl whois
    http://91.202.164.57:3128	26	yesterday	stat anon-chk -ssl whois
    connect://213.197.81.17:8080	25	yesterday	stat anon-chk -ssl whois
    http://188.175.113.50:8080	25	yesterday	stat anon-chk -ssl whois
    http://213.76.166.82:80	25	yesterday	stat anon-chk -ssl whois
    connect://188.75.135.130:80	25	yesterday	stat anon-chk -ssl whois
    http://117.243-pool.nikopol.net:3128	24	yesterday	stat anon-chk -ssl whois
    http://93-127-44-36.static.vega-ua.net:3128	24	yesterday	stat anon-chk -ssl whois
    http://93.99.22.174:3128	24	yesterday	stat anon-chk -ssl whois
    http://217.79.67.146:3128	24	yesterday	stat anon-chk -ssl whois
    connect://u15318074.onlinehome-server.com:3128	24	yesterday	stat anon-chk -ssl whois
    

     

    2010年12月19日 16:19

全部回复

  • 在用HttpWebRequest访问的时候 你没有提交比较的cookie验证信息,导致获取不到部分内容。

    cookies are possibly disabled No correct COOKIES found in your request. Please enable your COOKIES * or/and try again/try to reload/refresh this page or click THIS LINK . * to protect server load from crawlers 

    这里明确说明了,你必须登陆并且有cookie的情况下才能访问,服务器把你的请求当做网络蜘蛛了。

    建议你用webclient来模拟一个请求,提交必要的cookie验证信息,这样就可以实现用程序抓取了。

    有问题在问吧


    family as water
    2010年12月21日 2:33
  • 在用HttpWebRequest访问的时候 你没有提交比较的cookie验证信息,导致获取不到部分内容。

    cookies are possibly disabled No correct COOKIES found in your request. Please enable your COOKIES * or/and try again/try to reload/refresh this page or click THIS LINK . * to protect server load from crawlers 

    这里明确说明了,你必须登陆并且有cookie的情况下才能访问,服务器把你的请求当做网络蜘蛛了。

    建议你用webclient来模拟一个请求,提交必要的cookie验证信息,这样就可以实现用程序抓取了。

    有问题在问吧


    family as water

    开启COOKIE后得到下面信息,你可以看到他的代码加过密,源码和页面直接显示的不一样。

    Recently checked public proxy servers list with speed rating Recently checked login 
    lists 
    new 
    .txt 
    dump 
    cgi/web 
    geo 
    chk 
    anon 
    https 
    email 
    relay 
    dnsbl 
    head 
    whois 
    docs 
    howto 
    tored 
    links 
    bboard Last successfuly checked open proxy list google_ad_client = "pub-8829969316757294";
    //hlinks_wide
    google_ad_slot = "1556974262";
    google_ad_width = 728;
    google_ad_height = 15;
    //
     select formProcess last logs lines,
    sort by 
    unsorted 
    host name 
    port 
    speed 
    check date filter 
    HTTP 
    HTTPS 
    ALL 
    SOCKS reverse sort order MAX. 20 slowest hosts + yesterday DEMO hosts for NON-MEMBERS function hideTxt(str){var t='';var s=unescape(str);for(i=0; i<s.length; i++) t+=String.fromCharCode(s.charCodeAt(i)^3);document.write(t);
    }
    
    HOST:portspeedKb/sdate hideTxt('pl%60hp9%2c%2c%60%2e%3a%3b%2e100%2e2%3a1%2e216%2dkpg2%2dg%60%2d%60ln%60bpw%2dmfw92%3a%3a5');
    
    40yesterdaystat whois hideTxt('kwws9%2c%2c2%3a6%2d12%3a%2d163%2d2279%3b3');
    
    40yesterdaystat anon-chk -ssl whois hideTxt('pl%60hp9%2c%2c240%2d56%2d2%3b5%2d22094162');
    
    35yesterdaystat whois hideTxt('kwws9%2c%2cnfmwlqbqwjp%2dkv9021%3b');
    
    32yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c23%3a%2d105%2d5%3b%2d2119%3b3');
    
    32yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c127%2e11%3a%2e20%2e5%3a%2d%60vpw%2dsqlsbdbwjlm%2dmfw9021%3b');
    
    31yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c%3a0%2d264%2d2%3a1%2d2339%3b3%3b3');
    
    31yesterdaystat anon-chk -ssl whois hideTxt('%60lmmf%60w9%2c%2c%3a0%2d%3a%3a%2d11%2d2479021%3b');
    
    31yesterdaystat anon-chk -ssl whois hideTxt('%60lmmf%60w9%2c%2c124%2d2%3a7%2d113%2d%3b09%3b3%3b3');
    
    31yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c247%2d21%3a%2d200%2d129%3b333');
    
    30yesterdaystat anon-chk -ssl whois hideTxt('pl%60hp9%2c%2c%60%2e46%2e42%2e2%3b7%2e226%2dkpg2%2d%60l%2d%60ln%60bpw%2dmfw92620');
    
    29yesterdaystat whois hideTxt('pl%60hp9%2c%2cjs267%2e%3b7%2dpmo%2dthv%2dfgv96%3b572');
    
    29yesterdaystat whois hideTxt('%60lmmf%60w9%2c%2c2%3b%3b%2d246%2d220%2d639%3b3%3b3');
    
    28yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2cbofj%7bbmgqf%2dfjj%2dvp%2dfp9%3b3%3b3');
    
    28yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c41%2d173%2d07%2d259%3b3');
    
    27yesterdaystat anon-chk -ssl whois hideTxt('%60lmmf%60w9%2c%2c%3b0%2d111%2d117%2d13910');
    
    26yesterdaystat anon-chk -ssl whois hideTxt('%60lmmf%60w9%2c%2c23%3a%2d105%2d%3b3%2d2079021%3b');
    
    26yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c41%2d173%2d07%2d219%3b3');
    
    25yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2cf15102%2dvs%60%2ef%2d%60kfool%2dmo9%3b3');
    
    25yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c136%2d120%2d2%3a6%2d439%3b3%3b3');
    
    25yesterdaystat anon-chk -ssl whois hideTxt('pl%60hp9%2c%2c%3a2%2d2%3a7%2d10%3a%2d51923%3b3');
    
    25yesterdaystat whois hideTxt('%60lmmf%60w9%2c%2c%3a6%2e02%2e13%3b%2e114%2daqlbgabmg%2d%60lqajmb%2dqv9021%3b');
    
    24yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c%60%2e5%3a%2e270%2e130%2e247%2dkpg2%2dub%2d%60ln%60bpw%2dmfw9%3b210');
    
    24yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2cpg%2e2%3a437%2dgfgjal%7b%2deq9%3b3');
    
    23yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c44%2d%3a2%2d2%3a6%2d259021%3b');
    
    23yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c2%3b%3b%2d6%3a%2d161%2d2%3a39%3b3');
    
    23yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c%60%2e%3a%3b%2e114%2e55%2e44%2dkpg2%2djm%2d%60ln%60bpw%2dmfw9%3b3%3b6');
    
    23yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c%3a0%2d2%3b6%2d01%2d2029021%3b');
    
    22yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c120%2d2%3a1%2d%3b3%2d1159021%3b');
    
    22yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c%3a6%2e03%2e113%2e24%2daqlbgabmg%2d%60lqajmb%2dqv9021%3b');
    
    22yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2cssslf%2e27%3a%2e133%2dvmj%60pad%2dmfw9%3b3');
    
    22yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c2%3a5%2d1%3a%2d252%2d%3b69%3b3');
    
    21yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c51%2d271%2d64%2d419%3b3%3b3');
    
    21yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c%60%2e42%2e103%2e204%2e55%2dkpg2%2dsb%2d%60ln%60bpw%2dmfw9%3b3%3b6');
    
    21yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2cdt%2eqgn%2ejnslqw%2e1%2doo%2emph%2dypwwh%2dqv9021%3b');
    
    21yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c2%3b7%2d%3b1%2d72%2d449%3b3%3b3');
    
    21yesterdaystat anon-chk -ssl whois hideTxt('%60lmmf%60w9%2c%2c2%3a6%2d%3a6%2d123%2d1759021%3b');
    
    21yesterdaystat anon-chk -ssl whois hideTxt('%60lmmf%60w9%2c%2c124%2d2%3a%2d73%2d1159%3b3%3b3');
    
    20yesterdaystat anon-chk -ssl whois hideTxt('%60lmmf%60w9%2c%2cb%60wlq0%2dwkfbwqj%60bo%2dvsbwqbp%2ddq9021%3b');
    
    20yesterdaystat anon-chk -ssl whois hideTxt('kwws9%2c%2c24%3b%2d125%2d7%3a%2d2129021%3b');
    
    20yesterdaystat anon-chk -ssl whois google_ad_client = "pub-8829969316757294";
    //wide
    google_ad_slot = "3805105281";
    google_ad_width = 728;
    google_ad_height = 90;
    //
     google_ad_client = "pub-8829969316757294";
    //vertical_160x600
    google_ad_slot = "1065458462";
    google_ad_width = 160;
    google_ad_height = 600;
    //
     

     

    2010年12月21日 5:27
  • 这个是用js加密了

    试试webbrowser,这个可以执行js的。

     


    family as water
    2010年12月21日 7:17
  • 这个是用js加密了

    试试webbrowser,这个可以执行js的。

     


    family as water

    不会吧?只能New 一个WebBrowser然后跳转到这个页面模拟Ctrl+A和Ctrl+C么?

    貌似多线程里面不能New 一个WebBrowser吧?

    2010年12月22日 1:52
  • 这个是用js加密了

    试试webbrowser,这个可以执行js的。

     


    family as water

    不会吧?只能New 一个WebBrowser然后跳转到这个页面模拟Ctrl+A和Ctrl+C么?

    貌似多线程里面不能New 一个WebBrowser吧?

      您需要找到她的 hidetext 函数  写到您自己的页面里  再把数据从.net 传递给这个函数。或者您直接用.net实现一次他的hidetext 函数

    入了ipad,最近用ipad上论坛
    2011年3月11日 6:43
    版主
  • 除了webbrowser对象  还有很多可以执行js指令的办法  比如wscript  ,IE com等


    入了ipad,最近用ipad上论坛
    2011年3月11日 6:44
    版主