locked
Pattrern Match RRS feed

  • Question

  • I have developed a web scrapper that loads the content of website. Its working fine.

    I want to search for the url's in the loaded data. I am using Regular expression for url pattern and using Match function to do the work. But it is not giving the correct output.

    Sample code:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Net;
    using System.IO;
    using System.Text.RegularExpressions;

    namespace Web_Scrapper__Console_
    {
        class Program
        {
            static void Main(string[] args)
            {
                try
                {

                    string url = Console.ReadLine();
                    string strResult = "";

                    WebResponse objResponse;
                    WebRequest objRequest = System.Net.HttpWebRequest.Create(url);

                    Console.WriteLine("Extracting...\nPlease Wait\n");
                    objResponse = objRequest.GetResponse();

                    using (StreamReader sr = new StreamReader(objResponse.GetResponseStream()))
                    {
                        strResult = sr.ReadToEnd();
                        // Close and clean up the StreamReader
                        sr.Close();
                    }

                    string pat=@"^[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu|COM|ORG|NET|MIL|EDU)$";
                    
                    Regex r = new Regex(pat);

                    foreach (Match m in r.Matches(strResult))
                    {
                        Console.WriteLine(m.Value);
                    }

                    Console.ReadKey();
                }
                catch (Exception ex)
                {
                    Console.WriteLine(ex.Message,"Error");
                    Console.ReadKey();
                }
            }
        }
    }

    if the value supplied at strResult is a url as "www.google.com", its giving www.google.com as output but if it is as "abc www.google.com", nothing is printing on the screen. Also it is giving blank screen after the original value of strResult that is the loaded content from the website.

    How to get the required result ?


    Monday, December 17, 2012 6:14 PM

Answers

  • The problem is caused by ‘^’ and ‘$’ that have a special meaning: the begin and end of tested string. Try replacing them with ‘\b’ – the word boundary.

    Monday, December 17, 2012 8:07 PM

All replies

  • The problem is caused by ‘^’ and ‘$’ that have a special meaning: the begin and end of tested string. Try replacing them with ‘\b’ – the word boundary.

    Monday, December 17, 2012 8:07 PM
  • Thanks, it is working now.
    Monday, December 17, 2012 10:13 PM