none
count the urls in xml file

    Question

  • Hi
    I have more than one thousand(1000) xml file like (rud.xml)(rod02.xml) and so on in Temp folder which contains more than fifty thousand(50000) url below is the structure of the xml file,
    my requirment is that i want to count the url of all Xml folder which is in Temp folder and the count should
    be shown in the label control as i am working in c# winform suppose i have one xml file which contains 100 url and another xml file contains 200 file and so on  then after the loop completes the label should show the count 300 
    <?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url><loc>http://www.abc.com/pub/valentina-galetta/43/298/9A5</loc></url>
    <url><loc>http://www.cool.com/pub/isoboye-george/43/298/9A4</loc></url>
    <url><loc>http://www.good.com/pub/evcil-sever/43/298/9A3</loc></url>
    <url><loc>http://www.fond.com/pub/yogesh-talekar/43/298/9A2</loc></url>
    <url><loc>http://www.solid.com/pub/viriginia-guerrero/43/298/9A1</loc></url>
    <url><loc>http://www.jeo.com/pub/bekir-3f-3feng-3f-3fn/43/298/9A0</loc></url>
    <url><loc>http://www.fan.com/pub/marco-rossini/43/298/99B</loc></url>
    <url><loc>http://www.force.com/pub/celia-jim-3f-3fnez/43/298/99A</loc></url>
    <url><loc>http://www.super.com/pub/thumma-ramakrishna/43/298/999</loc></url>
    <url><loc>http://www.duper.com/pub/natalia-esp-3f-3fndola/43/298/997</loc></url>
    <url><loc>http://www.look.com/pub/nikhil-parashar/43/298/996</loc></url>
    <url><loc>http://www.my.com/pub/andr-3f-3fs-plaza-20-editorial-20alianza-/43/298/995</loc></url>
    <url><loc>http://www.shop.com/pub/marcel-erdmann/43/298/993</loc></url>
    <url><loc>http://www.poo.com/pub/enrique-cortes/43/298/992</loc></url>
    <url><loc>http://www.in.com/pub/prajyot-dhumal/43/298/990</loc></url>
    <url><loc>http://www.out.com/pub/milton-rodrigues/43/298/98B</loc></url>
    <url><loc>http://www.week.com/pub/rolf-schmitz/43/298/98A</loc></url>
    <url><loc>http://www.strong.com/pub/anil-anil/43/298/988</loc></url>
    <url><loc>http://www.qual.com/pub/joe-soap/43/298/986</loc></url>
    <url><loc>http://www.time.com/pub/silviu-pintea/43/298/985</loc></url>
    <url><loc>http://www.super.com/pub/paulo-primo/43/298/983</loc></url>
    </urlset> 




    Tuesday, November 22, 2011 1:54 PM

Answers

  • Assuming XML structure won't change

            int GetUrlCount(string folderPath)
            {
                int count = 0;
                string[] files = System.IO.Directory.GetFiles(folderPath);
    
                foreach (string file in files)
                {
                    count += ParseAndGetCount(file);
                }
    
                return count;
            }
    
            int ParseAndGetCount(string file)
            {
                XmlDocument doc = new XmlDocument();
                doc.LoadXml(System.IO.File.ReadAllText(file));
    
                if (doc.ChildNodes.Count > 0)
                    if (doc.ChildNodes[1].HasChildNodes)
                        return doc.ChildNodes[1].ChildNodes.Count;
                
                return 0;
            }
    
    


    Use

    //Change folder path as per your requirement
    lblCount.Text = GetUrlCount("c:\\temp").ToString();
    



    Thanks,
    A.m.a.L Hashim
    Microsoft Most Valuable Professional
    Dot Net Goodies
    Tuesday, November 22, 2011 2:33 PM
  • try this boss

       DirectoryInfo di = new DirectoryInfo(@"c:\temp\check");
       string filepath;
       int count =0;
       foreach (FileInfo fi in di.GetFiles("*.xml"))
       {
        filepath = fi.FullName;
        XDocument xdoc = XDocument.Load(filepath);
        count = count + xdoc.Root.Elements().Count();
       }

    Wednesday, November 23, 2011 7:01 AM

All replies

  • Assuming XML structure won't change

            int GetUrlCount(string folderPath)
            {
                int count = 0;
                string[] files = System.IO.Directory.GetFiles(folderPath);
    
                foreach (string file in files)
                {
                    count += ParseAndGetCount(file);
                }
    
                return count;
            }
    
            int ParseAndGetCount(string file)
            {
                XmlDocument doc = new XmlDocument();
                doc.LoadXml(System.IO.File.ReadAllText(file));
    
                if (doc.ChildNodes.Count > 0)
                    if (doc.ChildNodes[1].HasChildNodes)
                        return doc.ChildNodes[1].ChildNodes.Count;
                
                return 0;
            }
    
    


    Use

    //Change folder path as per your requirement
    lblCount.Text = GetUrlCount("c:\\temp").ToString();
    



    Thanks,
    A.m.a.L Hashim
    Microsoft Most Valuable Professional
    Dot Net Goodies
    Tuesday, November 22, 2011 2:33 PM
  • @A.m.a.l, you didn't handle the xml namespace.

    @mohammad, try this:

    using System;
    using System.IO;
    using System.Linq;
    using System.Windows.Forms;
    using System.Xml.Linq;
    
    namespace XmlUrlCountWinApp
    {
        public partial class Form1 : Form
        {
            public Form1()
            {
                InitializeComponent();
            }
    
            private void Form1_Load(object sender, EventArgs e)
            {
                try
                {
                    label1.Text = GetUrlCountFromTempDir().ToString();
                }
                catch (Exception ex)
                {
                    label1.Text = ex.Message;
                }
            }
    
            private int GetUrlCountFromTempDir()
            {
                string tempPath = Environment.GetEnvironmentVariable("TEMP");
                if (tempPath != null)
                {
                    string[] files = Directory.GetFiles(tempPath, "*.xml");
                    return files.Sum(file => GetUrlCountFromXml(file));
                }
                throw new Exception("The environment variable %TEMP% is not defined.");
            }
    
            private int GetUrlCountFromXml(string xmlPath)
            {
                XNamespace xn = "http://www.sitemaps.org/schemas/sitemap/0.9";
                return XElement.Load(xmlPath).Descendants(xn + "loc").Count();
            }
        }
    }
    



    aelassas.free.fr
    Tuesday, November 22, 2011 5:07 PM
  • try this boss

       DirectoryInfo di = new DirectoryInfo(@"c:\temp\check");
       string filepath;
       int count =0;
       foreach (FileInfo fi in di.GetFiles("*.xml"))
       {
        filepath = fi.FullName;
        XDocument xdoc = XDocument.Load(filepath);
        count = count + xdoc.Root.Elements().Count();
       }

    Wednesday, November 23, 2011 7:01 AM
  • Thanks all of for replying  me actually some files are also in the below format also

    If we want to count url for this below xml file then where should be the changes including the upper code

     

    <?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url><loc>http://www.linkedin.com/directory/companies/internet-web2.0-startups-social-networking/barcelona.html</loc><changefreq>weekly</changefreq></url>
    <url><loc>http://www.linkedin.com/directory/companies/internet-web2.0-startups-social-networking/basel.html</loc><changefreq>weekly</changefreq></url>
    <url><loc>http://www.linkedin.com/directory/companies/internet-web2.0-startups-social-networking/bath.html</loc><changefreq>weekly</changefreq></url>
    <url><loc>http://www.linkedin.com/directory/companies/computer-networking/sheffield.html</loc><changefreq>weekly</changefreq></url>
    <url><loc>http://www.linkedin.com/directory/companies/computer-networking/singapore.html</loc><changefreq>weekly</changefreq></url>
    <url><loc>http://www.linkedin.com/directory/companies/computer-networking/slough.html</loc><changefreq>weekly</changefreq></url>
    <url><loc>http://www.linkedin.com/directory/companies/computer-networking/slovak-republic.html</loc><changefreq>weekly</changefreq></url>
    <url><loc>http://www.linkedin.com/directory/companies/computer-networking/slovenia.html</loc><changefreq>weekly</changefreq></url>
    <url><loc>http://www.linkedin.com/directory/companies/computer-networking/south-africa.html</loc><changefreq>weekly</changefreq></url>
    <url><loc>http://www.linkedin.com/directory/companies/computer-networking/spain.html</loc><changefreq>weekly</changefreq></url>
    <url><loc>http://www.linkedin.com/directory/companies/computer-networking/spokane.html</loc><changefreq>weekly</changefreq></url>
    <url><loc>http://www.linkedin.com/directory/companies/computer-networking/sri-lanka.html</loc><changefreq>weekly</changefreq></url>
    <url><loc>http://www.linkedin.com/directory/companies/computer-networking/st-louis.html</loc><changefreq>weekly</changefreq></url>
    <url><loc>http://www.linkedin.com/directory/companies/computer-networking/stevenage.html</loc><changefreq>weekly</changefreq></url>
    <url><loc>http://www.linkedin.com/directory/companies/computer-networking/stockholm.html</loc><changefreq>weekly</changefreq></url>
    <url><loc>http://www.linkedin.com/directory/companies/computer-networking/sweden.html</loc><changefreq>weekly</changefreq></url>
    </urlset> 
    





    <?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url><loc>http://www.linkedin.com/groups/gid-2431604</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/groups/gid-2430868</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/groups/Wireless-Carrier-Reps-Past-Present-2430807</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/groups/gid-2430694</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/groups/gid-2430575</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/groups/gid-2431452</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/groups/gid-2432377</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/groups/gid-2428508</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/groups/gid-2432379</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/groups/gid-2432380</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/groups/gid-2432381</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/groups/gid-2432383</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/groups/gid-2432384</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/groups/gid-2432385</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/groups/gid-2432388</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/groups/gid-2432391</loc><changefreq>monthly</changefreq></url>
    </urlset> 
    
    



     


    salman
    Wednesday, November 23, 2011 1:08 PM
  • Have you tried executing code I have posted. I tested and it works for both formats.
    Thanks,
    A.m.a.L Hashim
    Microsoft Most Valuable Professional
    Dot Net Goodies
    Wednesday, November 23, 2011 1:16 PM
  • ya, mine as well. we are not moving through each child node, so it will work fine for the above mentioned structure as well
    • Edited by Prahalnathan Thursday, November 24, 2011 6:14 AM
    Thursday, November 24, 2011 6:13 AM
  • thanks for replying actually when i am using Hashim   code it is giving me error 

    ' ', hexadecimal value 0x1F, is an invalid character. Line 1, position 1.

     

    and when i am running Prahalnathan code it is not showing the count of above structure xml file , at least 22 xml files is in this format.. waiting for the reply


    salman

    Thursday, November 24, 2011 6:52 AM
  • BOSS 

    in ur xml the first character Line 1, position 1. is wrong. the xml file should start with <....

    check that and reply

    Thursday, November 24, 2011 8:09 AM
  • Hi Prahalnathan

    the file which is showing the error is the first file and the structure is shown below ,when i am running the Hashim code then this error comes, waiting for the reply

    <?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <url><loc>http://www.linkedin.com/company/default%20value</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/company/time-warner-inc.</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/company/teledyne</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/company/telefonica-europe-plc</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/company/rational-software</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/company/informix-software</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/company/unicible</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/company/ibm-global-services</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/company/societe_conseil_groupe_lgs</loc><changefreq>monthly</changefreq></url>
    <url><loc>http://www.linkedin.com/company/pureatria</loc><changefreq>monthly</changefreq></url>
    </urlset>
    


    salman
    Thursday, November 24, 2011 8:30 AM
  • check that there is no character(not even space) before the first < coz for me both of our code is working fine for the above said xml
    • Edited by Prahalnathan Thursday, November 24, 2011 9:25 AM
    Thursday, November 24, 2011 9:22 AM
  • Hi,

    did you try my code ? I have tested it with your files and it works well.

    Kind regards,


    aelassas.free.fr
    Thursday, November 24, 2011 9:39 AM
  • Hi Prahalnathan

    can you send me your  email id so that i can send you the file and then you can check to your system 


    salman
    Thursday, November 24, 2011 9:44 AM
  • boss, all the codes r working fine. the problem is in xml file i think
    Thursday, November 24, 2011 9:45 AM
  • ok my brother (mere bhai)

    but the same file goes well with your code 


    salman
    Thursday, November 24, 2011 9:56 AM
  • ya my id is masilaster@gmail.com.. send me the file with file name... if possible the folder itself
    Thursday, November 24, 2011 10:03 AM
  • i have send you the file run this file with your code
    salman
    Thursday, November 24, 2011 10:37 AM
  • hey boss,

    problem is with the line

    <url><loc>http://www.linkedin.com/directory/companies/computer-networking/so-paulo.html</loc><changefreq>weekly</changefreq></url>

    ‹ - character is the problem.

    In my fix, i am using xmldocument object so it is showing exception, as ‹ is not a valid character in xml.

     

    So use hashim code as he is loading the xml content as string, it will work fine for that.

    Post about how it goes

    Thursday, November 24, 2011 11:18 AM
  • ok Prahalnathan

    i have also run Hashim code for that xml file but it is giving me count 0.


    salman
    Thursday, November 24, 2011 11:24 AM
  • i got the count as 49999 by copy pasting the below code

    XmlDocument doc = new XmlDocument();
                doc.LoadXml(System.IO.File.ReadAllText("C:\\d_0.xml"));
    
                if (doc.ChildNodes.Count > 0)
                    if (doc.ChildNodes[1].HasChildNodes)
                        return doc.ChildNodes[1].ChildNodes.Count;

    r u getting any error???

    Thursday, November 24, 2011 11:33 AM
  • Why open the files in xml mode? Too hard...use regular expressions for this pattern matching.

    string pattern = @"(<url>\s*<loc>)";
    
    var count = Directory.EnumerateFiles(@"D:\temp", "*.xml" )
                                     .Sum(fl => 
                                               Regex.Matches( File.ReadAllText(fl), pattern)
                                                          .OfType<Match>()
                                                          .Count()
                        );
    
    
    Console.WriteLine("Total Urls " + count);
    
    

    Tested on both of the xml types you showed. HTH
     
    Check out our MSDN .Net Regular expression Forums for specific regular expression questions. Here are some helful links:

    William Wegerson (www.OmegaCoder.Com)

    Thursday, November 24, 2011 11:50 AM
  • Thanks Prahalnathan for replying

    actually return type will be count in int GetcountUrl() in the Hashim brother code now i am getting all the url count  . Thanks Hashim 


    salman
    Thursday, November 24, 2011 12:47 PM
  • cool brother. Any other doubt do post
    Thursday, November 24, 2011 1:14 PM
  • Have you tried with Linq To XML (the code I provided in my first reply) ?
    aelassas.free.fr
    Thursday, November 24, 2011 1:25 PM
  • Hi mohammad salmaan,

    Welcome to MSDN forum, very glad to hear that you have fixed your issue. If you have any problem, please feel free to post in the forum, there's lots of specialist who have a fantastic level of technic like Prahalnathan, Link.fr, OmegaMan, A.m.a.L in the forum, they can help you solve the issue effectively.

    Best Regards.


    Allen Li [MSFT]
    MSDN Community Support | Feedback to us
    Friday, November 25, 2011 2:00 AM
  • Have you tried with Linq To XML (the code I provided in my first reply) ?
    aelassas.free.fr


    I took notice!

    I created a blog article on the three different suggestions and timed them, the marked answer the xmldocument process was the slowest!!!....guess who was the fastest.

    Find out here:

    .Net Regex: Can Regular Expression Parsing be Faster than XmlDocument or Linq to Xml? 


    William Wegerson (www.OmegaCoder.Com)
    Friday, November 25, 2011 6:10 PM
  • the marked answer the xmldocument process was the slowest!!!....guess who was the fastest.

    Find out here:

    .Net Regex: Can Regular Expression Parsing be Faster than XmlDocument or Linq to Xml? 


    William Wegerson (www.OmegaCoder.Com)

    In conclusion, the final ranking on this thread is:
    1. OmegaMan
    2. Link.fr
    3. Hashim

    aelassas.free.fr
    • Edited by Link.fr Friday, November 25, 2011 7:13 PM
    Friday, November 25, 2011 7:12 PM
  • > the marked answer the xmldocument process was the slowest
     
     

    very rarely. take a look here

     
    • Edited by Malobukv Saturday, November 26, 2011 4:46 AM
    Saturday, November 26, 2011 4:43 AM
  • > the marked answer the xmldocument process was the slowest
     
     

    very rarely. take a look here

     

    He used  different regex options which were uncessary which significantly slowed his regex tests down. See his link where I gave my reply.

    William Wegerson (www.OmegaCoder.Com)
    Saturday, November 26, 2011 9:34 PM