none
Querying very large xml files with XML extractor

    Question

  • I have a merged very large xml file on scale of GB's. I am using following code with xpath queries to read and process data.

    public override IEnumerable<IRow> Extract(IUnstructuredReader input, IUpdatableRow output)
            {   
    IColumn column = output.Schema.FirstOrDefault(col => col.Type != typeof(string));
            if (column != null)
            {
                throw new ArgumentException(string.Format("Column '{0}' must be of type 'string', not '{1}'", column.Name, column.Type.Name));
            }
    
    
            XmlReaderSettings settings = new XmlReaderSettings();
            settings.ConformanceLevel = ConformanceLevel.Auto;//.Fragment;
            XmlReader r = XmlReader.Create(input.BaseStream, settings);
            XmlDocument xmlDocument = new XmlDocument();
            xmlDocument.Load(r);
            //xmlDocument.LoadXml("<root/>");
    
            //xmlDocument.DocumentElement.CreateNavigator().AppendChild(r);
            //xmlDocument.Load(input.BaseStream);
    
            XmlNamespaceManager nsmgr = new XmlNamespaceManager(xmlDocument.NameTable);
            if (this.namespaces != null)
            {
                foreach (Match nsdef in xmlns.Matches(this.namespaces))
                {
                    string prefix = nsdef.Groups[1].Value;
                    string uri = nsdef.Groups[3].Value;
                    nsmgr.AddNamespace(prefix, uri);
                }
            }
    
            foreach (XmlNode xmlNode in xmlDocument.DocumentElement.SelectNodes(this.rowPath, nsmgr))
            {
                foreach (IColumn col in output.Schema)
                {
                    var explicitColumnMapping = this.columnPaths.FirstOrDefault(columnPath => columnPath.Value == col.Name);
                    XmlNode xml = xmlNode.SelectSingleNode(explicitColumnMapping.Key ?? col.Name, nsmgr);
                    output.Set(explicitColumnMapping.Value ?? col.Name, xml == null ? null : xml.InnerXml);
                }
                yield return output.AsReadOnly();
            }
    }

    However, it only works well for smaller files on scale of MBs. It works fine locally but fails for ADLA. I need to use the namespace manager as well. How can i scale it so i can process bigger files. On submitting job with huge file I always get this error with no extra information.

    VertexFailedError

    (This thread can also be found on stackoverflow at: http://stackoverflow.com/questions/36356218/querying-very-large-xml-files)

    Monday, April 4, 2016 10:53 AM

Answers

  • U-SQL Extractors by default are scaled out to work in parallel over smaller parts of the input files, called extents. These extents are about 250MB in size each.

    If the data you are processing cannot fit into an extent, you have to tell the extractor with a C# attribute, that the extractor has to see the file in its entirety. You do that by adding the following part ahead of your extractor class:

        [SqlUserDefinedExtractor(AtomicFileProcessing = true)]

    Now in your case, XML documents obviously cannot be split since the parser needs to see the beginning and end of a document. This is especially true if you only have a single XML document (side note: Having GBs of a single XML document or JSON document is in my opinion often a bad idea).

    Furthermore, I would suggest that you look at the sample XML extractor that we provide on our GitHub site here: https://github.com/Azure/usql/tree/master/Examples/DataFormats


    Michael Rys

    • Proposed as answer by Michael Amadi Thursday, April 7, 2016 10:09 PM
    • Marked as answer by umar-qureshi Friday, April 8, 2016 5:46 AM
    Thursday, April 7, 2016 8:42 PM
    Moderator

All replies

  • U-SQL Extractors by default are scaled out to work in parallel over smaller parts of the input files, called extents. These extents are about 250MB in size each.

    If the data you are processing cannot fit into an extent, you have to tell the extractor with a C# attribute, that the extractor has to see the file in its entirety. You do that by adding the following part ahead of your extractor class:

        [SqlUserDefinedExtractor(AtomicFileProcessing = true)]

    Now in your case, XML documents obviously cannot be split since the parser needs to see the beginning and end of a document. This is especially true if you only have a single XML document (side note: Having GBs of a single XML document or JSON document is in my opinion often a bad idea).

    Furthermore, I would suggest that you look at the sample XML extractor that we provide on our GitHub site here: https://github.com/Azure/usql/tree/master/Examples/DataFormats


    Michael Rys

    • Proposed as answer by Michael Amadi Thursday, April 7, 2016 10:09 PM
    • Marked as answer by umar-qureshi Friday, April 8, 2016 5:46 AM
    Thursday, April 7, 2016 8:42 PM
    Moderator
  • I followed the XMLReader sample from git and extended it to support attributes and namespaces for our use case. It worked very well locally as well as with ADL. However, we might have discard it because of atomic file processing behavior. Millions of small XML files were merged to form big XML of GBs because of 3000 files limitation.
    Friday, April 8, 2016 5:46 AM
  • If you merge many small XML documents into a big one, I suggest that you remove the CR and LF characters from the XML files and use CR LF as the document boundary. As long as each XML document is less than about 4MB, you then can use the CR LF to split the files into batches to operate on each XML document per row. That way you regain the parallel processing.


    Michael Rys

    Monday, April 11, 2016 11:15 PM
    Moderator