none
Databricks Read a XML file

    Question

  • Hello,

    I'm trying to read a directory full of XML files into a SQL DW. First did it with Azure Functions, but got the advice to switch to Databricks for lesser server load while using Polybase. Volume is about 20.000 files per hour.

    But I can't find any example on how to read a xml file in python. I found some easy sample for python itself but when trying to import the libraries that are in those script it fails.

    Anybody could push me in the right direction.

    This is for example a script I've received for python local.

    import pandas as pd
    import xml.etree.ElementTree as ET
    import re
    import os
    xmlfolder = 'Energy_RT'
    xmlfiles = os.listdir(xmlfolder)
    ##Get attribute names (for now I took all leafs of the xml structure)
    firstfile = os.path.join(xmlfolder, xmlfiles[0])
    root = ET.parse(firstfile).getroot()
    attributes = [node.tag for node in root.iter() if len(node)==0]
    clean_attribute_names = [re.sub(r'\{.*\}', '', a) for a in attributes]
    #Create Dataframe and save it as csv
    df = pd.DataFrame(columns=clean_attribute_names, index=xmlfiles)
    for xf in xmlfiles:
        root = ET.parse(os.path.join(xmlfolder,xf)).getroot()
        df.loc[xf] = [node.text for node in root.iter() if node.tag in attributes]
    df.to_csv('out.csv')

    Mounting, listing, unmount all works fine, but the xml to csv conversion breaks my head.

    I'm not a python developer ...

    Tuesday, October 9, 2018 5:42 PM

All replies

  • Hello,

    What errors are you getting with the libraries?

    Tuesday, October 9, 2018 10:27 PM
    Moderator
  • Error: java.lang.RuntimeException: Installation failed with message: Collecting xml.etree.ElementTree Collecting xml.etree.ElementTree Collecting xml.etree.ElementTree Could not find a version that satisfies the requirement xml.etree.ElementTree (from versions: ) No matching distribution found for xml.etree.ElementTree You are using pip version 10.0.1, however version 18.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command. Could not find a version that satisfies the requirement xml.etree.ElementTree (from versions: ) No matching distribution found for xml.etree.ElementTree You are using pip version 10.0.1, however version 18.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command. Could not find a version that satisfies the requirement xml.etree.ElementTree (from versions: ) No matching distribution found for xml.etree.ElementTree You are using pip version 10.0.1, however version 18.1 is available. You should consider upgrading via the 'pip install --upgrade pip' command. 
    Wednesday, October 10, 2018 7:13 AM
  •  A reduced sample of the xml is

    <msg:TrainTrackingMessage xmlns:msg="be:brail:nmbs-it:esb:msg:traintraffic" xmlns:trtf="be:brail:nmbs-it:esb:traintraffic" xmlns:gene="be:brail:nmbs-it:esb:generalelements">
    	<gene:Event>
    		<gene:EventSource>Infrabel</gene:EventSource>
    		<gene:EventType>tracking</gene:EventType>
    		<gene:EventMessage>TrainTracking</gene:EventMessage>
    		<gene:EventTimeStamp>2018-09-27T14:13:15.458439</gene:EventTimeStamp>
    	</gene:Event>
    	<gene:Train>
    		<gene:TrainKey>
    			<gene:CirculationType>1</gene:CirculationType>
    			<gene:Discriminator>0</gene:Discriminator>
    			<gene:DepartureDate>2018-09-27</gene:DepartureDate>
    		</gene:TrainKey>
    		<gene:TrainDetails>
    			<gene:Operator>SNCB/NMBS</gene:Operator>
    			<gene:TrainGroup>1</gene:TrainGroup>
    		</gene:TrainDetails>
    	</gene:Train>
    	<trtf:TrainTracking>
    		<gene:ItineraryPoint>
    			<gene:PtcarIdentification>592</gene:PtcarIdentification>
    			<gene:OrderNumber>150</gene:OrderNumber>
    			<gene:ItineraryPointTimeInfo>
    				<gene:ArrivalTime>14:10:47</gene:ArrivalTime>
    				<gene:ArrivalMidnightCount>0</gene:ArrivalMidnightCount>
    				<gene:DepartureTime>14:10:54</gene:DepartureTime>
    				<gene:DepartureMidnightCount>0</gene:DepartureMidnightCount>
    				<gene:BufferTime>0</gene:BufferTime>
    				<gene:ReserveTime>2</gene:ReserveTime>
    			</gene:ItineraryPointTimeInfo>
    	</trtf:TrainTracking>
    </msg:TrainTrackingMessage>
    

    Wednesday, October 10, 2018 7:19 AM
  • Hello,

    It looks like you need to make the ElementTree library available on the cluster:

    https://docs.azuredatabricks.net/user-guide/libraries.html

    Tuesday, October 23, 2018 6:34 PM
    Moderator