none
How to get attribute values of certain nodes containing some string? RRS feed

  • Question

  • I have some xml files that have some nodes like (among other nodes)

    <disp-formula id="deqn1">
    ...\tag{1}
    </disp-formula>
    <disp-formula id="deqn2">
    ...\tag{2}
    </disp-formula>
    <disp-formula id="deqn3-6">
    ...\tag{3}
    ...
    ...\tag{4}
    ...\tag{5}...
    ...
    ......\tag{6}
    </disp-formula>

    ... etc and nodes like

    <xref ref-type="disp-formula" rid="deqn5">(5)</xref>
    <xref ref-type="disp-formula" rid="deqn2">2</xref>
    <xref ref-type="disp-formula" rid="deqn4">3, 4</xref>
    <xref ref-type="disp-formula" rid="deqn6">(5)-(6)</xref>

    I want to modify the rid attribute values excluding the string deqn of the nodes which also have the attribute ref-type="disp-formula" by using the following logic:

    1) Take the contents (only the numeric or alpha_numeric values) of the node xref which have the attribute ref-type="disp-formula",

    2) Check the file for that same value whether it exists inside a string \tag{} which is inside a node  <disp-formula> with an attribute named id,

    3) If yes, then get that id value and paste it in the rid attribute of the respective node <xref>

    The modifications in the file should be

    <xref ref-type="disp-formula" rid="deqn3-6">(5)</xref>
    <xref ref-type="disp-formula" rid="deqn2">2</xref>
    <xref ref-type="disp-formula" rid="deqn3-6">3, 4</xref>
    <xref ref-type="disp-formula" rid="deqn3-6">(5)-(6)</xref>


    I'm stuck at the very beginning and running out of ideas

    public static void Main(string[] args)
            {
                XDocument doc = XDocument.Load(@"C:\Users\sample.xml");
                var eqIDS=from x in doc.Descendants("xref")
                    where x.Attribute("ref-type").Value=="disp-formula"
                    let _x=x.Attribute("rid").Value.ToString().Substring(4)
                    select _x;
    
                if (eqIDS.Any())
                {
                    foreach (var element in eqIDS)
                    {
                        ????????
                    }
                }
                else
                {
                    Console.WriteLine("NO matches found");
                }
                Console.ReadLine();
            }


    Can anyone help!!!


    • Edited by Bumba_007 Friday, December 29, 2017 3:33 PM
    Friday, December 29, 2017 2:15 AM

All replies

  • The sample input and the sample output do not line up based upon your rules. You mentioned that you look for a tag with the given value (after the deqn I assume). For line deqnc you didn't provide any matching input values yet your output auto-magically mapped it to 3-6. Since you didn't specify how this range works and there isn't any sample input it is hard to understand your actual rules because the algorithm isn't complete.

    Here's the general algorithm you would follow based solely upon your initial set of values that work with your defined rules.

    Select all nodes called disp-formula
    For each node
        Determine the tag(s) that will map to it
        Add the tag(s) as keys to a dictionary with the value set to the id attribute

    Select all nodes called xref with a ref-type of disp-formula, using XPath
    For each of the nodes 
       Get the value of rid skipping the 'deqn' start string
       Look up the calculated value in the dictionary created earlier
       If the key is found then the value is the new value of the rid attribute
       Otherwise ??


    Michael Taylor http://www.michaeltaylorp3.net

    Friday, December 29, 2017 2:50 PM
    Moderator
  • Hi, CoolDadTx

    I've updated the sample input in the question.

    The general idea is to search the id value of the below nodes (excluding the string deqn)

    <xref ref-type="disp-formula" rid="deqn3">

    <xref ref-type="disp-formula" rid="deqn4">

    ...


    then search the same file for that value (i.e. 3,4 in this case) which is inside a <disp-formula> node in the form \tag{3} (i.e. the value is inside a string structure \tag{...}) and if there is a match then get the id attribute value of that <disp-formula> node where \tag{3} resides and put them in a variable for furthur use.



    • Edited by Bumba_007 Friday, December 29, 2017 3:42 PM
    Friday, December 29, 2017 3:40 PM
  • Then the algorithm I described should work. Give it a try and if it doesn't work then post the code you have along with the issue you're seeing and we can debug it.

    Michael Taylor http://www.michaeltaylorp3.net

    Friday, December 29, 2017 3:49 PM
    Moderator
  • Hi,

    I tried to do the first part of your algorithm

    Select all nodes called disp-formula
    For each node
        Determine the tag(s) that will map to it
        Add the tag(s) as keys to a dictionary with the value set to the id attribute

    But struggling to do so, can you help...

    This is what I came up with but it is not doing anything...

    Dictionary<string, string> dict = new Dictionary<string, string> ();
    			Regex pattern=new Regex("\\tag{{(\\w+)}}");
    			XDocument doc = XDocument.Load(@"C:\Test\sample.xml");
    			var deqns = from x in doc.Descendants("disp-formula")
    				where x.Attribute("id").Value.Contains("deqn")
    				select x;
    			foreach (var deqn in deqns)
    			{
    				string input=deqn.Value;
    				Match match = pattern.Match(input);
    				if (match.Success)
    				{
    					string v = match.Groups[1].Value;
    					dict.Add(v,deqn.Attribute("id").Value.Substring(4));
    					
    				}
    			}


    • Edited by Bumba_007 Sunday, December 31, 2017 2:00 PM
    Sunday, December 31, 2017 1:59 PM
  • Here's one approach. In this case I'm making assumptions about the actual XML format. There is no error checking being done. Also I assume this is part of a larger set of changes being made so the "formulas" are in a separate class for each of use. I also am using simple string parsing here rather than REs. REs would work but seemed overkill for this problem.

    class Program
    {
        static void Main ( string[] args )
        {
            var sourceXml = "<?xml version=\"1.0\" ?>" +
                "<root>" +
                "   <xrefs>" +
                "      <xref ref-type=\"disp-formula\" rid=\"deqn5\">(5)</xref>" +
                "      <xref ref-type=\"disp-formula\" rid=\"deqn2\">2</xref>" +
                "      <xref ref-type=\"disp-formula\" rid=\"deqn4\">3, 4</xref>" +
                "      <xref ref-type=\"disp-formula\" rid=\"deqn6\">(5)-(6)</xref>" +
                "   </xrefs>" +
                "   <formulas>" +
                "      <disp-formula id=\"deqn1\">\\tag{1}</disp-formula>" +
                "      <disp-formula id=\"deqn2\">\\tag{2}</disp-formula>" +
                "      <disp-formula id=\"deqn3-6\">\\tag{3}\\tag{4}\\tag{5}\\tag{6}</disp-formula>" +
                "   </formulas>" +
                "</root>";
    
            var xml = XDocument.Parse(sourceXml);
    
            //Build the list of formulas
            var formulas = ReadFormulas(xml);
    
            //Get the xref entries- assuming they appear only once
            var xrefs = xml.Descendants("xref");
            FixUpXRefs(xrefs, formulas);
    
            xml.Save(@"C:\Temp\test.xml");
        }
    
        static void FixUpXRefs ( IEnumerable<XElement> xrefs, IEnumerable<Formula> formulas )
        {
            foreach (var xref in xrefs)
            {
                //Get the subset of "rid" containing the actual value
                var idText = xref.Attribute("rid")?.Value ?? "";
                var id = (idText.Length > 4) ? idText.Substring(4) : "";
    
                //Find matching formula, if any
                var formula = formulas.FirstOrDefault(f => f.Tags.Contains(id, StringComparer.OrdinalIgnoreCase));
    
                //Fix up xref
                if (formula != null)
                    xref.Attribute("rid").Value = formula.Id;
            };
        }
    
        //Not sure how your tags are actually stored as the XML posted wouldn't make sense so we'll assume
        //that the tags are part of the element text and we'll just look for \tag
        static IEnumerable<string> ParseTags ( string value )
        {
            var tags = value.Split(new[] { @"\tag" }, StringSplitOptions.RemoveEmptyEntries );
            foreach (var tag in tags)
            {
                //Could use RE here but this is pretty simple to do using simple string search
                var startIndex = tag.IndexOf('{');
                var endIndex = (startIndex >= 0) ? tag.IndexOf('}') : -1;
                var text = (endIndex > 0) ? tag.Substring(startIndex + 1, endIndex - startIndex - 1) : "";
                if (!String.IsNullOrEmpty(text))
                    yield return text;
            };
        }
    
        //Using a strong type here rather than a simple dictionary because I assume you may want to do more with 
        //formulas later
        static IEnumerable<Formula> ReadFormulas ( XDocument doc )
        {
            //Find the formulas - assuming these elements only occur in one child element
            var nodes = doc.Descendants("disp-formula");
                
            foreach (var node in nodes)
            {
                //Not doing any error checking here...
                yield return new Formula()
                {
                    Id = node.Attribute("id").Value,
                    Tags = ParseTags(node.Value).ToList()
                };                
            };           
        }
    }
    
    class Formula
    {
        public string Id { get; set; }
    
        public List<string> Tags { get; set; } = new List<string>();
    }


    Michael Taylor http://www.michaeltaylorp3.net

    Sunday, December 31, 2017 7:49 PM
    Moderator
  • Hello Bumba_007,

    You also could try the below example.

    For a given xml file

    <?xml version="1.0" encoding="utf-8" ?>
    <root> 
      <xrefs>
        <xref ref-type="disp-formula" rid="deqn5">(5)</xref>
        <xref ref-type="disp-formula" rid="deqn2">2</xref>
        <xref ref-type="disp-formula" rid="deqn4">3, 4</xref>
        <xref ref-type="disp-formula" rid="deqn6">(5)-(6)</xref>
      </xrefs>
      
      <formulas>
        <disp-formula id="deqn1">\tag{1}</disp-formula>
        <disp-formula id="deqn2">\tag{2}</disp-formula>
        <disp-formula id="deqn3-6">\tag{3}\tag{4}\tag{5}\tag{6}</disp-formula>
      </formulas>  
    </root>
    


    The linq example

     static class ExteinClass {
    
            public static Boolean IsInRange(this string str, string range) {
                var value = Convert.ToInt32(str.Replace("deqn",""));
                var minimum = Convert.ToInt32(range.Replace("deqn", "").Split('-')[0]);
                var maximum = Convert.ToInt32(range.Replace("deqn", "").Split('-')[1]);
                return value >= minimum && value <= maximum;
            }
        }
        class Program
        {
            static void Main(string[] args)
            {
                XDocument doc = XDocument.Load(@"../../XMLFile1.xml");
    
                var DeqnRanges = doc.Descendants("disp-formula").Where(
                     x => x.Attribute("id").Value.Contains("deqn") && x.Attribute("id").Value.Contains("-")
                     ).Attributes("id");
    
                var Xref_Deqns = doc.Descendants("xref");
    
                foreach (var deqn in Xref_Deqns)
                {
                    foreach (var range in DeqnRanges)
                    {
                        if (deqn.Attribute("rid").Value.IsInRange(range.Value))
                        {
                            deqn.Attribute("rid").Value = range.Value;
                        }
                    }
                }
    
                doc.Save("new.xml");
            }

    The result

      <xrefs>
        <xref ref-type="disp-formula" rid="deqn3-6">(5)</xref>
        <xref ref-type="disp-formula" rid="deqn2">2</xref>
        <xref ref-type="disp-formula" rid="deqn3-6">3, 4</xref>
        <xref ref-type="disp-formula" rid="deqn3-6">(5)-(6)</xref>
      </xrefs>

    Best regards,

    Neil Hu


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Monday, January 1, 2018 9:01 AM
    Moderator
  • HI, Neil Hu

    Thank you for posting your answer. Your answer was much easier to understand.

    However there are few issues with the program

    if the file contains nodes in the format before the program runs

    <xref ref-type="disp-formula" rid="deqn3-6">(5)</xref>
    <xref ref-type="disp-formula" rid="deqn1-2">(2)</xref>

    I get an exception
    System.FormatException: Input string was not in a correct format.

    Same thing happens if there is a non-integer value like 2a, c6, 5a-5g ... in rid="deqn..." or in \tag{...} or in both

    Monday, January 1, 2018 3:37 PM
  • The xml file format look like this

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD with OASIS Tables v1.0 20120330//EN" "JATS-journalpublishing-oasis-article1.dtd">
    <article article-type="proceedings" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://www.niso.org/standards/z39-96/ns/oasis-exchange/table">
    <front>
    <journal-meta>
    <journal-title-group>
    <journal-title>Eleventh International Conference on Correlation Optics</journal-title>
    </journal-title-group>
    <issn pub-type="epub">0337-034N</issn>
    <publisher>
    <publisher-name>Elsevier</publisher-name>
    </publisher>
    </journal-meta>
    <article-meta>
    <article-id pub-id-type="doi">10.1037/rmh0000008</article-id>
    <title-group>
    <article-title>The Internet of Things for Health Care: A Comprehensive Survey</article-title>
    </title-group>
    <contrib-group>
    <contrib contrib-type="author">
    <name>
    <given-names>S. M. Riazul</given-names> <surname>Islam</surname>
    </name>
    <xref ref-type="aff" rid="a1"><sup>a</sup></xref>
    </contrib>
    <contrib contrib-type="author" corresp="yes">
    <name>
    <given-names>Daehan</given-names> <surname>Kwak</surname>
    </name>
    <xref ref-type="aff" rid="a2"><sup>b</sup></xref>
    <xref ref-type="corresp" rid="cor1">&#x002A;</xref>
    </contrib>
    </contrib-group>
    <aff id="a1"><label><sup>a</sup></label>MIT, USA</aff>
    <aff id="a2"><label><sup>b</sup></label>Catech, USA</aff>
    </article-meta>
    </front>
    <body>
    <section id="sect1">
    <p>The Internet of Things (IoT) makes smart objects <xref ref-type="disp-formula" rid="deqn5">(5)</xref> the ultimate building blocks in the development of cyber-physical smart pervasive frameworks.
    <disp-formula id="deqn1">ax+b=13\tag{1}</disp-formula>
    </p>
    <p>The IoT revolution is redesigning modern health care with <xref ref-type="disp-formula" rid="deqn1">2</xref> promising technological, economic, and social prospects.</p>
    <disp-formula id="deqn2-5">a+b=10\tag{2}\\
    v=3.2 \tag{3}\\
    v-op=x \tag{4}\\
    a+p-z=0 \tag{5}
    </disp-formula>
    </section>
    </body>
    </article>



    Also there could be multiple identical link nodes (i.e. with the same rid value)

    <xref ref-type="disp-formula" rid="deqn...">...</xref>
    

     in the file

    • Edited by Bumba_007 Monday, January 1, 2018 3:53 PM
    Monday, January 1, 2018 3:48 PM
  • I believe the code I posted should work for your XML. You'll just need to adjust the XPath for each of the elements you care about. Use XPathSelectElements to find the elements given an XPath. Since the paths seem relatively arbitrary, but unique, I'd try just searching for xref and disp-formula without regards for the parent elements (i.e. //xref and //disp-formula).

    "Also there could be multiple identical link nodes"

    Shouldn't matter for the xref. If there were dup formulas then you'd have to decide which one to use though.


    Michael Taylor http://www.michaeltaylorp3.net

    Monday, January 1, 2018 7:51 PM
    Moderator
  • Hello Bumba_007,

    According to your description I list some possible circumstances as below link.

    I need to confirm some situations with you.

    1. The option 1 of Xref_rid meet the option 2 of Disp-formula_id (ie 4 ,3f-5g)

    2. The option 2 of Xref_rid meet the option 1 of Disp-formula_id

    3. The option 3 of Xref_rid meet the option 2 of Disp-formula_id

    4 .The option 4 of Xref_rid meet the option 1 of Disp-formula_id

    And you just need to do these code logic in IsInRange method.

    Best regards,

    Neil Hu


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Tuesday, January 2, 2018 10:37 AM
    Moderator
  • Hi Neil Hu,

    I know IsInRange method is not going to work for alphanumeric strings, so what other way can this be done.

    If it is not possible, then how can I make the program search for those <xref> nodes whose rid="deqn..." consists only numeric value (i.e. deqn6, deqn25 ... etc) and only modifies those and leave the rest <xref> nodes whose rid="deqn..." as it is and not crashing the program.

    BTW, does IsInRange method work properly for numbers with dot i.e. 1.3, 2.4.16?



    Also the circumstances you described in your last post are true.
    • Edited by Bumba_007 Tuesday, January 2, 2018 2:14 PM
    Tuesday, January 2, 2018 2:12 PM
  • Hello Bumba_007,

    You just need to add some if-else logic in IsInRange method and then handle the above sorts of situations. Because it has variety of circumstance. I need to use hard-code to handle it.

      static class ExteinClass {
            public static Boolean IsInRange(this string str, string range) {
    
                //situation1: all numbers
                Regex strRegex = new Regex(@"deqn[0-9]+");
                Regex RangeRegex = new Regex(@"deqn[0-9]+-[0-9]+");
    
                if (strRegex.IsMatch(str) && RangeRegex.IsMatch(range))
                {
                    var value = Convert.ToInt32(str.Replace("deqn", ""));
                    var minimum = Convert.ToInt32(range.Replace("deqn", "").Split('-')[0]);
                    var maximum = Convert.ToInt32(range.Replace("deqn", "").Split('-')[1]);
    
                    return value >= minimum && value <= maximum;
                }
                //situation 2:number and letters
    
                else if (new Regex(@"deqn[A-Za-z0-9]+").IsMatch(str) && new Regex(@"deqn[A-Za-z0-9]+-[A-Za-z0-9]+").IsMatch(range)){
    
                }
                //situation 3:change the regex string to catch you situation 
                else if (new Regex(@"").IsMatch(str) && new Regex(@"").IsMatch(range)) {
    
                }
                return false;         
            }
        }

    Hope this would be helpful.

    Best regards,

    Neil Hu


    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Thursday, January 4, 2018 11:16 AM
    Moderator