locked
String Extraction. RRS feed

  • Question

  • User1867929564 posted

     Hi,

    When using opennlp,parse I get following sample as output when any sentence is entered.
    Example when I enter,

    The suburb of Saffron Park lay on the sunset side of London, as red and ragged as a cloud of sunset.
    I will get,

    (TOP (S (NP (NP (DT The) (NN suburb)) (PP (IN of) (NP (NNP Saffron) (NNP Park)))) (VP (VBD lay) (PP (IN on) (NP (NP (DT the) (NN sunset) (NN side)) (PP (IN of) (NP (NNP London))))) (, ,) (PP (IN as) (ADJP (ADJP (JJ red) (CC and) (JJ ragged)) (PP (IN as) (NP (NP (DT a) (NN cloud)) (PP (IN of) (NP (NN sunset)))))))) (. .)))

    I want all word which is NN,NNP,NNS,NNPS.

    How do i go about it ?

    Tuesday, August 10, 2010 2:29 AM

Answers

  • User-1071856410 posted

    Try this,

            ArrayList al = new ArrayList();
            string input = "(TOP (S (NP (NP (DT The) (NN suburb)) (PP (IN of) (NP (NNP Saffron) (NNP Park)))) (VP (VBD lay) (PP (IN on) (NP (NP (DT the) (NN sunset) (NN side)) (PP (IN of) (NP (NNP London))))) (, ,) (PP (IN as) (ADJP (ADJP (JJ red) (CC and) (JJ ragged)) (PP (IN as) (NP (NP (DT a) (NN cloud)) (PP (IN of) (NP (NN sunset)))))))) (. .)))";
            string pattern = @"((NNP*S*)(\s+)(\w+))";
            foreach (Match match in Regex.Matches(input, pattern, RegexOptions.Multiline))
            {
                string pattern1 = @"(\s+)(\w+)";
                foreach (Match m1 in Regex.Matches(match.Value.ToString(), pattern1, RegexOptions.Multiline))
                {
                    if (!al.Contains(m1.Value))
                    {
                        al.Add(m1.Value);
                    }
                }
            }
    
            foreach (object o in al)
            {
                Response.Write(o.ToString() + "</br>");
            }


     

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Tuesday, August 10, 2010 9:15 AM
  • User1867929564 posted

     Hi,

    Thanks a lot.Its almost correct.
    Suppose I enter "A&T bags Rs 747 CR orders"
    it becomes,
    (TOP (S (NP (NN L&T) (NNS bags) (NNS Rs)) (NP (CD 747)) (VP (VBD CR) (NP (NNS orders)))))

    Now the noun i am getting is,
    L,bags,Rs,orders
    It would be more accurate if i get,
    L&T,bags,Rs,orders

    Also can you explain ,what this two string does.

    string pattern = @"((NNP*S*)(\s+)(\w+))";
    string pattern1 = @"(\s+)(\w+)";   


    Again Thanks a lot.

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, August 11, 2010 12:26 AM
  • User-1071856410 posted

    It would be more accurate if i get,
    L&T,bags,Rs,orders

     

    Try modifying as below to accept given special characters inside word.

            ArrayList al = new ArrayList();
            
            // Give the list of speciaol characters that you want to consider inside the word here
            string allowedSpecialChars = "[!@#$%^&*]";
    
            string input = "(TOP (S (NP (NN L&T) (NNS bags) (NNS Rs)) (NP (CD 747)) (VP (VBD CR) (NP (NNS orders)))))";
            string pattern = @"((NNP*S*)(\s+)(\w+" + allowedSpecialChars + @"*\w*))";
           
            foreach (Match match in Regex.Matches(input, pattern, RegexOptions.Multiline))
            {
                string pattern1 = @"(\s+)(\w+" + allowedSpecialChars  + @"*\w*)";
                foreach (Match m1 in Regex.Matches(match.Value.ToString(), pattern1, RegexOptions.Multiline))
                {
                    if (!al.Contains(m1.Value))
                    {
                        al.Add(m1.Value);
                    }
                }
            }
    
            foreach (object o in al)
            {
                Response.Write(o.ToString() + "</br>");
            }


     

    Also can you explain ,what this two string does.
    string pattern = @"((NNP*S*)(\s+)(\w+))";
    string pattern1 = @"(\s+)(\w+)";   

    These are regular expression patterns for matching words from the string.

    for eg) ((NNP*S*)(\s+)(\w+)) - matches two mandatory N followed by any number of optional P followed by any number of optional S followed by one or more spaces followed by one or more word 

    Now we have modified the pattern to accept given special characters inside the word.

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, August 11, 2010 4:28 AM

All replies

  • User-1071856410 posted

    Are you looking to extract this pattern form your C# code ?

    If so, You could give RegEx a try,

            string input = // Your Parser Output
    
            MatchCollection Matches = Regex.Matches(input, @"(NNP*S*)");
            foreach (Match NextMatch in Matches)
            {
                Response.Write(NextMatch.Groups[1] + "</br>");
            }


     

    Tuesday, August 10, 2010 5:12 AM
  • User1867929564 posted

     observe the innermost bracket,I want to extract actual word associated with(NNPS,NNP,NNS,NN) it like in my example,it would be,

    suburb,Saffron,Park,sunset,side,London,cloud,<STRIKE>sunset</STRIKE>.
    because no duplicate word.

    Tuesday, August 10, 2010 6:33 AM
  • User-1071856410 posted

    Try this,

            ArrayList al = new ArrayList();
            string input = "(TOP (S (NP (NP (DT The) (NN suburb)) (PP (IN of) (NP (NNP Saffron) (NNP Park)))) (VP (VBD lay) (PP (IN on) (NP (NP (DT the) (NN sunset) (NN side)) (PP (IN of) (NP (NNP London))))) (, ,) (PP (IN as) (ADJP (ADJP (JJ red) (CC and) (JJ ragged)) (PP (IN as) (NP (NP (DT a) (NN cloud)) (PP (IN of) (NP (NN sunset)))))))) (. .)))";
            string pattern = @"((NNP*S*)(\s+)(\w+))";
            foreach (Match match in Regex.Matches(input, pattern, RegexOptions.Multiline))
            {
                string pattern1 = @"(\s+)(\w+)";
                foreach (Match m1 in Regex.Matches(match.Value.ToString(), pattern1, RegexOptions.Multiline))
                {
                    if (!al.Contains(m1.Value))
                    {
                        al.Add(m1.Value);
                    }
                }
            }
    
            foreach (object o in al)
            {
                Response.Write(o.ToString() + "</br>");
            }


     

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Tuesday, August 10, 2010 9:15 AM
  • User1867929564 posted

     Hi,

    Thanks a lot.Its almost correct.
    Suppose I enter "A&T bags Rs 747 CR orders"
    it becomes,
    (TOP (S (NP (NN L&T) (NNS bags) (NNS Rs)) (NP (CD 747)) (VP (VBD CR) (NP (NNS orders)))))

    Now the noun i am getting is,
    L,bags,Rs,orders
    It would be more accurate if i get,
    L&T,bags,Rs,orders

    Also can you explain ,what this two string does.

    string pattern = @"((NNP*S*)(\s+)(\w+))";
    string pattern1 = @"(\s+)(\w+)";   


    Again Thanks a lot.

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, August 11, 2010 12:26 AM
  • User1867929564 posted

     Hi,

    Finally I got going.
    In one I made my own proable list which has to be there in output.Generally all those abbrevition
     string[] strSearch={ "L&T","NIIT"} so on
    So after first output I check with the original string if in title any of strSearch is or not.
    If yes then I chk if it in first output or not,if yes then leave else add.
    This is the final output.
    Also,string length count is less than 3 then I am ignoring it.
    This way I am ignoring word such as "RS" etc.

    But you still need to explain thing I  asked.
    Thanks again

    Wednesday, August 11, 2010 4:22 AM
  • User-1071856410 posted

    It would be more accurate if i get,
    L&T,bags,Rs,orders

     

    Try modifying as below to accept given special characters inside word.

            ArrayList al = new ArrayList();
            
            // Give the list of speciaol characters that you want to consider inside the word here
            string allowedSpecialChars = "[!@#$%^&*]";
    
            string input = "(TOP (S (NP (NN L&T) (NNS bags) (NNS Rs)) (NP (CD 747)) (VP (VBD CR) (NP (NNS orders)))))";
            string pattern = @"((NNP*S*)(\s+)(\w+" + allowedSpecialChars + @"*\w*))";
           
            foreach (Match match in Regex.Matches(input, pattern, RegexOptions.Multiline))
            {
                string pattern1 = @"(\s+)(\w+" + allowedSpecialChars  + @"*\w*)";
                foreach (Match m1 in Regex.Matches(match.Value.ToString(), pattern1, RegexOptions.Multiline))
                {
                    if (!al.Contains(m1.Value))
                    {
                        al.Add(m1.Value);
                    }
                }
            }
    
            foreach (object o in al)
            {
                Response.Write(o.ToString() + "</br>");
            }


     

    Also can you explain ,what this two string does.
    string pattern = @"((NNP*S*)(\s+)(\w+))";
    string pattern1 = @"(\s+)(\w+)";   

    These are regular expression patterns for matching words from the string.

    for eg) ((NNP*S*)(\s+)(\w+)) - matches two mandatory N followed by any number of optional P followed by any number of optional S followed by one or more spaces followed by one or more word 

    Now we have modified the pattern to accept given special characters inside the word.

    • Marked as answer by Anonymous Thursday, October 7, 2021 12:00 AM
    Wednesday, August 11, 2010 4:28 AM