locked
Lexical Analysis using GPLEX

    Вопрос

  •  Jackson McCann wrote:
    ...but have you considered a lexical analysis approach? GPLEX is a c# lexer and can be found here http://plas.fit.qut.edu.au/gplex/ and it might provide an alternative approach.


    I like that link. Have you actually used the Aussie's product or the Managed Package Lex? It looks interesting. I am going to add the link to the .Net Regex Resources reference.

    Thanks
    16 марта 2007 г. 13:45
    Модератор

Ответы

  • Well here is my simple example - took about an hour to create.  Starting with some very simple html, stripping out everything between the body tags isn't covered

    This is some text in <b>bold</b> and <i>italics</i>
    <ul>
    <li>one</li>
    <li><b>two</b></li>
    <li>Three</li>
    </ul>
    <ol>
    <li>four</li>
    <li>five</li>
    <li>six</li>
    </ol>

    Here is the code that GPLEX turns into a C# program.

    /*
    This is a simple parser for html.  It tries to turn the html into something that
    can be displayed by a wiki. 
     */

    %namespace LexScanner
    %option noparser

    %x DOTLIST
    %x NUMLIST

    bl  [<]
    br  [>]
    bls  [<][/]
    any  [^<>]

    %%
    \n|\r\n? { Console.WriteLine() ; /* End of a line */ }

    <*>{bl}b{br} { Console.Write("'''") ; /* Three single quotes switches on bolding */ }
    <*>{bls}b{br} { Console.Write("'''") ; /* Three single quotes switches off bolding */ }

    <*>{bl}i{br} { Console.Write("'") ; /* A single quote switches on italics */ }
    <*>{bls}i{br} { Console.Write("'") ; /* A single quote switches of italics */ }

    <INITIAL>{bl}ul{br}  { BEGIN(DOTLIST) ;  /* Detect the start of simple list */ }
    <DOTLIST>{bls}ul{br} { BEGIN(INITIAL) ;  /* Detect the end of the list */ }
    <DOTLIST>{bl}li{br}  { Console.Write("   * ") ; /* Wiki code for a bulleted list item */ }
    <DOTLIST>{bls}li{br} { /* Ignore the end of list item tag */ }

    <INITIAL>{bl}ol{br}  { BEGIN(NUMLIST) ; /* Detect the start of numbered list */ }
    <NUMLIST>{bls}ol{br} { BEGIN(INITIAL) ; /* Detect the end of the list */ }
    <NUMLIST>{bl}li{br}  { Console.Write("   1 ") ; /* Wiki code for a numbered list item */ }
    <NUMLIST>{bls}li{br} { /* Ignore the end of list item tag */ }

    <*>{any}   { Console.Write(yytext) ; }

    %%

        public static void Main(string[] argp) {
           if (argp.Length == 0) 
                Console.WriteLine("Usage: WordCount filename(s)");
            for (int idx = 0; idx < argp.Length; idx++) {
                string name = argp[idx];
                try {
                    int tok;
                    FileStream file = new FileStream(name, FileMode.Open);
                    Scanner scnr = new Scanner(file);
                    Console.WriteLine("File: " + name);
                    do {
                        tok = scnr.yylex();
                    } while (tok > (int)Tokens.EOF);
                } catch (IOException) {
                    Console.WriteLine("File " + name + " not found");
                }
            }
        }

     It isn't hard to get it to create a .dll that you can use from a program. I create a c# console app, add the code below as htmllexer.lex and then put a pre-build event into the project to create a .cs file from the .lex file.

    gplex /minimize /summary  /out:$(ProjectDir)/htmllexer.cs $(ProjectDir)/htmllexer.lex

    And when I run the above on  my test file I get:

    This is some text in '''bold''' and 'italics'

       * one
       * '''two'''
       * Three


       1 four
       1 five
       1 six

     I had to make some edits to stop the code turning into smileys - so apologies if I've introduced a typo.

    16 марта 2007 г. 15:47

Все ответы

  • I've used it once - to create a simple reader for a flat file that contains formatted meta data about BEA Tuxedo field definitions, took about half a day to get up to speed with the app, integrate it into VS2005 and create the .lex file.
    16 марта 2007 г. 15:16
  • Well here is my simple example - took about an hour to create.  Starting with some very simple html, stripping out everything between the body tags isn't covered

    This is some text in <b>bold</b> and <i>italics</i>
    <ul>
    <li>one</li>
    <li><b>two</b></li>
    <li>Three</li>
    </ul>
    <ol>
    <li>four</li>
    <li>five</li>
    <li>six</li>
    </ol>

    Here is the code that GPLEX turns into a C# program.

    /*
    This is a simple parser for html.  It tries to turn the html into something that
    can be displayed by a wiki. 
     */

    %namespace LexScanner
    %option noparser

    %x DOTLIST
    %x NUMLIST

    bl  [<]
    br  [>]
    bls  [<][/]
    any  [^<>]

    %%
    \n|\r\n? { Console.WriteLine() ; /* End of a line */ }

    <*>{bl}b{br} { Console.Write("'''") ; /* Three single quotes switches on bolding */ }
    <*>{bls}b{br} { Console.Write("'''") ; /* Three single quotes switches off bolding */ }

    <*>{bl}i{br} { Console.Write("'") ; /* A single quote switches on italics */ }
    <*>{bls}i{br} { Console.Write("'") ; /* A single quote switches of italics */ }

    <INITIAL>{bl}ul{br}  { BEGIN(DOTLIST) ;  /* Detect the start of simple list */ }
    <DOTLIST>{bls}ul{br} { BEGIN(INITIAL) ;  /* Detect the end of the list */ }
    <DOTLIST>{bl}li{br}  { Console.Write("   * ") ; /* Wiki code for a bulleted list item */ }
    <DOTLIST>{bls}li{br} { /* Ignore the end of list item tag */ }

    <INITIAL>{bl}ol{br}  { BEGIN(NUMLIST) ; /* Detect the start of numbered list */ }
    <NUMLIST>{bls}ol{br} { BEGIN(INITIAL) ; /* Detect the end of the list */ }
    <NUMLIST>{bl}li{br}  { Console.Write("   1 ") ; /* Wiki code for a numbered list item */ }
    <NUMLIST>{bls}li{br} { /* Ignore the end of list item tag */ }

    <*>{any}   { Console.Write(yytext) ; }

    %%

        public static void Main(string[] argp) {
           if (argp.Length == 0) 
                Console.WriteLine("Usage: WordCount filename(s)");
            for (int idx = 0; idx < argp.Length; idx++) {
                string name = argp[idx];
                try {
                    int tok;
                    FileStream file = new FileStream(name, FileMode.Open);
                    Scanner scnr = new Scanner(file);
                    Console.WriteLine("File: " + name);
                    do {
                        tok = scnr.yylex();
                    } while (tok > (int)Tokens.EOF);
                } catch (IOException) {
                    Console.WriteLine("File " + name + " not found");
                }
            }
        }

     It isn't hard to get it to create a .dll that you can use from a program. I create a c# console app, add the code below as htmllexer.lex and then put a pre-build event into the project to create a .cs file from the .lex file.

    gplex /minimize /summary  /out:$(ProjectDir)/htmllexer.cs $(ProjectDir)/htmllexer.lex

    And when I run the above on  my test file I get:

    This is some text in '''bold''' and 'italics'

       * one
       * '''two'''
       * Three


       1 four
       1 five
       1 six

     I had to make some edits to stop the code turning into smileys - so apologies if I've introduced a typo.

    16 марта 2007 г. 15:47
  • Jackson,

    thank you for your time explaining this; it helps.

    16 марта 2007 г. 16:00