Contrib.Html.HtmlParse Class Reference

Yet another small HTML parser. More...

Inheritance diagram for Contrib.Html.HtmlParse:

wx.ZipRC.DoxygenHtbConverter

List of all members.

Public Member Functions

 HtmlParse (HtmlLex src)
void Parse ()
 Starts the parser, reads from the source, and starts events.

Protected Member Functions

virtual bool OnAttribute (string attributeName, string value, IDictionary attributes)
 This will be called on defining an attribute of an element.
virtual void OnClosingElement (string currentElement, string tag)
 This will be called whenever the stack decreases.
virtual void OnDefaultEvent (string token)
 This will be called whenever this is not a part of more specific events.
virtual void OnEndElementTag (string currentElement, string tagString, IDictionary attributes)
 Overload this to react on the start of an element.
virtual void OnRemark (string remark)
 This is called whenever the parser has read a remark.
virtual void OnStartElementTag (string currentElement)
 This will be called before any other domain on starting a new element tag.

Properties

int Depth [get]
 This is the depth of the internal stack representing the nested structure of tags.
StackEntry this [int depth] [get]
 This will access the internal stack element according to the given depth.

Classes

class  StackEntry
 This class represents an entry in the element stack. More...


Detailed Description

Yet another small HTML parser.

This class uses HtmlLex to scan for tokens in an HTML text stream. It calls some virtual methods like OnEnterElement() on certain events in parsing the text. Additionally, specializations can store a stack of state information that grows with parsing into nested tags and decreases with leaving nested tags either parsing the corresponding end tag or an end tag of a tag deeper in the stack.

This procedure is rather optimized for robustness and not for compatibility of standards. In fact, nearly anything defined by W3C is ignored here. The only tested purpose is the adoption of doxygen output to the wx.NET help viewer.

Definition at line 28 of file HtmlParse.cs.


Constructor & Destructor Documentation

Contrib.Html.HtmlParse.HtmlParse ( HtmlLex  src  ) 

Definition at line 53 of file HtmlParse.cs.

00054         {
00055             this._src = src;
00056             this._stack = new ArrayList();
00057         }


Member Function Documentation

virtual bool Contrib.Html.HtmlParse.OnAttribute ( string  attributeName,
string  value,
IDictionary  attributes 
) [protected, virtual]

This will be called on defining an attribute of an element.

this[0] will always be the current element. OnElement() has not yet been called. The value will be stripped of quotes if necessary.

Parameters:
attributeName is the name of the attribute
value is the value (without quotes) or empty attributeName is an attribute without values.
attributes maps the names of the parsed attributes to their values. You may extend this to inline new attributes that have not been parsed but that shall be processed by OnEndElementTag().
Returns:
with true this method tells the parser to add the received attribute to the list of attributes to ba passed to OnEndElementTag().

Reimplemented in wx.ZipRC.DoxygenHtbConverter.

Definition at line 132 of file HtmlParse.cs.

00133         {
00134             return true;
00135         }

virtual void Contrib.Html.HtmlParse.OnClosingElement ( string  currentElement,
string  tag 
) [protected, virtual]

This will be called whenever the stack decreases.

this[0] is always the currentElement.

Reimplemented in wx.ZipRC.DoxygenHtbConverter.

Definition at line 140 of file HtmlParse.cs.

00141         {
00142         }

virtual void Contrib.Html.HtmlParse.OnDefaultEvent ( string  token  )  [protected, virtual]

This will be called whenever this is not a part of more specific events.

Reimplemented in wx.ZipRC.DoxygenHtbConverter.

Definition at line 146 of file HtmlParse.cs.

00147         {
00148         }

virtual void Contrib.Html.HtmlParse.OnEndElementTag ( string  currentElement,
string  tagString,
IDictionary  attributes 
) [protected, virtual]

Overload this to react on the start of an element.

Parameters:
currentElement is the element's name in lower case letters (e.g. "ul").
currentElementString is the full string describing the current element like for instance '
. this[0] will always be this element.
currentElement is the current element's name like "ul"
tagString is the full tag string including attributes introduced by OnAttribute().
attributes maps the names of the parsed attributes to their values.

Reimplemented in wx.ZipRC.DoxygenHtbConverter.

Definition at line 117 of file HtmlParse.cs.

00118         {
00119         }

virtual void Contrib.Html.HtmlParse.OnRemark ( string  remark  )  [protected, virtual]

This is called whenever the parser has read a remark.

Parameters:
remark is the remark text without surrounding tags.

Definition at line 153 of file HtmlParse.cs.

00154         {
00155         }

virtual void Contrib.Html.HtmlParse.OnStartElementTag ( string  currentElement  )  [protected, virtual]

This will be called before any other domain on starting a new element tag.

Reimplemented in wx.ZipRC.DoxygenHtbConverter.

Definition at line 105 of file HtmlParse.cs.

00106         {
00107         }

void Contrib.Html.HtmlParse.Parse (  ) 

Starts the parser, reads from the source, and starts events.

Definition at line 159 of file HtmlParse.cs.

References Contrib.Html.HtmlParse.StackEntry.Element.

Referenced by wx.ZipRC.ZipResourceCompiler.Main().

00160         {
00161             IDictionary attributes=null;
00162             string attributeName=null;
00163             bool readingAttributeValue=false;
00164             StringBuilder fullElementString = null;
00165             for (string token = this._src.NextToken();
00166                 token != null;
00167                 token = this._src.NextToken())
00168             {
00169                 if (token.StartsWith("<!--"))
00170                 {
00171                     // processing remarks
00172                     token = token.Substring(4, token.Length - 7);
00173                     token = token.Trim();
00174                     this.OnRemark(token);
00175                     token = this._src.NextToken();
00176                     if (token == null) break;
00177                 }
00178                 if (token.TrimStart().StartsWith("</"))
00179                 {
00180                     token = token.Trim();
00181                     // end tag. decrease stack.
00182                     string element = token.Substring(2).ToLower();
00183                     token+=this._src.NextToken().Trim();
00184                     // ignore end tags without start. doxygen seems to add some 
00185                     // end tags without start. So, first search for a start.
00186                     bool foundTag = false;
00187                     foreach (StackEntry entry in this._stack)
00188                     {
00189                         if (entry.Element == element)
00190                         {
00191                             foundTag = true;
00192                             break;
00193                         }
00194                     }
00195                     if (foundTag)
00196                     {
00197                         // If we found a possible start, remove stack elements until
00198                         // start reached.
00199                         while (this._stack.Count > 0)
00200                         {
00201                             StackEntry current = (StackEntry)this._stack[0];
00202                             if (current.Element == element)
00203                             {
00204                                 this.OnClosingElement(current.Element, token);
00205                                 this._stack.RemoveAt(0);
00206                                 break;
00207                             }
00208                             else
00209                                 this._stack.RemoveAt(0);
00210                         }
00211                     }
00212                 }
00213                 else if (token.TrimStart().StartsWith("<"))
00214                 {
00215                     // start tag. stack grows.
00216                     token = token.Trim();
00217                     string element = token.Substring(1).ToLower();
00218                     if (this.Depth > 0 && this[0].Element == element)
00219                     {
00220                         // the new element is equal to the old element.
00221                         // act as if the old element has been closed explicitely.
00222                         this.OnClosingElement(this[0].Element, string.Format("</{0}>", this[0].Element));
00223                         this._stack.RemoveAt(0);
00224                     }
00225                     this._stack.Insert(0, new StackEntry(element));
00226                     if (this.Depth > 1)
00227                         this[0].State = (ICloneable)this[1].State.Clone(); // State inheritance
00228                     this.OnStartElementTag(element);
00229                     attributes = new Hashtable(); // Start reading attributes
00230                     fullElementString = new StringBuilder();
00231                     fullElementString.AppendFormat("{0} ", token);
00232                 }
00233                 else if (token.Trim() == ">")
00234                 {
00235                     token = token.Trim();
00236                     // closing the element definition. Only interesting if reading attributes.
00237                     if (attributes != null && fullElementString != null)
00238                     {
00239                         this.OnEndElementTag(this[0].Element, fullElementString.ToString()+token, attributes);
00240                     }
00241                     fullElementString = null;
00242                     attributes = null;
00243                     attributeName = null;
00244                 }
00245                 else if (attributes!=null && attributeName != null && readingAttributeValue)
00246                 {
00247                     token = token.Trim();
00248                     // the token is an attribute value.
00249                     string attributeValue = token;
00250                     if (attributeValue.StartsWith("\"") && attributeValue.EndsWith("\"") && attributeValue.Length > 1)
00251                         attributeValue = attributeValue.Substring(1, attributeValue.Length - 2);
00252                     IDictionary newAttributes = new Hashtable();
00253                     bool addThis=this.OnAttribute(attributeName, attributeValue, newAttributes);
00254                     foreach (DictionaryEntry entry in newAttributes)
00255                     {
00256                         attributes.Add(entry.Key, entry.Value);
00257                         if (entry.Value==null || entry.Value.ToString().Length > 0)
00258                             fullElementString.AppendFormat("{0}=\"{1}\" ", entry.Key, entry.Value);
00259                         else
00260                             fullElementString.AppendFormat("{0} ", entry.Key);
00261                     }
00262                     if (addThis)
00263                     {
00264                         fullElementString.AppendFormat("{0}=\"{1}\" ", attributeName, attributeValue);
00265                     }
00266                     readingAttributeValue = false;
00267                     attributeName = null;
00268                 }
00269                 else if (attributes != null && attributeName == null)
00270                 {
00271                     // the token is an attribute name
00272                     token = token.Trim();
00273                     attributeName = token;
00274                 }
00275                 else if (attributes != null)
00276                 {
00277                     // this branch expects an equal sign.
00278                     token = token.Trim();
00279                     if (token == "=")
00280                     {
00281                         readingAttributeValue = true;
00282                     }
00283                     else
00284                     {
00285                         IDictionary newAttributes=new Hashtable();
00286                         bool addThis=this.OnAttribute(attributeName, "", newAttributes);
00287                         foreach (DictionaryEntry entry in newAttributes)
00288                         {
00289                             attributes.Add(entry.Key, entry.Value);
00290                             if (entry.Value == null || entry.Value.ToString().Length > 0)
00291                                 fullElementString.AppendFormat("{0}=\"{1}\" ", entry.Key, entry.Value);
00292                             else
00293                                 fullElementString.AppendFormat("{0} ", entry.Key);
00294                         }
00295                         if (addThis)
00296                         {
00297                             fullElementString.AppendFormat("{0} ", attributeName);
00298                         }
00299                         readingAttributeValue = false;
00300                         attributeName = null;
00301                     }
00302                 }
00303                 else
00304                 {
00305                     this.OnDefaultEvent(token);
00306                 }
00307             }
00308         }


Property Documentation

int Contrib.Html.HtmlParse.Depth [get, protected]

This is the depth of the internal stack representing the nested structure of tags.

Definition at line 61 of file HtmlParse.cs.

StackEntry Contrib.Html.HtmlParse.this[int depth] (  )  [get, protected]

This will access the internal stack element according to the given depth.

this[0] returns the entry on the current element. this[1] will return the entry on the element containing the current element or null of there is not containing element. All elements without explicit end tag are considered to contain everything that their predecessor contains. So,

         <p>
         A paragraph.
         <p>
         Another paragraph.
         
will be parsed like
         <p>
         A paragraph.
         <p>
         Another paragraph.
         </p>
         </p>
         
rather than
         <p>
         A paragraph.
         </p>
         <p>
         Another paragraph.
         </p>
         

Definition at line 94 of file HtmlParse.cs.


Manual of the wx.NET   (c) 2003-2010 the wx.NET project