Documentation

Tokenizer
in package

The HTML5 tokenizer.

The tokenizer's role is reading data from the scanner and gathering it into semantic units. From the tokenizer, data is emitted to an event handler, which may (for example) create a DOM tree.

The HTML5 specification has a detailed explanation of tokenizing HTML5. We follow that specification to the maximum extent that we can. If you find a discrepancy that is not documented, please file a bug and/or submit a patch.

This tokenizer is implemented as a recursive descent parser.

Within the API documentation, you may see references to the specific section of the HTML5 spec that the code attempts to reproduce. Example: 8.2.4.1. This refers to section 8.2.4.1 of the HTML5 CR specification.

Tags
see
http://www.w3.org/TR/2012/CR-html5-20121217/

Table of Contents

Constants

CONFORMANT_HTML  = 'html'
CONFORMANT_XML  = 'xml'

Properties

$carryOn  : mixed
$events  : mixed
$mode  : mixed
$scanner  : mixed
$text  : mixed
Buffer for text.
$textMode  : mixed
$tok  : mixed
$untilTag  : mixed

Methods

__construct()  : mixed
Create a new tokenizer.
parse()  : mixed
Begin parsing.
setTextMode()  : mixed
Set the text mode for the character data reader.
attribute()  : bool
Parse attributes from inside of a tag.
attributeValue()  : string|null
Consume an attribute value. See section 8.2.4.37 and after.
bogusComment()  : bool
Consume malformed markup as if it were a comment.
buffer()  : mixed
Add text to the temporary buffer.
cdataSection()  : bool
Handle a CDATA section.
characterData()  : mixed
Parse anything that looks like character data.
comment()  : bool
Read a comment.
consumeData()  : mixed
Consume a character and make a move.
decodeCharacterReference()  : string
Decode a character reference and return the string.
doctype()  : bool
Parse a DOCTYPE.
endTag()  : mixed
Consume an end tag. See section 8.2.4.9.
eof()  : mixed
If the document is read, emit an EOF event.
flushBuffer()  : mixed
Send a TEXT event with the contents of the text buffer.
is_alpha()  : bool
Checks whether a (single-byte) character is an ASCII letter or not.
isCommentEnd()  : bool
Check if the scanner has reached the end of a comment.
isTagEnd()  : mixed
Check if the scanner has reached the end of a tag.
markupDeclaration()  : mixed
Look for markup.
parseError()  : string
Emit a parse error.
processingInstruction()  : bool
Handle a processing instruction.
quotedAttributeValue()  : string
Get an attribute value string.
quotedString()  : mixed
Utility for reading a quoted string.
rawText()  : bool
Read text in RAW mode.
rcdata()  : bool
Read text in RCDATA mode.
readUntilSequence()  : string
Read from the input stream until we get to the desired sequene or hit the end of the input stream.
sequenceMatches()  : bool
Check if upcomming chars match the given sequence.
tagName()  : mixed
Consume a tag name and body. See section 8.2.4.10.
text()  : bool
This buffers the current token as character data.
unquotedAttributeValue()  : mixed

Constants

CONFORMANT_HTML

public mixed CONFORMANT_HTML = 'html'

CONFORMANT_XML

public mixed CONFORMANT_XML = 'xml'

Properties

$mode

protected mixed $mode = self::CONFORMANT_HTML

$text

Buffer for text.

protected mixed $text = ''

Methods

__construct()

Create a new tokenizer.

public __construct(Scanner $scanner, EventHandler $eventHandler[, string $mode = self::CONFORMANT_HTML ]) : mixed

Typically, parsing a document involves creating a new tokenizer, giving it a scanner (input) and an event handler (output), and then calling the Tokenizer::parse() method.`

Parameters
$scanner : Scanner

A scanner initialized with an input stream.

$eventHandler : EventHandler

An event handler, initialized and ready to receive events.

$mode : string = self::CONFORMANT_HTML

parse()

Begin parsing.

public parse() : mixed

This will begin scanning the document, tokenizing as it goes. Tokens are emitted into the event handler.

Tokenizing will continue until the document is completely read. Errors are emitted into the event handler, but the parser will attempt to continue parsing until the entire input stream is read.

setTextMode()

Set the text mode for the character data reader.

public setTextMode(int $textmode[, string $untilTag = null ]) : mixed

HTML5 defines three different modes for reading text:

  • Normal: Read until a tag is encountered.
  • RCDATA: Read until a tag is encountered, but skip a few otherwise- special characters.
  • Raw: Read until a special closing tag is encountered (viz. pre, script)

This allows those modes to be set.

Normally, setting is done by the event handler via a special return code on startTag(), but it can also be set manually using this function.

Parameters
$textmode : int

One of Elements::TEXT_*.

$untilTag : string = null

The tag that should stop RAW or RCDATA mode. Normal mode does not use this indicator.

attribute()

Parse attributes from inside of a tag.

protected attribute(array<string|int, string> &$attributes) : bool
Parameters
$attributes : array<string|int, string>
Tags
throws
ParseError
Return values
bool

attributeValue()

Consume an attribute value. See section 8.2.4.37 and after.

protected attributeValue() : string|null
Return values
string|null

bogusComment()

Consume malformed markup as if it were a comment.

protected bogusComment([string $leading = '' ]) : bool

8.2.4.44.

The spec requires that the ENTIRE tag-like thing be enclosed inside of the comment. So this will generate comments like:

<!--&lt/+foo>-->

Parameters
$leading : string = ''

Prepend any leading characters. This essentially negates the need to backtrack, but it's sort of a hack.

Return values
bool

buffer()

Add text to the temporary buffer.

protected buffer(string $str) : mixed
Parameters
$str : string
Tags
see
flushBuffer()

cdataSection()

Handle a CDATA section.

protected cdataSection() : bool
Return values
bool

characterData()

Parse anything that looks like character data.

protected characterData() : mixed

Different rules apply based on the current text mode.

Tags
see
Elements::TEXT_RAW

Elements::TEXT_RCDATA.

comment()

Read a comment.

protected comment() : bool

Expects the first tok to be inside of the comment.

Return values
bool

consumeData()

Consume a character and make a move.

protected consumeData() : mixed

HTML5 8.2.4.1.

decodeCharacterReference()

Decode a character reference and return the string.

protected decodeCharacterReference([bool $inAttribute = false ]) : string

If $inAttribute is set to true, a bare & will be returned as-is.

Parameters
$inAttribute : bool = false

Set to true if the text is inside of an attribute value. false otherwise.

Return values
string

doctype()

Parse a DOCTYPE.

protected doctype() : bool

Parse a DOCTYPE declaration. This method has strong bearing on whether or not Quirksmode is enabled on the event handler.

Tags
todo

This method is a little long. Should probably refactor.

Return values
bool

endTag()

Consume an end tag. See section 8.2.4.9.

protected endTag() : mixed

eof()

If the document is read, emit an EOF event.

protected eof() : mixed

flushBuffer()

Send a TEXT event with the contents of the text buffer.

protected flushBuffer() : mixed

This emits an EventHandler::text() event with the current contents of the temporary text buffer. (The buffer is used to group as much PCDATA as we can instead of emitting lots and lots of TEXT events.)

is_alpha()

Checks whether a (single-byte) character is an ASCII letter or not.

protected is_alpha(string $input) : bool
Parameters
$input : string

A single-byte string

Return values
bool

True if it is a letter, False otherwise

isCommentEnd()

Check if the scanner has reached the end of a comment.

protected isCommentEnd() : bool
Return values
bool

isTagEnd()

Check if the scanner has reached the end of a tag.

protected isTagEnd(mixed &$selfClose) : mixed
Parameters
$selfClose : mixed

markupDeclaration()

Look for markup.

protected markupDeclaration() : mixed

parseError()

Emit a parse error.

protected parseError(string $msg) : string

A parse error always returns false because it never consumes any characters.

Parameters
$msg : string
Return values
string

processingInstruction()

Handle a processing instruction.

protected processingInstruction() : bool

XML processing instructions are supposed to be ignored in HTML5, treated as "bogus comments". However, since we're not a user agent, we allow them. We consume until ?> and then issue a EventListener::processingInstruction() event.

Return values
bool

quotedAttributeValue()

Get an attribute value string.

protected quotedAttributeValue(string $quote) : string
Parameters
$quote : string

IMPORTANT: This is a series of chars! Any one of which will be considered termination of an attribute's value. E.g. ""'" will stop at either ' or ".

Return values
string

The attribute value.

quotedString()

Utility for reading a quoted string.

protected quotedString(string $stopchars) : mixed
Parameters
$stopchars : string

Characters (in addition to a close-quote) that should stop the string. E.g. sometimes '>' is higher precedence than '"' or "'".

Return values
mixed

String if one is found (quotations omitted).

rawText()

Read text in RAW mode.

protected rawText(string $tok) : bool
Parameters
$tok : string

The current token.

Return values
bool

rcdata()

Read text in RCDATA mode.

protected rcdata(string $tok) : bool
Parameters
$tok : string

The current token.

Return values
bool

readUntilSequence()

Read from the input stream until we get to the desired sequene or hit the end of the input stream.

protected readUntilSequence(string $sequence) : string
Parameters
$sequence : string
Return values
string

sequenceMatches()

Check if upcomming chars match the given sequence.

protected sequenceMatches(string $sequence[, bool $caseSensitive = true ]) : bool

This will read the stream for the $sequence. If it's found, this will return true. If not, return false. Since this unconsumes any chars it reads, the caller will still need to read the next sequence, even if this returns true.

Example: $this->scanner->sequenceMatches('</script>') will see if the input stream is at the start of a '</script>' string.

Parameters
$sequence : string
$caseSensitive : bool = true
Return values
bool

tagName()

Consume a tag name and body. See section 8.2.4.10.

protected tagName() : mixed

text()

This buffers the current token as character data.

protected text(string $tok) : bool
Parameters
$tok : string

The current token.

Return values
bool

unquotedAttributeValue()

protected unquotedAttributeValue() : mixed

        
On this page

Search results