Discussion:
Tokenizer Tests Errors
Mohammad Houssami
2013-07-20 11:04:15 UTC
Permalink
1- I am currently working on building and HTML5 parser according to
the specs of WHATWG and while testing the tokenizer using the tests on the
HTML5Lib i have noticed some of them have bugs. In general these are the
major things i have noticed:
I am refering to this set of tests
:http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/test1.test

There are also some similar stuff in test2 test3 and test4 but lets just
stick with test1 for now.

I have noticed that in places where you have doctype tokens like this one :

{"description":"Correct Doctype lowercase",

"input":"<!DOCTYPE html>",

"output":[["DOCTYPE", "html", null, null, true]]}

The force quirck flag is set to true where as the specifications say its
usually on or off.

Like the example in the EOF here
:http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#doctype-state



2- In places where character tokens follow the tokenizer gives 1
character token with all the character as data when it should give 1
character token for every single character. Here is an example.

{"description":"Ampersand ampersand EOF",

"input":"&&",

"output":[["Character", "&&"]]}

My expected output for this is having 2 character tokens each with
ampersand data rather than just 1 token.



3- Assuming true stands for on and false for off, many quirck flags
are inverted where true(on) is given then it has to be false(off). The
earlier case I gave is an example.

The states that should be covered with this input are the following:
DataState: <!DOCTYPE html>

Tag open state: <!DOCTYPE html>

Markup deceleration open state: <!DOCTYPE html>
Doctype State: : <!DOCTYPE html>

Before doctype name state: <!DOCTYPE html>

Doctype name state: <!DOCTYPE html>

Doctype name state: <!DOCTYPE html>

Doctype name state: <!DOCTYPE html>

Doctype name state: <!DOCTYPE html>
The state says the following : U+003E GREATER-THAN SIGN (>)

Switch to the data state<http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#data-state>.
Emit the current DOCTYPE token.

And then in data state the EOF is read so there is nothing about the
force-quirck flag and the specifications say the following : " When a
DOCTYPE token is created, its name, public identifier, and system
identifier must be marked as missing (which is a distinct state from the
empty string), and the *force-quirks flag* must be set to *off* (its other
state is *on*)." So by default it has to be off(false).

Now there is one thing I am not certain about and is if this output is the
output after the parsing happens because I am testing the tokenizer without
any of the tree constructions stages and this might be the problem.

If I am wrong in any of the places please correct me so that I can know
where I am going wrong.
--
You received this message because you are subscribed to the Google Groups "html5lib-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to html5lib-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send an email to html5lib-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/html5lib-discuss.
For more options, visit https://groups.google.com/groups/opt_out.
Geoffrey Sneddon
2013-09-08 12:08:58 UTC
Permalink
To preface this — you'll probably have better luck filing bugs (over at
GitHub, as the Google Code page says) for things you think are issues
than emailing here.

Also see <http://wiki.whatwg.org/wiki/Parser_tests>.
Post by Mohammad Houssami
1- I am currently working on building and HTML5 parser according to
the specs of WHATWG and while testing the tokenizer using the tests on the
HTML5Lib i have noticed some of them have bugs. In general these are the
I am refering to this set of tests
:http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/test1.test
There are also some similar stuff in test2 test3 and test4 but lets just
stick with test1 for now.
{"description":"Correct Doctype lowercase",
"input":"<!DOCTYPE html>",
"output":[["DOCTYPE", "html", null, null, true]]}
The force quirck flag is set to true where as the specifications say its
usually on or off.
Like the example in the EOF here
:http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#doctype-state
"on" is mapped to false, "off" to true. Sure, we could use srings, but
I'm not sure that changing this is worthwhile given the number of
implementations already using the tests (as well as being a fair amount
of work to change!).
Post by Mohammad Houssami
2- In places where character tokens follow the tokenizer gives 1
character token with all the character as data when it should give 1
character token for every single character. Here is an example.
{"description":"Ampersand ampersand EOF",
"input":"&&",
"output":[["Character", "&&"]]}
My expected output for this is having 2 character tokens each with
ampersand data rather than just 1 token.
The tests compress adjacent character tokens down to one token, as most
implementations do, as it makes using the tests simpler. This can
obviously be reversed for something doing a 1:1 implementation of the spec.
Post by Mohammad Houssami
3- Assuming true stands for on and false for off, many quirck flags
are inverted where true(on) is given then it has to be false(off). The
earlier case I gave is an example.
DataState: <!DOCTYPE html>
Tag open state: <!DOCTYPE html>
Markup deceleration open state: <!DOCTYPE html>
Doctype State: : <!DOCTYPE html>
Before doctype name state: <!DOCTYPE html>
Doctype name state: <!DOCTYPE html>
Doctype name state: <!DOCTYPE html>
Doctype name state: <!DOCTYPE html>
Doctype name state: <!DOCTYPE html>
The state says the following : U+003E GREATER-THAN SIGN (>)
Switch to the data state<http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#data-state>.
Emit the current DOCTYPE token.
And then in data state the EOF is read so there is nothing about the
force-quirck flag and the specifications say the following : " When a
DOCTYPE token is created, its name, public identifier, and system
identifier must be marked as missing (which is a distinct state from the
empty string), and the *force-quirks flag* must be set to *off* (its other
state is *on*)." So by default it has to be off(false).
See above, true/false map the non-obvious way.
Post by Mohammad Houssami
Now there is one thing I am not certain about and is if this output is the
output after the parsing happens because I am testing the tokenizer without
any of the tree constructions stages and this might be the problem.
No, the tokenizer tests merely those tokens passed to the parser. You
need no parser to run them (though make sure you start in the right
initial state, which I believe is the "initialState" property
(defaulting to the data state, fairly obviously).

HTH,

Geoffrey.
--
You received this message because you are subscribed to the Google Groups "html5lib-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to html5lib-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send an email to html5lib-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/html5lib-discuss.
For more options, visit https://groups.google.com/groups/opt_out.
Loading...