Mohammad Houssami
2013-07-20 11:04:15 UTC
1- I am currently working on building and HTML5 parser according to
the specs of WHATWG and while testing the tokenizer using the tests on the
HTML5Lib i have noticed some of them have bugs. In general these are the
major things i have noticed:
I am refering to this set of tests
:http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/test1.test
There are also some similar stuff in test2 test3 and test4 but lets just
stick with test1 for now.
I have noticed that in places where you have doctype tokens like this one :
{"description":"Correct Doctype lowercase",
"input":"<!DOCTYPE html>",
"output":[["DOCTYPE", "html", null, null, true]]}
The force quirck flag is set to true where as the specifications say its
usually on or off.
Like the example in the EOF here
:http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#doctype-state
2- In places where character tokens follow the tokenizer gives 1
character token with all the character as data when it should give 1
character token for every single character. Here is an example.
{"description":"Ampersand ampersand EOF",
"input":"&&",
"output":[["Character", "&&"]]}
My expected output for this is having 2 character tokens each with
ampersand data rather than just 1 token.
3- Assuming true stands for on and false for off, many quirck flags
are inverted where true(on) is given then it has to be false(off). The
earlier case I gave is an example.
The states that should be covered with this input are the following:
DataState: <!DOCTYPE html>
Tag open state: <!DOCTYPE html>
Markup deceleration open state: <!DOCTYPE html>
Doctype State: : <!DOCTYPE html>
Before doctype name state: <!DOCTYPE html>
Doctype name state: <!DOCTYPE html>
Doctype name state: <!DOCTYPE html>
Doctype name state: <!DOCTYPE html>
Doctype name state: <!DOCTYPE html>
The state says the following : U+003E GREATER-THAN SIGN (>)
Switch to the data state<http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#data-state>.
Emit the current DOCTYPE token.
And then in data state the EOF is read so there is nothing about the
force-quirck flag and the specifications say the following : " When a
DOCTYPE token is created, its name, public identifier, and system
identifier must be marked as missing (which is a distinct state from the
empty string), and the *force-quirks flag* must be set to *off* (its other
state is *on*)." So by default it has to be off(false).
Now there is one thing I am not certain about and is if this output is the
output after the parsing happens because I am testing the tokenizer without
any of the tree constructions stages and this might be the problem.
If I am wrong in any of the places please correct me so that I can know
where I am going wrong.
the specs of WHATWG and while testing the tokenizer using the tests on the
HTML5Lib i have noticed some of them have bugs. In general these are the
major things i have noticed:
I am refering to this set of tests
:http://code.google.com/p/html5lib/source/browse/testdata/tokenizer/test1.test
There are also some similar stuff in test2 test3 and test4 but lets just
stick with test1 for now.
I have noticed that in places where you have doctype tokens like this one :
{"description":"Correct Doctype lowercase",
"input":"<!DOCTYPE html>",
"output":[["DOCTYPE", "html", null, null, true]]}
The force quirck flag is set to true where as the specifications say its
usually on or off.
Like the example in the EOF here
:http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#doctype-state
2- In places where character tokens follow the tokenizer gives 1
character token with all the character as data when it should give 1
character token for every single character. Here is an example.
{"description":"Ampersand ampersand EOF",
"input":"&&",
"output":[["Character", "&&"]]}
My expected output for this is having 2 character tokens each with
ampersand data rather than just 1 token.
3- Assuming true stands for on and false for off, many quirck flags
are inverted where true(on) is given then it has to be false(off). The
earlier case I gave is an example.
The states that should be covered with this input are the following:
DataState: <!DOCTYPE html>
Tag open state: <!DOCTYPE html>
Markup deceleration open state: <!DOCTYPE html>
Doctype State: : <!DOCTYPE html>
Before doctype name state: <!DOCTYPE html>
Doctype name state: <!DOCTYPE html>
Doctype name state: <!DOCTYPE html>
Doctype name state: <!DOCTYPE html>
Doctype name state: <!DOCTYPE html>
The state says the following : U+003E GREATER-THAN SIGN (>)
Switch to the data state<http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#data-state>.
Emit the current DOCTYPE token.
And then in data state the EOF is read so there is nothing about the
force-quirck flag and the specifications say the following : " When a
DOCTYPE token is created, its name, public identifier, and system
identifier must be marked as missing (which is a distinct state from the
empty string), and the *force-quirks flag* must be set to *off* (its other
state is *on*)." So by default it has to be off(false).
Now there is one thing I am not certain about and is if this output is the
output after the parsing happens because I am testing the tokenizer without
any of the tree constructions stages and this might be the problem.
If I am wrong in any of the places please correct me so that I can know
where I am going wrong.
--
You received this message because you are subscribed to the Google Groups "html5lib-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to html5lib-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send an email to html5lib-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/html5lib-discuss.
For more options, visit https://groups.google.com/groups/opt_out.
You received this message because you are subscribed to the Google Groups "html5lib-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to html5lib-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send an email to html5lib-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/html5lib-discuss.
For more options, visit https://groups.google.com/groups/opt_out.