Discussion:
Issue 211 in html5lib: Awful memory leak / infinite loop
h***@public.gmane.org
2012-08-24 08:27:02 UTC
Permalink
Status: New
Owner: ----

New issue 211 by jos...-DeyAJaLaifDnwRyio+***@public.gmane.org: Awful memory leak / infinite loop
http://code.google.com/p/html5lib/issues/detail?id=211

So I know this is not well-formed HTML, but it occurred in the wild as the
output from Markdown.

I have the latest pypi Python library (__version__ = 0.95-dev).

If I try to parse the following HTML, my program goes into an infinite loop
and memory usage increases without stop:

u"<p>So theres no shortage of info out there on rounded corners and I've
been through much of it and I'm posting to get the communities opinons at
this piont.</p>\n<p>My scenario is that we're developing a rounded corner
dependant design, mainly used for interactions (<button> and <a>). We are
going to use border radius for the good browsers on the block that play
nice with it and then use the server to send down javscript to browsers
that don't</p>\n<p>What I'm wondering is what to use to up scale the
browsers that ignore border radius CSS? I need something that works on
button aswell as a, div etc. I've been looking at the following and have
found that some don't play nice with <button>. Also the site already uses
jQuery.</p>\n<p>http://www.curvycorners.net/ -
http://code.google.com/p/jquerycurvycorners/</p>\n<p>http://www.html.it/articoli/niftycube/index.html</p>\n<p>http://www.malsup.com/jquery/corner/</p>"
--
You received this message because you are subscribed to the Google Groups "html5lib-discuss" group.
To post to this group, send an email to html5lib-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to html5lib-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB.
h***@public.gmane.org
2012-08-24 16:24:18 UTC
Permalink
Comment #1 on issue 211 by way...-***@public.gmane.org: Awful memory leak / infinite
loop
http://code.google.com/p/html5lib/issues/detail?id=211

I can't comment on the infinite loop, but as the maintainer of the Markdown
library, I was concerned regarding the original reporter's implication that
Markdown may be producing invalid HTML. While only the output is provided,
not the input, it appears to me that the invalid output is a result of
invalid input. You should be wrapping those random angle-bracket tags in
code tags. So "(`<button>` and `<a>`)" (note the backticks surrounding each
tag) would be output by Markdown as "(<code>&lt;button&gt;</code> and <code>&lt;a&gt;</code>)", which is valid HTML and will not result in an
infinite loop in html5lib.

If, in the event that the Markdown input is coming from an untrusted third
party, then you absolutely should be sanitizing it before passing it on to
anything else.

That said, one such way to sanitize (my recommendation) is to use the
Bleach library [1], which uses html5lib internally. So I guess we're back
to that infinite loop.

[1]: http://bleach.readthedocs.org/en/latest/index.html
--
You received this message because you are subscribed to the Google Groups "html5lib-discuss" group.
To post to this group, send an email to html5lib-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to html5lib-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB.
h***@public.gmane.org
2012-08-24 16:33:55 UTC
Permalink
Comment #2 on issue 211 by jos...-DeyAJaLaifDnwRyio+***@public.gmane.org: Awful memory leak /
infinite loop
http://code.google.com/p/html5lib/issues/detail?id=211

The Markdown comes from the wild and is probably invalid.

My idea was to pass the HTML through tidy before running an HTML parser,
thus avoiding an infinite loop. There are several tidy wrappers in Python.
I used pytidylib.

I didn't play with the options to make tidy more strict, and even after
tidy, html5lib still goes into an infinite loop. So my current workaround
is to use tidy followed by lxml :\
--
You received this message because you are subscribed to the Google Groups "html5lib-discuss" group.
To post to this group, send an email to html5lib-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to html5lib-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB.
h***@public.gmane.org
2013-04-09 20:53:50 UTC
Permalink
Updates:
Status: GitHub

Comment #3 on issue 211 by geoffers: Awful memory leak / infinite loop
http://code.google.com/p/html5lib/issues/detail?id=211

https://github.com/html5lib/html5lib-python/issues/4
--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings
--
You received this message because you are subscribed to the Google Groups "html5lib-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to html5lib-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send an email to html5lib-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.
Loading...