Discussion:
Issue 189 in html5lib: HTMLParser is not threadsafe
h***@public.gmane.org
2011-07-25 09:26:59 UTC
Permalink
Status: New
Owner: ----

New issue 189 by akvadr...-***@public.gmane.org: HTMLParser is not threadsafe
http://code.google.com/p/html5lib/issues/detail?id=189

Hi. I realize this is by design, but it's not intuitive, since similar
standard classes like YamlDecoder and JSONDecoder are.

It would be more clear if the input stream was supplied to the constructor,
like with ElementTree.

But at least, please document this in the class.
--
You received this message because you are subscribed to the Google Groups "html5lib-discuss" group.
To post to this group, send an email to html5lib-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to html5lib-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB.
h***@public.gmane.org
2011-09-16 12:24:47 UTC
Permalink
Comment #1 on issue 189 by geoffers: HTMLParser is not threadsafe
http://code.google.com/p/html5lib/issues/detail?id=189

Is there any reason to document it? This is the case with all Python code
in CPython (other implementations may differ), so the cases where things
are threadsafe are the notable exceptions.
--
You received this message because you are subscribed to the Google Groups "html5lib-discuss" group.
To post to this group, send an email to html5lib-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to html5lib-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB.
h***@public.gmane.org
2011-09-16 13:06:17 UTC
Permalink
Comment #2 on issue 189 by akvadr...-***@public.gmane.org: HTMLParser is not threadsafe
http://code.google.com/p/html5lib/issues/detail?id=189

(Most?) Everything in the python standard library is threadsafe and most
extensions are. I think you are referring to the GIL, which is different.
That prevents parallel execution, but if one thread is blocking, the others
can run safely.

The problem with the design of HTMLParser is that two threads can interfere
with each other, even if they are not running at the same time.
--
You received this message because you are subscribed to the Google Groups "html5lib-discuss" group.
To post to this group, send an email to html5lib-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to html5lib-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB.
h***@public.gmane.org
2012-03-11 20:24:32 UTC
Permalink
Comment #3 on issue 189 by na...-***@public.gmane.org: HTMLParser is not threadsafe
http://code.google.com/p/html5lib/issues/detail?id=189

This is clearly a defect. This is an object-oriented library in an object
oriented language. Two parsers should be completely independent of each
other, with no shared global variables, and thus thread-safe. If that's not
the case, this is a defect.

Do I have to scrap my plans to convert a parallel web crawler from
BeautifulSoup to html5lib?

This looks fixable. The trouble spots include at least these global
variables:

dom.py: moduleCache

That could be easily fixed with a lock in getDomModule. That's a once per
parse event, so there's no performance issue. All that's needs is

import threading
...
Lok = threading.Lock()

with Lok() :
... critical section...


etree.py: moduleCache

Same issue.

etree.lxml: fullTree

This seems to be set only once, at load time. Is it changed elsewhere?

what have I missed? Some lower level library? Is Python's SAX parser
unsafe?

This can and should be fixed.
--
You received this message because you are subscribed to the Google Groups "html5lib-discuss" group.
To post to this group, send an email to html5lib-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To unsubscribe from this group, send email to html5lib-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB.
h***@public.gmane.org
2013-04-09 21:28:12 UTC
Permalink
Updates:
Status: GitHub

Comment #4 on issue 189 by geoffers: HTMLParser is not threadsafe
http://code.google.com/p/html5lib/issues/detail?id=189

https://github.com/html5lib/html5lib-python/issues/8
--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings
--
You received this message because you are subscribed to the Google Groups "html5lib-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to html5lib-discuss+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
To post to this group, send an email to html5lib-discuss-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
Visit this group at http://groups.google.com/group/html5lib-discuss?hl=en-GB.
For more options, visit https://groups.google.com/groups/opt_out.
Loading...