You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I believe there is a bug in xml.dom.xmlbuilder.parse().
xml.dom.xmlbuilder is not documented by python, we don't have any tests for
them, and I can't find evidence on github of these classes being in use. That's
okay, because in this case the classes are implementations of a w3c standard
which was at the time a draft, and which is heavily documented.
First, some digital archeology. The first commit of Lib/xml/dom/xmlbuilder.py
is from 2003-01-25, with the message "Import from PyXML 1.10."
The issue already exists in that commit.
The earlier remains of the PyXML project are on sourceforce
in a cvs repo. Looking at the log there, I find that the first commit for xmlbuilder.py
is from 2002-02-21 with this message:
Preliminary implementation of the new direct-from-Expat minidom builder.
xmlbuilder contains the basic interface code for the DOM Level 3 "LS-Load"
feature; it can certainly be generalized a little more, but this seems a
reasonable start. expatbuilder contains the code that xmlbuilder uses to
glue minidom and Expat together. There's nothing general there! ;-)
The bug is already present in this version as well.
The file sees a few more commits, four of which reference updating to track
the new version of the draft standard, the latest of which is from 2002-08-02.
None of the commits since being imported into CPython reference the standard.
This means that in looking at the code, our best reference is the 2002-07-25 draft
version: https://www.w3.org/TR/2002/WD-DOM-Level-3-LS-20020725/.
What's called DOMBuilder in that draft and in our implementation, would in later
drafts be renamed to DOMParser and then LSParser, but the basic class remains
in the standard.
Going to the standard now:
The primary interface in the standard is DOMImplementationLS, a mixin for the DOMImplementation interface. The relevant methods for us are:
DOMBuilder createDOMBuilder(in unsigned short mode, in DOMString schemaType)raises(DOMException);
DOMInputSource createDOMInputSource();
Described as:
createDOMBuilder
Create a new DOMBuilder. The newly constructed parser may then be configured by means of its setFeature method, and used to parse documents by means of its parse method.
createDOMInputSource
Create a new "empty" DOMInputSource.
This interface represents a single input source for an XML entity.
This interface allows an application to encapsulate information about an input source in a single object, which may include a public identifier, a system identifier, a byte stream (possibly with a specified encoding), and/or a character stream.
The DOMBuilder will use the DOMInputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly; if not, the parser will use a byte stream, if available; if neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.
A DOMInputSource object belongs to the application: the parser shall never modify it in any way (it may modify a copy if necessary).
Finally, documentation for the DOMBuilder interface:
A interface to an object that is able to build a DOM tree from various input sources.
DOMBuilder provides an API for parsing XML documents and building the corresponding DOM document tree. A DOMBuilder instance is obtained from the DOMImplementationLS interface by invoking its createDOMBuildermethod.
DOMBuilders have a number of named features that can be queried or set. The name of DOMBuilder features must be valid XML names. Implementation specific features (extensions) should choose a implementation specific prefix to avoid name collisions.
Even if all features must be recognized by all implementations, being able to set a state (true or false) is not always required. The following list of recognized features indicates the definitions of each feature state, if setting the state to true or false must be supported or is optional and, which state is the default one:
parse
Parse an XML document from a resource identified by a DOMInputSource.
Parameters:
is of type DOMInputSource
The DOMInputSource from which the source document is to be read.
Return Value:
Document
So an application using this interface is intended to:
Get a hold of a DOMImplementation object. This is documented in python,
and accomplished via xml.dom.getDOMImplementation().
Next, call createDOMBuilder() and createDOMInputSource() to get
those two respective objects.
Configure the options for the DOMBuilder using the methods canSetFeature, getFeature, and setFeature.
Configure the values of the DOMInputSource as needed. As noted above,
DOMInput source will first prefer its characterStream attribute. If that
attribute is null, it will prefer the byteStream attribute. If both of
these are null, then the URI from its systemId attribute is used.
Finally, pass the DOMInputSource to DOMBuilder.parse() to receive a Document object.
In Python, this looks like this:
fromxml.domimportgetDOMImplementationimp=getDOMImplementation()
builder=imp.createDOMBuilder(imp.MODE_SYNCHRONOUS, None)
# configure the builder, if neededsource=imp.createDOMInputSource()
# set one or more attributes on the "empty" source objectsource.systemId="http://www.w3.org/2000/svg"document=builder.parse(source)
# Do whatever with the document
For a use case like this, with no additional customization of the source object,
the spec also has a DOMBuilder.parseURI() method which accepts a URI directly and
constructs the necessary DOMInputSource internally.
With all that as background, here's our current implementation of DOMBuilder.parse():
The problem is on line 192. systemId is not an attribute that exists on options here.
It's an instance of the xml.dom.xmlbuilder.Options class and represents the
features that can be configured on DOMBuilder. Those options are documented in
the specification, and are set using DOMBuilder.setFeature(). Our implementation
has a dictionary DOMBuilder._settings which contains the options from the
spec that we support, which starts here:
The important thing is that systemId is not an configuration setting in either
the specificiation or in our implementation, so as written we can never reach
the if branch in that method.
(The implementation is also non-conformant because it fails to consider input.characterStream before using input.byteStream, but that's a different
probem. I'm not looking to add a missing feature right now, just fix the already-existing-but-broken one.)
I'm going to tack on to this issue that DOMBuilder.parseURI() is entirely broken right now. It relies on DOMEntityResolver.resolveEntity, which uses the private method DOMEntityResolver._guess_media_encoding, which makes a call to http.client.HTTPMessage.plist, which is a method that hasn't existed since python 2.7 when the mimetools module went away. It's just broken.
I can fix that one too. I should probably write some tests for these at this point(?)
For the record, I'm finding these obscure issues because I'm working on the type stubs for the xml module, and these popped out when I tried to run a type check on the source.
Bug report
Bug description:
I believe there is a bug in
xml.dom.xmlbuilder.parse()
.xml.dom.xmlbuilder
is not documented by python, we don't have any tests forthem, and I can't find evidence on github of these classes being in use. That's
okay, because in this case the classes are implementations of a w3c standard
which was at the time a draft, and which is heavily documented.
First, some digital archeology. The first commit of
Lib/xml/dom/xmlbuilder.py
is from 2003-01-25, with the message "Import from PyXML 1.10."
The issue already exists in that commit.
The earlier remains of the PyXML project are on sourceforce
in a cvs repo. Looking at the log there, I find that the first commit for xmlbuilder.py
is from 2002-02-21 with this message:
The bug is already present in this version as well.
The file sees a few more commits, four of which reference updating to track
the new version of the draft standard, the latest of which is from 2002-08-02.
None of the commits since being imported into CPython reference the standard.
This means that in looking at the code, our best reference is the 2002-07-25 draft
version: https://www.w3.org/TR/2002/WD-DOM-Level-3-LS-20020725/.
What's called DOMBuilder in that draft and in our implementation, would in later
drafts be renamed to DOMParser and then LSParser, but the basic class remains
in the standard.
Going to the standard now:
The primary interface in the standard is DOMImplementationLS, a mixin for the
DOMImplementation
interface. The relevant methods for us are:Described as:
The documentation for DOMInputSource says:
Finally, documentation for the DOMBuilder interface:
And its parse method:
So an application using this interface is intended to:
and accomplished via
xml.dom.getDOMImplementation()
.createDOMBuilder()
andcreateDOMInputSource()
to getthose two respective objects.
canSetFeature
,getFeature
, andsetFeature
.DOMInput source will first prefer its
characterStream
attribute. If thatattribute is null, it will prefer the
byteStream
attribute. If both ofthese are null, then the URI from its
systemId
attribute is used.DOMInputSource
toDOMBuilder.parse()
to receive aDocument
object.In Python, this looks like this:
For a use case like this, with no additional customization of the source object,
the spec also has a
DOMBuilder.parseURI()
method which accepts a URI directly andconstructs the necessary DOMInputSource internally.
With all that as background, here's our current implementation of
DOMBuilder.parse()
:cpython/Lib/xml/dom/xmlbuilder.py
Lines 187 to 195 in aeb9b65
The problem is on line 192.
systemId
is not an attribute that exists onoptions
here.It's an instance of the
xml.dom.xmlbuilder.Options
class and represents thefeatures that can be configured on
DOMBuilder
. Those options are documented inthe specification, and are set using
DOMBuilder.setFeature()
. Our implementationhas a dictionary
DOMBuilder._settings
which contains the options from thespec that we support, which starts here:
cpython/Lib/xml/dom/xmlbuilder.py
Lines 101 to 103 in aeb9b65
The important thing is that
systemId
is not an configuration setting in eitherthe specificiation or in our implementation, so as written we can never reach
the if branch in that method.
I believe that the relevant lines should be:
(The implementation is also non-conformant because it fails to consider
input.characterStream
before usinginput.byteStream
, but that's a differentprobem. I'm not looking to add a missing feature right now, just fix the already-existing-but-broken one.)
CPython versions tested on:
3.13
Operating systems tested on:
No response
Linked PRs
xml.dom.xmlbuilder.DOMBuilder.parse()
#128284The text was updated successfully, but these errors were encountered: