Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in xml.dom.xmlbuilder.DOMBuilder.parse() #128302

Open
tungol opened this issue Dec 28, 2024 · 3 comments
Open

Bug in xml.dom.xmlbuilder.DOMBuilder.parse() #128302

tungol opened this issue Dec 28, 2024 · 3 comments
Labels
stdlib Python modules in the Lib dir topic-XML type-bug An unexpected behavior, bug, or error

Comments

@tungol
Copy link
Contributor

tungol commented Dec 28, 2024

Bug report

Bug description:

I believe there is a bug in xml.dom.xmlbuilder.parse().

xml.dom.xmlbuilder is not documented by python, we don't have any tests for
them, and I can't find evidence on github of these classes being in use. That's
okay, because in this case the classes are implementations of a w3c standard
which was at the time a draft, and which is heavily documented.

First, some digital archeology. The first commit of Lib/xml/dom/xmlbuilder.py
is from 2003-01-25, with the message "Import from PyXML 1.10."
The issue already exists in that commit.

The earlier remains of the PyXML project are on sourceforce
in a cvs repo. Looking at the log there, I find that the first commit for xmlbuilder.py
is from 2002-02-21 with this message:

Preliminary implementation of the new direct-from-Expat minidom builder.
xmlbuilder contains the basic interface code for the DOM Level 3 "LS-Load"
feature; it can certainly be generalized a little more, but this seems a
reasonable start.  expatbuilder contains the code that xmlbuilder uses to
glue minidom and Expat together.  There's nothing general there!  ;-)

The bug is already present in this version as well.

The file sees a few more commits, four of which reference updating to track
the new version of the draft standard, the latest of which is from 2002-08-02.
None of the commits since being imported into CPython reference the standard.
This means that in looking at the code, our best reference is the 2002-07-25 draft
version: https://www.w3.org/TR/2002/WD-DOM-Level-3-LS-20020725/.

What's called DOMBuilder in that draft and in our implementation, would in later
drafts be renamed to DOMParser and then LSParser, but the basic class remains
in the standard.

Going to the standard now:

The primary interface in the standard is DOMImplementationLS, a mixin for the
DOMImplementation interface. The relevant methods for us are:

DOMBuilder         createDOMBuilder(in unsigned short mode, in DOMString schemaType)raises(DOMException);
DOMInputSource     createDOMInputSource();

Described as:

createDOMBuilder
Create a new DOMBuilder. The newly constructed parser may then be configured by means of its setFeature method, and used to parse documents by means of its parse method.

createDOMInputSource
Create a new "empty" DOMInputSource.

The documentation for DOMInputSource says:

This interface represents a single input source for an XML entity.

This interface allows an application to encapsulate information about an input source in a single object, which may include a public identifier, a system identifier, a byte stream (possibly with a specified encoding), and/or a character stream.

The DOMBuilder will use the DOMInputSource object to determine how to read XML input. If there is a character stream available, the parser will read that stream directly; if not, the parser will use a byte stream, if available; if neither a character stream nor a byte stream is available, the parser will attempt to open a URI connection to the resource identified by the system identifier.

A DOMInputSource object belongs to the application: the parser shall never modify it in any way (it may modify a copy if necessary).

Finally, documentation for the DOMBuilder interface:

A interface to an object that is able to build a DOM tree from various input sources.

DOMBuilder provides an API for parsing XML documents and building the corresponding DOM document tree. A DOMBuilder instance is obtained from the DOMImplementationLS interface by invoking its createDOMBuildermethod.

DOMBuilders have a number of named features that can be queried or set. The name of DOMBuilder features must be valid XML names. Implementation specific features (extensions) should choose a implementation specific prefix to avoid name collisions.

Even if all features must be recognized by all implementations, being able to set a state (true or false) is not always required. The following list of recognized features indicates the definitions of each feature state, if setting the state to true or false must be supported or is optional and, which state is the default one:

And its parse method:

parse
Parse an XML document from a resource identified by a DOMInputSource.
Parameters:
is of type DOMInputSource
The DOMInputSource from which the source document is to be read.
Return Value:
Document

So an application using this interface is intended to:

  1. Get a hold of a DOMImplementation object. This is documented in python,
    and accomplished via xml.dom.getDOMImplementation().
  2. Next, call createDOMBuilder() and createDOMInputSource() to get
    those two respective objects.
  3. Configure the options for the DOMBuilder using the methods canSetFeature,
    getFeature, and setFeature.
  4. Configure the values of the DOMInputSource as needed. As noted above,
    DOMInput source will first prefer its characterStream attribute. If that
    attribute is null, it will prefer the byteStream attribute. If both of
    these are null, then the URI from its systemId attribute is used.
  5. Finally, pass the DOMInputSource to DOMBuilder.parse() to receive a
    Document object.

In Python, this looks like this:

from xml.dom import getDOMImplementation

imp = getDOMImplementation()
builder = imp.createDOMBuilder(imp.MODE_SYNCHRONOUS, None)
# configure the builder, if needed
source = imp.createDOMInputSource()
# set one or more attributes on the "empty" source object
source.systemId = "http://www.w3.org/2000/svg"
document = builder.parse(source)
# Do whatever with the document

For a use case like this, with no additional customization of the source object,
the spec also has a DOMBuilder.parseURI() method which accepts a URI directly and
constructs the necessary DOMInputSource internally.

With all that as background, here's our current implementation of DOMBuilder.parse():

def parse(self, input):
options = copy.copy(self._options)
options.filter = self.filter
options.errorHandler = self.errorHandler
fp = input.byteStream
if fp is None and options.systemId:
import urllib.request
fp = urllib.request.urlopen(input.systemId)
return self._parse_bytestream(fp, options)

The problem is on line 192. systemId is not an attribute that exists on options here.
It's an instance of the xml.dom.xmlbuilder.Options class and represents the
features that can be configured on DOMBuilder. Those options are documented in
the specification, and are set using DOMBuilder.setFeature(). Our implementation
has a dictionary DOMBuilder._settings which contains the options from the
spec that we support, which starts here:

_settings = {
("namespace_declarations", 0): [
("namespace_declarations", 0)],

The important thing is that systemId is not an configuration setting in either
the specificiation or in our implementation, so as written we can never reach
the if branch in that method.

I believe that the relevant lines should be:

        fp = input.byteStream
        if fp is None and input.systemId:
            import urllib.request
            fp = urllib.request.urlopen(input.systemId)

(The implementation is also non-conformant because it fails to consider
input.characterStream before using input.byteStream, but that's a different
probem. I'm not looking to add a missing feature right now, just fix the already-existing-but-broken one.)

CPython versions tested on:

3.13

Operating systems tested on:

No response

Linked PRs

@tungol tungol added the type-bug An unexpected behavior, bug, or error label Dec 28, 2024
@tungol tungol changed the title Bug in `xml.dom.xmlbuilder.DOMBuilder.parse() Bug in xml.dom.xmlbuilder.DOMBuilder.parse() Dec 28, 2024
@tungol
Copy link
Contributor Author

tungol commented Dec 28, 2024

I'm going to tack on to this issue that DOMBuilder.parseURI() is entirely broken right now. It relies on DOMEntityResolver.resolveEntity, which uses the private method DOMEntityResolver._guess_media_encoding, which makes a call to http.client.HTTPMessage.plist, which is a method that hasn't existed since python 2.7 when the mimetools module went away. It's just broken.

I can fix that one too. I should probably write some tests for these at this point(?)

Currently:

def _guess_media_encoding(self, source):
info = source.byteStream.info()
if "Content-Type" in info:
for param in info.getplist():
if param.startswith("charset="):
return param.split("=", 1)[1].lower()

A version that works for python 3 is:

    def _guess_media_encoding(self, source):
        info = source.byteStream.info()
        if "Content-Type" in info:
            for param in info.get_params([]):
                if param[0] == 'charset':
                    return param[1].lower()

@tungol
Copy link
Contributor Author

tungol commented Dec 28, 2024

For the record, I'm finding these obscure issues because I'm working on the type stubs for the xml module, and these popped out when I tried to run a type check on the source.

@picnixz picnixz added stdlib Python modules in the Lib dir topic-XML labels Dec 28, 2024
@picnixz
Copy link
Contributor

picnixz commented Dec 28, 2024

Thank you very much for this thorough investigation! I'll try to have a look at your PR today or next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-XML type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

2 participants