Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: hostname format check fails on empty string #677

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jvtm
Copy link
Contributor

@jvtm jvtm commented Jul 13, 2023

Assert that hostname format validation fails gracefully on empty strings.

This is especially for Python jsonschema library that raises an unexpected ValueError exception on hostname check (python-jsonschema/jsonschema#1121).

Adds similar test for:

  • draft3: host-name
  • draft4: hostname
  • draft6: hosntame
  • draft7: hostname, idn-hostname
  • draft2019-09: hostname, idn-hostname
  • draft2020-12: hostname, idn-hostname
  • draft-next: hostname, idn-hostname

Assert that hostname format validation fails gracefully on empty strings.

This is especially for Python `jsonschema` library that raises an unexpected
ValueError exception on `hostname` check (python-jsonschema/jsonschema#1121).

Adds similar test for:
 * draft3: host-name
 * draft4: hostname
 * draft6: hosntame
 * draft7: hostname, idn-hostname
 * draft2019-09: hostname, idn-hostname
 * draft2020-12: hostname, idn-hostname
 * draft-next: hostname, idn-hostname
@jvtm jvtm requested a review from a team as a code owner July 13, 2023 09:12
jvtm added a commit to jvtm/jsonschema that referenced this pull request Jul 13, 2023
Doing hostname format check on empty string seems to raise a ValueError:

    >>> from jsonschema.validators import validator_for
    >>> schema = {"$schema": "https://json-schema.org/draft/2020-12/schema", "type": "string", "format": "hostname"}
    >>> vcls = validator_for(schema)
    >>> validator = vcls(schema, format_checker=vcls.FORMAT_CHECKER)
    >>> list(validator.iter_errors(""))
    ...
      File "lib/python3.10/site-packages/jsonschema/_format.py", line 276, in is_host_name
        return FQDN(instance).is_valid
      File "lib/python3.10/site-packages/fqdn/__init__.py", line 44, in __init__
        raise ValueError("fqdn must be str")
    ValueError: fqdn must be str

Fix by adding `raises=ValueError` to the related `@_checks_drafts` decorator call.

See also json-schema-org/JSON-Schema-Test-Suite#677.

Fixes python-jsonschema#1121.
@Julian
Copy link
Member

Julian commented Jul 13, 2023

So, reading the spec(s), I'm not sure this is correct actually!

In particular, excluding draft 2020-12 which I'll get back to, earlier drafts all say the equivalent of:

A string instance is valid against this attribute if it is a valid representation for an Internet host name, as defined by RFC 1034, section 3.1 [RFC1034].

(see e.g. here though as I say you can flip through all the drafts and they say more or less the same thing).

Looking at that section, it says:

Each node has a label, which is zero to 63 octets in length. Brother nodes may not have the same label, although the same label can be used for nodes which are not brothers. One label is reserved, and that is the null (i.e., zero length) label used for the root.

which seems to say that the empty string is indeed allowed, and in the context of DNS, refers to the root.

Draft 2020-12 on the other hand was changed to reference RFC 1123 and not RFC 1034 (some context is here):

hostname: As defined by RFC 1123, section 2.1 [RFC1123] [...]

where RFC 1123 in that section seems to simply pass things along to RFC 952:

The syntax of a legal Internet host name was specified in RFC-952
[DNS:4]. One aspect of host name syntax is hereby changed: the
restriction on the first character is relaxed to allow either a
letter or a digit. Host software MUST support this more liberal
syntax.

where the grammar in the latter seems to say:

<hname> ::= <name>*["."<name>]
<name>  ::= <let>[*[<let-or-digit-or-hyphen>]<let-or-digit>]

which also seems to allow the empty string.

Have a look and if you have some existing domain knowledge let me know, but yeah my first read looks like these are actually valid under the quoted specs.

@gregsdennis
Copy link
Member

Does this mean we should keep the tests, but ensure that validation passes instead?

@Julian
Copy link
Member

Julian commented Jul 13, 2023

To my read, yes.

@jvtm
Copy link
Contributor Author

jvtm commented Jul 13, 2023

Uhh, what a can of worms.. ;)

Some docs here and there limit DNS labels to be 1-63 octets (and some 0-63), and full entry max 253. But no mentions of allowing / denying no labels at all. And then the usual "use a single dot for the root".

I wonder how various real life tools work.

@Julian
Copy link
Member

Julian commented Jul 13, 2023

Yes well no good deed goes unpunished certainly. The authoritative thing here is the spec(s) / RFCs, not what tools do for better or worse, though certainly looking through them can be insightful anyhow -- I certainly saw one allowing an empty string.

@jvtm
Copy link
Contributor Author

jvtm commented Jul 13, 2023

I'll reverse the check and reword the commit and PR later today.

Is it worth adding any other checks?

@Julian
Copy link
Member

Julian commented Jul 13, 2023

None that I can immediately think of, let's see if anyone else chimes in between now and then, and regardless thanks for raising the PR!

@gregsdennis
Copy link
Member

I'll give this a run through my implementation tomorrow.

@jvtm
Copy link
Contributor Author

jvtm commented Jul 13, 2023

Looks like in Python jsonschema idn-hostname for both zero length string and a single dot fail.

The check uses idna library https://github.com/kjd/idna/blob/8c703c782bf8144091c63470fcdbf29ac47eeb15/idna/core.py#L340-L370

@Julian
Copy link
Member

Julian commented Jul 13, 2023

(That's not surprising to me -- the question is usually twofold --

  • is the library we / I used doing what the JSON Schema spec says it should do, i.e. is it implementing the section that the JSON Schema spec references
  • does it have a bug :D

In that case, I'm also somewhat trusting the JSON Schema spec which says "all hostnames are valid idn-hostnames" though that too may be wrong! Tricky tricky...)

@karenetheridge
Copy link
Member

FWIW, my implementation does not accept empty string:

$; json-schema-eval --validate_formats
enter data instance, followed by ^D:
""
enter schema, followed by ^D:
{"type":"string","format":"hostname"}
^D
{
  "errors": [
    {
      "error": "not a hostname",
      "instanceLocation": "",
      "keywordLocation": "/format"
    }
  ],
  "valid": false
}

..which is just a wrapper around https://metacpan.org/pod/Data::Validate::Domain#is_domain($domain,-\%options) which says it implements the RFCs.

@gregsdennis
Copy link
Member

Weirdly, my implementation says hostname with empty string is valid, but idn-hostname with empty string is invalid.

Ah! For idn-hostname, I'm using System.Uri.CheckHostName() provided by .Net. But for hostname, I'm using the regex ^[a-zA-Z][-.a-zA-Z0-9]{0,22}[a-zA-Z0-9]$ which is very likely incorrect; I'm not sure where I got it (probably SO).

@Julian Julian added the waiting for author A pull request which is waiting for an update from its author. label Jul 17, 2023
@jvtm
Copy link
Contributor Author

jvtm commented Jul 18, 2023

I don't have time to work on this at the moment. I can see that there's some interest, and in theory the tests should be strict, but I'll leave that for others to decide. Feel free to take over and add the required tests.

For what it's worth, I added additional pattern to the actual use-case (that does not allow empty string, or actually any top level domain either).

@Julian
Copy link
Member

Julian commented Jul 18, 2023

As far as I can tell, for anyone picking this back up, there isn't any additional test to add I don't think beyond the ones already in this PR, so the work is simply to flip the "false"'s to "true" as the expected result, or to otherwise show where in the RFCs I quoted is a restriction on empty strings.

@Julian
Copy link
Member

Julian commented Jul 18, 2023

(And thanks @jvtm for getting this started!)

@shane-kearns
Copy link

In the context of DNS, the root is an empty fragment to allow for absolute paths.
For example, if your DNS search path is example.com
github.com could either be resolved to the A record in the github.com zone.
Or it could be resolved to whatever is at github.com.example.com if that failed.
Whereas github.com. is an absolute domain that must not use search paths.

The above behaviour is a gross security issue, so the default configuration for most implementations of DNS resolvers is to use the search path for path length of 1. (e.g. host would resolve to host.example.com but github.com would not due to an implementation defined constraint)

My interpretation of this:

Each node has a label, which is zero to 63 octets in length. Brother
nodes may not have the same label, although the same label can be used
for nodes which are not brothers. One label is reserved, and that is
the null (i.e., zero length) label used for the root.

"a..example.com" -> not a valid hostname (empty label is reserved for the root)
".example.com" -> not a valid hostname (empty label is reserved for the root)
"example.com." -> valid (empty label at the end indicates explicit root)
"" -> not a valid hostname, but is a valid domain name (you can't resolve it usefully, but joining a host to a domain like python's ".".join(['github', 'com', '']) is a useful and valid thing to do.

About the change from RFC1034 to RFC1123
DNS is very explicitly a binary protocol with no restrictions other than length on what each segment of a name can be. You can see this clarified in RFC2181.
Whereas RFC1123 defines the sort of alphanumeric host name character set we are used to seeing (and what is for example a valid hostname in a URL)

Considering this grammar in RFC952:

  <hname> ::= <name>*["."<name>]
  <name>  ::= <let>[*[<let-or-digit-or-hyphen>]<let-or-digit>]

A name consists of a let (relaxed by RFC to let-or-digit), optionally followed by any number of let-or-digit-or-hyphen terminated by let-or-digit.

These character sets are not defined in the grammar, but referring to the assumptions in section 1:

<let-or-digit> = alphanumeric, i.e. `[A-Za-z0-9]`
<let-or-digit-or-hyphen> = alphanumeric + hyphen, i.e. `[A-Za-z0-9\-]`

so a name is represented by the regular expression [A-Za-z0-9]([A-Za-z0-9\-]*[A-Za-z0-9])?
"" -> invalid
"a" -> valid
"2" -> valid (RFC1123 updates)
"-aleph" -> invalid (hyphen only allowed in the middle of a name not start or end)
aleph-beth -> valid

and a hname (hostname) is 1 or more names joined with a .
Interestingly this grammar does not allow "github.com." since names are not permitted to be empty.

A further complication are DNS service records, these explicitly begin with an underscore to distinguish them from hostnames.
e.g. _ldap._tcp.example.com would be where to find the SRV record for DNS service discovery of the LDAP service for example.com, if one exists.
I think it is reasonable to say that DNS service records are not hostnames and should not validate as "format": "hostname" in json-schema.

@Julian
Copy link
Member

Julian commented Mar 4, 2024

I had some comment written out about how I had read your comment 2 or 3 times and couldn't fully follow it because it had outside info that wasn't seemingly relevant to answering "what is the spec's behavior regardless of what you want to use it for and whether it's appropriate for that use case" -- but then the fourth time reading it I noticed a quite simple error (on my part) which I think explains the behavior, namely:

<hname> ::= <name>*["."<name>]

I think I carelessly assumed and/or forgot that that BNF syntax in RFCs puts the * repetition before and not after, so that means as you said, a name with one or more separated names, and very much not "zero or more names, followed by ..."

So I'm at least personally convinced now that they indeed are not allowed in the grammar -- I'm not sure if anyone else actually was on either side here honestly, as usual I find the "my implementation does X" not very helpful compared to "I understand the behavior to be X because that's what it says" -- if anyone does have some understanding and wants to speak up on either side feel free -- otherwise if it was just me as a holdout, I'm happy to see the merge conflicts fixed and this merged with them being invalid under these lines in the spec.

@Julian
Copy link
Member

Julian commented Mar 4, 2024

Sorry, I realize now I wrote that whole comment out the second time without saying "thank you! clearly your analysis was helpful for getting to the bottom of this now it seems"!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting for author A pull request which is waiting for an update from its author.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants