Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Switch to platform-specific TLS/SSL certificate verification #1529

Merged
merged 2 commits into from
Oct 15, 2024

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Oct 15, 2024

Use rustls-platform-verifier to verify certificates with the OS/platform's standard methods, including the system's standard CA trust store. See rustls-platform-verifier's documentation for details on each platform.¹

Previously, the system's trust store was ignored in favor of a baked in and unconfigurable trust store provided by webpki-roots.

Updates reqwest as necessary to ensure a single compatible rustls version remains in use. Updates wasm-bindgen as necessary for the newer reqwest.

¹ https://github.com/rustls/rustls-platform-verifier/blob/v/0.3.4/README.md

Resolves: #726
Related-to: #1527

tsibley and others added 2 commits October 14, 2024 22:24
Use rustls-platform-verifier to verify certificates with the
OS/platform's standard methods, including the system's standard CA trust
store.  See rustls-platform-verifier's documentation for details on each
platform.¹

Previously, the system's trust store was ignored in favor of a baked in
and unconfigurable trust store provided by webpki-roots.

Updates reqwest as necessary to ensure a single compatible rustls
version remains in use.  Updates wasm-bindgen as necessary for the newer
reqwest.

¹ <https://github.com/rustls/rustls-platform-verifier/blob/v/0.3.4/README.md>

Resolves: <#726>
Related-to: <#1527>
Copy link

vercel bot commented Oct 15, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
nextclade ✅ Ready (Inspect) Visit Preview Oct 15, 2024 5:39am

@tsibley
Copy link
Member Author

tsibley commented Oct 15, 2024

Looks like the builds went well in CI for all targets, but a few checks failed because there were no binaries uploaded and so none to download for those checks. That's because the binaries are written to .build/… but .out/… is what's uploaded as a workflow artifact for subsequent steps, and the conditions for ./docker/dev to copy from .build/.out/

nextclade/docker/dev

Lines 573 to 580 in 9b2d9dd

if [ -n "${CROSS:-}" ] && [ -n "${RELEASE:-}" ] && { [ "${BUILD:-}" == 1 ] || [ "${RUN:-}" == 1 ]; }; then
mkdir -p .out/
if [[ "${CROSS}" == *windows* ]]; then
cp "${BUILD_DIR}/${CROSS}/release/${PACKAGE_NAME}.exe" ".out/${PACKAGE_NAME}-${CROSS}.exe"
else
cp "${BUILD_DIR}/${CROSS}/release/${PACKAGE_NAME}" ".out/${PACKAGE_NAME}-${CROSS}"
fi
fi

weren't met. Huh. This seems intentional (?) but I admit I don't get why.

tsibley added a commit that referenced this pull request Oct 15, 2024
Allows adding additional CA certificates to the trust store specifically
for Nextclade, for when modifying the system's trust store isn't
desirable/possible.  Works across all platforms.

Currently requires using landed but unreleased changes to
rustls-platform-verifier so that macOS and Windows have the needed
Verifier::new_with_extra_certs() API.  Notably, that API also changes
signature between the latest released version (0.3.4) and the Git
revision we're using.

Related-to: <#1529>
Related-to: <#726>
tsibley added a commit that referenced this pull request Oct 15, 2024
Allows adding additional CA certificates to the trust store specifically
for Nextclade, for when modifying the system's trust store isn't
desirable/possible.  Works across all platforms.

Currently requires using landed unreleased changes to
rustls-platform-verifier so that macOS and Windows have the needed
Verifier::new_with_extra_certs() API.¹  That API also has signature
changes proposed² which, while unmerged, seem likely to land, so use
those in anticipation.  They actually bring the signature back closer to
the latest release (0.3.4).

¹ <rustls/rustls-platform-verifier#58>
² <rustls/rustls-platform-verifier#145>

Related-to: <#1529>
Related-to: <#726>
tsibley added a commit that referenced this pull request Oct 15, 2024
Allows adding additional CA certificates to the trust store specifically
for Nextclade, for when modifying the system's trust store isn't
desirable/possible.  Works across all platforms.

Currently requires using landed but unreleased changes to
rustls-platform-verifier so that macOS and Windows have the needed
Verifier::new_with_extra_certs() API.¹  That API also has signature
changes proposed² which, while unmerged, seem likely to land, so use
those in anticipation.  They actually bring the signature back closer to
the latest release (0.3.4).

¹ <rustls/rustls-platform-verifier#58>
² <rustls/rustls-platform-verifier#145>

Related-to: <#1529>
Related-to: <#726>
@ivan-aksamentov ivan-aksamentov mentioned this pull request Oct 15, 2024
@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Oct 15, 2024

@tsibley Thanks Tom! I am in favor of merging.

Indeed it seems that CI is borked now for some reason. I made an issue: #1532

This will have to be investigated. If you noticed potential breakage sources or have any ideas regarding improving the (admittedly horrible) dev script, the container and CI setup, I appreciate your advice :)

I believe that the bash checks should be met: running with br (build-release) means both BUILD=1 and RELEASE is set, and then CROSS should be set - the env var is provided explicitly when calling the script

@ivan-aksamentov ivan-aksamentov merged commit ed94c59 into master Oct 15, 2024
17 of 20 checks passed
@ivan-aksamentov ivan-aksamentov deleted the trs/rustls-platform-verifier branch October 15, 2024 14:40
@tsibley
Copy link
Member Author

tsibley commented Oct 15, 2024

I believe that the bash checks should be met: running with br (build-release) means both BUILD=1 and RELEASE is set, and then CROSS should be set - the env var is provided explicitly when calling the script

Oh, you're totally right. I misread. It seems as if it's either 1) never getting to the cp, as cp would complain if anything was missing or 2) it's getting to the cp just fine, but for some other reason the upload-artifact step isn't seeing .out/.

And writing that out, it just clicked. actions/upload-artifact@v4 made some pretty big breaking changes—I recall from when I upgraded to v4 in other repos—and indeed, here's the one that's biting us now:

With v4.4 and later, hidden files are excluded by default.

We need to set:

include-hidden-files: true

in its params. Maybe also if-no-files-found: error instead of the default of only warning, since it seems like binaries should always be present.

further investigation on #1532 and a fix in #1533

@tsibley
Copy link
Member Author

tsibley commented Oct 15, 2024

If you noticed potential breakage sources or have any ideas regarding improving the (admittedly horrible) dev script, the container and CI setup, I appreciate your advice :)

The dev setup is pretty… complicated. I can get some of it to work for me, but other parts are a little broken in various ways due to little details that don't line up. The sort of thing that accumulates when only a couple people regularly run a program and then someone new tries to run it.

  • Docker images are pulled, but then docker buildx build is still run immediately afterwards. This seems like it's not necessary? Maybe it's a quick-enough no-op in that case, but I couldn't tell because I ran into ERROR: cache export feature is currently not supported for docker driver. Please switch to a different driver (eg. "docker buildx create --use"). I know what's going on and how to fix this… but I realized I didn't have to since I didn't need to build.

  • ./docker/dev build works for me now, but CROSS=x86_64-unknown-linux-gnu ./docker/dev build does not: /usr/bin/time: cannot run cargo: Permission denied. My uid is 1000 but it seems like the cross container's files are owned by and only visible to 1001. My host is x86_64-unknown-linux-gnu.

I don't have the energy to fix 'em now, and I haven't really had to run ./docker/dev to hack on things yet, so… shrug

@tsibley
Copy link
Member Author

tsibley commented Oct 15, 2024

Oh, I see, the UID issue is directly caused by using the pulled images without rebuilding them locally. That's why the rebuild happens always.

@tsibley
Copy link
Member Author

tsibley commented Oct 15, 2024

Presumably if I fix the buildx builder instance issue and force a rebuild with ./docker/dev docker-image-build (because it won't be rebuilt now that they've been pulled), then the cargo permission denied issue will go away. But also presumably a better set umask in the images would make some of the UID stuff unnecessary.

@tsibley
Copy link
Member Author

tsibley commented Oct 15, 2024

Fixing CI revealed another CI issue (and maybe larger issue?) with this PR: systems without a trust store cause an error, e.g. like the bare bones Linux containers without ca-certificates installed that we use in ./tests/test-linux-distros.

We could address this with 1) default certs from webpki-roots in code or 2) in the CI setup by ensuring ca-certificates is installed. The "right" solution depends on if such an environment without a system trust store matters for actual use cases. I guess in other user-built containers it might already be relied upon (even without realizing it)?

@ivan-aksamentov
Copy link
Member

As mentioned in the comment in the dev script, none of this is required. It's just basically my personal giant set of bash aliases :)

Most people would just use Cargo and Node.js directly or in whatever the setup they like.

But then I dragged it into CI, and now there is no easy way back.

Docker images are pulled, but then docker buildx build is still run immediately afterwards.

This is to make sure the build is not stale, because I often tweak stuff. It checks hashes of files involved in docker build and rebuilds only if they differ. The pulled image is used for layer cache then.

And another reason to rebuild is the UID and GID, but that's not handled. Ideally I'd like to do rootless docker sometimes to avoid the hassle.

ERROR: cache export feature is currently not supported for docker driver. Please switch to a different driver (eg. "docker buildx create --use")

Could it be a problem with docker version? They constantly tweak cache-related features. I am using 27.1.2. It "works" (well, worked for the past few years) in CI with no particular dance, so that's the only thing on top of my head.

@ivan-aksamentov
Copy link
Member

For the missing ca-certs - not sure what to do. Can we install it on CI and say "user responsibility" in all other cases?

@ivan-aksamentov
Copy link
Member

I ship debian, alpine and scratch images, so that might need to be adjusted somehow.

Could there be also problems in Conda envs? I am not sure what they use.

I am hesitant to ship a fix for 3 users which potentially breaks 1000 users.

@tsibley
Copy link
Member Author

tsibley commented Oct 15, 2024

As mentioned in the comment in the dev script, none of this is required. It's just basically my personal giant set of bash aliases :)
[…]
But then I dragged it into CI, and now there is no easy way back.

Yeah, that's why I ignored it. :-)

But then I wanted to test what CI was testing…

Could it be a problem with docker version?

Ah, no, but I do know what the issue is (as I said): there's an incorrect assumption that the default buildx builder driver is docker-container not docker. (We handle this in nextstrain/docker-base by not assuming that.) I might address it so I can locally 1) build a proper binary against an old-enough glibc and 2) run tests/test-linux-distros.

I ship debian, alpine and scratch images, so that might need to be adjusted somehow.

I expect Debian image would be fine (ca-certificates is installed), but that the Alpine and scratch images would not be.

Could there be also problems in Conda envs? I am not sure what they use.

Conda envs basically always have ca-certificates installed on Linux, and on other systems they'll use the platform.

I don't really want to say "user responsibility" as it feels unfriendly and is essentially a breaking change. I'll address this in additional code to fall back to webpki-roots (which is what was always used previously).

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Oct 15, 2024

Thanks so much for digging and explaining Tom!

But also presumably a better set umask in the images would make some of the UID stuff unnecessary.

I like running my daily tasks in docker and the fact that everything runs and writes as root annoyed me for years. I desperately looking for a solution better than UID & GID build args. Last time I tried rootless docker a couple of years ago it was a major pain and I failed to even set it up on my system. Now you mentioned umask. If you know a way to set things up such that UID/GID are not baked-into the image, can you teach me please? Does it have limitations compared to setting UID/GID?

there's an incorrect assumption that the default buildx builder driver is docker-container not docker. (We handle this in nextstrain/docker-base by not assuming that.) I might address it so I can locally 1) build a proper binary against an old-enough glibc and 2) run tests/test-linux-distros.

I am not very familiar with the new builders stuff, neither I can repro it on my system or on CI, so I appreciate if you look at it. No rush though.

My main question is if it was working for me on different machines over the years and on CI, why it does not work for you? Do you have a different local docker configuration that prevents the docker-container driver assumed by my script from working?

We lived past 10 years without it all and now docker devs felt compelled to add this new complexities. I guess that there is a good reason. I need to learn more about it sometimes to avoid falling into these traps.

P.S. Maybe it's more convenient to discuss these things on Slack

tsibley added a commit that referenced this pull request Oct 16, 2024
…ault

We require the "docker-container" driver (via options we pass to `docker
buildx build`, e.g. --cache-to) but the default builder instance may use
the "docker" driver, leading to errors on dev machines like:

    ERROR: cache export feature is currently not supported for docker
    driver. Please switch to a different driver (eg. "docker buildx
    create --use")

Being explicit about our builder avoids that.

Related-to: <#1529 (comment)>
tsibley added a commit that referenced this pull request Oct 16, 2024
…ault

We require the "docker-container" driver (via options we pass to `docker
buildx build`, e.g. --cache-to) but the default builder instance may use
the "docker" driver, leading to errors on dev machines like:

    ERROR: cache export feature is currently not supported for docker
    driver. Please switch to a different driver (eg. "docker buildx
    create --use")

Being explicit about our builder avoids that.

This same pattern is used in nextstrain/docker-base's devel/build.¹

¹ <nextstrain/docker-base@fac834e5>
  <https://github.com/nextstrain/docker-base/blob/0acdc799/devel/build#L48-L56>

Related-to: <#1529 (comment)>
@tsibley
Copy link
Member Author

tsibley commented Oct 16, 2024

Last time I tried rootless docker a couple of years ago it was a major pain and I failed to even set it up on my system.

You might look into Podman, which has a much simpler rootless configuration due to its design not using a daemon (docker + dockerd) but forking+execing directly.

Now you mentioned umask. If you know a way to set things up such that UID/GID are not baked-into the image, can you teach me please? Does it have limitations compared to setting UID/GID?

(In case it's not clear, the umask I'm talking about is umask (man page, Wikipedia). I used "umask" as a shorthand for "default file permissions".)

I mentioned it because the uid/gid problems I ran into only seemed to arise in the first place because files/directories were set to be readable/executable only by the owner and no one else. That might because of the umask setting at image build time, or it might be because something explicitly set them that way. Regardless, in a container context with a single user, it rarely makes the most sense to have owner-only perms, as then the invoking uid (--user=…) and baked-in uid (USER in Dockerfile) have to match. The problem I ran into was because they didn't: my uid was 1000 and the baked-in uid was 1001 (because I used a pulled image directly instead of rebaking my uid into it). If the files/dirs were readable/executable by everyone, then this mismatch wouldn't have mattered.

My rules of thumb are to:

  1. Make sure that the Dockerfile produces images with world-readable/executable (and writable, if applicable) permissions. umask may be a part of this, or you may choose/need to achieve it other ways. If you want, you can also use a non-root default user, but for tool images like this, that's not strictly necessary if you'll also always be doing (2) below. (For service images like, say, Pg, it's more necessary as you'd less commonly invoke them with --user.)
  2. Invoke such images with --user=$(id -u):$(id -g) to ensure that any new files/dirs created on mounts shared with the host remain owned by the current user/group.

I am not very familiar with the new builders stuff, neither I can repro it on my system or on CI, so I appreciate if you look at it. No rush though.

#1534

My main question is if it was working for me on different machines over the years and on CI, why it does not work for you? Do you have a different local docker configuration that prevents the docker-container driver assumed by my script from working?

Yeah. My default buildx builder uses the docker driver, not the docker-container driver. This is the standard default, so you must have switched yours at some point long ago.

For reference, these are the builders I have right now:

$ docker buildx ls
NAME/NODE             DRIVER/ENDPOINT             STATUS  BUILDKIT             PLATFORMS
nextclade-builder     docker-container                                         
  nextclade-builder0  unix:///var/run/docker.sock stopped                      
nextstrain-builder    docker-container                                         
  nextstrain-builder0 unix:///var/run/docker.sock stopped                      
default *             docker                                                   
  default             default                     running v0.11.7+d3e6c1360f6e linux/amd64, linux/amd64/v2, linux/amd64/v3, linux/386

The output indicates the currently selected builder instance (default for me) with a *.

We lived past 10 years without it all and now docker devs felt compelled to add this new complexities. I guess that there is a good reason. I need to learn more about it sometimes to avoid falling into these traps.

I hear ya! I only know about this shit because we dealt with it for nextstrain/docker-base.

@tsibley
Copy link
Member Author

tsibley commented Oct 16, 2024

I'll address this in additional code to fall back to webpki-roots (which is what was always used previously).

Reasonable options I see for getting out of the hole I dug (lolsob):

  1. Use platform certs + built-in webpki-roots certs on all platforms…

    1. …using rustls_platform_verifier::Verifier::new_with_extra_roots(), similar to what feat: Support a NEXTCLADE_EXTRA_CA_CERTS environment variable and --extra-ca-certs option #1527 does for NEXTCLADE_EXTRA_CA_CERTS.

      This relies on currently-unreleased code, so we either accept that and move forward or revert this PR's merge to master and wait until the dep is released to switch to rustls-platform-verifier. Frankly, the former seems ok to me: merged-but-not-released seems very similar to merged-and-released for this package.

    2. …by enabling both the rustls-tls-native-roots and rustls-tls-webpki-roots features of reqwest. Currently we only enable the latter (via rustls feature alias), but we can enable both at once. This would use rustls-native-certs to acquire the platform certs, if any. It's better than no platform certs but not quite as good as using the platform's own cert verification methods (via rustls-platform-verifier).

      We could also support NEXTCLADE_EXTRA_CA_CERTS in this implementation by calling reqwest::blocking::ClientBuilder::add_root_certificate() repeatedly.

  2. Only add in certs from webpki-roots for Linux, where rustls_platform_verifier::Verifier::new_with_extra_roots() is available in released code. Other platforms should always have a system trust store with certs in it, so shouldn't need webpki-roots.

    When feat: Support a NEXTCLADE_EXTRA_CA_CERTS environment variable and --extra-ca-certs option #1527 lands, still don't add in webpki-roots for non-Linux platforms.

Thoughts? Mine are:

(1) is more like the Nextclade status quo before any of my work here: it just adds certs from the platform alongside previously used certs from webpki-roots. (1)(i) is maybe the least change from the status quo.

(1)(i), especially if we accept the unreleased code, is doubling down on switching to rustls-platform-verifier. (1)(ii) is walking it back and changing tack; we could still switch to it later (maybe when/if reqwest adopts it, as folks hope?) but maybe we won't.

(1)(i) and (2) preserve rustls-platform-verifier's full use of the platform's own verification methods beyond CA certs themselves, which is nice. Maybe not really something we need to work hard for, but at this point most of the work is done.

(2) is more "pure" in that it doesn't mix sources of certs unnecessarily. (1) is maybe more "consistent" though.

Having spent way more time on this now than I bargained for and really want to, I'm inclined to go with (1)(ii): ditch rustls-platform-verifier, go with something simpler and less different than what we have now, still solve CA cert problems now not later.

tsibley added a commit that referenced this pull request Oct 16, 2024
…verifier"

This reverts commit ed94c59, reversing
changes made to 48801fa.

Related-to: <#1529 (comment)>
tsibley added a commit that referenced this pull request Oct 16, 2024
…L trust store

Previously, the platform's trust store was ignored in favor of a baked
in and unconfigurable trust store provided by webpki-roots.  Now the
reqwest trust store will contain both certs obtained from the platform
at run time as well as certs baked in via webpki-roots.

Obtaining certs from the platform means that Nextclade will respect
OS-level configuration to trust private CAs / self-signed certs.
Keeping webpki-roots for all platforms is a precaution that makes this
change merely additive for backwards compatibility, in case of platforms
which lack a trust store (like some Linux containers) or platforms with
out-of-date trust stores.  It means that Nextclade binaries should
continue to Just Work™.

reqwest uses rustls-native-roots to obtain trusted CA certificates from
the standard trust stores for the OS/platform.  See the crate's
documentation for details on each platform.¹  Notably, this does not use
the platform's standard certificate verification methods like
rustls-platform-verifier; it just extracts certificates.  We may in the
future want to switch to rustls-platform-verifier (ourselves or by
waiting for reqwest to do so).

Updates reqwest because an earlier (but problematic and now reverted²)
change did so and there were some public API changes I'd like to use.
Updates wasm-bindgen as necessary for the newer reqwest (≥0.2.89) and
then a little further (0.2.93) to avoid Clippy warnings.³

¹ <https://docs.rs/crate/rustls-native-certs/0.8.0>
² <#1529 (comment)>.
³ <rustwasm/wasm-bindgen#3985>

Resolves: <#726>
Related-to: <#1529>
Related-to: <#1527>
tsibley added a commit that referenced this pull request Oct 16, 2024
…d --extra-ca-certs option

Allows adding additional CA certificates to the trust store specifically
for Nextclade, for when modifying the system's trust store isn't
desirable/possible.  Works across all platforms.

An option may be preferred to an env var by some users or in some
invocations.  It also provides a convenient place for documentation of
default CA cert handling and point of discovery for users.

Co-authored-by: ivan-aksamentov <[email protected]>
Supersedes: <#1527>
Related-to: <#1529>
Related-to: <#726>
tsibley added a commit that referenced this pull request Oct 16, 2024
…d --extra-ca-certs option

Allows adding additional CA certificates to the trust store specifically
for Nextclade, for when modifying the system's trust store isn't
desirable/possible.  Works across all platforms.

An option may be preferred to an env var by some users or in some
invocations.  It also provides a convenient place for documentation of
default CA cert handling and point of discovery for users.

Co-authored-by: ivan-aksamentov <[email protected]>
Supersedes: <#1527>
Related-to: <#1529>
Related-to: <#726>
@tsibley
Copy link
Member Author

tsibley commented Oct 16, 2024

Having spent way more time on this now than I bargained for and really want to, I'm inclined to go with (1)(ii): ditch rustls-platform-verifier, go with something simpler and less different than what we have now, still solve CA cert problems now not later.

I've implemented this as #1536.

tsibley added a commit that referenced this pull request Oct 16, 2024
…d --extra-ca-certs option

Allows adding additional CA certificates to the trust store specifically
for Nextclade, for when modifying the system's trust store isn't
desirable/possible.  Works across all platforms.

An option may be preferred to an env var by some users or in some
invocations.  It also provides a convenient place for documentation of
default CA cert handling and point of discovery for users.

Co-authored-by: ivan-aksamentov <[email protected]>
Supersedes: <#1527>
Related-to: <#1529>
Related-to: <#726>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants