Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc Dockerfiles updates to reduce image footprints #1363

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ashahba
Copy link
Collaborator

@ashahba ashahba commented Jan 8, 2025

Description

This PR, leverages multi-stage builds and a few other BKMs to reduce Docker image footprints as much as possible and still sticking with same base images and ensuring functionality remains intact.
Once the two images (one from main branch and one using Dockerfile provided in this PR) are built the results shows 20%+ image size reduction:

$ docker images | grep chatqna
chatqna                updated         a16d17afc110   12 minutes ago   649MB
chatqna                main            ef7dccfe29d2   15 minutes ago   828MB

There are a few other enhancements like introducing build-args for GenAIComps repo and branch names to be cloned which improves local development and testing experience and is less error prone since it doesn't require modifying the Dockerfiles themselves.

My goal is not to get this PR merged, but I like some early feedback review so that we can collect other BKMs into one before going after all Dockerfiles and decide to improve every single one of them.

Issues

This is a known issue.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds new functionality)
  • Breaking change (fix or feature that would break existing design and interface)
  • Others (enhancement, documentation, validation, etc.)

Dependencies

N/A

Tests

This is a WIP PR and there is a CI test to cover it as well.

Copy link

github-actions bot commented Jan 8, 2025

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

Copy link
Contributor

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this saves too little, please see:

Comment on lines 11 to +12
libgl1-mesa-glx \
libjemalloc-dev \
git
libjemalloc-dev && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these are installed? Nothing should be using Mesa / 3D libs, and jemalloc usage would need explicit e.g. LD_PRELOAD for it, which I do not see here.

If those are dependencies of some pip package, pip should take care it, shouldn't it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No unfortunately Pip won't take care of development packages.

Copy link
Contributor

@eero-t eero-t Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What actually needs these specific packages / why?

I mean, pip install in that Dockerfile works fine without either of them, application itself is Python / does not link such libs, and I do not see any problem with the app when those libs are missing...

Copy link
Contributor

@eero-t eero-t Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tracked from Git history through file renames/moves when these were first used. However, neither the related PRs nor files in them did not mention why these were added. See e.g: #136

There were few earlier Dockerfiles that already apt-get these, that used langchain as base, but they were running something else than ChatQnA. So it seems like these items are just blindly copy-pasted, without any justification, and maybe used just for some development activities, instead of being needed in production.

@ashahba Could you find out whether there is some justification for adding these?

(If yes, I should be included to base image after all, and document the reason there.)

Comment on lines +25 to +31
ARG GenAICompsRepo="https://github.com/opea-project/GenAIComps"
ARG GenAICompsBranch="main" # Branch name or Commit ID

RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
curl

RUN curl -sSL --retry 5 ${GenAICompsRepo}/tarball/${GenAICompsBranch} | tar --strip-components=1 -xzf -
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using base image from GenAIComps project is the final goal. These optimization are unnecessary there, and shared base image gives >10x space improvement, when all application images are present.

=> I would skip this and wait for the base image, so that Dockerfiles can be (greatly) simplified to reduce the resulting image size.

Copy link
Contributor

@eero-t eero-t Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compared using curl against git clone --depth 1 used in my PR: #1031

Both took about 6 secs (through company VPN), which was a bit disappointing.

I would assume curling compressed tarball to use a bit less bandwidth than Git's "packed" data though.

Extracted tarball took 24MB space. Git clone included extra 14MB in .git subdir, but that would not go not go to final target, it's only on the temporary layer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, there's a rather large difference in installing curl vs. git on top of python:3.11.

Building curl image took 9s, and git one 18s, probably due to extra dependencies / their sizes (which won't be in the final image):

$ docker images |grep  -E "(git|curl)-test"
localhost/git-test                                        latest           d5dfb6922688  2 minutes ago  239 MB
localhost/curl-test                                       latest           0e44588dadfa  3 minutes ago  158 MB

=> I'll add note about doing this change to my PR, if I need to update it again.

Comment on lines +45 to +47
COPY --from=devel /usr/local/lib/${PYTHON}/site-packages /usr/local/lib/${PYTHON}/site-packages
COPY --from=devel /usr/local/bin /usr/local/bin
COPY --from=devel /home/user/GenAIComps /home/user/GenAIComps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tested using separate stage for ChatQnA pip install, and copying the files from there, like you do here, it saved only couple of MBs. With shared base image, that's pretty insignificant, so I would skip it for simplicity.

Or are you seeing larger savings?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please double check your setup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a build with just pip install being done in separate step, or not, and image size difference reported by Docker was just few megs. How did you test it, and what different did you see?

(I tested it with the comps base image, but that should not matter as it does exactly the same install step.)

Copy link
Contributor

@eero-t eero-t Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When caching is disabled, pip should clean out any downloading artefacts after installation, so I think only size difference would come from pip & setuptools upgrades, and those are only couple of MB in size. Which corresponds to what I saw...

Are you sure you're not confusing the size difference with disk space used for curl install? apt-get update itself takes some space, and curl + its deps take ~4.5MB.

(I'm asking because if there is significant size difference, then I should consider same thing for base image.)


RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
ENV LANG C.UTF-8
Copy link
Contributor

@eero-t eero-t Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python base images already include this workaround for Python versions needing it: https://github.com/docker-library/python/blob/master/3.11/slim-bookworm/Dockerfile

(LANG env is removed in 3.13 version.)

So I do not think such override to be needed for the base image. Do you agree?

@eero-t
Copy link
Contributor

eero-t commented Jan 20, 2025

I guess this can be closed, as #1031 was merged, and next step will be using base image (#1369)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants