Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement request: Use deterministic UID/GID in Dockerfile #1555

Open
ghsa-retrieval opened this issue Jan 21, 2025 · 4 comments
Open

Enhancement request: Use deterministic UID/GID in Dockerfile #1555

ghsa-retrieval opened this issue Jan 21, 2025 · 4 comments

Comments

@ghsa-retrieval
Copy link

ghsa-retrieval commented Jan 21, 2025

Is your enhancement request related to a problem? Please describe.
The UID and GID in the Dockerfile can change since the user and group are only created by name. This has implications for deployments that rely on identifying the user exactly. For example: Settings for the security context in Kubernetes/Helm charts, such as runAsUser and runAsGroup, cannot be applied, since the UID and GID are not known ahead of time and may change between versions. Similarly configurations to the user namespacing rely on this information.

What are the benefits of the requested enhancement?
The user and group are no longer assigned a non-deterministic ID. You can set up user namespaces in a predictable way.

Describe the solution you would like
Modify the adduser and addgroup commands in the Dockerfile to use a numerical UID and GID instead of a name. The UID and GID should be ones that are not already occupied by the Python base image.

Additional notes
Using a numerical UID and GID instead of name is also recommended according to Docker: https://docs.docker.com/build/building/best-practices/#user

tdruez added a commit that referenced this issue Jan 27, 2025
@tdruez
Copy link
Contributor

tdruez commented Jan 27, 2025

@ghsa-retrieval Could you confirm that the changes at https://github.com/aboutcode-org/scancode.io/pull/1569/files are good enough for your needs?

@ghsa-retrieval
Copy link
Author

@tdruez Yes, but this change should come with a big warning because it will likely cause issues for existing installations where the container image used a different uid/gid before and stored its data in a volume. This can result in a failure to start due to permissions errors and would require to chown the volume. While this could have happened unintentionally in the past as well, give the non-deterministic nature of the assignment, here it is expected to break.

For comparison, my local compose install without the patch shows the following IDs:
UID = 101
GID = 108

However, I have observed the IDs fluctuating on Kubernetes deployments, so I'm not sure if people just got lucky with their compose deployments in the past or if docker compose does some magic under the hood for this case.

@tdruez
Copy link
Contributor

tdruez commented Jan 27, 2025

@ghsa-retrieval Thanks for the insight! Do you have any suggestions on how we can address this with minimal impact on existing instances?

@ghsa-retrieval
Copy link
Author

ghsa-retrieval commented Jan 27, 2025

@tdruez Good question. It seems that docker compose does not provide an option to automatically modify the user permissions, unlike Kubernetes. What you could do is introduce a new service to the docker compose which modifies the permissions before starting web, worker, and nginx. I'm not sure if there is a cleaner solution for this, as this would only have to be run once on update to the new container image version.

Note: This is just a quick example, you would likely want to put the UID and GID in the .env and reference the variables instead of hardcoding them as well as use a properly tagged version for the alpine image. I have also not checked if there are any files placed in the directories which may have been given ownership other than the one of the "app" user. The performance on startup will be impacted if there are many files to modify.

services:
  db:
    image: postgres:13
    env_file:
      - docker.env
    volumes:
      - db_data:/var/lib/postgresql/data/
    shm_size: "1gb"
    restart: always

  redis:
    image: redis
    # Enable redis data persistence using the "Append Only File" with the
    # default policy of fsync every second. See https://redis.io/topics/persistence
    command: redis-server --appendonly yes
    volumes:
      - redis_data:/data
    restart: always
    
  chown:
    image: alpine:latest
    restart: "no"
    command: sh -c "
        chown -R 1000:1000 /opt/scancodeio/.env && 
        chown -R 1000:1000 /etc/scancodeio && 
        chown -R 1000:1000 /var/scancodeio/workspace && 
        chown -R 1000:1000 /var/scancodeio/static"
    volumes:
      - .env:/opt/scancodeio/.env
      - /etc/scancodeio/:/etc/scancodeio/
      - workspace:/var/scancodeio/workspace/
      - static:/var/scancodeio/static/

  web:
    build: .
    command: wait-for-it --strict --timeout=60 db:5432 -- sh -c "
        ./manage.py migrate &&
        ./manage.py collectstatic --no-input --verbosity 0 --clear &&
        gunicorn scancodeio.wsgi:application --bind :8000 --timeout 600 --workers 8 ${GUNICORN_RELOAD_FLAG}"
    env_file:
      - docker.env
    expose:
      - 8000
    volumes:
      - .env:/opt/scancodeio/.env
      - /etc/scancodeio/:/etc/scancodeio/
      - workspace:/var/scancodeio/workspace/
      - static:/var/scancodeio/static/
    depends_on:
      chown:
         condition: service_completed_successfully
      db:
         condition: service_started

  worker:
    build: .
    # Ensure that potential db migrations run first by waiting until "web" is up
    command: wait-for-it --strict --timeout=120 web:8000 -- sh -c "
        ./manage.py rqworker --worker-class scancodeio.worker.ScanCodeIOWorker
                             --queue-class scancodeio.worker.ScanCodeIOQueue
                             --verbosity 1"
    env_file:
      - docker.env
    volumes:
      - .env:/opt/scancodeio/.env
      - /etc/scancodeio/:/etc/scancodeio/
      - workspace:/var/scancodeio/workspace/
    depends_on:
      chown:
         condition: service_completed_successfully
      redis:
         condition: service_started
      db:
         condition: service_started
      web:
         condition: service_started

  nginx:
    image: nginx:alpine
    ports:
      - "${NGINX_PUBLISHED_HTTP_PORT:-80}:80"
      - "${NGINX_PUBLISHED_HTTPS_PORT:-443}:443"
    volumes:
      - ./etc/nginx/conf.d/:/etc/nginx/conf.d/
      - /var/www/html:/var/www/html
      - static:/var/scancodeio/static/
    depends_on:
      web:
         condition: service_started
    restart: always

  clamav:
    image: clamav/clamav
    volumes:
      - clamav_data:/var/lib/clamav
      - workspace:/var/scancodeio/workspace/
    restart: always

volumes:
  db_data:
  redis_data:
  clamav_data:
  static:
  workspace:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants