Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

psycopg connection error to DB #1523

Open
ghsa-retrieval opened this issue Jan 8, 2025 · 5 comments
Open

psycopg connection error to DB #1523

ghsa-retrieval opened this issue Jan 8, 2025 · 5 comments
Labels
bug Something isn't working

Comments

@ghsa-retrieval
Copy link

ghsa-retrieval commented Jan 8, 2025

Describe the bug
In a deployment of ScanCode.io in Kubernetes the worker is consistently having issues with connection to the database server which it does not recover from (without restarting).

It appears that the TLS connection to the database is being forcefully closed with an RST by AWS RDS. Unfortunately, due to TLS and security features of Kubernetes, we were unable to intercept the plaintext traffic. It seems that the ScanCode worker send some message that is so out of the ordinary that the connection is being dropped.

System configuration

  • Which version of ScanCode.io are you running?
  • Are you running the app using Docker?
    • No, custom Helm chart that runs it on EKS
  • On which OS?
    • Linux
  • What inputs are you using?
    • Unclear if related to inputs, seemed to happen fairly randomly at different times
  • Which pipeline are you running?
    • load_sbom, scan_single_package

To Reproduce
Setup DejaCode with a ScanCode.io instance

  1. Create a product
  2. In the product use Action > Load packages from SBOMs
  3. Observe that once the load_sbom pipeline is finished and the worker is supposed to work on single_package_scans that it fails with connection issues to the DB (as seen in the stackt trace below).

Expected behavior
Database connection should not be lost or at least get reestablished.

Screenshots

    11:04:43 default: scanpipe.tasks.execute_pipeline_task(run_pk='fc031d48-908c-4c1d-baa2-35b6c7c04275') (fc031d48-908c-4c1d-baa2-35b6c7c04275)
    INFO Run[fc031d48-908c-4c1d-baa2-35b6c7c04275] Enter `execute_pipeline_task` Run.pk=fc031d48-908c-4c1d-baa2-35b6c7c04275
    INFO Run[fc031d48-908c-4c1d-baa2-35b6c7c04275] Run pipeline: "load_sbom" on project: "9b3ce723-6b49-4c48-bcc9-9951e13ed351"
    INFO Run[fc031d48-908c-4c1d-baa2-35b6c7c04275] Update Run instance with exitcode, output, and end_date
    INFO Delivering Webhook 2b62ae18-45d8-43c8-80c0-6516b3a1cb40
    ERROR URL has an invalid label.
    11:04:51 Successfully completed scanpipe.tasks.execute_pipeline_task(run_pk='fc031d48-908c-4c1d-baa2-35b6c7c04275') job in 0:00:08.509004s on worker cb0eed9a3c59493782acc22f850df31b
    11:04:51 default: Job OK (fc031d48-908c-4c1d-baa2-35b6c7c04275)
    11:04:51 Result is kept for 500 seconds
    11:05:12 default: scanpipe.tasks.execute_pipeline_task(run_pk='87a612c7-2e7a-4504-a951-4acb69ff970a') (87a612c7-2e7a-4504-a951-4acb69ff970a)
    INFO Run[87a612c7-2e7a-4504-a951-4acb69ff970a] Enter `execute_pipeline_task` Run.pk=87a612c7-2e7a-4504-a951-4acb69ff970a
    Job 87a612c7-2e7a-4504-a951-4acb69ff970a: error while executing failure callback
    Traceback (most recent call last):
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 105, in _execute
        return self.cursor.execute(sql, params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/psycopg/cursor.py", line 97, in execute
        raise ex.with_traceback(None)
    psycopg.OperationalError: consuming input failed: SSL error: cipher operation failed
    The above exception was the direct cause of the following exception:
    Traceback (most recent call last):
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/rq/worker.py", line 1573, in perform_job
        rv = job.perform()
             ^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/rq/job.py", line 1322, in perform
        self._result = self._execute()
                       ^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/rq/job.py", line 1356, in _execute
        result = self.func(*self.args, **self.kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/scanpipe/tasks.py", line 58, in execute_pipeline_task
        run = get_run_instance(run_pk)
              ^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/scanpipe/tasks.py", line 37, in get_run_instance
        return Run.objects.get(pk=run_pk)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/manager.py", line 87, in manager_method
        return getattr(self.get_queryset(), name)(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 645, in get
        num = len(clone)
              ^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 382, in __len__
        self._fetch_all()
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 1928, in _fetch_all
        self._result_cache = list(self._iterable_class(self))
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 91, in __iter__
        results = compiler.execute_sql(
                  ^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py", line 1574, in execute_sql
        cursor.execute(sql, params)
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 79, in execute
        return self._execute_with_wrappers(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 92, in _execute_with_wrappers
        return executor(sql, params, many, context)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 100, in _execute
        with self.db.wrap_database_errors:
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/utils.py", line 91, in __exit__
        raise dj_exc_value.with_traceback(traceback) from exc_value
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 105, in _execute
        return self.cursor.execute(sql, params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/psycopg/cursor.py", line 97, in execute
        raise ex.with_traceback(None)
    django.db.utils.OperationalError: consuming input failed: SSL error: cipher operation failed
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/base/base.py", line 298, in _cursor
        return self._prepare_cursor(self.create_cursor(name))
                                    ^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/utils/asyncio.py", line 26, in inner
        return func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/postgresql/base.py", line 429, in create_cursor
        cursor = self.connection.cursor()
                 ^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/psycopg/connection.py", line 213, in cursor
        self._check_connection_ok()
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/psycopg/_connection_base.py", line 524, in _check_connection_ok
        raise e.OperationalError("the connection is closed")
    psycopg.OperationalError: the connection is closed
    The above exception was the direct cause of the following exception:
    Traceback (most recent call last):
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/rq/job.py", line 1470, in execute_failure_callback
        self.failure_callback(self, self.connection, *exc_info)
      File "/opt/scancodeio/scanpipe/tasks.py", line 47, in report_failure
        run = get_run_instance(run_pk=job.id)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/scanpipe/tasks.py", line 37, in get_run_instance
        return Run.objects.get(pk=run_pk)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/manager.py", line 87, in manager_method
        return getattr(self.get_queryset(), name)(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 645, in get
        num = len(clone)
              ^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 382, in __len__
        self._fetch_all()
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 1928, in _fetch_all
        self._result_cache = list(self._iterable_class(self))
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 91, in __iter__
        results = compiler.execute_sql(
                  ^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py", line 1572, in execute_sql
        cursor = self.connection.cursor()
                 ^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/utils/asyncio.py", line 26, in inner
        return func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/base/base.py", line 320, in cursor
        return self._cursor()
               ^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/base/base.py", line 297, in _cursor
        with self.wrap_database_errors:
             ^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/utils.py", line 91, in __exit__
        raise dj_exc_value.with_traceback(traceback) from exc_value
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/base/base.py", line 298, in _cursor
        return self._prepare_cursor(self.create_cursor(name))
                                    ^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/utils/asyncio.py", line 26, in inner
        return func(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/django/db/backends/postgresql/base.py", line 429, in create_cursor
        cursor = self.connection.cursor()
                 ^^^^^^^^^^^^^^^^^^^^^^^^
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/psycopg/connection.py", line 213, in cursor
        self._check_connection_ok()
      File "/opt/scancodeio/.venv/lib/python3.12/site-packages/psycopg/_connection_base.py", line 524, in _check_connection_ok
        raise e.OperationalError("the connection is closed")
    django.db.utils.OperationalError: the connection is closed
@ghsa-retrieval ghsa-retrieval added the bug Something isn't working label Jan 8, 2025
@ghsa-retrieval
Copy link
Author

I know this is a bit of an unusual bug report in so far that it does not use the standard docker compose setup, but any help is greatly appreciated. If you have any workaround that could recover from this or additional tests we could try, it would help us a lot. This is currently a major roadblock for using ScanCode.io with DejaCode in production.

@ghsa-retrieval
Copy link
Author

It seems that this issue is caused by how ScanCode.io's rqworker handle the database connection. I'm not familiar with Django-RQ or RQ itself, but it looks like either ScanCode.io does not use the rqworker correctly in this context or this is an underlying issue in Django-RQ, similar to: rq/django-rq#650

@ghsa-retrieval
Copy link
Author

Setting the rqworker argument --max-jobs 1, which results in the worker pod terminating after each job and restarted by the cluster, seems to avoid the connection issue. It is obviously very inefficient, but perhaps relevant.

@tdruez Do you have any ideas regarding DB connection management, if there is something you can do on your side? I've seen rq.SimpleWorker suggested as worker class elsewhere, because it does not fork. However, ScanCode.io has both custom worker and queue to handle the job cleanup. Would it perhaps be viable to use the SimpleWorker as base class? If you have any ideas, it would be very much appreciated.

@ghsa-retrieval
Copy link
Author

ghsa-retrieval commented Jan 17, 2025

Switching to rq.SimpleWorker avoids the connection issues (./manage.py --worker-class rq.SimpleWorker --queue-class scancodeio.worker.ScanCodeIOQueue --verbosity 1). This would suggest that basing the ScanCodeIOWorker class on it, could be a fix. Since I'm not an expert on RQ, I don't know if better or more performant options exist though.

@ghsa-retrieval
Copy link
Author

ghsa-retrieval commented Jan 21, 2025

Using SimpleWorker as a base class for ScanCodeIOWorker seems to be working fine. In k8s we can use Horizontal Pod Autoscaling to increase parallelism and spawn multiple worker in separate pods. That avoids the connection issue completely. The only potential issue I found was that three jobs of about a hundred seemingly got stuck (not completed after 18h). One switched to success when cancelling the pipeline and the others switched to stopped when manually cancelling their pipeline. I will have to see I can access the logs, but due to the dynamic horizontal scaling the may have gotten lost. The other workers that continued to run contained log messages such as:

    INFO Synchronizing QUEUED and RUNNING Run with their related Jobs...
    INFO 3 Run to synchronize:

However, they seemingly never gotten around to synchronizing the results. I'm not sure if this is related to the SimpleWorker or an unrelated hiccup that happened. Starting another single_package_scan allowed the jobs to complete.

It would be beneficial if ScanCode would allow to configure whether we want threads or processes. Quite apparantely something goes wrong when using Worker and TLS connections. I would suggest that there could be a ScanCodeIOWorker (forks) and ScanCodeIOSimpleWorker (non-forking). That way one could select the particular worker class in the command that starts the RQ worker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant