Skip to content

Conversation

@zyoung1212
Copy link

@zyoung1212 zyoung1212 commented May 26, 2025

What problem does this PR solve?

This PR adds Gunicorn support to enable production-ready deployment of the RAGFlow framework, replacing the default development server with a robust WSGI HTTP server. This change significantly improves performance, stability, and concurrency handling in real-world deployments.

Type of change

  • New Feature (non-breaking change which adds functionality)
  • Performance Improvement

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 💞 feature Feature request, pull request that fullfill a new feature. 📖 documentation Improvements or additions to documentation labels May 26, 2025
@KevinHuSh KevinHuSh changed the title feature: Support production deployment mode by using gunicorn. Feat: Support production deployment mode by using gunicorn. May 26, 2025
@KevinHuSh KevinHuSh added the ci Continue Integration label May 26, 2025
@zyoung1212

This comment was marked as resolved.

@zyoung1212

This comment was marked as resolved.

@JinHai-CN
Copy link
Contributor

Would you please check if the test cases can run in your local environment?

@zyoung1212
Copy link
Author

I'm trying.

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jun 4, 2025
@zyoung1212 zyoung1212 marked this pull request as draft June 4, 2025 13:02
@zyoung1212
Copy link
Author

zyoung1212 commented Jun 4, 2025

Hi team, @JinHai-CN @ZhenhangTung @KevinHuSh

Apologies for the earlier PR that failed CI. I have since:

  • Re‑worked the build – all checks now pass on my own CI runner (8 vCPU / 32 GB RAM).
  • Added Gunicorn + gevent for true production deployment.
  • Hardened thread‑safety issues that surfaced with multi‑worker Gunicorn.
  • Updated Docker entrypoint, added conf/gunicorn.conf.py, a WSGI module and pinned the required gunicorn / gevent dependencies.
  • Minor tweaks: safer table creation, clearer startup warnings, distributed‑lock for DB‑init, etc.
    (See full diff in this PR.)

Although the test suite is green, these changes have not yet been exercised in a live production environment. Could you please run a thorough review and, if possible, a staging deployment to confirm everything behaves as expected?

Thank you for your time and feedback!

@zyoung1212 zyoung1212 marked this pull request as ready for review June 4, 2025 13:11
@KevinHuSh
Copy link
Collaborator

Appreciations!
It's almost done.
Excellent work.
I've found that it might bring some unknown risks, such as this, and what @ericwu0930 encountered.
For the risk and ambiguous resolutions, could you please let the default starting mode still be debugging?

@zyoung1212
Copy link
Author

Appreciations! It's almost done. Excellent work. I've found that it might bring some unknown risks, such as this, and what @ericwu0930 encountered. For the risk and ambiguous resolutions, could you please let the default starting mode still be debugging?

Sure, I will do that.

@ericwu0930
Copy link

@zyoung1212 @nareshrajkumar866
I use embedding saas from Qwen, and I think the monkey patch may conflict with Qwen'SDK dashscope.

2025-06-06 09:54:03,859 ERROR 113 exception:
Traceback (most recent call last):
File "/ragflow/rag/llm/embedding_model.py", line 213, in encode_queries
resp = dashscope.TextEmbedding.call(
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/embeddings/text_embedding.py", line 46, in call
return super().call(model=model,
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/client/base_api.py", line 146, in call
return request.call()
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/api_entities/http_request.py", line 84, in call
output = next(response)
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/api_entities/http_request.py", line 303, in _handle_request
raise e
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/api_entities/http_request.py", line 286, in _handle_request
response = session.post(url=self.url,
File "/ragflow/.venv/lib/python3.10/site-packages/requests/sessions.py", line 637, in post
return self.request("POST", url, data=data, json=json, **kwargs)
File "/ragflow/.venv/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/ragflow/.venv/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/ragflow/.venv/lib/python3.10/site-packages/requests/adapters.py", line 589, in send
resp = conn.urlopen(
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
response = self._make_request(
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 464, in _make_request
self._validate_conn(conn)
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
conn.connect()
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 741, in connect
sock_and_verified = _ssl_wrap_socket_and_match_hostname(
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 882, in ssl_wrap_socket_and_match_hostname
context.verify_mode = resolve_cert_reqs(cert_reqs)
File "/usr/lib/python3.10/ssl.py", line 738, in verify_mode
super(SSLContext, SSLContext).verify_mode.

[Previous line repeated 417 more times]
RecursionError: maximum recursion depth exceeded while calling a Python object

To avoid some unkonwn risks, it will be better to start gunicorn in sync workers mode.

@zyoung1212
Copy link
Author

@zyoung1212 @nareshrajkumar866 I use embedding saas from Qwen, and I think the monkey patch may conflict with Qwen'SDK dashscope.

2025-06-06 09:54:03,859 ERROR 113 exception:
Traceback (most recent call last):
File "/ragflow/rag/llm/embedding_model.py", line 213, in encode_queries
resp = dashscope.TextEmbedding.call(
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/embeddings/text_embedding.py", line 46, in call
return super().call(model=model,
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/client/base_api.py", line 146, in call
return request.call()
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/api_entities/http_request.py", line 84, in call
output = next(response)
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/api_entities/http_request.py", line 303, in _handle_request
raise e
File "/ragflow/.venv/lib/python3.10/site-packages/dashscope/api_entities/http_request.py", line 286, in _handle_request
response = session.post(url=self.url,
File "/ragflow/.venv/lib/python3.10/site-packages/requests/sessions.py", line 637, in post
return self.request("POST", url, data=data, json=json, **kwargs)
File "/ragflow/.venv/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "/ragflow/.venv/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "/ragflow/.venv/lib/python3.10/site-packages/requests/adapters.py", line 589, in send
resp = conn.urlopen(
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
response = self._make_request(
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 464, in _make_request
self._validate_conn(conn)
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1093, in _validate_conn
conn.connect()
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 741, in connect
sock_and_verified = _ssl_wrap_socket_and_match_hostname(
File "/ragflow/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 882, in ssl_wrap_socket_and_match_hostname
context.verify_mode = resolve_cert_reqs(cert_reqs)
File "/usr/lib/python3.10/ssl.py", line 738, in verify_mode
super(SSLContext, SSLContext).verify_mode.

[Previous line repeated 417 more times]
RecursionError: maximum recursion depth exceeded while calling a Python object

To avoid some unkonwn risks, it will be better to start gunicorn in sync workers mode.

Thank you very much for your feedback. I will work on some configuration optimizations to support different Gunicorn modes.

@zyoung1212
Copy link
Author

@KevinHuSh @ericwu0930
Hi, I’ve just pushed some updates:

  • Added support for switching between development and production modes via a .env file (default is development).
  • If the production mode is selected, it also supports switching between sync and gevent modes.
  • All changes have passed CI checks.

Would you mind testing again when you have a moment, especially to verify the issue you mentioned earlier? Thanks in advance! @ericwu0930

@zyoung1212 zyoung1212 force-pushed the feature/support-gunicorn branch from 4d6c232 to bcc8192 Compare June 6, 2025 06:58
@ericwu0930
Copy link

@zyoung1212 @KevinHuSh
Hi all, while introducing Gunicorn can significantly improve concurrency performance, there may be unknown risks. I set up the service using the command gunicorn --workers=20, but encountered a random bug (shown below). I suspect the default MySQL connection pool size is too large (not sure) and am currently working to fix this issue. In summary, be careful to merge this feature.

2025-06-10 16:00:07,182 ERROR    100 (0, '')
Traceback (most recent call last):
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 3291, in execute_sql
    cursor.execute(sql, params or ())
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/cursors.py", line 153, in execute
    result = self._query(query)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/cursors.py", line 322, in _query
    conn.query(q)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/connections.py", line 562, in query
    self._execute_command(COMMAND.COM_QUERY, sql)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/connections.py", line 843, in _execute_command
    raise err.InterfaceError(0, "")
pymysql.err.InterfaceError: (0, '')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 880, in full_dispatch_request
    rv = self.dispatch_request()
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 865, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
  File "/ragflow/api/utils/api_utils.py", line 294, in decorated_function
    objs = APIToken.query(token=token)
  File "/ragflow/api/db/db_models.py", line 212, in query
    return [query_record for query_record in query_records]
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 7243, in __iter__
    self.execute()
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 2011, in inner
    return method(self, database, *args, **kwargs)
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 2082, in execute
    return self._execute(database)
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 2255, in _execute
    cursor = database.execute(self)
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 3299, in execute
    return self.execute_sql(sql, params)
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 3289, in execute_sql
    with __exception_wrapper__:
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 3059, in __exit__
    reraise(new_type, new_type(exc_value, *exc_args), traceback)
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 192, in reraise
    raise value.with_traceback(tb)
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 3291, in execute_sql
    cursor.execute(sql, params or ())
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/cursors.py", line 153, in execute
    result = self._query(query)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/cursors.py", line 322, in _query
    conn.query(q)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/connections.py", line 562, in query
    self._execute_command(COMMAND.COM_QUERY, sql)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/connections.py", line 843, in _execute_command
    raise err.InterfaceError(0, "")
peewee.InterfaceError: (0, '')
2025-06-10 16:00:13,135 ERROR    100 (0, '')
Traceback (most recent call last):
  File "/ragflow/.venv/lib/python3.10/site-packages/peewee.py", line 3291, in execute_sql
    cursor.execute(sql, params or ())
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/cursors.py", line 153, in execute
    result = self._query(query)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/cursors.py", line 322, in _query
    conn.query(q)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/connections.py", line 562, in query
    self._execute_command(COMMAND.COM_QUERY, sql)
  File "/ragflow/.venv/lib/python3.10/site-packages/pymysql/connections.py", line 843, in _execute_command
    raise err.InterfaceError(0, "")
pymysql.err.InterfaceError: (0, '')
image

@ZhenhangTung
Copy link
Contributor

@ericwu0930 thanks for raising it.

@ericwu0930
Copy link

@zyoung1212 @KevinHuSh @ZhenhangTung
Another problem.
Login error. After a restart, everything works normally, but after a period of time, I am unable to log in to Ragflow and logged out immediately. The log reports the following error. Any ideas?

werkzeug.exceptions.Unauthorized: 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.
2025-06-11 10:05:02,191 WARNING  3303 load_user got exception Signature b'uoCyaMaa0mZlvg8436I9_06064s' does not match
2025-06-11 10:05:02,191 ERROR    3303 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.
Traceback (most recent call last):
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 880, in full_dispatch_request
    rv = self.dispatch_request()
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 865, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
  File "/ragflow/.venv/lib/python3.10/site-packages/flask_login/utils.py", line 285, in decorated_view
    return current_app.login_manager.unauthorized()
  File "/ragflow/.venv/lib/python3.10/site-packages/flask_login/login_manager.py", line 178, in unauthorized
    abort(401)
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/helpers.py", line 272, in abort
    current_app.aborter(code, *args, **kwargs)
  File "/ragflow/.venv/lib/python3.10/site-packages/werkzeug/exceptions.py", line 863, in __call__
    raise self.mapping[code](*args, **kwargs)
werkzeug.exceptions.Unauthorized: 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.

@zyoung1212
Copy link
Author

@zyoung1212 @KevinHuSh @ZhenhangTung Another problem. Login error. After a restart, everything works normally, but after a period of time, I am unable to log in to Ragflow and logged out immediately. The log reports the following error. Any ideas?

werkzeug.exceptions.Unauthorized: 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.
2025-06-11 10:05:02,191 WARNING  3303 load_user got exception Signature b'uoCyaMaa0mZlvg8436I9_06064s' does not match
2025-06-11 10:05:02,191 ERROR    3303 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.
Traceback (most recent call last):
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 880, in full_dispatch_request
    rv = self.dispatch_request()
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 865, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
  File "/ragflow/.venv/lib/python3.10/site-packages/flask_login/utils.py", line 285, in decorated_view
    return current_app.login_manager.unauthorized()
  File "/ragflow/.venv/lib/python3.10/site-packages/flask_login/login_manager.py", line 178, in unauthorized
    abort(401)
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/helpers.py", line 272, in abort
    current_app.aborter(code, *args, **kwargs)
  File "/ragflow/.venv/lib/python3.10/site-packages/werkzeug/exceptions.py", line 863, in __call__
    raise self.mapping[code](*args, **kwargs)
werkzeug.exceptions.Unauthorized: 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.

Sorry, I've been busy with some company projects lately and haven't had much time to think through the issue. We might need a bit more detail to get a better idea—looking forward to solving it together with you!

@ericwu0930
Copy link

ericwu0930 commented Jun 11, 2025

@zyoung1212 @KevinHuSh @ZhenhangTung Another problem. Login error. After a restart, everything works normally, but after a period of time, I am unable to log in to Ragflow and logged out immediately. The log reports the following error. Any ideas?

werkzeug.exceptions.Unauthorized: 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.
2025-06-11 10:05:02,191 WARNING  3303 load_user got exception Signature b'uoCyaMaa0mZlvg8436I9_06064s' does not match
2025-06-11 10:05:02,191 ERROR    3303 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.
Traceback (most recent call last):
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 880, in full_dispatch_request
    rv = self.dispatch_request()
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/app.py", line 865, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
  File "/ragflow/.venv/lib/python3.10/site-packages/flask_login/utils.py", line 285, in decorated_view
    return current_app.login_manager.unauthorized()
  File "/ragflow/.venv/lib/python3.10/site-packages/flask_login/login_manager.py", line 178, in unauthorized
    abort(401)
  File "/ragflow/.venv/lib/python3.10/site-packages/flask/helpers.py", line 272, in abort
    current_app.aborter(code, *args, **kwargs)
  File "/ragflow/.venv/lib/python3.10/site-packages/werkzeug/exceptions.py", line 863, in __call__
    raise self.mapping[code](*args, **kwargs)
werkzeug.exceptions.Unauthorized: 401 Unauthorized: The server could not verify that you are authorized to access the URL requested. You either supplied the wrong credentials (e.g. a bad password), or your browser doesn't understand how to supply the credentials required.

@ZhenhangTung @KevinHuSh
Hi! May I ask you under what circumstances would this error occur? What direction should I investigate? I noticed there were similar errors #7249. What's more, this issue doesn't appear in the test environment but only occurs in the production environment.

@zyoung1212
Copy link
Author

Hello, @ericwu0930
Could you please tell me what the startup parameters are for the above two issues (except the known --workers=20)? Are they using the sync mode or gevent mode?
Additionally, are there any stable methods to trigger these two issues currently?

@ericwu0930
Copy link

Hello, @ericwu0930 Could you please tell me what the startup parameters are for the above two issues (except the known --workers=20)? Are they using the sync mode or gevent mode? Additionally, are there any stable methods to trigger these two issues currently?

sync mode.

image
The first issue is triggered when I requested /api/v1/retrieval continuously. And it was resolved by reducing MySQL's max connections to 100. Actually, I haven't figured out the reason completely.

Another one has no idea. Except for being kicked out immediately after login which prevents me from operating on the ragflow website, everything goes well.
image

@Colstuwjx
Copy link
Contributor

Hi @ericwu0930 , I've also met a similar 401 issue on my internal deployment before, the reproduce steps from my side are:

  1. start multiple instances with the development server, use a nginx to proxy to these ragflow-01 / ragflow-02 instances;
  2. integrate with okta login sso and try to login, sometimes it would be able to login and then throw 401, sometimes it would directly return "invalid_state"

I've fixed this issue by setting SECRET_KEY and introducing flask redis session to share the session across multiple ragflow backend instances, will take time to submit an individual PR for sharing this fix.

@Colstuwjx
Copy link
Contributor

@zyoung1212 @ericwu0930 could you run the ragflow backend service normally ? I've tried this PR and there are too many issues...

For example, when it's sync mode (gevent would also have issue), it would stuck on exiting master process and must kill it via kill -9 :

# environment: macOS M1 PRO, Python 3.10.13
$ gunicorn --config conf/gunicorn.conf.py  api.wsgi:application
...
[2025-06-18 15:10:41 +0800] [30246] [INFO] Worker exiting (pid: 30246)
[2025-06-18 15:10:41 +0800] [30240] [INFO] Worker exiting (pid: 30240)
[2025-06-18 15:10:44 +0800] [29891] [INFO] Shutting down: Master # will stuck on here forever, and then I must kill it by kill -9
[1]    29891 killed     gunicorn --config conf/gunicorn.conf.py api.wsgi:application

And also when I'm trying to add a chunk by calling POST {{host}}/api/v1/datasets/f1f8ec244abd11f09b2076cb1995044e/documents/ecf88c024abd11f09b2076cb1995044e/chunks API, it would also return 500 / 504:

objc[30573]: +[MPSGraphObject initialize] may have been in progress in another thread when fork() was called.
objc[30573]: +[MPSGraphObject initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[2025-06-18 15:13:30 +0800] [30510] [ERROR] Worker (pid:30573) was sent SIGKILL! Perhaps out of memory?
RAGFlow worker about to be forked
[2025-06-18 15:13:30 +0800] [30581] [INFO] Booting worker with pid: 30581
RAGFlow worker spawned (pid: 30581)

@ericwu0930
Copy link

@Colstuwjx @zyoung1212
The root cause has been identified. On June 11, I started 10 Gunicorn processes. One of them crashed and restarted on June 17. RagFlow's JWT is derived from secret_key, but the original RagFlow did not assign a value to secret_key, so it defaulted to the process's startup date. As a result, 9 machines had JWTs dated 2025-06-11, while the new one had a JWT dated 2025-06-17. If a user's login request is routed to one process but subsequent API calls hit a different process with a mismatched JWT date, they get kicked out.
企业微信截图_17502420362075

I've solved this by setting SECRET_KEY. I think it's no need to introduce flask redis session.

@ericwu0930
Copy link

ericwu0930 commented Jun 18, 2025

@Colstuwjx @zyoung1212 The root cause has been identified. On June 11, I started 10 Gunicorn processes. One of them crashed and restarted on June 17. RagFlow's JWT is derived from secret_key, but the original RagFlow did not assign a value to secret_key, so it defaulted to the process's startup date. As a result, 9 machines had JWTs dated 2025-06-11, while the new one had a JWT dated 2025-06-17. If a user's login request is routed to one process but subsequent API calls hit a different process with a mismatched JWT date, they get kicked out. 企业微信截图_17502420362075

I've solved this by setting SECRET_KEY. I think it's no need to introduce flask redis session.

@KevinHuSh @zyoung1212 @Colstuwjx
By the way, after upgrading from v0.18 to v0.19, I noticed a gradual memory increase in ragflow container. The process killed on June 17 was likely due to this issue. I don't believe this is related to the introduction of Gunicorn—rather, using Gunicorn may have amplified an existing problem.

企业微信截图_17502433561248

@ericwu0930
Copy link

ericwu0930 commented Jun 18, 2025

@zyoung1212 @ericwu0930 could you run the ragflow backend service normally ? I've tried this PR and there are too many issues...

For example, when it's sync mode (gevent would also have issue), it would stuck on exiting master process and must kill it via kill -9 :

# environment: macOS M1 PRO, Python 3.10.13
$ gunicorn --config conf/gunicorn.conf.py  api.wsgi:application
...
[2025-06-18 15:10:41 +0800] [30246] [INFO] Worker exiting (pid: 30246)
[2025-06-18 15:10:41 +0800] [30240] [INFO] Worker exiting (pid: 30240)
[2025-06-18 15:10:44 +0800] [29891] [INFO] Shutting down: Master # will stuck on here forever, and then I must kill it by kill -9
[1]    29891 killed     gunicorn --config conf/gunicorn.conf.py api.wsgi:application

And also when I'm trying to add a chunk by calling POST {{host}}/api/v1/datasets/f1f8ec244abd11f09b2076cb1995044e/documents/ecf88c024abd11f09b2076cb1995044e/chunks API, it would also return 500 / 504:

objc[30573]: +[MPSGraphObject initialize] may have been in progress in another thread when fork() was called.
objc[30573]: +[MPSGraphObject initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
[2025-06-18 15:13:30 +0800] [30510] [ERROR] Worker (pid:30573) was sent SIGKILL! Perhaps out of memory?
RAGFlow worker about to be forked
[2025-06-18 15:13:30 +0800] [30581] [INFO] Booting worker with pid: 30581
RAGFlow worker spawned (pid: 30581)

@Colstuwjx
I didn’t use the author’s original code entirely—this is my own implementation, which can boot normally in sync mode.
main...ericwu0930:ragflow:feature/support-gunicorn

@Colstuwjx
Copy link
Contributor

Hi @ericwu0930 thanks for your kindly sharing. I foound the issues I mentioned before were caused by the package import of these codes , after moved them inside the post_fork function, the guicorn workers could be started normally on my local environment ( my local environment is macOS M1 PRO).

However, when I try to benchmark the gunicorn workers, e.g. keep posting create chunk requests to ragflow http endpoint, the gunicorn workers would crash and the benchmark client would return 502 errors like below:

0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2593] Worker 34: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2637] Worker 99: Received non-200 response: 502
[Csv Row number 2636] Worker 77: Received non-200 response: 502
[Csv Row number 2595] Worker 61: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2597] Worker 7: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2603] Worker 35: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2602] Worker 98: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

And when I switch to the original development server, it could be even more stable, less timeout errors, although the process would also finally crashed.

@zyoung1212
Copy link
Author

zyoung1212 commented Jun 19, 2025

Hi @ericwu0930 thanks for your kindly sharing. I foound the issues I mentioned before were caused by the package import of these codes , after moved them inside the post_fork function, the guicorn workers could be started normally on my local environment ( my local environment is macOS M1 PRO).嗨,感谢你的分享。我发现之前提到的问题是由于这些代码的包导入造成的,在将它们移动到 post_fork 函数内部后,gunicorn 工作进程可以在我的本地环境(我的本地环境是 macOS M1 PRO)上正常启动。

However, when I try to benchmark the gunicorn workers, e.g. keep posting create chunk requests to ragflow http endpoint, the gunicorn workers would crash and the benchmark client would return 502 errors like below:然而,当我尝试对 gunicorn 工作进程进行基准测试时,例如不断向 ragflow http 端点发送创建块请求,gunicorn 工作进程会崩溃,基准测试客户端会返回如下 502 错误:

0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2593] Worker 34: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2637] Worker 99: Received non-200 response: 502
[Csv Row number 2636] Worker 77: Received non-200 response: 502
[Csv Row number 2595] Worker 61: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2597] Worker 7: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2603] Worker 35: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
[Csv Row number 2602] Worker 98: Request failed: Post "https://ragflow.devcorp.net/api/v1/datasets/c791f5224c3211f098255637d18efa72/documents/cfe131704c3211f0b92e5637d18efa72/chunks": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

And when I switch to the original development server, it could be even more stable, less timeout errors, although the process would also finally crashed.当我切换回原始开发服务器时,它可能会更加稳定,超时错误更少,尽管该进程最终也会崩溃。

Hi, would you be able to share your benchmark? I'll spend some more time troubleshooting the issue later, and I'm also looking forward to optimizing it together with you.

@Colstuwjx
Copy link
Contributor

ragflow-perf-test.go.txt

Hi @zyoung1212 , you could use this to do the benchmark, I used another internal csv data to do the perf test, but the behavior is similar to this script.

@ericwu0930
Copy link

@Colstuwjx @zyoung1212 The root cause has been identified. On June 11, I started 10 Gunicorn processes. One of them crashed and restarted on June 17. RagFlow's JWT is derived from secret_key, but the original RagFlow did not assign a value to secret_key, so it defaulted to the process's startup date. As a result, 9 machines had JWTs dated 2025-06-11, while the new one had a JWT dated 2025-06-17. If a user's login request is routed to one process but subsequent API calls hit a different process with a mismatched JWT date, they get kicked out. 企业微信截图_17502420362075
I've solved this by setting SECRET_KEY. I think it's no need to introduce flask redis session.

@KevinHuSh @zyoung1212 @Colstuwjx By the way, after upgrading from v0.18 to v0.19, I noticed a gradual memory increase in ragflow container. The process killed on June 17 was likely due to this issue. I don't believe this is related to the introduction of Gunicorn—rather, using Gunicorn may have amplified an existing problem.

企业微信截图_17502433561248

#7995 related problem.

@Archerzlt
Copy link

Archerzlt commented Jun 24, 2025

ragflow version 0.19.0
Hi,when I tried to deploy the service using GUNICORN tool, I encountered some problems:
I have added the following code to replace run_Simple() in the script 'ragflow_Server. py'

# start http server
try:
    logging.info("RAGFlow HTTP server start...")
    from gunicorn.app.base import BaseApplication
    class StandaloneApplication(BaseApplication):
        def __init__(self, app, options=None):
            self.options = options or {}
            self.application = app
            super().__init__()

        def load_config(self):
            config = {key: value for key, value in self.options.items()
                      if key in self.cfg.settings and value is not None}
            for key, value in config.items():
                self.cfg.set(key.lower(), value)

        def load(self):
            return self.application

    options = {
        'bind': f"{settings.HOST_IP}:{settings.HOST_PORT}",
        'workers': 10,  # 建议 CPU 核心数 * 2 + 1
        'worker_class': 'sync',  # 使用 gevent 提高并发
        'timeout': 60,
        'pidfile': "gunicorn.pid",  # 输出访问日志到 stdout
        'accesslog': "access.log",  # 输出错误日志到 stdout
        'errorlog': 'error.log',
        'max_requests': 3000,
        'max_requests_jitter':300
    }

    StandaloneApplication(app, options).run()

    # run_simple(
    #     hostname=settings.HOST_IP,
    #     port=settings.HOST_PORT,
    #     application=app,
    #     threaded=True,
    #     use_reloader=RuntimeConfig.DEBUG,
    #     use_debugger=RuntimeConfig.DEBUG,
    # )
except Exception:
    traceback.print_exc()
    stop_event.set()
    time.sleep(1)
    os.kill(os.getpid(), signal.SIGKILL)

problems:
1.Login
鉴权
e0a4c979a138f4290b0b20f1fe45fe1
2.Click the chat button
mysql error
568be654537a99f8717b18217993cf2

f073671d45c10319ac35f36ab515967
3.Conversation dialogue
HTTP request stuck and raise 500 internal error
a9a99065ee1635b2c4431bf3799c8dc

@zyoung1212 zyoung1212 marked this pull request as draft July 4, 2025 14:03
@xujiesh0510
Copy link

xujiesh0510 commented Aug 22, 2025

based on main...ericwu0930:ragflow:feature/support-gunicorn
modify conf/gunicorn.conf.py ,add

# Gunicorn configuration file for RAGFlow production deployment
import os
if int(os.environ.get('GUNICORN_GEVENT_MODE',0)) == 1:
    from gevent import monkey
    monkey.patch_all()

    # Import gevent for greenlet spawning
    import gevent
    from gevent import spawn
    USE_GEVENT = True
else:
    USE_GEVENT = False

to the beginning of the file,it will solve ssl recursion problem.

then modify api/wsgi.py ,change
if os.environ.get('GUNICORN_GEVENT_MODE',0) == 1: to
if int(os.environ.get('GUNICORN_GEVENT_MODE',0)) == 1:

@mcj2761358
Copy link

When starting with Gunicorn, multiple workers are enabled (18 workers are configured locally).

When using RagFlow, it redirects to the login page after one or two clicks.

By adding logs, it was found that with different workers and the same query conditions, some can retrieve results while others cannot.

——————————

My temporary solution is to not use a database connection pool.

image

@leecj
Copy link
Contributor

leecj commented Sep 11, 2025

When will this feature be released?

@zyoung1212
Copy link
Author

Hello everyone, I'm very sorry that this PR has been delayed for so long. It still has many issues at the moment, so it remains in the Draft status. Due to the heavy workload at my company, I haven't been able to continue optimizing the code during this period. I hope I will have time to look into it further in the days ahead. Thanks.

@leecj
Copy link
Contributor

leecj commented Sep 13, 2025

Hello everyone, I'm very sorry that this PR has been delayed for so long. It still has many issues at the moment, so it remains in the Draft status. Due to the heavy workload at my company, I haven't been able to continue optimizing the code during this period. I hope I will have time to look into it further in the days ahead. Thanks.

Thank you, in the early versions, I achieved high availability by using Gunicorn, only needing to switch the update_process to run in a separate file. I wonder if it should provide a simpler implementation, as this would also reduce the compatibility issues that need to be considered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continue Integration 📖 documentation Improvements or additions to documentation 💞 feature Feature request, pull request that fullfill a new feature. size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.