feat(benchmark): Create mock LLM server for use in benchmarks #1403

tgasser-nv · 2025-09-17T19:17:21Z

Summary

This PR adds a Mock LLM and Guardrails and example Content-Safety configuration to use this end-to-end with Guardrails. I have a follow-on PR using Locust to run performance benchmarks on Guardrails on a laptop without any NVCF function calls, local GPUs, or modifications to the Guardrails code.

Description

This PR includes an OpenAI-compatible Mock LLM Fast API app. This is intended to mock production LLMs for performance-testing purposes. The configuration file comes from a .env file, such as below for the Content Safety mock.

MODEL="nvidia/llama-3.1-nemoguard-8b-content-safety"
UNSAFE_PROBABILITY=0.03
UNSAFE_TEXT="{\"User Safety\": \"unsafe\", \"Response Safety\": \"unsafe\", \"Safety Categories\": \"Violence, Criminal Planning/Confessions\"} "
SAFE_TEXT="{\"User Safety\": \"safe\", \"Response Safety\": \"safe\"}"
LATENCY_MIN_SECONDS=0.5
LATENCY_MAX_SECONDS=0.5
LATENCY_MEAN_SECONDS=0.5
LATENCY_STD_SECONDS=0.0

The Mock LLM first decides randomly if it should return a safe response or not, using the UNSAFE_PROBABILITY probability. This determines whether SAFE_TEXT or UNSAFE_TEXT is returned when the model responds. The Mock LLM then samples latency for the response from a normal distribution (parameterized by LATENCY_MEAN_SECONDS and LATENCY_STD_SECONDS), and clips the minimum and maximum values against LATENCY_MIN_SECONDS and LATENCY_MAX_SECONDS respectively.

After waiting, it then responds with the text.

Test Plan

This test-plan shows how the Mock LLM can be integrated with Guardrails seamlessly. As long as we characterize our Nemoguard and Application LLM latency correctly and can represent them with a distribution, we can use this to perform performance testing.

Terminal 1 (Content Safety Mock)

$ cd nemoguardrails/benchmark/mock_llm_server
$ poetry run python run_server.py --port 8000 --config-file configs/meta-llama-3.3-70b-instruct.env
2025-10-07 13:35:00 INFO: Using config file: configs/meta-llama-3.3-70b-instruct.env
2025-10-07 13:35:00 INFO: Starting Mock LLM Server on 0.0.0.0:8000
2025-10-07 13:35:00 INFO: OpenAPI docs available at: http://0.0.0.0:8000/docs
2025-10-07 13:35:00 INFO: Health check at: http://0.0.0.0:8000/health
2025-10-07 13:35:00 INFO: Serving model with config configs/meta-llama-3.3-70b-instruct.env
2025-10-07 13:35:00 INFO: Press Ctrl+C to stop the server
INFO:     Loading environment from 'configs/meta-llama-3.3-70b-instruct.env'
INFO:     Started server process [95977]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Returning ModelSettings: %s model='meta/llama-3.3-70b-instruct' unsafe_probability=0.0 unsafe_text="I can't help with that. Is there anything else I can assist you with?" safe_text="I can provide information and help with a wide range of topics, from science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text summarization. However, I can't assist with requests that involve harm or illegal activities." latency_min_seconds=4.0 latency_max_seconds=4.0 latency_mean_seconds=4.0 latency_std_seconds=0.0
2025-10-07 13:37:21 INFO: Request finished: 200, took 4.020 seconds
INFO:     127.0.0.1:60072 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Terminal 2 (Content Safety Mock)

$ cd nemoguardrails/benchmark/mock_llm_server
$ poetry run python run_server.py --port 8001 --config-file configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env
Returning ModelSettings: %s model='nvidia/llama-3.1-nemoguard-8b-content-safety' unsafe_probability=0.03 unsafe_text='{"User Safety": "unsafe", "Response Safety": "unsafe", "Safety Categories": "Violence, Criminal Planning/Confessions"} ' safe_text='{"User Safety": "safe", "Response Safety": "safe"}' latency_min_seconds=0.5 latency_max_seconds=0.5 latency_mean_seconds=0.5 latency_std_seconds=0.0
2025-10-07 13:37:17 INFO: Request finished: 200, took 0.524 seconds
INFO:     127.0.0.1:60070 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Returning ModelSettings: %s model='nvidia/llama-3.1-nemoguard-8b-content-safety' unsafe_probability=0.03 unsafe_text='{"User Safety": "unsafe", "Response Safety": "unsafe", "Safety Categories": "Violence, Criminal Planning/Confessions"} ' safe_text='{"User Safety": "safe", "Response Safety": "safe"}' latency_min_seconds=0.5 latency_max_seconds=0.5 latency_mean_seconds=0.5 latency_std_seconds=0.0
2025-10-07 13:37:22 INFO: Request finished: 200, took 0.503 seconds
INFO:     127.0.0.1:60076 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Terminal 3 (Guardrails production code)

$ cd nemoguardrails/benchmark/mock_llm_server
$ poetry run nemoguardrails server --port 9000 --config configs/guardrail_configs --default-config-id content_safety_colang1

INFO:     Started server process [96087]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)
/Users/tgasser/projects/nemo_guardrails/nemoguardrails/server/api.py:221: UserWarning: No config_id or config_ids provided, using default config_id
  warnings.warn(
INFO:nemoguardrails.server.api:Got request for config None
Entered verbose mode.
13:37:16.148 | Registered Actions ['ClavataCheckAction', 'GetAttentionPercentageAction', 'GetCurrentDateTimeAction', 
'UpdateAttentionMaterializedViewAction', 'alignscore request', 'alignscore_check_facts', 'autoalign_factcheck_output_api', 
'autoalign_groundedness_output_api', 'autoalign_input_api', 'autoalign_output_api', 'call cleanlab api', 'call fiddler faithfulness', 
'call fiddler safety on bot message', 'call fiddler safety on user message', 'call gcpnlp api', 'call_activefence_api', 
'content_safety_check_input', 'content_safety_check_output', 'create_event', 'detect_pii', 'detect_sensitive_data', 
'injection_detection', 'jailbreak_detection_heuristics', 'jailbreak_detection_model', 'llama_guard_check_input', 
'llama_guard_check_output', 'mask_pii', 'mask_sensitive_data', 'pangea_ai_guard', 'patronus_api_check_output', 
'patronus_lynx_check_output_hallucination', 'protect_text', 'retrieve_relevant_chunks', 'self_check_facts', 'self_check_hallucination', 
'self_check_input', 'self_check_output', 'summarize_document', 'topic_safety_check_input', 'validate_guardrails_ai_input', 
'validate_guardrails_ai_output', 'wolfram alpha request']
13:37:17.122 | Event UtteranceUserActionFinished | {'final_transcript': 'what can you do for me?'}
13:37:17.123 | Event StartInternalSystemAction | {'uid': '453f...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'StartInputRails'}}, 'action_result_key': None, 'action_uid': 'e65b...', 'is_system_action': True}
13:37:17.124 | Executing action create_event
13:37:17.124 | Event StartInputRails | {'uid': '0880...'}
13:37:17.124 | Event StartInternalSystemAction | {'uid': 'ed33...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'StartInputRail', 'flow_id': '$triggered_input_rail'}}, 'action_result_key': None, 'action_uid': 'ebb1...', 'is_system_action': True}
13:37:17.125 | Executing action create_event
13:37:17.125 | Event StartInputRail | {'uid': '9a9f...', 'flow_id': 'content safety check input $model=content_safety'}
13:37:17.125 | Event StartInternalSystemAction | {'uid': 'f497...', 'action_name': 'content_safety_check_input', 'action_params': {}, 
'action_result_key': 'response', 'action_uid': '3a0a...', 'is_system_action': False}
13:37:17.125 | Executing action content_safety_check_input
13:37:17.127 | Invocation Params {'_type': 'chat-nvidia-ai-playground', 'stop': None}

LLM Prompt (4a790..) - content_safety_check_input $model=content_safety
                                                                                                                                         
<**Tim: Snipped the input-rail content-safety prompt for clarity**>

LLM Completion (4a790..)
{"User Safety": "safe", "Response Safety": "safe"}                                                                                       

13:37:17.664 | Output Stats None
13:37:17.665 | LLM call took 0.53 seconds
13:37:17.666 | Event InternalSystemActionFinished | {'uid': '0829...', 'action_uid': '3a0a...', 'action_name': 
'content_safety_check_input', 'action_params': {}, 'action_result_key': 'response', 'status': 'success', 'is_success': True, 
'return_value': {'allowed': True, 'policy_violations': []}, 'events': [], 'is_system_action': False}
13:37:17.668 | Event StartInternalSystemAction | {'uid': 'c51a...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'InputRailFinished', 'flow_id': '$triggered_input_rail'}}, 'action_result_key': None, 'action_uid': '58ff...', 'is_system_action': True}
13:37:17.668 | Executing action create_event
13:37:17.669 | Event InputRailFinished | {'uid': 'a44e...', 'flow_id': 'content safety check input $model=content_safety'}
13:37:17.670 | Event StartInternalSystemAction | {'uid': '1a35...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'InputRailsFinished'}}, 'action_result_key': None, 'action_uid': '45db...', 'is_system_action': True}
13:37:17.671 | Executing action create_event
13:37:17.672 | Event InputRailsFinished | {'uid': '68eb...'}
13:37:17.673 | Event StartInternalSystemAction | {'uid': '0df6...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'UserMessage', 'text': '$user_message'}}, 'action_result_key': None, 'action_uid': '6fbd...', 'is_system_action': True}
13:37:17.674 | Executing action create_event
13:37:17.674 | Event UserMessage | {'uid': '2626...', 'text': 'what can you do for me?'}
13:37:17.675 | Event StartInternalSystemAction | {'uid': '27cf...', 'action_name': 'generate_user_intent', 'action_params': {}, 
'action_result_key': None, 'action_uid': '8722...', 'is_system_action': True}
13:37:17.675 | Executing action generate_user_intent
13:37:17.680 | Invocation Params {'_type': 'chat-nvidia-ai-playground', 'stop': ['User:']}

LLM Prompt (79bb7..) - general
                                                                                                                                         
System                                                                                                                                   
Below is a conversation between a helpful AI assistant and a user. The bot is designed to generate human-like text based on the input 
that it receives. The bot is talkative and provides lots of specific details. If the bot does not know the answer to a question, it 
truthfully says it does not know.                                                                                                        
User                                                                                                                                     
what can you do for me?                                                                                                                  


LLM Completion (79bb7..)
I can provide information and help with a wide range of topics, from science and history to entertainment and culture. I can also help 
with language-related tasks, such as translation and text summarization. However, I can't assist with requests that involve harm or 
illegal activities.                                                                                                                      

13:37:21.722 | Output Stats None
13:37:21.723 | LLM call took 4.04 seconds
13:37:21.724 | Event BotMessage | {'uid': '165a...', 'text': "I can provide information and help with a wide range of topics, from 
science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text 
summarization. However, I can't assist with requests that involve harm or illegal activities."}
13:37:21.726 | Event StartInternalSystemAction | {'uid': '952a...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'StartOutputRails'}}, 'action_result_key': None, 'action_uid': 'b7b2...', 'is_system_action': True}
13:37:21.727 | Executing action create_event
13:37:21.727 | Event StartOutputRails | {'uid': '9b33...'}
13:37:21.729 | Event StartInternalSystemAction | {'uid': '9124...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'StartOutputRail', 'flow_id': '$triggered_output_rail'}}, 'action_result_key': None, 'action_uid': '0a9e...', 'is_system_action': True}
13:37:21.730 | Executing action create_event
13:37:21.730 | Event StartOutputRail | {'uid': '84c7...', 'flow_id': 'content safety check output $model=content_safety'}
13:37:21.733 | Event StartInternalSystemAction | {'uid': 'ca6c...', 'action_name': 'content_safety_check_output', 'action_params': {}, 
'action_result_key': 'response', 'action_uid': '8ae0...', 'is_system_action': False}
13:37:21.734 | Executing action content_safety_check_output
13:37:21.736 | Invocation Params {'_type': 'chat-nvidia-ai-playground', 'stop': None}

LLM Prompt (ced97..) - content_safety_check_output $model=content_safety
                                                                                                                                         
<**Tim: Snipped the input-rail content-safety prompt for clarity**>

LLM Completion (ced97..)
{"User Safety": "safe", "Response Safety": "safe"}                                                                                       

13:37:22.248 | Output Stats None
13:37:22.248 | LLM call took 0.51 seconds
13:37:22.249 | Event InternalSystemActionFinished | {'uid': '1e53...', 'action_uid': '8ae0...', 'action_name': 
'content_safety_check_output', 'action_params': {}, 'action_result_key': 'response', 'status': 'success', 'is_success': True, 
'return_value': {'allowed': True, 'policy_violations': []}, 'events': [], 'is_system_action': False}
13:37:22.251 | Event StartInternalSystemAction | {'uid': '5428...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'OutputRailFinished', 'flow_id': '$triggered_output_rail'}}, 'action_result_key': None, 'action_uid': '2bec...', 'is_system_action': 
True}
13:37:22.251 | Executing action create_event
13:37:22.252 | Event OutputRailFinished | {'uid': '92aa...', 'flow_id': 'content safety check output $model=content_safety'}
13:37:22.254 | Event StartInternalSystemAction | {'uid': 'e3e0...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'OutputRailsFinished'}}, 'action_result_key': None, 'action_uid': '1979...', 'is_system_action': True}
13:37:22.254 | Executing action create_event
13:37:22.254 | Event OutputRailsFinished | {'uid': 'eb79...'}
13:37:22.256 | Event StartInternalSystemAction | {'uid': '15e4...', 'action_name': 'create_event', 'action_params': {'event': {'_type': 
'StartUtteranceBotAction', 'script': '$bot_message'}}, 'action_result_key': None, 'action_uid': '5e90...', 'is_system_action': True}
13:37:22.256 | Executing action create_event
13:37:22.256 | Event StartUtteranceBotAction | {'uid': 'c53b...', 'script': "I can provide information and help with a wide range of 
topics, from science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text 
summarization. However, I can't assist with requests that involve harm or illegal activities.", 'action_uid': '6587...'}
13:37:22.258 | Total processing took 5.14 seconds. LLM Stats: 3 total calls, 5.08 total time, 998 total tokens, 903 total prompt tokens, 
95 total completion tokens, [0.53, 4.04, 0.51] as latencies
INFO:     127.0.0.1:60066 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Terminal 4 (Client issuing request)

 ~ curl -X POST http://0.0.0.0:9000/v1/chat/completions \
   -H 'Accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "model": "meta/llama-3.3-70b-instruct",
      "messages": [
         {
            "role": "user",
            "content": "what can you do for me?"
         }
      ],
      "max_tokens": 16,
      "stream": false,
      "temperature": 1,
      "top_p": 1
   }' | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   602  100   334  100   268     53     43  0:00:06  0:00:06 --:--:--    79
{
  "messages": [
    {
      "role": "assistant",
      "content": "I can provide information and help with a wide range of topics, from science and history to entertainment and culture. I can also help with language-related tasks, such as translation and text summarization. However, I can't assist with requests that involve harm or illegal activities."
    }
  ]
}

Related Issue(s)

Checklist

I've read the CONTRIBUTING guidelines.
I've updated the documentation if applicable.
I've added tests if applicable.
@mentions of the person or team responsible for reviewing proposed changes.

codecov-commenter · 2025-09-17T19:23:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.88%. Comparing base (2af64d6) to head (d9b73be).

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1403      +/-   ##
===========================================
+ Coverage    71.66%   71.88%   +0.22%     
===========================================
  Files          171      174       +3     
  Lines        17020    17154     +134     
===========================================
+ Hits         12198    12332     +134     
  Misses        4822     4822

Flag	Coverage Δ
python	`71.88% <100.00%> (+0.22%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
nemoguardrails/benchmark/mock_llm_server/api.py	`100.00% <100.00%> (ø)`
nemoguardrails/benchmark/mock_llm_server/models.py	`100.00% <100.00%> (ø)`
...rdrails/benchmark/mock_llm_server/response_data.py	`100.00% <100.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

… of this into endpoints

codecov-commenter · 2025-09-22T16:33:54Z

Codecov Report

❌ Patch coverage is 80.81633% with 47 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...guardrails/benchmark/mock_llm_server/run_server.py	0.00%	44 Missing ⚠️
nemoguardrails/benchmark/mock_llm_server/config.py	87.50%	3 Missing ⚠️

📢 Thoughts on this report? Let us know!

tgasser-nv added 2 commits September 17, 2025 10:45

Initial scaffold of mock OpenAI-compatible server

1bb4443

Refactor mock LLM, fix tests

d9b73be

tgasser-nv self-assigned this Sep 17, 2025

tgasser-nv added 5 commits September 17, 2025 16:59

Added tests to load YAML config. Still debugging dependency-injection…

9021b81

… of this into endpoints

Move FastAPI app import **after** the dependencies are loaded and cached

687e33b

Remove debugging print statements

c0afd8d

Temporary checkin

e62f394

Add refusal probability and tests to check it

6ddcaca

tgasser-nv added 7 commits October 1, 2025 20:55

Use YAML configs for Nemoguard and app LLMs

3b3f49a

Add Mock configs for content-safety and App LLM

f142c0f

Add async sleep statements and logging to record request time

a18b514

Change content-safety mock to have latency of 0.5s

6beb888

Add unit-tests to mock llm

c056b3b

Check for config file

4104a1f

Rename test files to avoid conflicts with other tests

1cca2ff

tgasser-nv assigned Pouyanpi, cparisien and tgasser-nv and unassigned Pouyanpi, cparisien and tgasser-nv Oct 7, 2025

tgasser-nv requested review from Pouyanpi and cparisien October 7, 2025 18:44

Remove example_usage.py script and type-clean config.py

e87715c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(benchmark): Create mock LLM server for use in benchmarks #1403

feat(benchmark): Create mock LLM server for use in benchmarks #1403

Uh oh!

tgasser-nv commented Sep 17, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Sep 17, 2025

Uh oh!

codecov-commenter commented Sep 22, 2025 •

edited

Loading

Uh oh!

Uh oh!

feat(benchmark): Create mock LLM server for use in benchmarks #1403

Are you sure you want to change the base?

feat(benchmark): Create mock LLM server for use in benchmarks #1403

Uh oh!

Conversation

tgasser-nv commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Description

Test Plan

Terminal 1 (Content Safety Mock)

Terminal 2 (Content Safety Mock)

Terminal 3 (Guardrails production code)

Terminal 4 (Client issuing request)

Related Issue(s)

Checklist

Uh oh!

codecov-commenter commented Sep 17, 2025

Codecov Report

Uh oh!

codecov-commenter commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

tgasser-nv commented Sep 17, 2025 •

edited

Loading

codecov-commenter commented Sep 22, 2025 •

edited

Loading