Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
157 changes: 157 additions & 0 deletions docs/user-guides/community/openai-moderations-api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
# OpenAI Moderations API

NeMo Guardrails supports using the [OpenAI Moderations API](https://platform.openai.com/docs/guides/moderation) as an input or output rail out-of-the-box. You need to have the `OPENAI_API_KEY` environment variable set or configure it in your OpenAI client.

## Basic Usage

```yaml
rails:
input:
flows:
# The simplified version using the flagged response
- openai moderation

# The detailed version with individual category checks
# - openai moderation detailed
```

The `openai moderation` flow uses OpenAI's built-in flagging system to decide if the input should be allowed or not. The `openai moderation detailed` flow checks individual violation categories with custom logic.

## Using the Moderation API Endpoint

You can also use the moderation endpoint directly:

```yaml
rails:
config:
moderation:
providers:
- id: openai-moderation
provider: openai
model: omni-moderation-latest
action: openai_moderation_api
default: true
```

Then call the endpoint:
```bash
curl -X POST http://localhost:8000/v1/moderations \
-H "Content-Type: application/json" \
-d '{
"input": "...text to classify goes here..."
}'
```

## Supported Categories

OpenAI's moderation API detects the following categories:

| Category | Description |
|----------|-------------|
| `harassment` | Content that expresses, incites, or promotes harassing language towards any target |
| `harassment/threatening` | Harassment content that also includes violence or serious harm towards any target |
| `hate` | Content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste |
| `hate/threatening` | Hateful content that also includes violence or serious harm towards the targeted group |
| `illicit` | Content that gives advice or instruction on how to commit illicit acts |
| `illicit/violent` | Illicit content that also includes references to violence or procuring a weapon |
| `self-harm` | Content that promotes, encourages, or depicts acts of self-harm |
| `self-harm/intent` | Content where the speaker expresses intent to engage in acts of self-harm |
| `self-harm/instructions` | Content that encourages or gives instructions on acts of self-harm |
| `sexual` | Content meant to arouse sexual excitement or that promotes sexual services |
| `sexual/minors` | Sexual content that includes an individual who is under 18 years old |
| `violence` | Content that depicts death, violence, or physical injury |
| `violence/graphic` | Content that depicts death, violence, or physical injury in graphic detail |

## Customizing Thresholds

To customize the behavior, you can overwrite the [default flows](https://github.com/NVIDIA/NeMo-Guardrails/tree/develop/nemoguardrails/library/openai_moderate_text/flows.co) in your config. For example, to create a custom moderation flow:

```colang
define subflow custom openai moderation
"""Custom guardrail with specific threshold logic."""
$result = execute openai_moderation_api

# Block if OpenAI flags it as harmful
if $result.get("flagged", False)
bot refuse to respond
stop

# Custom threshold checks on category scores
if $result.category_scores.get("violence", 0) > 0.5
bot inform cannot engage in violent content
stop
```

## Detailed Category Handling

Using OpenAI Text Moderation, you can control various violation categories individually. The API returns both boolean flags for each category and confidence scores. Here's an example of a detailed input moderation flow:

```colang
define flow openai input moderation detailed
$result = execute openai_moderation_api(text=$user_message)

if $result.categories.get("harassment", False)
bot inform cannot engage in abusive or harmful behavior
stop

if $result.categories.get("hate", False)
bot inform cannot engage in abusive or harmful behavior
stop

if $result.categories.get("sexual", False)
bot inform cannot engage in inappropriate content
stop

if $result.categories.get("violence", False)
bot inform cannot engage in abusive or harmful behavior
stop

define bot inform cannot engage in abusive or harmful behavior
"I will not engage in any abusive or harmful behavior."

define bot inform cannot engage in inappropriate content
"I will not engage with inappropriate content."
```

## Using with Category Scores

You can also use the confidence scores for more nuanced control:

```colang
define subflow openai score based moderation
"""Moderation based on confidence scores rather than binary flags."""
$result = execute openai_moderation_api

# Custom thresholds for different categories
if $result.category_scores.get("harassment", 0) > 0.7
bot inform cannot engage in abusive or harmful behavior
stop

if $result.category_scores.get("hate", 0) > 0.6
bot inform cannot engage in abusive or harmful behavior
stop

if $result.category_scores.get("violence", 0) > 0.8
bot inform cannot engage in abusive or harmful behavior
stop
```

## Environment Setup

Make sure you have your OpenAI API key configured:

```bash
export OPENAI_API_KEY="your-api-key-here"
```

Or you can configure it in your application code when initializing the OpenAI client.

## Installation

The OpenAI moderation integration requires the `openai` package:

```bash
pip install openai
```

This is typically included when you install NeMo Guardrails with OpenAI support.
17 changes: 16 additions & 1 deletion docs/user-guides/guardrails-library.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ NeMo Guardrails comes with a library of built-in guardrails that you can easily
- [Prompt Security Protection](#prompt-security-protection)
- [Pangea AI Guard](#pangea-ai-guard)
- [Trend Micro Vision One AI Application Security](#trend-micro-vision-one-ai-application-security)
- OpenAI Moderation API - *[COMING SOON]*
- [OpenAI Moderations API](#openai-moderations-api)

4. Other
- [Jailbreak Detection](#jailbreak-detection)
Expand Down Expand Up @@ -937,6 +937,21 @@ rails:

For more details, check out the [Trend Micro Vision One AI Application Security](./community/trend-micro.md) page.

### OpenAI Moderations API

NeMo Guardrails supports using the [OpenAI Moderation API](https://platform.openai.com/docs/guides/moderation) for input and output content moderation. You need to have the `OPENAI_API_KEY` environment variable set.

#### Example usage

```yaml
rails:
input:
flows:
- openai moderation
```

For more details, check out the [OpenAI Moderations API Integration](./community/openai-moderations-api.md) page.

## Other

### Jailbreak Detection
Expand Down
3 changes: 3 additions & 0 deletions nemoguardrails/library/gcp_moderate_text/actions.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,3 +142,6 @@ async def call_gcp_text_moderation_api(
violations_dict[violation.name] = violation.confidence

return {"max_risk_score": max_risk_score, "violations": violations_dict}


call_gcp_text_moderation_api.action_meta.setdefault("tags", []).append("moderation")
14 changes: 14 additions & 0 deletions nemoguardrails/library/openai_moderate_text/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
84 changes: 84 additions & 0 deletions nemoguardrails/library/openai_moderate_text/actions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging
from typing import Optional

try:
from openai import OpenAI
except ImportError:
pass


from nemoguardrails.actions import action

log = logging.getLogger(__name__)


def openai_text_moderation_mapping(result: dict) -> bool:
"""Determines whether the output from the OpenAI moderation API should be blocked.
Returns True (i.e. block the output) if any of the conditions are met.
"""
# Check overall risk score if available.
flagged = result.get("results", [{}])[0].get("flagged", False)

if flagged:
return True

confidence_threshold = 0.8
scores = result["results"][0]["category_scores"]

if any(score > confidence_threshold for score in scores.values()):
return True

# If none of the thresholds are exceeded, allow the output.
return False


@action(
name="openai_moderation_api",
is_system_action=True,
output_mapping=openai_text_moderation_mapping,
)
async def call_openai_text_moderation_api(
context: Optional[dict] = None, **kwargs
) -> dict:
try:
from openai import OpenAI

except ImportError:
raise ImportError(
"Could not import openai, please install it with " "`pip install openai`."
)

user_message = context.get("user_message")
client = OpenAI()

response = client.moderations.create(
model="omni-moderation-latest",
input=user_message,
)

result = response.model_dump()["results"][0]
result["categories"] = dict(result.get("categories", {}))
result["category_scores"] = {
str(k): float(v) for k, v in result.get("category_scores", {}).items()
}
result["flagged"] = bool(result.get("flagged", False))

return result


call_openai_text_moderation_api.action_meta.setdefault("tags", []).append("moderation")
93 changes: 93 additions & 0 deletions nemoguardrails/library/openai_moderate_text/flows.co
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
"""
https://platform.openai.com/docs/guides/moderation

Supported Categories:

Category Description
harassment Content that expresses, incites, or promotes harassing language towards any target.
harassment/threatening Harassment content that also includes violence or serious harm towards any target.
hate Content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste. Hateful content aimed at non-protected groups (e.g., chess players) is harassment.
hate/threatening Hateful content that also includes violence or serious harm towards the targeted group based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste.
illicit Content that gives advice or instruction on how to commit illicit acts. A phrase like "how to shoplift" would fit this category.
illicit/violent The same types of content flagged by the illicit category, but also includes references to violence or procuring a weapon.
self-harm Content that promotes, encourages, or depicts acts of self-harm, such as suicide, cutting, and eating disorders.
self-harm/intent Content where the speaker expresses that they are engaging or intend to engage in acts of self-harm, such as suicide, cutting, and eating disorders.
self-harm/instructions Content that encourages performing acts of self-harm, such as suicide, cutting, and eating disorders, or that gives instructions or advice on how to commit such acts.
sexual Content meant to arouse sexual excitement, such as the description of sexual activity, or that promotes sexual services (excluding sex education and wellness).
sexual/minors Sexual content that includes an individual who is under 18 years old.
violence Content that depicts death, violence, or physical injury.
violence/graphic Content that depicts death, violence, or physical injury in graphic detail.
"""

define subflow openai moderation
"""Guardrail based on model classification of potentially harmful content"""
$result = execute openai_moderation_api

if $result.get("flagged", False)
bot refuse to respond
stop

define subflow openai moderation detailed
"""Guardrail based on individual risk scores."""
$result = execute openai_moderation_api

if $result.categories.get("sexual", False)
stop
bot inform cannot engage in inappropriate content

if $result.categories.get("sexual/minors", False)
stop
bot inform cannot engage in inappropriate content

if $result.categories.get("harassment", False)
bot inform cannot engage in abusive or harmful behavior
stop

if $result.categories.get("harassment/threatening", False)
bot inform cannot engage in abusive or harmful behavior
stop

if $result.categories.get("hate", False)
bot inform cannot engage in abusive or harmful behavior
stop

if $result.categories.get("hate/threatening", False)
bot inform cannot engage in abusive or harmful behavior
stop

if $result.categories.get("illicit", False)
bot inform cannot engage in inappropriate content
stop

if $result.categories.get("illicit/violent", False)
bot inform cannot engage in inappropriate content
stop

if $result.categories.get("self-harm", False)
bot inform cannot engage with sensitive content
stop

if $result.categories.get("self-harm/intent", False)
bot inform cannot engage with sensitive content
stop

if $result.categories.get("self-harm/instructions", False)
bot inform cannot engage with sensitive content
stop

if $result.categories.get("violence", False)
bot inform cannot engage in abusive or harmful behavior
stop

if $result.categories.get("violence/graphic", False)
bot inform cannot engage in abusive or harmful behavior
stop

define bot inform cannot engage in abusive or harmful behavior
"I will not engage in any abusive or harmful behavior."

define bot inform cannot engage with inappropriate content
"I will not engage with inappropriate content."

define bot inform cannot engage with sensitive content
"I will not engage with sensitive content."
Loading
Loading