Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[flake8]
max-line-length = 79
extend-ignore = E203,W503
exclude =
.git,
__pycache__,
.venv,
venv,
build,
dist,
*.egg-info
37 changes: 37 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# My preferences
My coding style is to NEVER use inline comments. Code should be well written and self explanatory. If it is not, then it should be refactored. I instead use private functions to explain the code, and use docstrings in the private functions to document what it is achieving if it's not obvious from the private function's name and signature.

In general, I am all for the KISS approach. I write opinionated code where there should be only ONE way of doing things, and it should be very minimal without any surprises. This will very much apply to this project.

# Context
What this project wants to achieve is to build a standard for agentic execution environments. We are partnering with HuggingFace on this, so we want to land in the Huggingface/PyTorch way of doing things (which is again, is minimal and very Pythonic).

What is an agentic execution environment? We have an opinionated answer for this: we start from the Gym/Gymnasium basic API of:

1. reset(): Start a new trajectory
2. step(): Takes an action, returns an observation (which could contain rewards)
3. render(): Renders the environments to help visualise what the agent see, examples modes are “human”, “rgb_array”, “ansi” for text.
4. close(): Closes the environment, important when external software is used, i.e. pygame for rendering, databases

As a reminder, we also follow the "classical RL" approach of having the environment hold the state, NOT the policy. So for example, if your policy is playing chess, the policy will give you a knee-jerk reaction to a board state and tell you the next move, but who's hosting and updating the board state is the environment. It will also be responsible for showing only the observable part of the state in e.g. partial information games like poker, or in a driver sim if a car is hidden behind buildings/other cars etc. Not every env will have a state, but if a state exists it will live there.

We are however going to make a few changes on top of it. We follow in the footsteps of TorchRL and bring over the following ideas:

1. We connect it to dataloaders. TorchRL uses the [DataCollector abstraction](https://docs.pytorch.org/rl/main/reference/collectors.html) to loop a policy and its environment together to build the "inner loop" where the env gets actions and the policy gets observations *and* (crucially) to take a dataloader to load the tasks. This is a genius idea because it allows us to encapsulate all the complexity of dynamic data generation in online RL and see "just a fancy dataloader". The negative is that this only works for the narrow application of RL post-training of LLMs, doesn't yet go to agents. So we are gonna have to extend it.

2. We compute rewards by attaching a `transform()` function to the Env's `step()`. This is a very flexible way of doing it, following in TorchVision's steps.

Finally, we extend to fully-fledged agents. We follow the same approach as CodeAct. Read the paper in the PDF in this directory to make sure you understand it and !!CRITICAL!! ASK QUESTIONS IF YOU DO NOT. In short, every *action* is arbitrary Python code that gets executed and that can chain multiple tool calls in a single action (as opposed to the JSON-style traditional function calling way which would spend the whole action to do a single tool call).

How does this connect to our env? We run `step` and eval python code that gets sent us, see this picture:
![agentic_loop.png](./agentic_loop.png)

We do not do this here -- this is just the big picture for you to familiarize. We assume that our customers will do this and only this.

# What I'm giving you
I'm giving you a half-baked implementation which connects Gym and TorchRL that I built for RL post-training a while back. It's not even fully working, but it will give you an idea about what I'm doing and how I write code, what my preferences are etc.

# What I expect from you
I want you to take my half-baked implementation and graduate it to a CodeAct implementation that can do agents while still being able to support RL training as well (via transforms).

This should be a dynamic session: please ask questions whenever you are not sure about things and use me as your design partner in a pair design session.
89 changes: 89 additions & 0 deletions LINTING_FIXES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Linting Fixes Applied

## Summary
All flake8 linting errors have been fixed to ensure production-quality code.

## Issues Fixed

### Line Length Violations (E501)
- **Problem**: Lines exceeding 79 characters
- **Solution**: Split long lines using parentheses, broke function parameters across multiple lines, and added intermediate variables where appropriate
- **Files affected**: All source files

### Missing Newlines at End of Files (W292)
- **Problem**: Files missing final newline character
- **Solution**: Added newline character to end of all source files
- **Files affected**: All source files

### Unused Imports (F401)
- **Problem**: Imported `sys` and `Observation` but never used
- **Solution**: Removed unused imports from `environment.py`

### Indentation Issues (E128)
- **Problem**: Continuation lines not properly indented
- **Solution**: Fixed indentation for multi-line function signatures and expressions

## Code Quality Improvements

### Function Signatures
**Before:**
```python
def from_exception(cls, exc: Exception, stdout: str = "", stderr: str = "") -> "ExecutionResult":
```

**After:**
```python
def from_exception(
cls, exc: Exception, stdout: str = "", stderr: str = ""
) -> "ExecutionResult":
```

### Long Expressions
**Before:**
```python
if last_line and not any(last_line.startswith(kw) for kw in ['def ', 'class ', 'if ', 'for ', 'while ', 'with ', 'try:', 'import ', 'from ']):
```

**After:**
```python
keywords = [
'def ', 'class ', 'if ', 'for ', 'while ', 'with ',
'try:', 'import ', 'from '
]
if last_line and not any(
last_line.startswith(kw) for kw in keywords
):
```

### Import Organization
**Before:**
```python
from .types import Action, CodeAction, CodeObservation, CodeState, ExecutionResult, Observation
```

**After:**
```python
from .types import (
Action,
CodeAction,
CodeObservation,
CodeState,
ExecutionResult,
)
```

## Configuration Added

Created `.flake8` configuration file with:
- Max line length: 79 characters
- Ignored rules: E203 (whitespace before ':'), W503 (line break before binary operator)
- Excluded directories: common build and cache directories

## Verification

✅ **Flake8 Check**: No errors or warnings
✅ **Python Syntax**: All files compile successfully
✅ **Functionality**: All tests and examples pass
✅ **Code Style**: Consistent formatting throughout codebase

The codebase now meets production-quality standards and is ready for deployment.
197 changes: 195 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,195 @@
# envtorch
An environment library for RL and beyond
# EnvTorch: Agentic Execution Environments

A unified framework for CodeAct environments that supports both agent execution and RL training, built on Gym/Gymnasium APIs with PyTorch/HuggingFace integration patterns.

## Overview

EnvTorch provides a standard for agentic execution environments following the CodeAct paradigm, where actions are arbitrary Python code that can chain multiple tool calls. The framework bridges traditional RL environments with modern agent capabilities.

### Key Features

- **CodeAct Execution**: Actions are Python code strings executed in persistent contexts
- **State Persistence**: Variables and functions persist across steps within episodes
- **Tool Integration**: MCP (Model Context Protocol) support for external capabilities
- **RL Compatibility**: Transform system for reward computation and training
- **Error Handling**: Exceptions become observations for agent learning
- **Clean APIs**: Minimal, opinionous design following KISS principles

## Quick Start

```python
from src import create_codeact_env, CodeAction

# Create environment
env = create_codeact_env()
obs = env.reset()

# Execute Python code
action = CodeAction(code="""
x = 10
y = 20
result = x * y
print(f"Result: {result}")
result # Return value
""")

obs = env.step(action)
print(f"Output: {obs.execution_result.stdout}")
print(f"Return: {obs.execution_result.return_value}")
```

## Core Components

### Actions and Observations

```python
# Actions contain arbitrary Python code
action = CodeAction(code="math.sqrt(16)")

# Observations include execution results
obs = env.step(action)
print(obs.execution_result.return_value) # 4.0
print(obs.execution_result.success) # True
print(obs.execution_result.stdout) # Any print output
```

### Tool Integration

```python
from src import create_mcp_environment

# Environment with MCP tools
env = create_mcp_environment()
obs = env.reset()

# Tools available as Python objects
action = CodeAction(code="""
content = "Hello, world!"
file_write("/tmp/hello.txt", content)
result = file_read("/tmp/hello.txt")
print(f"File contents: {result}")
""")

obs = env.step(action)
```

### RL Training with Transforms

```python
from src import create_math_env_transform

# Environment that rewards correct math solutions
transform = create_math_env_transform(expected_answer=42)
env = create_codeact_env()
env.transform = transform

# Agent gets rewarded for correct answers
action = CodeAction(code="21 * 2") # Correct answer
obs = env.step(action)
print(obs.reward) # 1.0 (success) + quality bonuses
```

## Architecture

### Type System
- `Action` / `CodeAction`: Base and concrete action types
- `Observation` / `CodeObservation`: Base and concrete observation types
- `State` / `CodeState`: Environment state with execution context
- `ExecutionResult`: Detailed code execution results

### Core Classes
- `Environment`: Base class following Gym API
- `CodeActEnvironment`: Main environment for code execution
- `Transform`: Base class for observation modification
- `ToolRegistry`: Manages available tools and functions

### Transform Examples
- `CodeSafetyTransform`: Penalizes unsafe code patterns
- `MathProblemTransform`: Rewards correct numerical answers
- `CodeQualityTransform`: Evaluates code quality metrics
- `CompositeTransform`: Combines multiple transforms

## File Structure

```
src/
├── types.py # Core type definitions
├── interfaces.py # Abstract base classes
├── environment.py # Main CodeAct environment
├── transforms.py # Transform implementations
├── mcp.py # MCP integration
└── __init__.py # Clean exports
```

## Usage Patterns

### Agent Exploration
```python
env = create_codeact_env()
obs = env.reset()

# Multi-step problem solving
action1 = CodeAction(code="data = [1, 2, 3, 4, 5]")
obs = env.step(action1)

action2 = CodeAction(code="mean = sum(data) / len(data); mean")
obs = env.step(action2) # Uses persistent data from step 1
```

### RL Training Loop
```python
# Create environment with reward function
transform = create_safe_env_transform()
env = create_codeact_env()
env.transform = transform

for episode in range(100):
obs = env.reset()
action = generate_action() # From your policy
obs = env.step(action)

reward = obs.reward # Computed by transforms
# Update policy based on reward
```

### Hybrid Agent + RL
```python
# Phase 1: Agent exploration
env = create_codeact_env()
# Agent explores different solution approaches

# Phase 2: RL optimization
env.transform = optimization_transform
# Train to optimize based on exploration insights
```

## Design Principles

- **KISS Approach**: Minimal, opinionated design
- **Single Way**: One clear way to accomplish tasks
- **Pythonic**: Follows PyTorch/HuggingFace patterns
- **No Inline Comments**: Code should be self-explanatory
- **Functional Composition**: Private functions explain complex logic

## Testing

Run the test suite:
```bash
python test_unified.py
```

Run examples:
```bash
python example.py
```

## Requirements

See `requirements.txt` for dependencies. Core requirements:
- Python 3.9+
- PyTorch 2.0+
- HuggingFace datasets

## License

BSD 3-Clause License (see LICENSE file)
Binary file added agentic loop.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added codeactpaper.pdf
Binary file not shown.
Loading