meta-pytorch · Darktex · Oct 6, 2025 · Oct 3, 2025 · Oct 3, 2025 · Oct 6, 2025
diff --git a/.flake8 b/.flake8
@@ -0,0 +1,11 @@
+[flake8]
+max-line-length = 79
+extend-ignore = E203,W503
+exclude =
+    .git,
+    __pycache__,
+    .venv,
+    venv,
+    build,
+    dist,
+    *.egg-info
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,37 @@
+# My preferences
+My coding style is to NEVER use inline comments. Code should be well written and self explanatory. If it is not, then it should be refactored. I instead use private functions to explain the code, and use docstrings in the private functions to document what it is achieving if it's not obvious from the private function's name and signature.
+
+In general, I am all for the KISS approach. I write opinionated code where there should be only ONE way of doing things, and it should be very minimal without any surprises. This will very much apply to this project.
+
+# Context
+What this project wants to achieve is to build a standard for agentic execution environments. We are partnering with HuggingFace on this, so we want to land in the Huggingface/PyTorch way of doing things (which is again, is minimal and very Pythonic).
+
+What is an agentic execution environment? We have an opinionated answer for this: we start from the Gym/Gymnasium basic API of:
+
+1. reset(): Start a new trajectory
+2. step(): Takes an action, returns an observation (which could contain rewards)
+3. render(): Renders the environments to help visualise what the agent see, examples modes are “human”, “rgb_array”, “ansi” for text.
+4. close(): Closes the environment, important when external software is used, i.e. pygame for rendering, databases
+
+As a reminder, we also follow the "classical RL" approach of having the environment hold the state, NOT the policy. So for example, if your policy is playing chess, the policy will give you a knee-jerk reaction to a board state and tell you the next move, but who's hosting and updating the board state is the environment. It will also be responsible for showing only the observable part of the state in e.g. partial information games like poker, or in a driver sim if a car is hidden behind buildings/other cars etc. Not every env will have a state, but if a state exists it will live there.
+
+We are however going to make a few changes on top of it. We follow in the footsteps of TorchRL and bring over the following ideas:
+
+1. We connect it to dataloaders. TorchRL uses the [DataCollector abstraction](https://docs.pytorch.org/rl/main/reference/collectors.html) to loop a policy and its environment together to build the "inner loop" where the env gets actions and the policy gets observations *and* (crucially) to take a dataloader to load the tasks. This is a genius idea because it allows us to encapsulate all the complexity of dynamic data generation in online RL and see "just a fancy dataloader". The negative is that this only works for the narrow application of RL post-training of LLMs, doesn't yet go to agents. So we are gonna have to extend it.
+
+2. We compute rewards by attaching a `transform()` function to the Env's `step()`. This is a very flexible way of doing it, following in TorchVision's steps.
+
+Finally, we extend to fully-fledged agents. We follow the same approach as CodeAct. Read the paper in the PDF in this directory to make sure you understand it and !!CRITICAL!! ASK QUESTIONS IF YOU DO NOT. In short, every *action* is arbitrary Python code that gets executed and that can chain multiple tool calls in a single action (as opposed to the JSON-style traditional function calling way which would spend the whole action to do a single tool call).
+
+How does this connect to our env? We run `step` and eval python code that gets sent us, see this picture:
+![agentic_loop.png](./agentic_loop.png)
+
+We do not do this here -- this is just the big picture for you to familiarize. We assume that our customers will do this and only this.
+
+# What I'm giving you
+I'm giving you a half-baked implementation which connects Gym and TorchRL that I built for RL post-training a while back. It's not even fully working, but it will give you an idea about what I'm doing and how I write code, what my preferences are etc.
+
+# What I expect from you
+I want you to take my half-baked implementation and graduate it to a CodeAct implementation that can do agents while still being able to support RL training as well (via transforms).
+
+This should be a dynamic session: please ask questions whenever you are not sure about things and use me as your design partner in a pair design session.
diff --git a/LINTING_FIXES.md b/LINTING_FIXES.md
@@ -0,0 +1,89 @@
+# Linting Fixes Applied
+
+## Summary
+All flake8 linting errors have been fixed to ensure production-quality code.
+
+## Issues Fixed
+
+### Line Length Violations (E501)
+- **Problem**: Lines exceeding 79 characters
+- **Solution**: Split long lines using parentheses, broke function parameters across multiple lines, and added intermediate variables where appropriate
+- **Files affected**: All source files
+
+### Missing Newlines at End of Files (W292)
+- **Problem**: Files missing final newline character
+- **Solution**: Added newline character to end of all source files
+- **Files affected**: All source files
+
+### Unused Imports (F401)
+- **Problem**: Imported `sys` and `Observation` but never used
+- **Solution**: Removed unused imports from `environment.py`
+
+### Indentation Issues (E128)
+- **Problem**: Continuation lines not properly indented
+- **Solution**: Fixed indentation for multi-line function signatures and expressions
+
+## Code Quality Improvements
+
+### Function Signatures
+**Before:**
+```python
+def from_exception(cls, exc: Exception, stdout: str = "", stderr: str = "") -> "ExecutionResult":
+```
+
+**After:**
+```python
+def from_exception(
+    cls, exc: Exception, stdout: str = "", stderr: str = ""
+) -> "ExecutionResult":
+```
+
+### Long Expressions
+**Before:**
+```python
+if last_line and not any(last_line.startswith(kw) for kw in ['def ', 'class ', 'if ', 'for ', 'while ', 'with ', 'try:', 'import ', 'from ']):
+```
+
+**After:**
+```python
+keywords = [
+    'def ', 'class ', 'if ', 'for ', 'while ', 'with ',
+    'try:', 'import ', 'from '
+]
+if last_line and not any(
+    last_line.startswith(kw) for kw in keywords
+):
+```
+
+### Import Organization
+**Before:**
+```python
+from .types import Action, CodeAction, CodeObservation, CodeState, ExecutionResult, Observation
+```
+
+**After:**
+```python
+from .types import (
+    Action,
+    CodeAction,
+    CodeObservation,
+    CodeState,
+    ExecutionResult,
+)
+```
+
+## Configuration Added
+
+Created `.flake8` configuration file with:
+- Max line length: 79 characters
+- Ignored rules: E203 (whitespace before ':'), W503 (line break before binary operator)
+- Excluded directories: common build and cache directories
+
+## Verification
+
+✅ **Flake8 Check**: No errors or warnings
+✅ **Python Syntax**: All files compile successfully
+✅ **Functionality**: All tests and examples pass
+✅ **Code Style**: Consistent formatting throughout codebase
+
+The codebase now meets production-quality standards and is ready for deployment.
diff --git a/README.md b/README.md
@@ -1,2 +1,195 @@
-# envtorch
-An environment library for RL and beyond
+# EnvTorch: Agentic Execution Environments
+
+A unified framework for CodeAct environments that supports both agent execution and RL training, built on Gym/Gymnasium APIs with PyTorch/HuggingFace integration patterns.
+
+## Overview
+
+EnvTorch provides a standard for agentic execution environments following the CodeAct paradigm, where actions are arbitrary Python code that can chain multiple tool calls. The framework bridges traditional RL environments with modern agent capabilities.
+
+### Key Features
+
+- **CodeAct Execution**: Actions are Python code strings executed in persistent contexts
+- **State Persistence**: Variables and functions persist across steps within episodes
+- **Tool Integration**: MCP (Model Context Protocol) support for external capabilities
+- **RL Compatibility**: Transform system for reward computation and training
+- **Error Handling**: Exceptions become observations for agent learning
+- **Clean APIs**: Minimal, opinionous design following KISS principles
+
+## Quick Start
+
+```python
+from src import create_codeact_env, CodeAction
+
+# Create environment
+env = create_codeact_env()
+obs = env.reset()
+
+# Execute Python code
+action = CodeAction(code="""
+x = 10
+y = 20
+result = x * y
+print(f"Result: {result}")
+result  # Return value
+""")
+
+obs = env.step(action)
+print(f"Output: {obs.execution_result.stdout}")
+print(f"Return: {obs.execution_result.return_value}")
+```
+
+## Core Components
+
+### Actions and Observations
+
+```python
+# Actions contain arbitrary Python code
+action = CodeAction(code="math.sqrt(16)")
+
+# Observations include execution results
+obs = env.step(action)
+print(obs.execution_result.return_value)  # 4.0
+print(obs.execution_result.success)       # True
+print(obs.execution_result.stdout)        # Any print output
+```
+
+### Tool Integration
+
+```python
+from src import create_mcp_environment
+
+# Environment with MCP tools
+env = create_mcp_environment()
+obs = env.reset()
+
+# Tools available as Python objects
+action = CodeAction(code="""
+content = "Hello, world!"
+file_write("/tmp/hello.txt", content)
+result = file_read("/tmp/hello.txt")
+print(f"File contents: {result}")
+""")
+
+obs = env.step(action)
+```
+
+### RL Training with Transforms
+
+```python
+from src import create_math_env_transform
+
+# Environment that rewards correct math solutions
+transform = create_math_env_transform(expected_answer=42)
+env = create_codeact_env()
+env.transform = transform
+
+# Agent gets rewarded for correct answers
+action = CodeAction(code="21 * 2")  # Correct answer
+obs = env.step(action)
+print(obs.reward)  # 1.0 (success) + quality bonuses
+```
+
+## Architecture
+
+### Type System
+- `Action` / `CodeAction`: Base and concrete action types
+- `Observation` / `CodeObservation`: Base and concrete observation types
+- `State` / `CodeState`: Environment state with execution context
+- `ExecutionResult`: Detailed code execution results
+
+### Core Classes
+- `Environment`: Base class following Gym API
+- `CodeActEnvironment`: Main environment for code execution
+- `Transform`: Base class for observation modification
+- `ToolRegistry`: Manages available tools and functions
+
+### Transform Examples
+- `CodeSafetyTransform`: Penalizes unsafe code patterns
+- `MathProblemTransform`: Rewards correct numerical answers
+- `CodeQualityTransform`: Evaluates code quality metrics
+- `CompositeTransform`: Combines multiple transforms
+
+## File Structure
+
+```
+src/
+├── types.py          # Core type definitions
+├── interfaces.py     # Abstract base classes
+├── environment.py    # Main CodeAct environment
+├── transforms.py     # Transform implementations
+├── mcp.py           # MCP integration
+└── __init__.py      # Clean exports
+```
+
+## Usage Patterns
+
+### Agent Exploration
+```python
+env = create_codeact_env()
+obs = env.reset()
+
+# Multi-step problem solving
+action1 = CodeAction(code="data = [1, 2, 3, 4, 5]")
+obs = env.step(action1)
+
+action2 = CodeAction(code="mean = sum(data) / len(data); mean")
+obs = env.step(action2)  # Uses persistent data from step 1
+```
+
+### RL Training Loop
+```python
+# Create environment with reward function
+transform = create_safe_env_transform()
+env = create_codeact_env()
+env.transform = transform
+
+for episode in range(100):
+    obs = env.reset()
+    action = generate_action()  # From your policy
+    obs = env.step(action)
+
+    reward = obs.reward  # Computed by transforms
+    # Update policy based on reward
+```
+
+### Hybrid Agent + RL
+```python
+# Phase 1: Agent exploration
+env = create_codeact_env()
+# Agent explores different solution approaches
+
+# Phase 2: RL optimization
+env.transform = optimization_transform
+# Train to optimize based on exploration insights
+```
+
+## Design Principles
+
+- **KISS Approach**: Minimal, opinionated design
+- **Single Way**: One clear way to accomplish tasks
+- **Pythonic**: Follows PyTorch/HuggingFace patterns
+- **No Inline Comments**: Code should be self-explanatory
+- **Functional Composition**: Private functions explain complex logic
+
+## Testing
+
+Run the test suite:
+```bash
+python test_unified.py
+```
+
+Run examples:
+```bash
+python example.py
+```
+
+## Requirements
+
+See `requirements.txt` for dependencies. Core requirements:
+- Python 3.9+
+- PyTorch 2.0+
+- HuggingFace datasets
+
+## License
+
+BSD 3-Clause License (see LICENSE file)
diff --git a/agentic loop.png b/agentic loop.png
diff --git a/codeactpaper.pdf b/codeactpaper.pdf