Skip to content

Commit

Permalink
docs: improve documentation (All-Hands-AI#3425)
Browse files Browse the repository at this point in the history
* add imports to eval harness

* update out-dated custom sandbox guide

* Update docs/modules/usage/evaluation_harness.md

* remove llm pasta

* update od doc

---------

Co-authored-by: Engel Nyst <[email protected]>
  • Loading branch information
xingyaoww and enyst authored Aug 16, 2024
1 parent 3f20e4e commit 5d92048
Show file tree
Hide file tree
Showing 3 changed files with 32 additions and 38 deletions.
29 changes: 1 addition & 28 deletions docs/modules/usage/custom_sandbox_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,34 +90,7 @@ Congratulations!

## Technical Explanation

The relevant code is defined in [ssh_box.py](https://github.com/OpenDevin/OpenDevin/blob/main/opendevin/runtime/docker/ssh_box.py) and [image_agnostic_util.py](https://github.com/OpenDevin/OpenDevin/blob/main/opendevin/runtime/docker/image_agnostic_util.py).

In particular, `ssh_box.py` checks the config object for ```config.sandbox_container_image``` and then attempts to retrieve the image using [get_od_sandbox_image](https://github.com/OpenDevin/OpenDevin/blob/main/opendevin/runtime/docker/image_agnostic_util.py#L72) which is defined in image_agnostic_util.py.

When first using a custom image, it will not be found and thus it will be built (on subsequent runs the built image will be found and returned).

The custom image is built using [_build_sandbox_image()](https://github.com/OpenDevin/OpenDevin/blob/main/opendevin/runtime/docker/image_agnostic_util.py#L29), which creates a docker file using your custom_image as a base and then configures the environment for OpenDevin, like this:

```python
dockerfile_content = (
f'FROM {base_image}\n'
'RUN apt update && apt install -y openssh-server wget sudo\n'
'RUN mkdir -p -m0755 /var/run/sshd\n'
'RUN mkdir -p /opendevin && mkdir -p /opendevin/logs && chmod 777 /opendevin/logs\n'
'RUN echo "" > /opendevin/bash.bashrc\n'
'RUN if [ ! -d /opendevin/miniforge3 ]; then \\\n'
' wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" && \\\n'
' bash Miniforge3-$(uname)-$(uname -m).sh -b -p /opendevin/miniforge3 && \\\n'
' chmod -R g+w /opendevin/miniforge3 && \\\n'
' bash -c ". /opendevin/miniforge3/etc/profile.d/conda.sh && conda config --set changeps1 False && conda config --append channels conda-forge"; \\\n'
' fi\n'
'RUN /opendevin/miniforge3/bin/pip install --upgrade pip\n'
'RUN /opendevin/miniforge3/bin/pip install jupyterlab notebook jupyter_kernel_gateway flake8\n'
'RUN /opendevin/miniforge3/bin/pip install python-docx PyPDF2 python-pptx pylatexenc openai\n'
).strip()
```

> Note: the name of the image is modified via [_get_new_image_name()](https://github.com/OpenDevin/OpenDevin/blob/main/opendevin/runtime/docker/image_agnostic_util.py#L63) and it is the modified name that is searched for on subsequent runs.
Please refer to [custom docker image section of the runtime documentation](https://docs.all-hands.dev/modules/usage/runtime#advanced-how-opendevin-builds-and-maintains-od-runtime-images) for more details.

## Troubleshooting / Errors

Expand Down
36 changes: 30 additions & 6 deletions docs/modules/usage/evaluation_harness.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,9 +84,35 @@ To integrate your own benchmark, we suggest starting with the one that most clos

## How to create an evaluation workflow


To create an evaluation workflow for your benchmark, follow these steps:

1. Create a configuration:
1. Import relevant OpenDevin utilities:
```python
import agenthub
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from opendevin.controller.state.state import State
from opendevin.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
parse_arguments,
)
from opendevin.core.logger import opendevin_logger as logger
from opendevin.core.main import create_runtime, run_controller
from opendevin.events.action import CmdRunAction
from opendevin.events.observation import CmdOutputObservation, ErrorObservation
from opendevin.runtime.runtime import Runtime
```

2. Create a configuration:
```python
def get_config(instance: pd.Series, metadata: EvalMetadata) -> AppConfig:
config = AppConfig(
Expand All @@ -103,15 +129,15 @@ To create an evaluation workflow for your benchmark, follow these steps:
return config
```

2. Initialize the runtime and set up the evaluation environment:
3. Initialize the runtime and set up the evaluation environment:
```python
async def initialize_runtime(runtime: Runtime, instance: pd.Series):
# Set up your evaluation environment here
# For example, setting environment variables, preparing files, etc.
pass
```

3. Create a function to process each instance:
4. Create a function to process each instance:
```python
async def process_instance(instance: pd.Series, metadata: EvalMetadata) -> EvalOutput:
config = get_config(instance, metadata)
Expand Down Expand Up @@ -141,7 +167,7 @@ To create an evaluation workflow for your benchmark, follow these steps:
)
```

4. Run the evaluation:
5. Run the evaluation:
```python
metadata = make_metadata(llm_config, dataset_name, agent_class, max_iterations, eval_note, eval_output_dir)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
Expand All @@ -162,8 +188,6 @@ Remember to customize the `get_instruction`, `your_user_response_function`, and

By following this structure, you can create a robust evaluation workflow for your benchmark within the OpenDevin framework.

Certainly! I'll add a section explaining the user_response_fn and include a description of the workflow and interaction. Here's an updated version of the guideline with the new section:


## Understanding the `user_response_fn`

Expand Down
5 changes: 1 addition & 4 deletions opendevin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,5 @@ flowchart LR
```

## Runtime
The Runtime class is abstract, and has a few different implementations:

* We have a LocalRuntime, which runs commands and edits files directly on the user's machine
* We have a DockerRuntime, which runs commands inside of a docker sandbox, and edits files directly on the user's machine
* We have an E2BRuntime, which uses [e2b.dev containers](https://github.com/e2b-dev/e2b) to sandbox file and command operations
Please refer to the [documentation](https://docs.all-hands.dev/modules/usage/runtime) to learn more about `Runtime`.

0 comments on commit 5d92048

Please sign in to comment.