You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Use provided SentinelBench URL (has default value)
143
+
base_website_path=args['sentinelbench_url']
144
+
# Ensure URL ends with slash
145
+
ifnotbase_website_path.endswith('/'):
146
+
base_website_path+='/'
149
147
150
148
benchmark=SentinelBenchBenchmark(
151
149
data_dir=data_dir,
@@ -277,7 +275,7 @@ def main(
277
275
difficulty: Annotated[Optional[str], typer.Option(help="⚡ Filter tasks by difficulty level or multiple levels separated by commas (e.g., 'easy,medium')", rich_help_panel="🛡️ SentinelBench Options")] =None,
278
276
use_test_variants: Annotated[bool, typer.Option(help="🧪 Use test variants for SentinelBench (smaller set)", rich_help_panel="🛡️ SentinelBench Options")] =False,
279
277
use_full_variants: Annotated[bool, typer.Option(help="🎛️ Use full variants for SentinelBench (all combinations)", rich_help_panel="🛡️ SentinelBench Options")] =False,
SentinelBench: A benchmark for evaluating AI agents on monitoring and long-term observation tasks.
2
-
3
1
This benchmark focuses on testing AI agents' capabilities in persistent monitoring, state change detection, and task completion under varying complexity and noise levels.
4
2
5
-
The benchmark includes 18 interactive web-based tasks designed around monitoring scenarios, from simple button pressing to complex social media monitoring.
6
-
7
-
## Task Characterization
8
-
9
-
Each task includes several dimensions for analysis:
10
-
-**difficulty**: easy, medium, hard
11
-
-**base_task**: underlying task type (e.g., reactor, animal-mover, button-presser)
12
-
-**duration**: Short, Medium, Long
13
-
-**criteria**: Objective, Subjective, Mixed
14
-
-**activity**: Active (requires user interaction), Passive (monitoring/waiting)
15
-
-**noise**: Low, Medium, High
16
-
-**realism**: Playful, Realistic
17
-
18
3
## Usage
19
4
20
5
To run SentinelBench evaluations:
@@ -23,6 +8,12 @@ To run SentinelBench evaluations:
23
8
python experiments/eval/run.py --current-dir . --dataset SentinelBench --split test --run-id 1 --simulated-user-type none --parallel 1 --config experiments/endpoint_configs/config.yaml --mode run
24
9
```
25
10
11
+
**Note**: The above command uses the default SentinelBench URL (`https://sentinel-bench.vercel.app/`). If you're hosting SentinelBench locally or at a different URL, specify it with `--sentinelbench-url`:
12
+
13
+
```bash
14
+
python experiments/eval/run.py --current-dir . --dataset SentinelBench --split test --run-id 1 --simulated-user-type none --parallel 1 --config experiments/endpoint_configs/config.yaml --mode run --sentinelbench-url http://YOUR_HOST_IP:5173/
15
+
```
16
+
26
17
### Task Filtering
27
18
28
19
SentinelBench supports filtering tasks to run specific subsets:
python experiments/eval/run.py --current-dir . --dataset SentinelBench --split test --run-id 1 --simulated-user-type none --parallel 1 --config experiments/endpoint_configs/config.yaml --mode run --base-task animal-mover --difficulty medium
49
40
```
50
41
51
-
## Local Hosting
42
+
*For all examples above, add `--sentinelbench-url http://YOUR_HOST_IP:5173/` if using a custom URL instead of the default.*
43
+
44
+
## URL Configuration
52
45
53
-
SentinelBench is designed to be hosted locally during development and testing. The default configuration expects the benchmark website to be running at `http://172.25.159.193:5173/`.
46
+
SentinelBench evaluations use `https://sentinel-bench.vercel.app/` by default. For local development or custom deployments, you can override this URL.
54
47
48
+
### Using Default (Production) URL
49
+
No additional configuration needed - just run the commands as shown above.
50
+
51
+
### Using Custom/Local URL
52
+
Add `--sentinelbench-url` parameter to specify your custom URL:
2. Install dependencies and start the development server with: `npm run dev -- --host 0.0.0.0`
58
-
3. Ensure it's accessible at the expected URL (http://172.25.159.193:5173/)
60
+
1. Clone the MagenticUI repository
61
+
2. Navigate to the SentinelBench/ directory
62
+
3. Install dependencies and start the development server with: `npm run dev -- --host 0.0.0.0`
63
+
4. Note the IP address and port where the server is running (typically shown in the terminal output)
64
+
5. Use this URL with the `--sentinelbench-url` parameter
65
+
66
+
**Common local URLs:**
67
+
- Local development: `http://localhost:5173/` or `http://127.0.0.1:5173/`
68
+
- Network accessible: `http://YOUR_MACHINE_IP:5173/` (replace YOUR_MACHINE_IP with your actual IP)
69
+
- Docker/VM: Check your container/VM's IP address
70
+
71
+
## Running Analysis
72
+
73
+
We provide all scripts to run analysis within the tools/ subdirectory. This subdirectory also contains a README.md file with explanations of the order the tools should be ran and how to better utilize them.
0 commit comments