SentinelBench implementation #385

matheusmaldaner · 2025-09-28T17:12:52Z

Added SentinelBench implementation for testing long-running monitoring capabilities of modern AI agents. This PR includes:

SentinelBench/: entire implementation of the SentinelBench benchmark
...benchmarks/sentinelbench/...: specifications to run MagenticUI on SentinelBench,
...benchmarks/sentinelbench/tools/...: python files to analyze and plot the results after running the tasks
many others QOL changes

…hen running with docker

…crosoft#317) Co-authored-by: root <root@LAPTOP-AL1099TT> Co-authored-by: Matheus Kunzler Maldaner <[email protected]> Co-authored-by: Matheus Kunzler Maldaner <[email protected]> Co-authored-by: Hussein Mozannar <[email protected]>

…ment" (microsoft#341)

…el step usage

…back to this

matheusmaldaner and others added 30 commits August 3, 2025 01:20

Added browser_local flag to --magentic-cli to enable seeing browser w…

47f0bfd

…hen running with docker

Minor change from using argparse to Typer

0e573e1

Added citation to README

53157ee

rough UI integration of sentinel steps

19bf0d6

Updates prompt to ensure URL is passed in the details field

21ae496

Dynamic sleep_duration

0d65087

Lock file

f4ae04d

Poe check

2dba654

Fixed dataset download

bf85f38

Added sentinel-tasks and use-local-browser flags

133f93c

Revert the changes made locally to run.py

8fec221

Remove current session tab (microsoft#318)

b814b9b

Mcp server list (microsoft#319)

18aa63f

Addresses Issue microsoft#137 (microsoft#339)

f8285ff

Revert "Add GitHub Actions workflow for automatic GitHub Pages deploy…

2f881c1

…ment" (microsoft#341)

Add github pages (microsoft#342)

f9512c3

Rough testing changes

d7ffa49

Last pushes

f02ff73

SentinelBench fixes

a300e45

Enables file upload tool and print plan to console to validate sentin…

2428f8b

…el step usage

Enables task or difficulty specific runs for SentinelBench

a87680a

Create function to compare results with and without sentinel tasks

7c2d738

Code to save partial results during eval when a task is canceled

36a7de8

Updated pricing

0dc805d

Data exploration for WebGames

cf10c5f

Instructions to run

290c52a

Introduced time limit flag and variants file for sentinel bench eval

79d05ec

Fixed pathing

ea8b68d

Fixed critical error in sleep_duration validation

72d61a0

matheusmaldaner added 22 commits September 28, 2025 13:30

Updated task naming

35337a2

Removed duplicate test.jsonl

47fa512

Added --pretty-output flag

bd87dc6

Support to run multiple tasks at once from command line

3315dfa

Improve --help flag display

e54defc

Hardcoded way of forcing timeout values for specific tasks, will get …

23f73ef

…back to this

Fix sentinel task prompt

a9822f7

Small prompt fix

d1178eb

Fix final answer extraction when running parallel runs

83de1d1

Add --sentinelbench-url flag to change the base url

a858173

Delete unused files

aacfcb3

Poe check

fd7425e

Init

15aef43

Remove plots

a7cc3f7

File to test task variants

56eb345

Poe check

0685420

SentinelBench utility tools

7834fc9

Task variants

1e48e53

Poe check

2420120

Small changes on README and comments

3ec1441

Small fixes and poe checks

ed7b34d

Fixes absolute paths and hardcoded values

05da06a

matheusmaldaner force-pushed the sentinel/evals branch from 271471b to 05da06a Compare September 28, 2025 17:30

matheusmaldaner and others added 4 commits September 28, 2025 13:46

uv.lock

042d8af

Merge branch 'main' into sentinel/evals

e073ff3

All SentinelBench tasks and router configs

69ea146

SentinelBench dump

40d41f0

matheusmaldaner marked this pull request as ready for review September 28, 2025 19:07

matheusmaldaner added 2 commits September 30, 2025 14:31

Ignores SentinelBench env variables and typo fix

bb67372

Better exception handling + uv.lock file

1836bc0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SentinelBench implementation #385

SentinelBench implementation #385

Uh oh!

matheusmaldaner commented Sep 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

SentinelBench implementation #385

Are you sure you want to change the base?

SentinelBench implementation #385

Uh oh!

Conversation

matheusmaldaner commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

matheusmaldaner commented Sep 28, 2025 •

edited

Loading