Adding internet and research paper tools for open world search tasks#75
Adding internet and research paper tools for open world search tasks#75mr-sarthakgupta wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request expands the agentic capabilities of the codebase by introducing several new tools: web_search (via DuckDuckGo and Semantic Scholar), fetch_webpage (for downloading and parsing webpages), research_papers (for discovering and reading ML papers), and run_command (for executing sandboxed shell commands). The review feedback highlights critical security and robustness issues, including a safety check bypass in run_command when unsafe is enabled, backwards rate-limiting logic for unauthenticated Semantic Scholar requests, and opportunities to improve HTML parsing, timeout handling, and arXiv ID normalization.
| if not unsafe: | ||
| safety_err = _check_command_safety(command) | ||
| if safety_err: | ||
| return f"Error: {safety_err}", False |
There was a problem hiding this comment.
When unsafe is True, the safety check is completely bypassed, allowing arbitrary command execution (including blocked executables like rm, curl, etc.). To prevent this security bypass, _check_command_safety should always be called, passing the unsafe flag to enforce the appropriate safety constraints.
safety_err = _check_command_safety(command, unsafe=unsafe)
if safety_err:
return f"Error: {safety_err}", False| # Rate limit only when authenticated (1 req/s for search, 10 req/s for others) | ||
| if S2_API_KEY: | ||
| min_interval = 1.0 if "search" in path else 0.1 | ||
| elapsed = time.monotonic() - _s2_last_request | ||
| if elapsed < min_interval: | ||
| await asyncio.sleep(min_interval - elapsed) |
There was a problem hiding this comment.
The rate-limiting logic is backwards: it only throttles requests when S2_API_KEY is present (authenticated). Unauthenticated requests have much stricter rate limits on Semantic Scholar and will immediately trigger 429 Too Many Requests if not throttled. We should always enforce a rate limit, defaulting to a safer 1.0 second interval when unauthenticated.
# Rate limit always, defaulting to 1.0s when unauthenticated
min_interval = 1.0 if not S2_API_KEY or "search" in path else 0.1
elapsed = time.monotonic() - _s2_last_request
if elapsed < min_interval:
await asyncio.sleep(min_interval - elapsed)| def _validate_arxiv_id(args: dict) -> str | None: | ||
| """Return arxiv_id or None if missing.""" | ||
| return args.get("arxiv_id") |
There was a problem hiding this comment.
LLMs frequently extract arXiv IDs with prefixes like arxiv: or arXiv:. If these are passed directly to the APIs or URLs, the requests will fail. Normalizing the ID by stripping any case-insensitive arxiv: prefix in _validate_arxiv_id makes the tool much more robust.
| def _validate_arxiv_id(args: dict) -> str | None: | |
| """Return arxiv_id or None if missing.""" | |
| return args.get("arxiv_id") | |
| def _validate_arxiv_id(args: dict) -> str | None: | |
| """Return arxiv_id or None if missing.""" | |
| val = args.get("arxiv_id") | |
| if val and isinstance(val, str): | |
| return re.sub(r"^(arxiv:)+", "", val, flags=re.IGNORECASE).strip() | |
| return val |
| except httpx.RequestError: | ||
| continue |
There was a problem hiding this comment.
If _parse_paper_html raises an unexpected parsing exception (e.g., AttributeError or TypeError due to malformed HTML), it will propagate out of the loop and fail the entire tool call. Catching general Exception alongside httpx.RequestError ensures we robustly fall back to the next URL or the abstract fallback.
| except httpx.RequestError: | |
| continue | |
| except (httpx.RequestError, Exception): | |
| continue |
| def html_to_text(html_content: str) -> str: | ||
| """Convert HTML to clean plain text.""" | ||
| extractor = _TextExtractor() | ||
| try: | ||
| extractor.feed(html_content) | ||
| except Exception: | ||
| # If the parser chokes, fall back to stripping all tags. | ||
| return re.sub(r"<[^>]+>", " ", html_content) | ||
| return extractor.get_text() |
There was a problem hiding this comment.
Using a custom HTMLParser subclass is highly sensitive to mismatched or unclosed tags in real-world HTML, which can cause the parser to skip large portions of the webpage. Since beautifulsoup4 is already a dependency of the project, we should use it to decompose unwanted tags and extract text reliably.
| def html_to_text(html_content: str) -> str: | |
| """Convert HTML to clean plain text.""" | |
| extractor = _TextExtractor() | |
| try: | |
| extractor.feed(html_content) | |
| except Exception: | |
| # If the parser chokes, fall back to stripping all tags. | |
| return re.sub(r"<[^>]+>", " ", html_content) | |
| return extractor.get_text() | |
| def html_to_text(html_content: str) -> str: | |
| """Convert HTML to clean plain text.""" | |
| try: | |
| from bs4 import BeautifulSoup | |
| soup = BeautifulSoup(html_content, "html.parser") | |
| for element in soup(_SKIP_TAGS): | |
| element.decompose() | |
| return soup.get_text(separator="\n").strip() | |
| except Exception: | |
| return re.sub(r"<[^>]+>", " ", html_content) |
| resp = requests.get( | ||
| hit.url, | ||
| headers={"User-Agent": USER_AGENT}, | ||
| timeout=15, | ||
| allow_redirects=True, | ||
| ) |
There was a problem hiding this comment.
A timeout of 15 seconds per page fetch can make the web_search tool extremely slow and unresponsive if multiple top hits are slow or hanging. Reducing the timeout to 5 seconds ensures the search returns quickly while still capturing responsive pages.
| resp = requests.get( | |
| hit.url, | |
| headers={"User-Agent": USER_AGENT}, | |
| timeout=15, | |
| allow_redirects=True, | |
| ) | |
| resp = requests.get( | |
| hit.url, | |
| headers={"User-Agent": USER_AGENT}, | |
| timeout=5, | |
| allow_redirects=True, | |
| ) |
|
@mr-sarthakgupta nice work, this is a really useful addition. I pulled it down and ran the tools, web_search and fetch_webpage both worked with no key needed, and the safe-mode design for run_command makes sense. A few things before the merge is approved:
None of these are hard to fix and the feature itself is great. This should be good to merge once the lint and the path/env bits are handled. Thanks! |
|
Hi Shubham, thanks for the feedback! Yes, these features do have some holes right now, I mostly duct-taped them for my own use. Would be fixing the issues in the coming days within this PR. Also, I might open another PR for symbolic regression soon! PS: Some time ago I had sent an email to you with some questions about sky-discover, while a few of them have been answered with my own experiments but would still really appreciate if you could take a look if you find any time! (I sent it using the ID mrsarthakgupta@gmail.com) |
|
Also, another small problem I faced (not related to this PR or topic), I wasn't able to find a way to run EvoX and Ada-Evolve with each other, they seem like alternatives by looking at the code structure but are at separate layers in terms of search logic and I believe I also saw some results on the website or someplace with both being used |
Adding internet and research paper tools for open world search tasks
the tools allow the agent to search papers and understand existing methods before trying introducing alterations to the solution set.
I've used these tools on my own project for determining protein aggregation laws using a skydiscover derivative and found them really useful!
These tools are heavily inspired by huggingface's https://github.com/huggingface/ml-intern