Skip to content

Adding internet and research paper tools for open world search tasks#75

Open
mr-sarthakgupta wants to merge 2 commits into
skydiscover-ai:mainfrom
mr-sarthakgupta:add-more-tools
Open

Adding internet and research paper tools for open world search tasks#75
mr-sarthakgupta wants to merge 2 commits into
skydiscover-ai:mainfrom
mr-sarthakgupta:add-more-tools

Conversation

@mr-sarthakgupta

@mr-sarthakgupta mr-sarthakgupta commented Jun 1, 2026

Copy link
Copy Markdown

Adding internet and research paper tools for open world search tasks

the tools allow the agent to search papers and understand existing methods before trying introducing alterations to the solution set.

I've used these tools on my own project for determining protein aggregation laws using a skydiscover derivative and found them really useful!

These tools are heavily inspired by huggingface's https://github.com/huggingface/ml-intern

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request expands the agentic capabilities of the codebase by introducing several new tools: web_search (via DuckDuckGo and Semantic Scholar), fetch_webpage (for downloading and parsing webpages), research_papers (for discovering and reading ML papers), and run_command (for executing sandboxed shell commands). The review feedback highlights critical security and robustness issues, including a safety check bypass in run_command when unsafe is enabled, backwards rate-limiting logic for unauthenticated Semantic Scholar requests, and opportunities to improve HTML parsing, timeout handling, and arXiv ID normalization.

Comment on lines +297 to +300
if not unsafe:
safety_err = _check_command_safety(command)
if safety_err:
return f"Error: {safety_err}", False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

When unsafe is True, the safety check is completely bypassed, allowing arbitrary command execution (including blocked executables like rm, curl, etc.). To prevent this security bypass, _check_command_safety should always be called, passing the unsafe flag to enforce the appropriate safety constraints.

    safety_err = _check_command_safety(command, unsafe=unsafe)
    if safety_err:
        return f"Error: {safety_err}", False

Comment on lines +75 to +80
# Rate limit only when authenticated (1 req/s for search, 10 req/s for others)
if S2_API_KEY:
min_interval = 1.0 if "search" in path else 0.1
elapsed = time.monotonic() - _s2_last_request
if elapsed < min_interval:
await asyncio.sleep(min_interval - elapsed)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The rate-limiting logic is backwards: it only throttles requests when S2_API_KEY is present (authenticated). Unauthenticated requests have much stricter rate limits on Semantic Scholar and will immediately trigger 429 Too Many Requests if not throttled. We should always enforce a rate limit, defaulting to a safer 1.0 second interval when unauthenticated.

        # Rate limit always, defaulting to 1.0s when unauthenticated
        min_interval = 1.0 if not S2_API_KEY or "search" in path else 0.1
        elapsed = time.monotonic() - _s2_last_request
        if elapsed < min_interval:
            await asyncio.sleep(min_interval - elapsed)

Comment on lines +510 to +512
def _validate_arxiv_id(args: dict) -> str | None:
"""Return arxiv_id or None if missing."""
return args.get("arxiv_id")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

LLMs frequently extract arXiv IDs with prefixes like arxiv: or arXiv:. If these are passed directly to the APIs or URLs, the requests will fail. Normalizing the ID by stripping any case-insensitive arxiv: prefix in _validate_arxiv_id makes the tool much more robust.

Suggested change
def _validate_arxiv_id(args: dict) -> str | None:
"""Return arxiv_id or None if missing."""
return args.get("arxiv_id")
def _validate_arxiv_id(args: dict) -> str | None:
"""Return arxiv_id or None if missing."""
val = args.get("arxiv_id")
if val and isinstance(val, str):
return re.sub(r"^(arxiv:)+", "", val, flags=re.IGNORECASE).strip()
return val

Comment on lines +709 to +710
except httpx.RequestError:
continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If _parse_paper_html raises an unexpected parsing exception (e.g., AttributeError or TypeError due to malformed HTML), it will propagate out of the loop and fail the entire tool call. Catching general Exception alongside httpx.RequestError ensures we robustly fall back to the next URL or the abstract fallback.

Suggested change
except httpx.RequestError:
continue
except (httpx.RequestError, Exception):
continue

Comment on lines +138 to +146
def html_to_text(html_content: str) -> str:
"""Convert HTML to clean plain text."""
extractor = _TextExtractor()
try:
extractor.feed(html_content)
except Exception:
# If the parser chokes, fall back to stripping all tags.
return re.sub(r"<[^>]+>", " ", html_content)
return extractor.get_text()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a custom HTMLParser subclass is highly sensitive to mismatched or unclosed tags in real-world HTML, which can cause the parser to skip large portions of the webpage. Since beautifulsoup4 is already a dependency of the project, we should use it to decompose unwanted tags and extract text reliably.

Suggested change
def html_to_text(html_content: str) -> str:
"""Convert HTML to clean plain text."""
extractor = _TextExtractor()
try:
extractor.feed(html_content)
except Exception:
# If the parser chokes, fall back to stripping all tags.
return re.sub(r"<[^>]+>", " ", html_content)
return extractor.get_text()
def html_to_text(html_content: str) -> str:
"""Convert HTML to clean plain text."""
try:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, "html.parser")
for element in soup(_SKIP_TAGS):
element.decompose()
return soup.get_text(separator="\n").strip()
except Exception:
return re.sub(r"<[^>]+>", " ", html_content)

Comment on lines +213 to +218
resp = requests.get(
hit.url,
headers={"User-Agent": USER_AGENT},
timeout=15,
allow_redirects=True,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

A timeout of 15 seconds per page fetch can make the web_search tool extremely slow and unresponsive if multiple top hits are slow or hanging. Reducing the timeout to 5 seconds ensures the search returns quickly while still capturing responsive pages.

Suggested change
resp = requests.get(
hit.url,
headers={"User-Agent": USER_AGENT},
timeout=15,
allow_redirects=True,
)
resp = requests.get(
hit.url,
headers={"User-Agent": USER_AGENT},
timeout=5,
allow_redirects=True,
)

@shubham3-ucb

shubham3-ucb commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

@mr-sarthakgupta nice work, this is a really useful addition. I pulled it down and ran the tools, web_search and fetch_webpage both worked with no key needed, and the safe-mode design for run_command makes sense. A few things before the merge is approved:

  1. Lint is failing. black wants to reformat 5 of the new files (agentic_generator.py, run_command_tool.py, web_search_tool.py, papers_tool.py, fetch_webpage_tool.py). A quick black skydiscover/ and a push should turn CI green.

  2. In safe mode, run_command can still read files outside the codebase root. The docstring says absolute paths outside the root are blocked, but there is no actual path check, so cat/ls/head/find/grep happily take an absolute path. I tested it with codebase_root set to a temp dir:

    cat /etc/passwd   -> exit 0, returns the file
    ls /Users/<me>    -> lists my home dir
    

    Could we resolve the path args and reject anything outside codebase_root? Or if that is out of scope for now, just drop the "absolute paths are blocked" line so the docstring matches the behavior.

  3. env is on the safe allowlist, so the agent can dump the whole environment (including OPENAI_API_KEY) straight into the model context. I confirmed it prints the key. I would pull env and printenv out of the allowlist.

  4. Small one: fetch_webpage correctly blocks non-http(s) schemes (file:// is rejected), but there is no host check, so localhost and 169.254.169.254 (cloud metadata) still get hit. I confirmed both are attempted rather than refused. A guard for loopback, link-local, and private ranges would be good if this ever runs on shared infra.

None of these are hard to fix and the feature itself is great. This should be good to merge once the lint and the path/env bits are handled. Thanks!

@mr-sarthakgupta

mr-sarthakgupta commented Jun 6, 2026

Copy link
Copy Markdown
Author

Hi Shubham, thanks for the feedback!

Yes, these features do have some holes right now, I mostly duct-taped them for my own use. Would be fixing the issues in the coming days within this PR. Also, I might open another PR for symbolic regression soon!

PS: Some time ago I had sent an email to you with some questions about sky-discover, while a few of them have been answered with my own experiments but would still really appreciate if you could take a look if you find any time! (I sent it using the ID mrsarthakgupta@gmail.com)

@mr-sarthakgupta

Copy link
Copy Markdown
Author

Also, another small problem I faced (not related to this PR or topic), I wasn't able to find a way to run EvoX and Ada-Evolve with each other, they seem like alternatives by looking at the code structure but are at separate layers in terms of search logic and I believe I also saw some results on the website or someplace with both being used

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants