Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
<h3 align="center">The open source taint analysis engine for the AI era</h3>

<p align="center">
Formal inter-procedural taint analysis — finds what pattern matching engines miss, enacts what LLM agents discover as rules, scales where neither can alone.
Formal inter-procedural taint analysis — finds what AST-pattern matchers miss, enacts what LLM agents discover as rules, scales where neither can alone.
</p>

<p align="center">
Expand Down Expand Up @@ -104,8 +104,8 @@ LLM security agents find vulnerabilities humans miss, burn tokens on every file,

The more AI writes code, the more you need formal methods underneath.

- **Find what pattern matching engines miss.** The inter-procedural dataflow engine tracks untrusted data across function boundaries, persistence layers, aliases, and async code.
- **One finding becomes total coverage.** Code-native rules let you enact every uncovered vulnerability as a rule with the engine applying it across the entire codebase, deterministically, in minutes of CPU.
- **Find what AST-pattern matchers miss.** The inter-procedural dataflow engine tracks untrusted data across function boundaries, persistence layers, aliases, and async code.
- **One finding becomes total coverage.** AST-pattern rules let you enact every uncovered vulnerability as a rule with the engine applying it across the entire codebase, deterministically, in minutes of CPU.
- **Open source, batteries included.** Engine, rules, CI integrations — the entire stack ships under Apache 2.0 and MIT. No paid tier to unlock taint tracking, no gates on writing your own rules.

## Quick Start
Expand Down
6 changes: 3 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,19 +29,19 @@

AI generates production code faster than today's security tooling can keep up with. The code looks production-ready — yet it buries vulnerabilities in data flows that are fundamentally hard to catch. These include untrusted input winding through framework abstractions, cross-controller interactions with persistence layers, and async code. At the rate AI produces it, humans can't review this code at the depth it requires.

The tools meant to help aren't keeping up either — pattern matching engines catch surface-level issues but struggle to follow data flow across function and file boundaries, LLM agents burn tokens on every file and still produce inconsistent results, and enterprise analyzers that go further gate their analysis behind a paywall, with rule sets that rarely cover your stack.
The tools meant to help aren't keeping up either — AST-pattern matchers catch surface-level issues but struggle to follow data flow across function and file boundaries, LLM agents burn tokens on every file and still produce inconsistent results, and enterprise analyzers that go further gate their analysis behind a paywall, with rule sets that rarely cover your stack.

The more AI writes code, the more you need formal methods underneath.

### Find what pattern matching engines miss
### Find what AST-pattern matchers miss

The engine runs IFDS-with-abduction — formal inter-procedural dataflow analysis. It tracks untrusted data from HTTP inputs to dangerous APIs across endpoints, persistence layers, object fields, aliased references, and async code. That includes multi-hop attack paths — cross-endpoint flows, stored injections, data through object fields and aliases — at monorepo scale. 100+ rules across 20+ vulnerability classes.

Models Spring data flow and the full Boot ecosystem, analyzing Java and Kotlin at bytecode level. More languages and frameworks ahead.

### One finding becomes total coverage

LLM security agents find things — but at token cost per file, with results that shift each run, and no guarantee of complete coverage. Code-native rules turn their findings into leverage. Every vulnerability an agent uncovers can be enacted as a rule — a source, a sink, and the data flow between them — which the agent can write itself. The engine applies that rule across the entire codebase, deterministically, in minutes of CPU. When a finding is a false positive, a sanitizer can be added to the rule — the refinement propagates to every match, permanently. One discovery compounds across the entire codebase.
LLM security agents find things — but at token cost per file, with results that shift each run, and no guarantee of complete coverage. AST-pattern rules turn their findings into leverage. Every vulnerability an agent uncovers can be enacted as a rule — a source, a sink, and the data flow between them — which the agent can write itself. The engine applies that rule across the entire codebase, deterministically, in minutes of CPU. When a finding is a false positive, a sanitizer can be added to the rule — the refinement propagates to every match, permanently. One discovery compounds across the entire codebase.

The entire system is designed to work with AI agents. Formal analysis produces reproducible results agents can act on without introducing uncertainty. Rules read like code, not a proprietary DSL — so agents write and tune them the same way humans do.

Expand Down
24 changes: 14 additions & 10 deletions docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,36 +2,40 @@

**What is OpenTaint?**

OpenTaint is an open source taint analysis engine built for the AI coding era. It performs inter-procedural dataflow analysis on Java and Kotlin bytecode — cross-endpoint flow tracking, persistence layer modelling, alias analysis, and asynchronous code analysis. Code-native rules find real vulnerabilities in web applications. Finds what pattern matching engines miss, enacts what LLM agents discover as permanent rules, scales where neither can alone.
OpenTaint is an open source taint analysis engine built for the AI era.. It runs inter-procedural dataflow analysis to track untrusted data across function boundaries, persistence layers, aliases, and async code. For Java and Kotlin, the analysis works on bytecode. Rules are written in a readable AST-pattern format expressive enough to describe both vulnerable and safe patterns, letting the engine analyze deeply and precisely. It catches what AST-pattern matchers miss, turns LLM agent findings into reusable rules, and scales beyond what either can do alone.

**What vulnerabilities does OpenTaint detect?**

OpenTaint detects 20+ vulnerability types including SQL injection, XSS, SSRF, SpEL injection, open redirects, path traversal, command injection, and more. It tracks untrusted data from entry points through your application to dangerous APIs. Currently offers deep Spring Boot ecosystem supportevery finding is automatically mapped to its HTTP endpoint, so you know exactly which APIs are affected.
It detects over 20 classes of vulnerability, including SQL injection, XSS, SSRF, SpEL injection, open redirects, path traversal, and command injection. For each finding, the report walks the full path — from the HTTP source, through method calls, async boundaries, and JPA persistence, down to the dangerous calland ties it back to the Spring endpoint where the data entered.

**What are code-native rules?**
**What are AST-pattern rules?**

Rules that look like code. Readable, writable, and tunable by humans and AI agents alike. The engine translates each rule into a full taint configuration — sources, sinks, sanitizers, and propagators connected by typed taint marks. When a rule produces a false positive, you refine the rule directly. No query language to learn, no black box to work around.
Two layers. AST-pattern rules describe the shape of vulnerable code — the same rule format Semgrep and ast-grep use, readable by humans and AI agents alike. Whole-program taint analysis is what reads them: the engine models data flow across the entire program — through function boundaries, fields, async code, and persistence layers — and follows each rule's metavariables as values moving through that flow. AST-pattern matchers stop at the syntactic match; OpenTaint keeps tracing the data through them. When a rule fires on safe code, you refine it directly — the rule format is the same one you'd write for Semgrep or ast-grep.

**Why not just use an LLM agent for security scanning?**

LLM agents offer no formal guarantees. Run the same prompt twice and you may get different results — no determinism, no reproducibility. An LLM agent scanning a large codebase burns through token budgets and still can't guarantee full coverage. OpenTaint scans the same codebase in minutes of CPU compute — deterministically. AI agents can read and write OpenTaint's code-native rules, so you get the best of both: AI flexibility with formal analysis underneath.
You can, but LLM agents don't come with formal guarantees. Run the same prompt twice and the results may differ — there's no determinism and no way to argue about coverage. And on large codebases the bill adds up quickly: an agent burns through tokens on every file, the cost scales with codebase size, and it still can't promise it looked everywhere. OpenTaint scans the same codebase in minutes of CPU, deterministically, every run. Since AI agents can read and write its AST-pattern rules, you don't have to choose: the agent discovers, the engine applies.

**What languages and frameworks are supported?**

Java and Kotlin, analyzed at the bytecode level to precisely understand inheritance, generics, and library interactions. Deep Spring Boot support including Spring MVC, Spring Data, and related libraries. More languages ahead.
Java and Kotlin today. The engine works on bytecode, which gives it precise resolution of inheritance, generics, and library calls — including the standard library and any third-party JARs in the build classpath. Spring Boot is supported deeply, including Spring MVC, Spring Data, and the surrounding libraries. Python and Go are next on the roadmap.

**Why is OpenTaint the most thorough taint analyzer for Spring apps?**

It does inter-procedural data-flow analysis, following tainted data across method boundaries and through async constructs — Reactor, Spring WebFlux, and Kotlin coroutines are all modeled via data-flow approximations. Out of the box, it also models JPA persistence layers, so it catches stored injections where untrusted input arrives at one endpoint, gets saved to the database, and reappears in a completely different request later. Most engines treat the persistence layer as an opaque boundary; OpenTaint models it as part of the flow, linking writes in one request to reads in another.

**How does OpenTaint compare to Semgrep?**

Semgrep's open-source engine includes intra-procedural taint analysis — it tracks data within a single function. Its Pro engine adds inter-procedural taint analysis behind a paid tier. OpenTaint ships full inter-procedural dataflow analysis — cross-endpoint flows, persistence layers, stored injections — under Apache 2.0. Rules use a code-native format that the engine translates into complete taint configurations. Semgrep rule syntax is supported as a migration path.
Semgrep's open-source engine does intra-procedural taint analysis — it tracks data within a single function. Inter-procedural analysis lives in the Pro engine, which is closed source and paid. OpenTaint ships full inter-procedural dataflow analysis — cross-endpoint flows, persistence layers, stored injections — under Apache 2.0, and it's free for any codebase, including commercial closed-source projects. Rules are written in an AST-pattern format that the engine translates into full taint configurations, and existing Semgrep rule syntax is supported so you can migrate gradually.

**How does OpenTaint compare to CodeQL?**

CodeQL requires learning QL — a specialized query language that AI agents can't easily write. OpenTaint delivers formal inter-procedural dataflow analysis with code-native rules any developer or AI agent can read, write, and refine. No proprietary licensing, no paywall. Full taint analysis out of the box.
CodeQL does inter-procedural taint analysis too, but it's proprietary — free for open source projects, and gated behind a paid GitHub Advanced Security license for closed-source code. Its rules are written in QL, a domain-specific query language with its own semantics to learn. OpenTaint is fully open source with no paywall on private code, and its rules are written in an AST-pattern format that any developer or AI agent can read, write, and refine. Full inter-procedural taint analysis comes out of the box.

**Is OpenTaint free to use?**

Yes. The core engine is licensed under [Apache 2.0](../LICENSE.md). The CLI, CI integrations, and rules are licensed under [MIT](../cli/LICENSE).
Yes. The core engine is licensed under [Apache 2.0](../LICENSE.md), and the CLI, CI integrations, and rules are licensed under [MIT](../cli/LICENSE). Free to use on any codebase, including commercial closed-source projects.

**Can I use existing Semgrep rules?**

OpenTaint supports Semgrep rule syntax, so existing rules work as a starting point. The engine adds inter-procedural dataflow analysis on top, and you can migrate to code-native rules at your own pace for full control over taint configurations.
OpenTaint supports Semgrep's rule format, with some restrictions and a few extensions (e.g. a taint-style join mode). The engine interprets metavariables as data values — not just syntactic placeholders — and propagates them through inter-procedural dataflow. Because of that semantic difference, the same rule can produce different findings in OpenTaint than in Semgrep.
Loading