Skip to content

feat: Go SDK with examples, CI/CD, and release-please#80

Open
0xdeafcafe wants to merge 31 commits intomainfrom
feat/go-sdk
Open

feat: Go SDK with examples, CI/CD, and release-please#80
0xdeafcafe wants to merge 31 commits intomainfrom
feat/go-sdk

Conversation

@0xdeafcafe
Copy link
Copy Markdown
Collaborator

@0xdeafcafe 0xdeafcafe commented Jun 30, 2025

Summary

Full Go SDK for scenario-based AI agent testing, aligned with the existing JS and Python SDKs.

Core SDK (30 commits)

  • Provider-agnostic agent testing framework with AgentAdapter interface
  • Built-in UserSimulatorAgent and JudgeAgent with LLM-powered evaluation
  • Script DSL: User(), Agent(), Judge(), Proceed(), Succeed(), Fail()
  • LLM provider adapters: OpenAI, Anthropic, Gemini, AWS Bedrock
  • LangWatch integration with event reporting and OpenTelemetry tracing
  • Verbose and Metadata fields on ScenarioConfig (wired into runner + events)

API alignment with JS/Python

  • AgentRole values changed to Title case ("Agent", "User", "Judge")
  • LastAssistantMessage() renamed to LastAgentMessage()
  • ScenarioResult uses MetCriteria/UnmetCriteria (matches JS)

Bug fix

  • Fixed swapped ToolMessage(content, toolCallID) arguments in OpenAI provider

Example test suite (go/examples/)

10 test files ported from the JS example suite:

  • weather_agent_test.go — tool calling + HasToolCall() assertions
  • vegetarian_recipe_agent_test.go — multi-turn with judge checkpoint criteria
  • travel_agent_test.go — multi-tool agent, recursive execution
  • false_assumptions_test.go — hardcoded messages + Proceed(WithProceedTurns, WithProceedOnTurn)
  • grouping_scenarios_test.go — echo agent, SetID, Succeed()
  • error_handling_test.go — agent error propagation
  • simple_tool_mocking_test.go — mocked tool execution, parameter verification
  • custom_judge_test.go — custom judge with direct LLM structured output
  • multiturn_10_scripted_test.go — fully scripted 10-turn conversation
  • mocked_weather_agent_tool_test.go — hardcoded tool call/result injection

CI/CD

  • go-ci.yml — vet, test, provider checks, example tests with secrets
  • go-publish.yml — verify + warm Go module proxy on release tag
  • Release-please configured for go component (release-type: go, starting at v0.1.0)

Test plan

  • All 10 example tests pass against live OpenAI API
  • go vet ./... clean on core SDK, examples, and providers
  • Existing internal unit tests pass (ksuid, ptr)
  • CI workflow runs on this PR

@rogeriochaves rogeriochaves force-pushed the main branch 2 times, most recently from 77a92af to 9fdb87c Compare December 16, 2025 15:54
@0xdeafcafe 0xdeafcafe marked this pull request as ready for review February 24, 2026 09:05
Extract LLM inference interface and message types into dedicated files.
Update agent interfaces and user simulator with refined API.
- Prefix event types with SCENARIO_ (SCENARIO_RUN_STARTED, etc.)
- Add ScenarioRunStatus type with uppercase values (SUCCESS, ERROR, FAILED)
- Rewrite eventalert with tmpdir file coordination across processes
- Show greeting banner only when API key is missing
- Use path-based watch URL format ({setUrl}/{batchRunId})
- Add SCENARIO_HEADLESS env var to suppress browser open
- Add SCENARIO_DISABLE_SIMULATION_REPORT_INFO env var to suppress banners
- Scope watch message per scenarioSetId
- Cache batch run ID per process with sync.Once
Add go.opentelemetry.io/otel, otel/sdk, otel/trace, otel/attribute,
otel/codes and github.com/langwatch/langwatch/sdk-go for full OTel
tracing integration.
- SpanCollector: implements sdktrace.SpanProcessor to collect spans,
  filters by thread ID with parent chain walking
- SpanDigestFormatter: renders spans as plain-text hierarchy with
  timestamps, durations, attributes, events, and error sections
- setupObservability: creates LangWatch exporter, TracerProvider, and
  SpanCollector; wires into global OTel provider
- Remove TracedInference and instrumentBuiltInAgents (replaced by OTel)
- Execution: create per-turn and per-agent spans with tracer, end spans
  at all exit paths
- JudgeAgent: build transcript from messages, include OTel trace digest
  in judge prompt alongside conversation transcript
- Runner: init observability when API key present, wire span collector
  to judge agents and execution, default endpoint to app.langwatch.ai,
  default SetID to "default", shutdown observability after run
Multi-provider inference abstraction supporting OpenAI, Anthropic,
Google Gemini, and AWS Bedrock with tool/function calling conversion.
…-please

API alignment:
- AgentRole values: lowercase → Title case to match JS/Python
- Rename LastAssistantMessage → LastAgentMessage for consistency
- Add Verbose and Metadata fields to ScenarioConfig
- Wire Verbose (prints failure details) and Metadata (sent in events)

Bug fix:
- Fix swapped ToolMessage arguments in OpenAI provider (content/toolCallID)

Examples (10 test files in go/examples/):
- weather-agent, vegetarian-recipe, travel-agent, false-assumptions,
  grouping-scenarios, error-handling, simple-tool-mocking, custom-judge,
  multiturn-10-scripted, mocked-weather-agent-tool

CI/CD:
- go-ci.yml: vet, test, provider checks, example tests with secrets
- go-publish.yml: verify + warm Go module proxy on release
- release-please: add go component (release-type: go, v0.1.0)
- version.go, CHANGELOG.md for release-please integration
Copilot AI review requested due to automatic review settings April 9, 2026 17:34
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

This PR's diff exceeds the size limit for automated low-risk evaluation. Manual review required.

This PR requires a manual review before merging.

@0xdeafcafe 0xdeafcafe changed the title feat: go sdk feat: Go SDK with examples, CI/CD, and release-please Apr 9, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces an initial Scenario Go SDK (core runner/execution engine, DSL helpers, LangWatch event reporting + OTel tracing) plus provider adapters (OpenAI/Anthropic/Gemini/Bedrock), examples, and Go CI/publish workflows.

Changes:

  • Add Go SDK core types (agents, messages, execution, script DSL) and LangWatch event reporting / OTel tracing utilities.
  • Add provider-specific Inference adapters for OpenAI, Anthropic, Gemini, and AWS Bedrock.
  • Add Go examples/tests, release-please config entries, and GitHub Actions workflows for Go CI and publishing.

Reviewed changes

Copilot reviewed 66 out of 72 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
go/version.go SDK version constant.
go/utils.go Criterion param-name normalization + message role reversal utilities.
go/tracing.go Tracing notes / legacy placeholder.
go/tracing_setup.go OTel + LangWatch exporter setup and handle.
go/tracing_digest.go Plain-text span digest formatter for judge evaluation.
go/tracing_collector.go OTel span processor/collector to attach spans to judge.
go/script.go Script DSL helper functions (User/Agent/Judge/Proceed/...).
go/runner.go Run() entrypoint, option handling, reporter + observability wiring.
go/execution.go Core scenario execution engine (turn loop, agent calls, events, spans).
go/executionstate.go Execution state tracking and helper queries (tool calls, last messages).
go/message.go Provider-agnostic message / tool-call domain types.
go/llm.go Provider-agnostic Inference interface + tool schema types.
go/domain.go Public interfaces/types for scripts, execution, state, results, options.
go/config.go ScenarioConfig definition.
go/ids.go KSUID-based IDs for thread/scenario/batch/run.
go/events.go Event types emitted during execution.
go/eventbus.go In-process channel-based event bus.
go/eventreporter.go HTTP reporter posting events to LangWatch API.
go/eventalert.go Console banner + “follow live” URL + coordination-file logic.
go/agent.go Agent roles, inputs/returns, config, and judge option types.
go/agent_user_simulator.go Built-in user simulator agent (role reversal + LLM call).
go/agent_judge.go Built-in judge agent (criteria tools, transcript + OTel digest).
go/README.md Go SDK documentation and usage examples.
go/CHANGELOG.md Initial Go SDK changelog entry.
go/go.mod Go module definition for SDK.
go/go.sum Go module dependency lockfile for SDK.
go/internal/judge_agent_tools.go Judge tool-argument parsing helpers.
go/internal/libraries/ptr/ptr.go Small pointer helper library.
go/internal/libraries/ptr/ptr_test.go Tests for ptr helpers.
go/internal/libraries/ksuid/README.md Internal KSUID library docs.
go/internal/libraries/ksuid/LICENSE Internal KSUID library license.
go/internal/libraries/ksuid/base62.go KSUID base62 decode implementation.
go/internal/libraries/ksuid/id.go KSUID ID type, parsing, encoding, JSON/db integration.
go/internal/libraries/ksuid/id_test.go KSUID ID tests/benchmarks.
go/internal/libraries/ksuid/instance_id.go Instance ID generation (docker/hardware/random).
go/internal/libraries/ksuid/node.go KSUID node generator.
go/internal/libraries/ksuid/node_test.go KSUID node benchmark.
go/providers/openai/openai.go OpenAI provider adapter implementing Inference.
go/providers/openai/convert.go Scenario<->OpenAI message/tool conversion helpers.
go/providers/openai/go.mod Provider module definition.
go/providers/openai/go.sum Provider dependency lockfile.
go/providers/anthropic/anthropic.go Anthropic provider adapter implementing Inference.
go/providers/anthropic/convert.go Scenario<->Anthropic message/tool conversion helpers.
go/providers/anthropic/go.mod Provider module definition.
go/providers/anthropic/go.sum Provider dependency lockfile.
go/providers/gemini/gemini.go Gemini provider adapter implementing Inference.
go/providers/gemini/convert.go Scenario<->Gemini message/tool conversion helpers.
go/providers/gemini/go.mod Provider module definition.
go/providers/gemini/go.sum Provider dependency lockfile.
go/providers/bedrock/bedrock.go Bedrock provider adapter implementing Inference.
go/providers/bedrock/convert.go Scenario<->Bedrock message/tool conversion helpers.
go/providers/bedrock/go.mod Provider module definition.
go/providers/bedrock/go.sum Provider dependency lockfile.
go/examples/.gitignore Ignore local env files for examples.
go/examples/.env.example Example env variables for running examples.
go/examples/go.mod Examples module definition.
go/examples/go.sum Examples dependency lockfile.
go/examples/helpers_test.go Example helper agents + tool mocking helpers.
go/examples/weather_agent_test.go Example scenario: weather tool calling.
go/examples/travel_agent_test.go Example scenario: multi-tool travel agent + judge criteria.
go/examples/vegetarian_recipe_agent_test.go Example scenario: multi-turn judge checkpoints.
go/examples/simple_tool_mocking_test.go Example scenario: tool mocking + parameter assertion.
go/examples/multiturn_10_scripted_test.go Example scenario: fully scripted 10-turn conversation + judge.
go/examples/mocked_weather_agent_tool_test.go Example scenario: injecting tool call/result messages.
go/examples/grouping_scenarios_test.go Example scenario: grouping via SetID.
go/examples/false_assumptions_test.go Example scenario: proceed options + bias criteria.
go/examples/error_handling_test.go Example scenario: agent error propagation into result.
go/examples/custom_judge_test.go Example scenario: fully custom LLM judge agent.
.release-please-manifest.json Adds go component version tracking.
.release-please-config.json Adds release-please config for Go component.
.github/workflows/go-ci.yml Go CI workflow (vet/test + providers + examples).
.github/workflows/go-publish.yml Go publish/indexing workflow triggered on releases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +14 to +26
func ParseJudgeAgentFinishTestToolArguments(arguments string) (*JudgeAgentFinishTestToolArguments, error) {
var resp *JudgeAgentFinishTestToolArguments
if err := json.Unmarshal([]byte(arguments), &resp); err != nil {
return nil, fmt.Errorf("failed to parse judge agent finish tool arguments: %w", err)
}

if resp.Verdict == "" {
resp.Verdict = "inconclusive"
}
if resp.Reasoning == "" {
resp.Reasoning = "No reasoning provided"
}
if resp.Criteria == nil {
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json.Unmarshal is being done into a *JudgeAgentFinishTestToolArguments pointer (var resp *... then Unmarshal(&resp)). If the LLM returns null, resp remains nil and the subsequent field accesses will panic. Consider unmarshaling into a value struct (non-pointer) or explicitly handling the resp == nil case after unmarshal before setting defaults.

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +67
func getHardwareAddr(ctx context.Context) (net.HardwareAddr, error) {
addrs, err := net.Interfaces()
if err != nil {
return nil, err
}

for _, addr := range addrs {
// only return physical interfaces (i.e. not loopback)
if len(addr.HardwareAddr) >= 6 {
return addr.HardwareAddr, nil
}
}

return nil, fmt.Errorf("%w: %w", ErrNoHardwareAddress, err)
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err is guaranteed to be nil here (all earlier error returns have already happened), so fmt.Errorf("%w: %w", ErrNoHardwareAddress, err) either adds a confusing <nil> wrap or relies on multi-%w behavior. Prefer returning ErrNoHardwareAddress directly (or include a concrete context error) to avoid wrapping nil.

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +53
// NewNode returns a ID generator for the current machine.
func NewNode(environment string, instanceID InstanceID) *Node {
return &Node{
InstanceID: instanceID,
}
}

// Generate returns a new ID for the machine and resource configured.
func (n *Node) Generate(ctx context.Context, resource string) (id ID) {
if strings.ContainsRune(resource, '_') {
panic(fmt.Errorf("ksuid resource contains underscore: %s", resource))
}

id.Environment = Production
id.Resource = resource
id.InstanceID = n.InstanceID
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewNode(environment, instanceID) ignores the environment argument, and Generate hard-codes id.Environment = Production, making it impossible to generate non-prod/environment-prefixed KSUIDs despite the API/docs implying it. Consider storing environment on Node and using it when populating id.Environment.

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +97
case scenario.MessageRoleTool:
var responseData map[string]any
if msg.Content != "" {
if err := json.Unmarshal([]byte(msg.Content), &responseData); err != nil {
responseData = map[string]any{"result": msg.Content}
}
}
result = append(result, &genai.Content{
Role: "user",
Parts: []*genai.Part{
{
FunctionResponse: &genai.FunctionResponse{
Name: msg.ToolCallID,
Response: responseData,
},
},
},
})

Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Gemini tool results, FunctionResponse.Name should match the function/tool name, but this code sets it to msg.ToolCallID (an OpenAI-style call ID). This will break tool-call flows whenever ToolCallID != tool name (e.g. the mocked tool-call example uses IDs like call_mock_001). Consider building a toolCallID -> toolName map by scanning prior assistant messages' ToolCalls, and use that mapped tool name when creating FunctionResponse.

Copilot uses AI. Check for mistakes.
Comment on lines +121 to +134
if part.FunctionCall != nil {
args := "{}"
if part.FunctionCall.Args != nil {
b, err := json.Marshal(part.FunctionCall.Args)
if err == nil {
args = string(b)
}
}
msg.ToolCalls = append(msg.ToolCalls, scenario.ToolCall{
ID: part.FunctionCall.Name,
Name: part.FunctionCall.Name,
Arguments: args,
})
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemini responses don’t appear to provide a unique tool-call ID, but this conversion sets ToolCall.ID to part.FunctionCall.Name. If the model emits multiple calls to the same function, IDs will collide and downstream tool-result correlation via ToolCallID becomes ambiguous. Consider generating a deterministic unique ID per tool call (e.g., call_1, call_2, …) while preserving Name for the function name.

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +71
func showWatchMessage(setURL, scenarioSetID string) {
if isGreetingDisabled() {
return
}

if !createCoordinationFile("watch-" + scenarioSetID) {
return
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scenarioSetID is concatenated into fileType and used to build a temp-file path. Because SetID is user-controlled, values containing path separators (e.g., ../ or /) can change the resulting path and cause unexpected failures or collisions. Consider sanitizing scenarioSetID (e.g., replace non [A-Za-z0-9._-] chars) before using it in a filename.

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +28
// ValueOrNil returns the value of the pointer if it is not nil, otherwise it returns the
// zero value of the type.
func ValueOrNil[T any](v *T) T {
if v == nil {
var zero T
return zero
}

return *v
}

// ValueOrZero returns the value of the pointer if it is not nil, otherwise it returns
// the zero value of the type.
func ValueOrZero[T any](v *T) T {
if v == nil {
var zero T
return zero
}

return *v
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueOrNil and ValueOrZero have identical implementations and semantics (both return the zero value when nil). Keeping both increases API surface without adding behavior. Consider removing one of them or changing one to provide distinct semantics.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +5
# ksuid

ksuid is a Go library that generated prefixed, k-sorted globally unique identifiers.

Each KSUID has a resource type and optionally an environment prefix (no environment prefix is for prod use only). They are roughly sortable down to per-second resolution.
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar: “ksuid is a Go library that generated …” should be “ksuid is a Go library that generates …”.

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +187
func toGeminiSchema(params map[string]any) *genai.Schema {
if params == nil {
return nil
}

schema := &genai.Schema{
Type: genai.TypeObject,
}

if props, ok := params["properties"].(map[string]any); ok {
schema.Properties = make(map[string]*genai.Schema)
for name, propDef := range props {
schema.Properties[name] = convertPropertyToSchema(propDef)
}
}

if req, ok := params["required"].([]any); ok {
for _, r := range req {
if s, ok := r.(string); ok {
schema.Required = append(schema.Required, s)
}
}
}

Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toGeminiSchema only reads required when it’s typed as []any, but callers commonly provide JSON schema required as []string (as in the examples). This means required fields will be silently dropped for Gemini tool definitions. Consider accepting both []string and []any (string elements) when populating schema.Required.

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +156
func toAnthropicTools(tools []scenario.ToolDefinition) []anthropic.ToolUnionParam {
result := make([]anthropic.ToolUnionParam, 0, len(tools))
for _, tool := range tools {
tp := &anthropic.ToolParam{
Name: tool.Name,
Description: anthropic.String(tool.Description),
InputSchema: anthropic.ToolInputSchemaParam{
Properties: tool.Parameters["properties"],
},
}
if req, ok := tool.Parameters["required"].([]any); ok {
reqStrings := make([]string, 0, len(req))
for _, r := range req {
if s, ok := r.(string); ok {
reqStrings = append(reqStrings, s)
}
}
tp.InputSchema.Required = reqStrings
}
result = append(result, anthropic.ToolUnionParam{OfTool: tp})
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toAnthropicTools only reads required when it’s typed as []any, but callers commonly provide JSON schema required as []string (as in the examples). This means required fields will be silently dropped in the Anthropic tool schema. Consider accepting both []string and []any (string elements) when building tp.InputSchema.Required.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants