feat: Go SDK with examples, CI/CD, and release-please#80
feat: Go SDK with examples, CI/CD, and release-please#800xdeafcafe wants to merge 31 commits intomainfrom
Conversation
77a92af to
9fdb87c
Compare
61dfaae to
86db52d
Compare
Extract LLM inference interface and message types into dedicated files. Update agent interfaces and user simulator with refined API.
- Prefix event types with SCENARIO_ (SCENARIO_RUN_STARTED, etc.)
- Add ScenarioRunStatus type with uppercase values (SUCCESS, ERROR, FAILED)
- Rewrite eventalert with tmpdir file coordination across processes
- Show greeting banner only when API key is missing
- Use path-based watch URL format ({setUrl}/{batchRunId})
- Add SCENARIO_HEADLESS env var to suppress browser open
- Add SCENARIO_DISABLE_SIMULATION_REPORT_INFO env var to suppress banners
- Scope watch message per scenarioSetId
- Cache batch run ID per process with sync.Once
Add go.opentelemetry.io/otel, otel/sdk, otel/trace, otel/attribute, otel/codes and github.com/langwatch/langwatch/sdk-go for full OTel tracing integration.
- SpanCollector: implements sdktrace.SpanProcessor to collect spans, filters by thread ID with parent chain walking - SpanDigestFormatter: renders spans as plain-text hierarchy with timestamps, durations, attributes, events, and error sections - setupObservability: creates LangWatch exporter, TracerProvider, and SpanCollector; wires into global OTel provider - Remove TracedInference and instrumentBuiltInAgents (replaced by OTel)
- Execution: create per-turn and per-agent spans with tracer, end spans at all exit paths - JudgeAgent: build transcript from messages, include OTel trace digest in judge prompt alongside conversation transcript - Runner: init observability when API key present, wire span collector to judge agents and execution, default endpoint to app.langwatch.ai, default SetID to "default", shutdown observability after run
Multi-provider inference abstraction supporting OpenAI, Anthropic, Google Gemini, and AWS Bedrock with tool/function calling conversion.
…-please API alignment: - AgentRole values: lowercase → Title case to match JS/Python - Rename LastAssistantMessage → LastAgentMessage for consistency - Add Verbose and Metadata fields to ScenarioConfig - Wire Verbose (prints failure details) and Metadata (sent in events) Bug fix: - Fix swapped ToolMessage arguments in OpenAI provider (content/toolCallID) Examples (10 test files in go/examples/): - weather-agent, vegetarian-recipe, travel-agent, false-assumptions, grouping-scenarios, error-handling, simple-tool-mocking, custom-judge, multiturn-10-scripted, mocked-weather-agent-tool CI/CD: - go-ci.yml: vet, test, provider checks, example tests with secrets - go-publish.yml: verify + warm Go module proxy on release - release-please: add go component (release-type: go, v0.1.0) - version.go, CHANGELOG.md for release-please integration
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.
This PR requires a manual review before merging. |
There was a problem hiding this comment.
Pull request overview
Introduces an initial Scenario Go SDK (core runner/execution engine, DSL helpers, LangWatch event reporting + OTel tracing) plus provider adapters (OpenAI/Anthropic/Gemini/Bedrock), examples, and Go CI/publish workflows.
Changes:
- Add Go SDK core types (agents, messages, execution, script DSL) and LangWatch event reporting / OTel tracing utilities.
- Add provider-specific
Inferenceadapters for OpenAI, Anthropic, Gemini, and AWS Bedrock. - Add Go examples/tests, release-please config entries, and GitHub Actions workflows for Go CI and publishing.
Reviewed changes
Copilot reviewed 66 out of 72 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| go/version.go | SDK version constant. |
| go/utils.go | Criterion param-name normalization + message role reversal utilities. |
| go/tracing.go | Tracing notes / legacy placeholder. |
| go/tracing_setup.go | OTel + LangWatch exporter setup and handle. |
| go/tracing_digest.go | Plain-text span digest formatter for judge evaluation. |
| go/tracing_collector.go | OTel span processor/collector to attach spans to judge. |
| go/script.go | Script DSL helper functions (User/Agent/Judge/Proceed/...). |
| go/runner.go | Run() entrypoint, option handling, reporter + observability wiring. |
| go/execution.go | Core scenario execution engine (turn loop, agent calls, events, spans). |
| go/executionstate.go | Execution state tracking and helper queries (tool calls, last messages). |
| go/message.go | Provider-agnostic message / tool-call domain types. |
| go/llm.go | Provider-agnostic Inference interface + tool schema types. |
| go/domain.go | Public interfaces/types for scripts, execution, state, results, options. |
| go/config.go | ScenarioConfig definition. |
| go/ids.go | KSUID-based IDs for thread/scenario/batch/run. |
| go/events.go | Event types emitted during execution. |
| go/eventbus.go | In-process channel-based event bus. |
| go/eventreporter.go | HTTP reporter posting events to LangWatch API. |
| go/eventalert.go | Console banner + “follow live” URL + coordination-file logic. |
| go/agent.go | Agent roles, inputs/returns, config, and judge option types. |
| go/agent_user_simulator.go | Built-in user simulator agent (role reversal + LLM call). |
| go/agent_judge.go | Built-in judge agent (criteria tools, transcript + OTel digest). |
| go/README.md | Go SDK documentation and usage examples. |
| go/CHANGELOG.md | Initial Go SDK changelog entry. |
| go/go.mod | Go module definition for SDK. |
| go/go.sum | Go module dependency lockfile for SDK. |
| go/internal/judge_agent_tools.go | Judge tool-argument parsing helpers. |
| go/internal/libraries/ptr/ptr.go | Small pointer helper library. |
| go/internal/libraries/ptr/ptr_test.go | Tests for ptr helpers. |
| go/internal/libraries/ksuid/README.md | Internal KSUID library docs. |
| go/internal/libraries/ksuid/LICENSE | Internal KSUID library license. |
| go/internal/libraries/ksuid/base62.go | KSUID base62 decode implementation. |
| go/internal/libraries/ksuid/id.go | KSUID ID type, parsing, encoding, JSON/db integration. |
| go/internal/libraries/ksuid/id_test.go | KSUID ID tests/benchmarks. |
| go/internal/libraries/ksuid/instance_id.go | Instance ID generation (docker/hardware/random). |
| go/internal/libraries/ksuid/node.go | KSUID node generator. |
| go/internal/libraries/ksuid/node_test.go | KSUID node benchmark. |
| go/providers/openai/openai.go | OpenAI provider adapter implementing Inference. |
| go/providers/openai/convert.go | Scenario<->OpenAI message/tool conversion helpers. |
| go/providers/openai/go.mod | Provider module definition. |
| go/providers/openai/go.sum | Provider dependency lockfile. |
| go/providers/anthropic/anthropic.go | Anthropic provider adapter implementing Inference. |
| go/providers/anthropic/convert.go | Scenario<->Anthropic message/tool conversion helpers. |
| go/providers/anthropic/go.mod | Provider module definition. |
| go/providers/anthropic/go.sum | Provider dependency lockfile. |
| go/providers/gemini/gemini.go | Gemini provider adapter implementing Inference. |
| go/providers/gemini/convert.go | Scenario<->Gemini message/tool conversion helpers. |
| go/providers/gemini/go.mod | Provider module definition. |
| go/providers/gemini/go.sum | Provider dependency lockfile. |
| go/providers/bedrock/bedrock.go | Bedrock provider adapter implementing Inference. |
| go/providers/bedrock/convert.go | Scenario<->Bedrock message/tool conversion helpers. |
| go/providers/bedrock/go.mod | Provider module definition. |
| go/providers/bedrock/go.sum | Provider dependency lockfile. |
| go/examples/.gitignore | Ignore local env files for examples. |
| go/examples/.env.example | Example env variables for running examples. |
| go/examples/go.mod | Examples module definition. |
| go/examples/go.sum | Examples dependency lockfile. |
| go/examples/helpers_test.go | Example helper agents + tool mocking helpers. |
| go/examples/weather_agent_test.go | Example scenario: weather tool calling. |
| go/examples/travel_agent_test.go | Example scenario: multi-tool travel agent + judge criteria. |
| go/examples/vegetarian_recipe_agent_test.go | Example scenario: multi-turn judge checkpoints. |
| go/examples/simple_tool_mocking_test.go | Example scenario: tool mocking + parameter assertion. |
| go/examples/multiturn_10_scripted_test.go | Example scenario: fully scripted 10-turn conversation + judge. |
| go/examples/mocked_weather_agent_tool_test.go | Example scenario: injecting tool call/result messages. |
| go/examples/grouping_scenarios_test.go | Example scenario: grouping via SetID. |
| go/examples/false_assumptions_test.go | Example scenario: proceed options + bias criteria. |
| go/examples/error_handling_test.go | Example scenario: agent error propagation into result. |
| go/examples/custom_judge_test.go | Example scenario: fully custom LLM judge agent. |
| .release-please-manifest.json | Adds go component version tracking. |
| .release-please-config.json | Adds release-please config for Go component. |
| .github/workflows/go-ci.yml | Go CI workflow (vet/test + providers + examples). |
| .github/workflows/go-publish.yml | Go publish/indexing workflow triggered on releases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func ParseJudgeAgentFinishTestToolArguments(arguments string) (*JudgeAgentFinishTestToolArguments, error) { | ||
| var resp *JudgeAgentFinishTestToolArguments | ||
| if err := json.Unmarshal([]byte(arguments), &resp); err != nil { | ||
| return nil, fmt.Errorf("failed to parse judge agent finish tool arguments: %w", err) | ||
| } | ||
|
|
||
| if resp.Verdict == "" { | ||
| resp.Verdict = "inconclusive" | ||
| } | ||
| if resp.Reasoning == "" { | ||
| resp.Reasoning = "No reasoning provided" | ||
| } | ||
| if resp.Criteria == nil { |
There was a problem hiding this comment.
json.Unmarshal is being done into a *JudgeAgentFinishTestToolArguments pointer (var resp *... then Unmarshal(&resp)). If the LLM returns null, resp remains nil and the subsequent field accesses will panic. Consider unmarshaling into a value struct (non-pointer) or explicitly handling the resp == nil case after unmarshal before setting defaults.
| func getHardwareAddr(ctx context.Context) (net.HardwareAddr, error) { | ||
| addrs, err := net.Interfaces() | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| for _, addr := range addrs { | ||
| // only return physical interfaces (i.e. not loopback) | ||
| if len(addr.HardwareAddr) >= 6 { | ||
| return addr.HardwareAddr, nil | ||
| } | ||
| } | ||
|
|
||
| return nil, fmt.Errorf("%w: %w", ErrNoHardwareAddress, err) | ||
| } |
There was a problem hiding this comment.
err is guaranteed to be nil here (all earlier error returns have already happened), so fmt.Errorf("%w: %w", ErrNoHardwareAddress, err) either adds a confusing <nil> wrap or relies on multi-%w behavior. Prefer returning ErrNoHardwareAddress directly (or include a concrete context error) to avoid wrapping nil.
| // NewNode returns a ID generator for the current machine. | ||
| func NewNode(environment string, instanceID InstanceID) *Node { | ||
| return &Node{ | ||
| InstanceID: instanceID, | ||
| } | ||
| } | ||
|
|
||
| // Generate returns a new ID for the machine and resource configured. | ||
| func (n *Node) Generate(ctx context.Context, resource string) (id ID) { | ||
| if strings.ContainsRune(resource, '_') { | ||
| panic(fmt.Errorf("ksuid resource contains underscore: %s", resource)) | ||
| } | ||
|
|
||
| id.Environment = Production | ||
| id.Resource = resource | ||
| id.InstanceID = n.InstanceID |
There was a problem hiding this comment.
NewNode(environment, instanceID) ignores the environment argument, and Generate hard-codes id.Environment = Production, making it impossible to generate non-prod/environment-prefixed KSUIDs despite the API/docs implying it. Consider storing environment on Node and using it when populating id.Environment.
| case scenario.MessageRoleTool: | ||
| var responseData map[string]any | ||
| if msg.Content != "" { | ||
| if err := json.Unmarshal([]byte(msg.Content), &responseData); err != nil { | ||
| responseData = map[string]any{"result": msg.Content} | ||
| } | ||
| } | ||
| result = append(result, &genai.Content{ | ||
| Role: "user", | ||
| Parts: []*genai.Part{ | ||
| { | ||
| FunctionResponse: &genai.FunctionResponse{ | ||
| Name: msg.ToolCallID, | ||
| Response: responseData, | ||
| }, | ||
| }, | ||
| }, | ||
| }) | ||
|
|
There was a problem hiding this comment.
For Gemini tool results, FunctionResponse.Name should match the function/tool name, but this code sets it to msg.ToolCallID (an OpenAI-style call ID). This will break tool-call flows whenever ToolCallID != tool name (e.g. the mocked tool-call example uses IDs like call_mock_001). Consider building a toolCallID -> toolName map by scanning prior assistant messages' ToolCalls, and use that mapped tool name when creating FunctionResponse.
| if part.FunctionCall != nil { | ||
| args := "{}" | ||
| if part.FunctionCall.Args != nil { | ||
| b, err := json.Marshal(part.FunctionCall.Args) | ||
| if err == nil { | ||
| args = string(b) | ||
| } | ||
| } | ||
| msg.ToolCalls = append(msg.ToolCalls, scenario.ToolCall{ | ||
| ID: part.FunctionCall.Name, | ||
| Name: part.FunctionCall.Name, | ||
| Arguments: args, | ||
| }) | ||
| } |
There was a problem hiding this comment.
Gemini responses don’t appear to provide a unique tool-call ID, but this conversion sets ToolCall.ID to part.FunctionCall.Name. If the model emits multiple calls to the same function, IDs will collide and downstream tool-result correlation via ToolCallID becomes ambiguous. Consider generating a deterministic unique ID per tool call (e.g., call_1, call_2, …) while preserving Name for the function name.
| func showWatchMessage(setURL, scenarioSetID string) { | ||
| if isGreetingDisabled() { | ||
| return | ||
| } | ||
|
|
||
| if !createCoordinationFile("watch-" + scenarioSetID) { | ||
| return | ||
| } |
There was a problem hiding this comment.
scenarioSetID is concatenated into fileType and used to build a temp-file path. Because SetID is user-controlled, values containing path separators (e.g., ../ or /) can change the resulting path and cause unexpected failures or collisions. Consider sanitizing scenarioSetID (e.g., replace non [A-Za-z0-9._-] chars) before using it in a filename.
| // ValueOrNil returns the value of the pointer if it is not nil, otherwise it returns the | ||
| // zero value of the type. | ||
| func ValueOrNil[T any](v *T) T { | ||
| if v == nil { | ||
| var zero T | ||
| return zero | ||
| } | ||
|
|
||
| return *v | ||
| } | ||
|
|
||
| // ValueOrZero returns the value of the pointer if it is not nil, otherwise it returns | ||
| // the zero value of the type. | ||
| func ValueOrZero[T any](v *T) T { | ||
| if v == nil { | ||
| var zero T | ||
| return zero | ||
| } | ||
|
|
||
| return *v | ||
| } |
There was a problem hiding this comment.
ValueOrNil and ValueOrZero have identical implementations and semantics (both return the zero value when nil). Keeping both increases API surface without adding behavior. Consider removing one of them or changing one to provide distinct semantics.
| # ksuid | ||
|
|
||
| ksuid is a Go library that generated prefixed, k-sorted globally unique identifiers. | ||
|
|
||
| Each KSUID has a resource type and optionally an environment prefix (no environment prefix is for prod use only). They are roughly sortable down to per-second resolution. |
There was a problem hiding this comment.
Grammar: “ksuid is a Go library that generated …” should be “ksuid is a Go library that generates …”.
| func toGeminiSchema(params map[string]any) *genai.Schema { | ||
| if params == nil { | ||
| return nil | ||
| } | ||
|
|
||
| schema := &genai.Schema{ | ||
| Type: genai.TypeObject, | ||
| } | ||
|
|
||
| if props, ok := params["properties"].(map[string]any); ok { | ||
| schema.Properties = make(map[string]*genai.Schema) | ||
| for name, propDef := range props { | ||
| schema.Properties[name] = convertPropertyToSchema(propDef) | ||
| } | ||
| } | ||
|
|
||
| if req, ok := params["required"].([]any); ok { | ||
| for _, r := range req { | ||
| if s, ok := r.(string); ok { | ||
| schema.Required = append(schema.Required, s) | ||
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
toGeminiSchema only reads required when it’s typed as []any, but callers commonly provide JSON schema required as []string (as in the examples). This means required fields will be silently dropped for Gemini tool definitions. Consider accepting both []string and []any (string elements) when populating schema.Required.
| func toAnthropicTools(tools []scenario.ToolDefinition) []anthropic.ToolUnionParam { | ||
| result := make([]anthropic.ToolUnionParam, 0, len(tools)) | ||
| for _, tool := range tools { | ||
| tp := &anthropic.ToolParam{ | ||
| Name: tool.Name, | ||
| Description: anthropic.String(tool.Description), | ||
| InputSchema: anthropic.ToolInputSchemaParam{ | ||
| Properties: tool.Parameters["properties"], | ||
| }, | ||
| } | ||
| if req, ok := tool.Parameters["required"].([]any); ok { | ||
| reqStrings := make([]string, 0, len(req)) | ||
| for _, r := range req { | ||
| if s, ok := r.(string); ok { | ||
| reqStrings = append(reqStrings, s) | ||
| } | ||
| } | ||
| tp.InputSchema.Required = reqStrings | ||
| } | ||
| result = append(result, anthropic.ToolUnionParam{OfTool: tp}) |
There was a problem hiding this comment.
toAnthropicTools only reads required when it’s typed as []any, but callers commonly provide JSON schema required as []string (as in the examples). This means required fields will be silently dropped in the Anthropic tool schema. Consider accepting both []string and []any (string elements) when building tp.InputSchema.Required.
Summary
Full Go SDK for scenario-based AI agent testing, aligned with the existing JS and Python SDKs.
Core SDK (30 commits)
AgentAdapterinterfaceUserSimulatorAgentandJudgeAgentwith LLM-powered evaluationUser(),Agent(),Judge(),Proceed(),Succeed(),Fail()VerboseandMetadatafields onScenarioConfig(wired into runner + events)API alignment with JS/Python
AgentRolevalues changed to Title case ("Agent","User","Judge")LastAssistantMessage()renamed toLastAgentMessage()ScenarioResultusesMetCriteria/UnmetCriteria(matches JS)Bug fix
ToolMessage(content, toolCallID)arguments in OpenAI providerExample test suite (
go/examples/)10 test files ported from the JS example suite:
weather_agent_test.go— tool calling +HasToolCall()assertionsvegetarian_recipe_agent_test.go— multi-turn with judge checkpoint criteriatravel_agent_test.go— multi-tool agent, recursive executionfalse_assumptions_test.go— hardcoded messages +Proceed(WithProceedTurns, WithProceedOnTurn)grouping_scenarios_test.go— echo agent,SetID,Succeed()error_handling_test.go— agent error propagationsimple_tool_mocking_test.go— mocked tool execution, parameter verificationcustom_judge_test.go— custom judge with direct LLM structured outputmultiturn_10_scripted_test.go— fully scripted 10-turn conversationmocked_weather_agent_tool_test.go— hardcoded tool call/result injectionCI/CD
go-ci.yml— vet, test, provider checks, example tests with secretsgo-publish.yml— verify + warm Go module proxy on release taggocomponent (release-type: go, starting atv0.1.0)Test plan
go vet ./...clean on core SDK, examples, and providers