Skip to content

Commit

Permalink
fix: eval plugin docs (#1814)
Browse files Browse the repository at this point in the history
  • Loading branch information
ssbushi authored Feb 4, 2025
1 parent dcacadc commit ed8439b
Show file tree
Hide file tree
Showing 2 changed files with 30 additions and 34 deletions.
30 changes: 14 additions & 16 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,8 @@ This section explains how to perform inference-based evaluation using Genkit.

### Setup
<ol>
<li>Use an existing Genkit app or create a new one by following our [Getting
started](get-started) guide.</li>
<li>Use an existing Genkit app or create a new one by following our [Get
started](get-started.md) guide.</li>
<li>Add the following code to define a simple RAG application to evaluate. For
this guide, we use a dummy retriever that always returns the same documents.

Expand All @@ -52,7 +52,6 @@ import { genkit, z, Document } from "genkit";
import {
googleAI,
gemini15Flash,
gemini15Pro,
} from "@genkit-ai/googleai";

// Initialize Genkit
Expand Down Expand Up @@ -163,7 +162,7 @@ to open the Datasets page.
c. Repeat steps (a) and (b) a couple more times to add more examples. This
guide adds the following example inputs to the dataset:

```
```none {:.devsite-disable-click-to-copy}
"Can I give milk to my cats?"
"From which animals did dogs evolve?"
```
Expand All @@ -173,8 +172,8 @@ to open the Datasets page.
### Run evaluation and view results
To start evaluating the flow, click the `Evaluations` tab in the Dev UI and
click the **Run new evaluation** button to get started.
To start evaluating the flow, click the **Run new evaluation** button on your
dataset page. You can also start a new evaluation from the `Evaluations` tab.
1. Select the `Flow` radio button to evaluate a flow.
Expand Down Expand Up @@ -233,7 +232,7 @@ and is only enforced if a schema is specified on the target flow.
control for advanced use cases (e.g. providing model parameters, message
history, tools, etc). You can find the full schema for `GenerateRequest` in
our [API reference
docs](https://js.api.genkit.dev/interfaces/genkit._.GenerateRequest.html).
docs](https://js.api.genkit.dev/interfaces/genkit._.GenerateRequest.html){: .external}.
Note: Schema validation is a helper tool for editing examples, but it is
possible to save an example with invalid schema. These examples may fail when
Expand All @@ -244,7 +243,7 @@ the running an evaluation.
### Genkit evaluators
Genkit includes a small number of native evaluators, inspired by
[RAGAS](https://docs.ragas.io/en/stable/), to help you get started:
[RAGAS](https://docs.ragas.io/en/stable/){: .external}, to help you get started:
* Faithfulness -- Measures the factual consistency of the generated answer
against the given context
Expand All @@ -256,7 +255,7 @@ harm, or exploit
### Evaluator plugins
Genkit supports additional evaluators through plugins, like the Vertex Rapid
Evaluators, which you access via the [VertexAI
Evaluators, which you can access via the [VertexAI
Plugin](./plugins/vertex-ai#evaluators).
## Advanced use
Expand Down Expand Up @@ -316,7 +315,7 @@ for evaluation. To run on a subset of the configured evaluators, use the
`--evaluators` flag and provide a comma-separated list of evaluators by name:

```posix-terminal
genkit eval:flow qaFlow --input testInputs.json --evaluators=genkit/faithfulness,genkit/answer_relevancy
genkit eval:flow qaFlow --input testInputs.json --evaluators=genkitEval/maliciousness,genkitEval/answer_relevancy
```
You can view the results of your evaluation run in the Dev UI at
`localhost:4000/evaluate`.
Expand Down Expand Up @@ -393,9 +392,8 @@ export const qaFlow = ai.defineFlow({
const factDocs = await ai.retrieve({
retriever: dummyRetriever,
query,
options: { k: 2 },
});
const factDocsModified = await run('factModified', async () => {
const factDocsModified = await ai.run('factModified', async () => {
// Let us use only facts that are considered silly. This is a
// hypothetical step for demo purposes, you may perform any
// arbitrary task inside a step and reference it in custom
Expand All @@ -408,7 +406,7 @@ export const qaFlow = ai.defineFlow({
const llmResponse = await ai.generate({
model: gemini15Flash,
prompt: `Answer this question with the given context ${query}`,
docs: factDocs,
docs: factDocsModified,
});
return llmResponse.text;
}
Expand Down Expand Up @@ -482,7 +480,7 @@ Here is an example flow that uses a PDF file to generate potential user
questions.

```ts
import { genkit, run, z } from "genkit";
import { genkit, z } from "genkit";
import { googleAI, gemini15Flash } from "@genkit-ai/googleai";
import { chunk } from "llm-chunk"; // npm i llm-chunk
import path from "path";
Expand Down Expand Up @@ -515,9 +513,9 @@ export const synthesizeQuestions = ai.defineFlow(
async (filePath) => {
filePath = path.resolve(filePath);
// `extractText` loads the PDF and extracts its contents as text.
const pdfTxt = await run("extract-text", () => extractText(filePath));
const pdfTxt = await ai.run("extract-text", () => extractText(filePath));

const chunks = await run("chunk-it", async () =>
const chunks = await ai.run("chunk-it", async () =>
chunk(pdfTxt, chunkingConfig)
);

Expand Down
34 changes: 16 additions & 18 deletions docs/plugin-authoring-evaluator.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,23 +61,22 @@ function getDeliciousnessPrompt(ai: Genkit) {
output: {
schema: DeliciousnessDetectionResponseSchema,
}
},
`You are a food critic. Assess whether the provided output sounds delicious, giving only "yes" (delicious), "no" (not delicious), or "maybe" (undecided) as the verdict.
prompt: `You are a food critic. Assess whether the provided output sounds delicious, giving only "yes" (delicious), "no" (not delicious), or "maybe" (undecided) as the verdict.
Examples:
Output: Chicken parm sandwich
Response: { "reason": "A classic and beloved dish.", "verdict": "yes" }
Examples:
Output: Chicken parm sandwich
Response: { "reason": "A classic and beloved dish.", "verdict": "yes" }
Output: Boston Logan Airport tarmac
Response: { "reason": "Not edible.", "verdict": "no" }
Output: Boston Logan Airport tarmac
Response: { "reason": "Not edible.", "verdict": "no" }
Output: A juicy piece of gossip
Response: { "reason": "Metaphorically 'tasty' but not food.", "verdict": "maybe" }
Output: A juicy piece of gossip
Response: { "reason": "Metaphorically 'tasty' but not food.", "verdict": "maybe" }
New Output: {% verbatim %}{{ responseToTest }} {% endverbatim %}
Response:
`
);
New Output: {% verbatim %}{{ responseToTest }} {% endverbatim %}
Response:
`
});
}
```

Expand All @@ -91,7 +90,7 @@ responsibility of the evaluator to validate that all fields required for
evaluation are present.

```ts
import { ModelArgument, z } from 'genkit';
import { ModelArgument } from 'genkit';
import { BaseEvalDataPoint, Score } from 'genkit/evaluator';

/**
Expand All @@ -100,6 +99,7 @@ import { BaseEvalDataPoint, Score } from 'genkit/evaluator';
export async function deliciousnessScore<
CustomModelOptions extends z.ZodTypeAny,
>(
ai: Genkit,
judgeLlm: ModelArgument<CustomModelOptions>,
dataPoint: BaseEvalDataPoint,
judgeConfig?: CustomModelOptions
Expand Down Expand Up @@ -141,8 +141,7 @@ export async function deliciousnessScore<
The final step is to write a function that defines the `EvaluatorAction`.

```ts
import { Genkit, z } from 'genkit';
import { BaseEvalDataPoint, EvaluatorAction } from 'genkit/evaluator';
import { EvaluatorAction } from 'genkit/evaluator';

/**
* Create the Deliciousness evaluator action.
Expand All @@ -162,7 +161,7 @@ export function createDeliciousnessEvaluator<
isBilled: true,
},
async (datapoint: BaseEvalDataPoint) => {
const score = await deliciousnessScore(judge, datapoint, judgeConfig);
const score = await deliciousnessScore(ai, judge, datapoint, judgeConfig);
return {
testCaseId: datapoint.testCaseId,
evaluation: score,
Expand Down Expand Up @@ -245,7 +244,6 @@ As with the LLM-based evaluator, define the scoring function. In this case,
the scoring function does not need a judge LLM.

```ts
import { EvalResponses } from 'genkit';
import { BaseEvalDataPoint, Score } from 'genkit/evaluator';

const US_PHONE_REGEX =
Expand Down

0 comments on commit ed8439b

Please sign in to comment.