This is an example repo for how to build a web3 based tx prompt evaluator with langsmith. It uses an llm to judge the result of each agent eval.
There are some bugs and areas for improvbement but it works.
Make sure to add model keys, and langsmith secrets to an env. If you want to use this as is you also need a matsnet pk.
Also an example of how to use synthetic data for evals. All of the prompts evaled are generated by llm. I only made 10 but you could do 1000.
There probably is room for improvement on the judge system instructions.