Changelog

New features, improvements, and fixes in Agenta.

17 November 2025v0.62.3

Jinja2 Template Support in the Playground

You can now use Jinja2 templates in your prompts. Jinja2 is available in both the Playground and in prompt management.

Learn more in our blog post or check the documentation.

13 November 2025

Agenta Core is Now Open Source

We're open sourcing the core of Agenta under the MIT license. All functional features are now available to the community. This includes the evaluation system, prompt playground and management, observability, and all core workflows.

Development moves back to the public repository. We're building in public again. Only enterprise collaboration features like RBAC, SSO, and audit logs remain under a separate license.

Get started with the self-hosting guide. View the code and contribute on GitHub. Read why we made this decision at agenta.ai/blog/commercial-open-source-is-hard-our-journey.

12 November 2025v0.62.0

Evaluation SDK

You can now run programmatic evaluations of complex AI agents and workflows directly from code. The Evaluation SDK gives you full control over test data and evaluation logic. It works with agents built using any framework.

The SDK lets you create test sets in code or fetch them from Agenta. You can use built-in evaluators like LLM-as-a-Judge, semantic similarity, or regex matching. You can also write custom Python evaluators. The SDK evaluates end-to-end workflows or specific spans in execution traces. Evaluations run on your own infrastructure; results display in the Agenta dashboard.

Check out the Evaluation SDK documentation to get started.

11 November 2025v0.62.0

Online Evaluation

You can now automatically evaluate every request to your LLM application in production. Online Evaluation helps you catch hallucinations and off-brand responses as they happen. You no longer need to discover problems through user complaints.

You can configure evaluators like LLM-as-a-Judge with custom prompts. Set sampling rates to control costs. Create evaluations with filters for specific spans in your traces. All evaluated requests appear in one dashboard. You can filter traces by evaluation scores to understand issues. You can also add problematic cases to test sets for continuous improvement.

Setting up online evaluation takes just a couple of minutes. It provides immediate visibility into production quality.

10 November 2025v0.62.0

Customize LLM-as-a-Judge Output Schemas

The LLM-as-a-Judge evaluator now supports custom output schemas. Create multiple feedback outputs per evaluator with any structure you need.

You can configure output types (binary, multiclass), include reasoning to improve prediction quality, or provide a raw JSON schema with any structure you define. Use these custom schemas in your evaluations to capture exactly the feedback you need.

Learn more in the LLM-as-a-Judge documentation.

3 November 2025v0.59.10

Documentation Architecture Overhaul

We've completely rewritten and restructured our documentation with a new architecture. This is one of the largest updates we've made, involving a near-complete rewrite of existing content.

Key improvements include:

Diataxis Framework: Organized content into Tutorials, How-to Guides, Reference, and Explanation sections for better discoverability
Expanded Observability Docs: Added missing documentation for tracing, annotations, and observability features
JavaScript/TypeScript Support: Added code examples and documentation for JavaScript developers alongside Python
Ask AI Feature: Ask questions directly to the documentation for instant answers

24 October 2025v0.58.1

Vertex AI Provider Support

We've added support for Google Cloud's Vertex AI platform. You can now use Gemini models and other Vertex AI partner models in the playground, configure them in the Model Hub, and access them through the Gateway using InVoke endpoints.

Check out the documentation for configuring Vertex AI models.

14 October 2025v0.58.0

Filtering Traces by Annotation

You can now filter and search traces based on their annotations. This helps you find traces with low scores or bad feedback quickly.

We rebuilt the filtering system in observability with a simpler dropdown and more options. You can now filter by span status, input keys, app or environment references, and any key within your span.

The new annotation filtering lets you find:

Spans evaluated by a specific evaluator
Spans with user feedback like success=True

This enables powerful workflows: capture user feedback from your app, filter to find traces with bad feedback, add them to test sets, and improve your prompts based on real user data.

26 September 2025v0.54.0

New Evaluation Results Dashboard

We've completely redesigned the evaluation results dashboard. You can analyse your evaluation results more easily and understand performance across different metrics.

Here's what's new:

Metrics plots: We've added plots for all the evaluator metrics. You can not see the distribution of the results and easily spot outliers.
Side-by-side comparison: You can now compare multiple evaluations simultaneously. You can compare the plots but also the single outputs.
Improved test cases view: The results are now displayed in a tabular format works both for small and large datasets.
Focused detail view: A new focused drawer lets you examine individual data points in more details. It's very helpful if your data is large.
Configuration view: See exactly which configurations were used in each evaluation
Evaluation Run naming and descriptions: Add names and descriptions to your evaluation runs to organize things better.

24 September 2025v0.53.0

Deep URL Support for Sharable Links

URLs across Agenta now include workspace context, making them fully shareable between team members. Previously, URLs would always point to the default workspace, causing issues when refreshing pages or sharing links.

Now you can deep link to almost anything in the platform - prompts, evaluations, and more - in any workspace. Share links directly with team members and they'll see exactly what you intended, regardless of their default workspace settings.