The Citadel AI team is excited to release LangCheck Studio – our free, public playground for experimenting with LLM-as-a-judge techniques.
Over the past two years, we’ve seen LLM-as-a-judge become an indispensable technique for ensuring the quality, safety, and security of LLM applications. It easily adapts to your business requirements and requires no coding expertise, enabling non-technical domain experts (e.g. doctors, lawyers, support agents) to collaborate with engineers to refine LLM behavior.
With LangCheck Studio, we wanted to make it easy to try the LLM-as-a-judge workflow in your browser, no code required. It showcases the core principle of Eval-Centric AI: LLM-as-a-judge validated against human judgement.
LangCheck Studio is built on our team’s experience developing LangCheck, the leading open-source library for multilingual LLM evaluation, and Lens for LLMs, our commercial product for evaluating, monitoring, and governing AI systems.
Evaluation Workflow in LangCheck Studio
Let’s walk through LangCheck Studio’s workflow for evaluating LLMs. First, select one or two LLMs that you’d like to evaluate on a task, such as answering customer support tickets.

Next, create a dataset of questions to evaluate both LLMs with. LangCheck Studio can automatically generate a dataset based on your short description of the task.

Next, you’ll customize the system prompt, which controls the LLM’s responses, and the judge prompt, which controls the judge’s evaluation criteria (e.g. see the refund policy rules in the judge prompt). After that, just kick off the evaluations!

On the evaluations page, you’ll see the responses from both LLMs side-by-side, and the judge’s automated evaluation of both responses.

Choose whether you agree or disagree with the judge’s decision, and finally, on the results page, you can measure how much the LLM judge aligns with your preferences.

That’s it! You’ve successfully completed one round of the Eval-Centric AI workflow.
Bring LLM-as-a-judge to your team
LangCheck Studio guides you through one round of the LLM-as-a-judge workflow. In a real application, you’d usually go back and iterate on the judge prompt so it can learn to mimic your preferences.
To try out LLM-as-a-judge on your own team, we recommend Lens for LLMs, our official product that operationalizes the full Eval-Centric AI workflow, helping you ensure the quality, safety, and security of your LLM applications.

To learn more about Lens for LLMs, check out this demo video, or contact Citadel AI for a demo and trial.