We’re excited to announce the official launch of Lens for LLMs – Citadel AI’s new product to evaluate and monitor LLM applications.
In the past several months, we’ve been working with our community of beta users (from startups to major enterprise companies) to expand the evaluation and monitoring capabilities of Lens, and integrate Lens into our users’ LLM development workflows.
In this post, we’ll highlight how Lens can help you evaluate, monitor, and improve your LLM applications, and share stories from our beta users. To learn more about Lens, check out the beta announcement or contact us at any time for a demo.
Automated Red Teaming for Safety and Security
Many beta users wanted to red team their LLM applications against safety and security risks. For example, a chatbot developer wanted to ensure that their chatbot didn’t respond with inappropriate messages even when exposed to various jailbreaks.
We’re excited to introduce Lens’ industry-leading safety and security evaluation for both English and Japanese. You can now use Lens to easily evaluate the safety of inputs and responses, detect jailbreaks with built-in and custom metrics, and simulate adversarial attacks on your LLM application.
Evaluating LLM Applications with Custom Metrics
Lens has an extensive suite of built-in metrics, allowing you to automatically evaluate factual consistency, answer relevancy, toxicity, jailbreak detection, and more.
In addition to these built-in metrics, you can now create custom, prompt-based metrics to have full control over the evaluation criteria. Our beta users have created custom metrics for off-topic detection, question classification, safety evaluation, and more.
We used Lens to create Custom Metrics based on our own evaluation criteria, including safety evaluation. It was easy to create these evaluation prompts by using the built-in prompt templates.
In addition, by using custom prompts, we were able to create evaluation metrics that would be hard to program otherwise, and obtained quantitative results for these metrics. Moving forward, I believe that these metrics will be important to enable quantitative decision-making for companies to evaluate various LLMs.
Satoru Katsumata, Researcher, Retrieva
Scaling Manual Evaluation
The built-in and custom automated metrics in Lens allow you to quickly run evaluations across large datasets of test cases.
In addition to automated evaluation, our beta users found it useful to occasionally conduct small-scale human evaluations to double check the automated evaluations, as well as to “vibe check” the style and tone of LLM responses.
Lens enables you to quickly collect a small set of human feedback (even from non-technical users), which are used to align the automated metrics to efficiently scale up the human feedback to the full dataset. For more details, see this blog post.
By creating Manual Metrics in Lens, we were able to clearly visualize the similarities and differences between automated evaluators and human evaluators. Lens was essential for enabling us to conduct in-depth evaluations of the output quality of a single LLM through a combination of human and automated evaluation.
Additionally, through the Pairwise Comparison feature, we were able to compare two different LLMs and determine which one is more user-friendly and provides more useful responses to our users.
Yuki Hamaguchi, Software Engineer, Suntory
Lens for LLMs helps improve the reliability of LLM system evaluations by combining automated and manual evaluations. In addition to a variety of built-in metrics, including hallucination detection, Lens also allows users to create their own Custom Metrics, so we believe it can be used flexibly for various use cases at different companies. We expect that the use of such tools will lead to safer use of LLM systems.
Naoki Kitou, Security Consultant, AI Security Business Development Team, NRI Secure
Performance Monitoring
New LLMs are being released every few days, and most of our beta users tried to improve their applications by upgrading their LLMs, system prompts, and RAG knowledge bases.
In order to measure if these changes are actually improvements, many beta users intentionally monitored the performance of their systems across many different versions.
In Lens, you can now monitor the performance over time of your LLM applications, allowing you to easily visualize progress on key metrics on a single page.
Monitoring the performance of different LLM versions is essential to ensure ongoing improvements in AI systems. Lens has been an invaluable tool for tracking key metrics and comparing versions, providing clear, actionable insights for optimizing both our own and our clients’ applications.
In particular, I find the performance monitoring feature a great way to see how changes affect the applications over time, making it easier to assess the impact of updates to LLM models, system prompts, and RAG knowledge bases.
Francisco Soares, CEO, Furious Green
If you’re interested to see how Lens for LLMs can help you evaluate, monitor, and improve your LLM applications, please contact us at any time.