Evaluation
Last updated
Last updated
In Epsilla, AI agent evaluation is designed as a continuous performance assessment framework, aimed at testing and improving the AI agents' response quality over time. This evaluation system runs predefined scenarios that simulate real-world interactions, allowing AI agent builders and operation team to monitor the performance of AI agents across various situations. The evaluation process utilizes large language models (LLMs) to compare the AI-generated responses against human-labeled answers, scoring them based on a set of metrics such as accuracy, relevance, and coverage.
This approach is conceptually similar to Continuous Integration/Continuous Delivery (CI/CD) practices, where the goal is to iteratively test and improve the system. It leverages human input and LLMs to provide ongoing feedback on the AI's performance, ensuring that the agents meet high-quality standards as they are updated and refined over time.
On the navigation bar, click on the Evaluations tab.
This will lead you to the page where you can create and manage all your evaluations.