RAG Benchmarking
Article 15FlagshipPlug in any RAG system — LangChain, LlamaIndex, or custom — and benchmark it against classic and agentic-era metrics. Faithfulness, answer relevancy, retrieval precision, and four agentic metrics for multi-step agents. Measured faithfulness of 0.958 on the 50-sample golden dataset.
Quick Start
bashpip install rag-benchmarkingpythonfrom app.sdk.client import RagEval
client = RagEval(api_url="http://localhost:5001", api_key="your-key")
# Works with LangChain
result = my_chain.invoke({"query": "What is RAG?"})
sample = RagEval.from_langchain(result)
# Or any dict with question / contexts / answer
sample = {
"question": "What is RAG?",
"contexts": ["RAG stands for Retrieval-Augmented Generation."],
"answer": "RAG combines retrieval with LLM generation.",
}
report = client.evaluate([sample], metrics=["faithfulness", "answer_relevancy"])
print(report["metrics"])
# {"faithfulness": 0.95, "answer_relevancy": 0.81}Benchmark Results
Measured on the 50-sample golden dataset using gemini-2.5-flash as judge at temperature=0.0.
Features
- Framework-agnostic — works with LangChain, LlamaIndex, or any custom RAG system
- Classic metrics: faithfulness, answer relevancy, context precision/recall
- Retrieval metrics: Precision@K, Recall@K, MRR, NDCG
- Agentic metrics: agent faithfulness, tool call accuracy, source attribution, retrieval necessity
- REST API + Python SDK with LangChain and LlamaIndex adapters
- Run history with comparison across configurations
EU AI Act Context
Provides systematic accuracy testing and documentation for high-risk AI systems under Article 15.
Known Limitations
- Benchmark datasets are English-only; no multilingual evaluation support.
- Custom dataset integration requires manual formatting to the expected JSONL schema.
- Accuracy metrics only — latency and throughput are not measured.
- LLM-as-judge metrics depend on the configured judge model quality.
- Rate limiting is in-memory and resets on server restart.
For the most current status, see GitHub issues.
Contributing
Contributions are welcome — Apache 2.0 licensed. See the contributing guide and open issues.
License
Licensed under the Apache License 2.0.
The Compound Moat
One tool is a start. The chain is the moat.
Each AiExponent tool produces structured evidence the next tool consumes. Browse the full toolchain — from Article 5 screening through Article 72 post-market monitoring.
See all tools →