RAG Benchmarking

Article 15Flagshipv1.0.0-rc1

Plug in any RAG system — LangChain, LlamaIndex, or custom — and benchmark it against classic and agentic-era metrics. Faithfulness, answer relevancy, retrieval precision, and four agentic metrics for multi-step agents. Measured faithfulness of 0.958 on the 50-sample golden dataset.

On this page

Quick Start

bashpip install rag-benchmarking
pythonfrom app.sdk.client import RagEval

client = RagEval(api_url="http://localhost:5001", api_key="your-key")

# Works with LangChain
result = my_chain.invoke({"query": "What is RAG?"})
sample = RagEval.from_langchain(result)

# Or any dict with question / contexts / answer
sample = {
    "question": "What is RAG?",
    "contexts": ["RAG stands for Retrieval-Augmented Generation."],
    "answer": "RAG combines retrieval with LLM generation.",
}

report = client.evaluate([sample], metrics=["faithfulness", "answer_relevancy"])
print(report["metrics"])
# {"faithfulness": 0.95, "answer_relevancy": 0.81}

Benchmark Results

Measured on the 50-sample golden dataset using gemini-2.5-flash as judge at temperature=0.0.

96%
faithfulness
Excellent
81%
answer_relevancy
Good

Features

  • Framework-agnostic — works with LangChain, LlamaIndex, or any custom RAG system
  • Classic metrics: faithfulness, answer relevancy, context precision/recall
  • Retrieval metrics: Precision@K, Recall@K, MRR, NDCG
  • Agentic metrics: agent faithfulness, tool call accuracy, source attribution, retrieval necessity
  • REST API + Python SDK with LangChain and LlamaIndex adapters
  • Run history with comparison across configurations

Regulatory Foundation

Article 15Accuracy, robustness and cybersecurityApplication date 2026-08-02· Upcoming

Read the full pillar: EU AI Act Article 15 explainer →

What the regulation requires

1. High-risk AI systems shall be designed and developed in such a way that they achieve an appropriate level of accuracy, robustness, and cybersecurity, and that they perform consistently in those respects throughout their lifecycle. 3. The levels of accuracy and the relevant accuracy metrics of high-risk AI systems shall be declared in the accompanying instructions of use. 4. High-risk AI systems shall be as resilient as possible regarding errors, faults or inconsistencies that may occur within the system or the environment in which the system operates, in particular due to their interaction with natural persons or other systems. Technical and organisational measures shall be taken in this regard. The robustness of high-risk AI systems may be achieved through technical redundancy solutions, which may include backup or fail-safe plans.
15(1)15(3)15(4)

What you face if you don't comply

Article 15 becomes enforceable on 2 August 2026 for high-risk AI systems under Annex III. Providers must declare accuracy metrics in the instructions for use and demonstrate consistent performance across the lifecycle; non-compliance via the Article 16 provider-obligation chain is sanctionable up to €15M or 3% of global annual turnover under Article 99(4). For RAG-based high-risk systems, "appropriate accuracy" is not a self-asserted figure — it is a metric declared on the label and defensible against post-market evidence.

Up to €15M or 3% of global annual turnover
Article 99(4) · Penalties

How RAG Benchmarking addresses this

  • 15(1)Reproducible accuracy benchmarks for RAG pipelines (retrieval recall, answer faithfulness, citation precision) with versioned eval sets
  • 15(3)Generates the accuracy-metrics block for the Article 13 instructions for use, with confidence intervals and eval-set provenance
  • 15(4)Robustness suite: input perturbations, noisy-context, adversarial-passage, and OOD query stress tests with pass/fail thresholds
  • 15(4)Lifecycle drift monitoring — replays the declared eval set against the live system on a schedule and alerts on metric regression

Source: eur-lex.europa.eu/…/CELEX:32024R1689 · Retrieved

Frequently asked questions

Direct answers to common questions about RAG Benchmarking and Article 15. Regulatory citations reference EUR-Lex CELEX:32024R1689.

What does EU AI Act Article 15 require?
High-risk AI systems must achieve appropriate accuracy, robustness, and cybersecurity throughout their lifecycle. Accuracy metrics must be declared in the instructions for use (Article 15(3)), and the system must be resilient to errors, faults, and inconsistencies (Article 15(4)). Source: Regulation (EU) 2024/1689 Article 15(1), 15(3), 15(4).
When does Article 15 become enforceable?
Article 15 obligations for high-risk AI systems become enforceable on 2 August 2026, per Article 113. Source: Regulation (EU) 2024/1689 Article 113.
Does RAG-Bench cover the cybersecurity leg of Article 15?
No. RAG-Bench covers accuracy and robustness — measuring faithfulness, retrieval precision, agentic metrics, and adversarial-passage robustness. The cybersecurity leg (prompt injection resistance, jailbreak defence, model integrity) requires a runtime AI security control such as AgentShield. Pair the two for full Article 15 coverage.
What metrics does RAG-Bench measure?
Classic metrics (faithfulness, answer relevancy, context precision/recall), retrieval metrics (Precision@K, Recall@K, MRR, NDCG), and four agentic metrics (agent faithfulness, tool-call accuracy, source attribution, retrieval necessity).
Is RAG-Bench framework-agnostic?
Yes. RAG-Bench works with LangChain, LlamaIndex, or any custom RAG system that returns a sample with `question`, `contexts`, and `answer` fields. SDK adapters for LangChain and LlamaIndex are included; custom integrations use the JSONL schema directly.
What is the measured faithfulness on the golden dataset?
0.958 on the published 50-sample golden dataset (rated "Excellent"), with 0.810 answer relevancy ("Good"). These are the actual numbers from the v1.0.0-rc1 release benchmark — not aspirational targets.
Can I bring my own evaluation dataset?
Yes. RAG-Bench accepts custom datasets in JSONL format with the expected schema. The bundled golden dataset is English-only; multilingual evaluation is not supported in v1.0.
Is RAG-Bench free?
Yes. Apache 2.0 licensed. The harness itself runs locally; LLM-as-judge metrics depend on whichever judge model you configure (which may have its own usage cost).
What is the penalty for Article 15 non-compliance?
Up to €15M or 3% of global annual turnover, whichever is higher, under Article 99(4). The provider-obligation chain via Article 16 routes Article 15 failures through this penalty band.
How does drift monitoring work?
You declare an evaluation set version and a metric threshold. RAG-Bench replays the eval set against the live system on a schedule and alerts on metric regression — supporting the lifecycle-consistent-performance requirement of Article 15(1).

Known Limitations

  • Benchmark datasets are English-only; no multilingual evaluation support.
  • Custom dataset integration requires manual formatting to the expected JSONL schema.
  • Accuracy metrics only — latency and throughput are not measured.
  • LLM-as-judge metrics depend on the configured judge model quality.
  • Rate limiting is in-memory and resets on server restart.

For the most current status, see GitHub issues.

Contributing

Contributions are welcome — Apache 2.0 licensed. See the contributing guide and open issues.

License

Licensed under the Apache License 2.0. Not legal advice. Not a notified body.

The Compound Moat

One tool is a start. The chain is the moat.

Each AiExponent tool produces structured evidence the next tool consumes. Browse the full toolchain — from Article 5 screening through Article 72 post-market monitoring.

See all tools →