← Back to Dashboard

About the Impunity Index

An NLP-powered research tool analyzing publicly released Epstein case documents to surface patterns of evidence density versus legal accountability.

Why We Created This

Plenty of incredible work has been done mapping who shows up in the Epstein files. Journalists, researchers, and open-source communities have built searchable archives, entity graphs, and document indexes. But nobody had built a way for people to actually measure the gap between evidence and accountability. That is what the Impunity Index does.

It takes the documentary footprint of every named individual in the corpus and cross-references it against whether they ever faced real consequences. The result is a single, corpus-derived metric that quantifies impunity: high evidence plus low accountability equals a high impunity score. We built this because the data was public but the pattern was not visible. We wanted to make it visible.

What is the Impunity Index?

The Impunity Index is an NLP-powered tool that analyzes publicly available Epstein court documents and government records to surface patterns of evidence density versus legal accountability.

Each individual in the dataset receives an Evidence Index (0–10) based on how frequently and severely they appear across the document corpus. This is multiplied by a Consequence Modifier based on whether they faced legal consequences, producing a final Impunity Index score.

The goal is simple: make it easier to see who had the most evidence against them and whether anything happened as a result. The gap between evidence and consequences is what we call impunity.

How to Read the Impunity Index

The Impunity Index does not measure guilt. It measures the gap between documentary evidence and legal accountability. A high score means the evidence outpaced the consequences. Here is how that works in practice.

Jeffrey Epstein 6.7
Evidence: 9.5 / 10 Tier 2: Convicted

Epstein has one of the highest evidence footprints in the entire corpus. But he was arrested, charged, and died in federal custody. The conviction modifier (x0.7) pulls his score down significantly. His score is not zero because the evidence density is extremely high, but the consequence meaningfully reduces his impunity.

Bill Clinton 10.0
Evidence: 9.4 / 10 Tier 0: No consequence

Clinton appears frequently in the documents with significant evidence signals: 26+ documented flight legs, black book entry, and high keyword co-occurrence. He has faced no criminal charges or legal consequences related to the Epstein case. With no consequence modifier reduction, his raw evidence score translates almost directly into impunity.

It is counterintuitive that Epstein's score is lower than some of his associates. That is exactly the point. The Impunity Index measures the gap, not the guilt. A high score means the evidence outpaced the consequences.

How It Works

1. Document Ingestion

We process over 1.4 million pages from DOJ/EFTA releases, court filings, depositions, and publicly available Epstein case documents. Text is extracted programmatically from PDFs and organized into a searchable corpus.

2. Entity Extraction

Named Entity Recognition (spaCy) identifies mentions of individuals across the document corpus. Each person is linked to the specific documents where they appear, along with contextual information about the nature of each mention.

3. Classification

Three model families produce independent evidence signals: Logistic Regression on tabular NLP features, Random Forest with TF-IDF text features, and Sentence Transformers (all-MiniLM-L6-v2) combined with Support Vector Classification for semantic analysis. Legal-BERT was evaluated but failed due to insufficient training data — that failure is documented as a finding.

4. Scoring

The Evidence Index combines six features using log-scaled, percentile-capped normalization: email/EFTA document count, DOJ corpus mentions, keyword co-occurrence with incriminating terms, flight log entries, person-to-person connections, and black book presence. The ML Evidence Signals shown on each profile represent consensus across three model families, not a single model’s output.

5. Semantic Relevance

Documents are scored for relevance using semantic similarity (sentence-transformer embeddings), not just keyword matching. A document containing detailed allegations scores higher than a passing name mention. Document summaries shown on profiles are extractive (first sentences of the source document) and should be verified against original source documents.

Data Sources

Note on document links

Some document links in the app may not resolve. Our DOJ document reference numbers (EFTA numbers) are accurate and can be used to look up the original documents on the DOJ Epstein files page.

Important Disclaimers

Inclusion does not imply guilt

Many individuals referenced in the Epstein files appeared as witnesses, were mentioned in passing, or had legitimate professional relationships. Being named in a document does not indicate involvement in criminal activity.

AI-generated analysis

Summaries, scores, and classifications are produced by machine learning models and may contain errors or misinterpretations. Always refer to the original source documents for authoritative information. The Impunity Index is a research tool, not a factual determination.

Not legal advice

This site is for informational and academic purposes only and does not constitute legal advice or official legal analysis.

Data limitations

All text was extracted programmatically from PDFs. Handwritten documents, image-embedded text, and certain file formats may not be fully captured. The dataset is not comprehensive — it represents the publicly released portion of Epstein case files.

Model inconsistencies

Multiple models were used across the pipeline. While we use consensus scoring to mitigate this, scores are not perfect and should be interpreted as directional indicators, not precise measurements.

Inspiration & Acknowledgments

This project was built for Duke AIPI by Lindsay Gross, Shreya Mendi, and Andrew Jin.

We built on the shoulders of others who believed this information should be accessible to the public:

Ethics Statement

Our goal is accountability transparency, not accusation.

We believe the public has a right to understand patterns in publicly released legal documents. These are court records, government filings, and depositions that have already been made public through legal proceedings and FOIA requests. We are not revealing private information — we are making existing public information more accessible and analyzable.

We designed the scoring methodology to be evidence-based and reproducible, not sensationalized. Every score traces back to specific document features that can be independently verified. We chose log-scaled, percentile-capped normalization specifically to avoid arbitrary score inflation.

We welcome scrutiny of our methodology and corrections to our data. If you find an error in our analysis or believe a score is miscalibrated, we want to know. The code is open-source and the methodology is documented.