Large Language Model Benchmarking

LLM BENCHMARKS

Large Language Model Benchmarking for Extractive, Classification, and Predictive Tasks

Welcome to the ultimate resource for comparing large language models. Here, we meticulously analyze and present the accuracy, speed, and cost-effectiveness of leading models across critical tasks such as information extraction, clause classification, and summarization.

Indico Data has been a guiding force in the AI industry since its inception, consistently emphasizing practical AI applications and real customer outcomes amidst a landscape often clouded by overhype. Indico was the first in the industry to deploy a large language model-based application inside the enterprise and the first to integrate explainability and auditability directly into its products, setting a standard for transparency and trust.

While the vast majority of LLM benchmarking is focused on chatbot-related tasks, Indico recognized the need to understand the performance of large language models for more deterministic tasks such as extraction and classification, and further to understand the performance and costs based on assumptions related to context length and task complexity.

Leaderboard

Indico Data runs a monthly benchmarking exercise across providers (LLama, Azure OpenAI, Google, AWS Bedrock, and Indico trained discriminative standard language models RoBERTa and DeBERTa), datasets (e.g. cord and CUAD), and capabilities (text classification, key information extraction, and generative summarization). The table below ranks the accuracy (F1 score) of these models for each capability averaged over datasets and prompt styles. The "Accuracy" page contains the same information at a much more granular level of detail.

Accuracy Comparisons

Full details of this month's benchmarking run across models, capabilities, prompt styles. This information is meant to facilitate decision making when trying to decide the best model for a given task. For example, if missed information in your process is expensive, then
you should choose a model with high recall.

Clear All

Green means better than average, red means worse than average, orange is average.
The size is how far above/below average the model is.

Cost of Ownership

Gain insights into not just how well each model performs, but how fast and cost-efficiently they do it.

Plotted below are the tradeoffs between accuracy (F1 score) and cost and accuracy and response time by model for all capabilities, datasets, and prompt styles.

Extraction
Classification

Last Updated August 2024

What changed with the most recent set of benchmarks?

We made sure to test the most recent models from OpenAI, Anthropic, Google, and Mistral.
We removed the summarization benchmark. Providing a fair comparison of how well models perform on summarization tasks requires significant nuance and is difficult to assess with a human evaluator. We recommend looking into resources like: Summarization Leaderboard and the Hallucination Leaderboard.
For one of our extraction tasks (Charities), LLM-based solutions have higher precision but significantly lower recall than the equivalent Discriminative solution.
For the NDA task, some LLMs outperform the discriminative solution for the first time. Gemini 1.5 Pro, Gemini 1.5 Flash lead the pack, followed by GPT-4-1106-preview and GPT-4-turbo.
There is still a large gap between Discriminative AI and Gen AI on the CORD receipt extraction task, which includes quite granular categories. Even after breaking up the extraction task into smaller pieces for the GenAI models (with separate prompts for line items, summary information, and overall receipt metadata) there is still a gap of nearly 40 F1 points on CORD.
Gemini 1.5 Flash and GPT4o-mini both have impressive showings for their weight class, trailing their larger counterparts by less than 5 F1 points on the extraction tasks. Gemini’s new pricing comes in at less than 1/100th the cost of the most expensive models in our benchmark, GPT4o-mini at less than 1/50th.
While progress seems to have slowed down at the upper end of model sizes, small models continue to become more competitive and cost continues to drop, meaning “intelligence per dollar” is accelerating rapidly. These smaller models are also lower latency, so we expect these smaller models (not GPT4) to become the workhorses of the enterprise.
Claude 3.5 Sonnet is the best overall performer if price is no object, coming in 1st on the classification benchmark and a close 3rd on the extraction benchmark.
All major providers are competitive, with the lead OpenAI once held having all but dissipated.
We identified an issue with the last round of benchmarks that meant LLM performance was underestimated, but have corrected the issue in this round of benchmarks.
Providers are starting to differentiate based on feature/function rather than pure response quality. OpenAI is doubling down on “function calling,” Anthropic on better chat UX through their Projects and Artifacts features, and Gemini on cost-savings via context caching and aggressive pricing.

What hasn’t changed?

In aggregate, discriminative models still lead the pack for Extraction tasks, and the gap is still significant: around 20 F1 points, due largely to the poor performance of LLMs.

Extraction Datasets

CORD

COnsolidated Receipt Dataset for post-OCR parsing.

Original source: https://github.com/clovaai/cord

From the authors:
…The dataset consists of thousands of Indonesian receipts, which contains images and box/text annotations for OCR, and multi-level semantic labels for parsing...

Kleister NDA

Non-disclosure agreements, published by Applica AI.

Original source: https://github.com/applicaai/kleister-nda

From the authors:
Extract the information from NDAs (Non-Disclosure Agreements) about the involved parties, jurisdiction, contract term, etc...

Charity Reports

Charity financial reports, published by Applica AI.

Original source: https://github.com/applicaai/kleister-charity

From the authors:
The goal of this task is to retrieve charity address (but not other addresses), charity number, charity name and its annual income and spending in GBP in PDF files published by British charities...

Classification Datasets

CUAD

Contract Understanding Atticus Dataset

Original source: https://www.atticusprojectai.org/cuad

From the authors:
...a corpus of 13,000+ labels in 510 commercial legal contracts that have been manually labeled under the supervision of experienced lawyers to identify 41 types of legal clauses that are considered important in contact review...

Resource Contracts

ResourceContracts is a repository of publicly available oil, gas, and mining contracts

Original source: https://www.resourcecontracts.org/

Indico retrieved hundreds of contracts from this repository and labeled key information including names, organizations, section orders, and full clauses (used in this classification task).

Contract NLI

Legal language classification into three classes: Entailment, Contradiction or NotMentioned

Original source: https://stanfordnlp.github.io/contract-nli/

From the authors:
ContractNLI is the first dataset to utilize NLI for contracts and is also the largest corpus of annotated contracts (as of September 2021)...

Prompt Styles: Extraction

Documents are first split into overlapping chunks of roughly 1200 tokens and then those chunks are injected into the following prompt structure:

System prompt: You are a skilled human knowledge worker whose task is to extract key information from text.
Extraction instructions: """Instructions:

Find the data elements in the document that match the instructions for the fields provided.
Do not calculate or infer anything. Answers should be copied directly from the Document with no modification or formatting changes.
Answer "N/A" if no perfect matches are available. (Note, the majority of responses will be N/A)
Output your answer(s) as a bulleted list. - DO NOT number your answers."""

Finally, the LLM is fed a document chunk c and descriptions of the fields to be extracted.

Prompt Styles: Classification

There are six distinct classification prompt styles applied to each chunk c and class list, but they all share a common backbone:

Extraction instructions: """Instructions:

You are a skilled human knowledge worker whose task is classify text.
Please classify this text:\n------------------\n{c}\n--------------\ninto one of these categories: {class_list}\
Respond with one category only."

The variants:

No description of the classes
With description of the classes
Prompted to include rationale (with and without descriptions): Show your workings and then answer using the format 'Answer:
Prompted under duress with rationale (with and without descriptions): I am under serious pressure to get this right and may lose my job if I don't. Please help me.

Featured LLM Resources

Large language models are the driving force behind the generative AI boom of 2023. However, they've been around for a while - and we know a thing or two about them.

Since our founding in 2014, Indico has been on the forefront of innovation in unstructured data and intelligent document processing, with a leadership team that brings years of experience deep expertise in artificial intelligence and machine learning-powered solutions.

Blog

Future-proofing insurance: Navigating AI, data science, and regulatory landscapes with Apollo 1971’s Joe Curry

Webinars

Leveraging LLMs and automation in insurance: A webinar recap

Webinars

How carriers are leveraging large language models (LLMs) and automation to drive better decisions

Blog

Simpler labeling means faster time to value for insurance intelligent automation models

Blogs

A sure fix for a vexing insurance automation problem: importing data from table

Blogs

The risks and rewards inherent in ChatGPT and generative AI

Blogs

Unleashing efficiency: AI-powered document intake for managing general agents

Better data.
Better decisions.

Subscribe to our newsletter

Ask Indico

We help carriers make faster, smarter decisions across underwriting and claims — ask me how.

LLM BENCHMARKS