Sumo AI

500,000 Hours of

Real Human Conversation

The single largest commercially available dataset of its kind.
Sourced from U.S. legal services industry

Human

VERIFIED

Human-In-The-Loop Transcriptions:

• Word Level Timestamps

• Fully Diarized

• Dual Channel

• PII Redacted

Not scripted, not synthetic. Authentic calls:

• Natural disfluencies

• Interruptions

• Emotional variance

• Unpredictability

Real

WORLD

Not scripted, not synthetic. Authentic calls:

• Natural disfluencies

• Interruptions

• Emotional variance

• Unpredictability

Not scripted, not synthetic. Authentic calls:

• Natural disfluencies

• Interruptions

• Emotional variance

• Unpredictability

Metadata

INCLUDED

The context around the call, not just the call itself:

• CSAT scores

• Geographic location

• Outcomes and summaries

• Topic & sentiment labels

The context around the call, not just the call itself:

• CSAT scores

• Geographic location

• Outcomes and summaries

• Topic & sentiment labels

Legal

DOMAIN

Complex conversations across the full spectrum of consumer legal services:

• Business formations

• Wills & Trusts

• Trademarks & Copyrights

• Taxes

Complex conversations across the full spectrum of consumer legal services.

• Business formations

• Estate planning

• IP & taxes

What you can build with real conversation data

Frontier models are trained on clean data. Production environments aren't clean.

Robust ASR in Noisy Conditions

Train models that actually work in production

Real calls include background noise, overlapping speech, and heavy accents - the exact conditions where clean-data-trained ASR models degrade. Fine-tune on authentic acoustic environments to close the gap between benchmark WER and production WER.

Speech Recognition

Accent Adaptation

Noise Robustness

Emotion & Tone Recognition

Go beyond surface-level sentiment

Customer service conversaations are one of the richest natural sources of emotionally dynamic conversation. Train models to detect frustration masked by calm speech, sarcasm, escalation patterns, and tonal shifts that carry more signal than words alone.

Sentiment Analysis

Paralinguistics

Affective Computing

Pragmatic & Indirect Speech

Interpret what people mean, not what they say

"I guess I'll just figure it out myself" isn't a plan - it's a complaint. Customer calls are dense with indirect speech acts and implicit requests that frontier models still take at face value.

NLU

Intent Detection

Pragmatics

Turn-Taking & Conversation Flow

Build voice agents that don't talk over people

Real dialogue involves interruptions, backchannels, long pauses, and implicit cues about when a speaker is done vs. thinking. Train on natural patterns to build systems that handle conversational flow without awkward collisions.

Voice Agents

Dialogue Systems

Real-Time Processing

Code-Switching & Multilingual Mixing

Handle language the way people actually speak it

Diverse customer bases produce natural intra-sentential code-switching - Spanglish, Hinglish, Cantonese-English blends. Models handle each language in isolation but break at the seams. Train on real multilingual speech.

Multilingual

Code-Switching

Language ID

Noisy Transcript Comprehension

Extract meaning from imperfect ASR output

Production pipelines produce run-on, unpunctuated, error-filled transcripts. Fine-tune downstream models •⁠ ⁠summarization, entity extraction, routing - to be robust to the disfluent, messy input they'll actually encounter.

Post-ASR NLP

Entity Extraction

Summarization

Technical Specs

Audio Format

• Dual-channel audio (agent + caller separated)
• 8kHz /16-bit PCM

Text Format

• Fully diarized with speaker-aligned timestamps
• Human-In-The-Loop golden transcripts (<2% WER)

Native Annotations
(Included)

Word-Level Timestamps

PII Redaction

Geography

CSAT

Intent

Sentiment

Overtalk

Human Annotations
(Available)

Emotions

Accent & Dialect

Code Switching

Disfluency

Social Dynamics

Cultural Patterns

Fully Compliant
Fully De-Identified

All personally identifiable information has been removed from both audio and transcripts. The data is fully compliant with CCA, GDPR, and applicable privacy regulations - ready for use in model training, evaluation, and research without additional redaction or legal review.

Data
Origins

Data Origins

Every hour of audio originates from the legal services industry in the United States from legal services companies that help millions of customers with business formation, registered agent services, trademarks, IP filings, wills, trusts, and tax advisory.

Access Our Sample

sumo

Overview

Samples

Analysis

Distributions

Use Cases

Technical Specs

Privacy

Catalog

Download Sample Data

Get in Touch

Human

Real

Metadata

Legal

What you can build with real conversation data

Robust ASR in Noisy Conditions

Emotion & Tone Recognition

Pragmatic & Indirect Speech

Turn-Taking & Conversation Flow

Code-Switching & Multilingual Mixing

Noisy Transcript Comprehension

Technical Specs

Fully CompliantFully De-Identified

Fully CompliantFully De-Identified

Data Origins

DataOrigins

Data Origins

Data Origins

Access Our Sample

Fully Compliant
Fully De-Identified

Fully Compliant
Fully De-Identified

Data
Origins

Data
Origins