CUSTOMER SERVICE AUDIO DATASET

500,000 hours of real human conversation

The single largest commercially available dataset of its kind.
Sourced from U.S. legal services industry

Accurate

TRANSCIPT

High-Fidelity & Aligned

Dual

CHANNEL

Agent & Caller Separated

Fully

DIARIZED

Speaker-Aligned Transcripts

PII

REDACTED

Audio & Text, CCPA / GDPR

Largest of its Kind

No other commercially available dataset of customer service audio comes close in scale, depth, or provenance.

Real-World Conversations

Not scripted, not synthetic. Authentic calls with natural disfluencies, interruptions, emotional variance, and unpredictability.

Native Metadata

Every call ships with native call metadata like geographic location and routing category. Alongside programmatic enrichments, overtalk detection, turn-taking patterns, speech rate.

Every call ships with native call metadata like geographic location and CSAT scores (when available). Alongside programmatic enrichments like overtalk detection, turn-taking patterns, and speech rate.

Domain Density

Business formation, IP, estate planning, tax - high-complexity conversations across the full spectrum of consumer legal services.

USE CASES

What you can build with real conversation data

Frontier models are trained on clean data. Production environments aren't clean.

Robust ASR in Noisy Conditions

Train models that actually work in production

Real calls include background noise, overlapping speech, and heavy accents - the exact conditions where clean-data-trained ASR models degrade. Fine-tune on authentic acoustic environments to close the gap between benchmark WER and production WER.

Speech Recognition

Accent Adaptation

Noise Robustness

Emotion & Tone Recognition

Go beyond surface-level sentiment

Customer service audio is one of the richest natural sources of emotionally dynamic conversation. Train models to detect frustration masked by calm speech, sarcasm, escalation patterns, and tonal shifts that carry more signal than words alone.

Sentiment Analysis

Paralinguistics

Affective Computing

Pragmatic & Indirect Speech

Interpret what people mean, not what they say

"I guess I'll just figure it out myself" isn't a plan - it's a complaint. Customer calls are dense with indirect speech acts and implicit requests that frontier models still take at face value.

NLU

Intent Detection

Pragmatics

Turn-Taking & Conversation Flow

Build voice agents that don't talk over people

Real dialogue involves interruptions, backchannels, long pauses, and implicit cues about when a speaker is done vs. thinking. Train on natural patterns to build systems that handle conversational flow without awkward collisions.

Voice Agents

Dialogue Systems

Real-Time Processing

Code-Switching & Multilingual Mixing

Handle language the way people actually speak it

Diverse customer bases produce natural intra-sentential code-switching - Spanglish, Hinglish, Cantonese-English blends. Models handle each language in isolation but break at the seams. Train on real multilingual speech.

Multilingual

Code-Switching

Language ID

Noisy Transcript Comprehension

Extract meaning from imperfect ASR output

Production pipelines produce run-on, unpunctuated, error-filled transcripts. Fine-tune downstream models •⁠ ⁠summarization, entity extraction, routing - to be robust to the disfluent, messy input they'll actually encounter.

Post-ASR NLP

Entity Extraction

Summarization

Technical Specs

Format

Dual-channel audio (agent + caller separated)

Sample Rate

8kHz / 16-bit PCM

Transcripts

Fully diarized with speaker-aligned timestamps

Native Annotations
(Included)

Word-Level Timestamps

PII Redaction

Geography

CSAT

Intent

Sentiment

Overtalk

Human Annotations
(Available)

Emotions

Accent & Dialect

Code Switching

Disfluency

Social Dynamics

Cultural Patterns

Privacy & Compliance

Fully compliant, fully de-identified

Fully Compliant
Fully De-Identified

All personally identifiable information has been removed from both audio and transcripts. The data is fully compliant with CCA, GDPR, and applicable privacy regulations - ready for use in model training, evaluation, and research without additional redaction or legal review.

Provenance

Data
Origins

Data
Origins

Data Origins

Every hour of audio originates from the legal services industry in the United States from legal services companies that help millions of customers with business formation, registered agent services, trademarks, IP filings, wills, trusts, and tax advisory.

Stay up-to-date on new datasets

©️ 2026 Sumo AI

©️ 2026 Sumo AI

©️ 2026 Sumo AI