CUSTOMER SERVICE AUDIO DATASET
500,000 hours of real human conversation
The single largest commercially available dataset of its kind.
Sourced from U.S. legal services industry
Accurate
TRANSCIPT
High-Fidelity & Aligned
Dual
CHANNEL
Agent & Caller Separated
Fully
DIARIZED
Speaker-Aligned Transcripts
PII
REDACTED
Audio & Text, CCPA / GDPR
Largest of its Kind
No other commercially available dataset of customer service audio comes close in scale, depth, or provenance.
Real-World Conversations
Not scripted, not synthetic. Authentic calls with natural disfluencies, interruptions, emotional variance, and unpredictability.
Native Metadata
Domain Density
Business formation, IP, estate planning, tax - high-complexity conversations across the full spectrum of consumer legal services.
USE CASES
What you can build with real conversation data
Frontier models are trained on clean data. Production environments aren't clean.
Robust ASR in Noisy Conditions
Train models that actually work in production
Real calls include background noise, overlapping speech, and heavy accents - the exact conditions where clean-data-trained ASR models degrade. Fine-tune on authentic acoustic environments to close the gap between benchmark WER and production WER.
Speech Recognition
Accent Adaptation
Noise Robustness
Emotion & Tone Recognition
Go beyond surface-level sentiment
Customer service audio is one of the richest natural sources of emotionally dynamic conversation. Train models to detect frustration masked by calm speech, sarcasm, escalation patterns, and tonal shifts that carry more signal than words alone.
Sentiment Analysis
Paralinguistics
Affective Computing
Pragmatic & Indirect Speech
Interpret what people mean, not what they say
"I guess I'll just figure it out myself" isn't a plan - it's a complaint. Customer calls are dense with indirect speech acts and implicit requests that frontier models still take at face value.
NLU
Intent Detection
Pragmatics
Turn-Taking & Conversation Flow
Build voice agents that don't talk over people
Real dialogue involves interruptions, backchannels, long pauses, and implicit cues about when a speaker is done vs. thinking. Train on natural patterns to build systems that handle conversational flow without awkward collisions.
Voice Agents
Dialogue Systems
Real-Time Processing
Code-Switching & Multilingual Mixing
Handle language the way people actually speak it
Diverse customer bases produce natural intra-sentential code-switching - Spanglish, Hinglish, Cantonese-English blends. Models handle each language in isolation but break at the seams. Train on real multilingual speech.
Multilingual
Code-Switching
Language ID
Noisy Transcript Comprehension
Extract meaning from imperfect ASR output
Production pipelines produce run-on, unpunctuated, error-filled transcripts. Fine-tune downstream models • summarization, entity extraction, routing - to be robust to the disfluent, messy input they'll actually encounter.
Post-ASR NLP
Entity Extraction
Summarization
Technical Specs
Format
Dual-channel audio (agent + caller separated)
Sample Rate
8kHz / 16-bit PCM
Transcripts
Fully diarized with speaker-aligned timestamps
Native Annotations
(Included)
Word-Level Timestamps
PII Redaction
Geography
CSAT
Intent
Sentiment
Overtalk
Human Annotations
(Available)
Emotions
Accent & Dialect
Code Switching
Disfluency
Social Dynamics
Cultural Patterns
Privacy & Compliance
All personally identifiable information has been removed from both audio and transcripts. The data is fully compliant with CCA, GDPR, and applicable privacy regulations - ready for use in model training, evaluation, and research without additional redaction or legal review.
Provenance
Every hour of audio originates from the legal services industry in the United States from legal services companies that help millions of customers with business formation, registered agent services, trademarks, IP filings, wills, trusts, and tax advisory.
Stay up-to-date on new datasets

