Now Entering Pipeline · Early Access
Sumo AI · Consumer Technology Support Dataset
Seven hundred fifty thousand hours of real technical support.
Customer support audio from a global consumer electronics company. Real customers, real devices, real troubleshooting, recorded at enterprise scale.
750K
HOURS
6.6M+
CALLS
43 TB
RAW AUDIO
EN · ES
85% / 15%
WHY THIS DATA
The dialogue structure models can't fake
Most customer service audio is about accounts and transactions. This corpus is about diagnosis. A customer describes a symptom. An agent forms a hypothesis, gives an instruction, waits while the customer tries it, hears what happened, and adjusts. The next step depends entirely on answer.
That loop, multi-turn, grounded, contingent on real-world feedback, is the structure underneath every assistant that walks a human through a task. It barely exists in web text, and it cannot be synthesized convincingly. Here it runs hundreds of thousands of hours deep, in the casual register of real consumers working through hardware, software, and everything between.
It is also acoustically unforgiving in the best way. Model numbers, serial numbers, firmware versions, error codes: dense alphanumeric speech that breaks weak transcription systems and trains strong ones.
THE DATASET
Built like every Sumo corpus
Single enterprise source, full provenance, and the same human-in-the-loop pipeline behind our legal and financial services datasets.
01
Real World
Genuine customers with malfunctioning devices, real deadlines, and real emotion. Background noise, interruptions, hold queues, accents, and the full arc from frustration to resolution. Nothing scripted, nothing acted, nothing synthetic.
02
Procedural Dialogue
Multi-turn troubleshooting at scale: symptom, hypothesis, instruction, attempt, outcome, next step. Device setup, connectivity, software and firmware issues, warranty triage, subscriptions and billing. The training signal for assistants that guide humans through tasks.
03
Bilingual at Consumer Register
85% English, 15% Spanish, spoken the way consumers actually talk: informal, idiomatic, emotional, and full of product vocabulary and alphanumerics that stress-test any speech system.
04
Entering Our Pipeline
43 TB of raw audio now moving through the same process as every Sumo dataset: human-in-the-loop transcription, PII de-identification, and voice anonymization. Early access partners receive data as it clears the pipeline and help set processing priorities.
Regulated domains taught the models formal language under high stakes. Consumer electronics is the other half: informal speech, technical vocabulary, and dialogue that has to actually solve something before the call ends.
