Now Entering Pipeline · Early Access
Sumo AI · Consumer Technology Support Dataset
Seven hundred fifty thousand hours of
real technical support.
Customer support audio from a global consumer electronics
company. Real customers, real devices, real troubleshooting,
recorded at enterprise scale.
Request Early Access
Speak with Our Team
750K
Hours
6.6M+
Calls
43 TB
Raw Audio
EN · ES
85% / 15%
Why This Data
The dialogue structure models can't fake
Most customer service audio is about accounts and transactions. This corpus is about
diagnosis. A customer describes a symptom. An agent forms a hypothesis, gives an
instruction, waits while the customer tries it, hears what happened, and adjusts. The
next step depends entirely on the answer.
That loop, multi-turn, grounded, contingent on real-world feedback, is the structure
underneath every assistant that walks a human through a task. It barely exists in web
text, and it cannot be synthesized convincingly. Here it runs hundreds of thousands
of hours deep, in the casual register of real consumers working through hardware,
software, and everything between.
It is also acoustically unforgiving in the best way. Model numbers, serial numbers,
firmware versions, error codes: dense alphanumeric speech that breaks weak
transcription systems and trains strong ones.
The Dataset
Built like every Sumo corpus
Single enterprise source, full provenance, and the same human-in-the-loop pipeline
behind our legal and financial services datasets.
01
Real World
Genuine customers with malfunctioning devices, real
deadlines, and real emotion. Background noise,
interruptions, hold queues, accents, and the full arc from
frustration to resolution. Nothing scripted, nothing
acted, nothing synthetic.
02
Procedural Dialogue
Multi-turn troubleshooting at scale: symptom,
hypothesis, instruction, attempt, outcome, next step.
Device setup, connectivity, software and firmware
issues, warranty triage, subscriptions and billing. The
training signal for assistants that guide humans through
tasks.
03
Bilingual at Consumer Register
85% English, 15% Spanish, spoken the way consumers
actually talk: informal, idiomatic, emotional, and full of
product vocabulary and alphanumerics that stress-test
any speech system.
04
Entering Our Pipeline
43 TB of raw audio now moving through the same
process as every Sumo dataset: human-in-the-loop
transcription, PII de-identification, and voice
anonymization. Early access partners receive data as it
clears the pipeline and help set processing priorities.
Regulated domains taught the models formal language under high
stakes. Consumer electronics is the other half: informal speech, technical
vocabulary, and dialogue that has to actually solve something before the
call ends.
Request Early Access

