Sumo AI

750,000 Hours of
Real Technical Support.

Customer support audio from a global consumer electronics company. Real customers, real devices, real troubleshooting, recorded at enterprise scale.

750K

HOURS

6.6M+

CALLS

43 TB

RAW AUDIO

EN · ES

85% / 15%

WHY THIS DATA

The dialogue structure models can't fake

Most customer service audio is about accounts and transactions. This corpus is about diagnosis. A customer describes a symptom. An agent forms a hypothesis, gives an instruction, waits while the customer tries it, hears what happened, and adjusts. The next step depends entirely on answer.

That loop, multi-turn, grounded, contingent on real-world feedback, is the structure underneath every assistant that walks a human through a task. It barely exists in web text, and it cannot be synthesized convincingly. Here it runs hundreds of thousands of hours deep, in the casual register of real consumers working through hardware, software, and everything between.

It is also acoustically unforgiving in the best way. Model numbers, serial numbers, firmware versions, error codes: dense alphanumeric speech that breaks weak transcription systems and trains strong ones.

THE DATASET

Built like every Sumo corpus

Single enterprise source, full provenance, and the same human-in-the-loop pipeline behind our legal and financial services datasets.

Real World

Genuine customers with malfunctioning devices, real deadlines, and real emotion. Background noise, interruptions, hold queues, accents, and the full arc from frustration to resolution. Nothing scripted, nothing acted, nothing synthetic.

Procedural Dialogue

Multi-turn troubleshooting at scale: symptom, hypothesis, instruction, attempt, outcome, next step. Device setup, connectivity, software and firmware issues, warranty triage, subscriptions and billing. The training signal for assistants that guide humans through tasks.

Bilingual at Consumer Register

85% English, 15% Spanish, spoken the way consumers actually talk: informal, idiomatic, emotional, and full of product vocabulary and alphanumerics that stress-test any speech system.

Entering Our Pipeline

43 TB of raw audio now moving through the same process as every Sumo dataset: human-in-the-loop transcription, PII de-identification, and voice anonymization. Early access partners receive data as it clears the pipeline and help set processing priorities.

©️ 2026 Sumo AI

©️ 2026 Sumo AI

©️ 2026 Sumo AI

sumo

Overview

Samples

Analysis

Distributions

Use Cases

Technical Specs

Privacy

Catalog

Download Sample Data

Download Data

Get in Touch