Sumo AI · Datasets
Real human conversation, at frontier scale.
Enterprise customer service audio, licensed at the source, transcribed by our human-in-the-loop pipeline, and delivered ready for training.
68M
CALLS
8M+
HOURS
29
LANGUAGES
200+
COUNTRIES
THE CATALOG
Four ways into the corpus
Domain-dense collections for depth. The language corpus for breadth. Every dataset single-source, fully licensed, with provenance on every record.
Domain Dataset
Legal Services
500k+ hours · English
Customer conversations from the leading consumer legal platform in the United States. Dense in business formation, trademarks and IP, estate planning, and tax, with native metadata and human annotations.
View dataset →
Domain Dataset
Financial Services
6M+ hours · 29 languages
Multilingual customer conversations from a global financial platform serving 200+ countries. Transfers, fraud and disputes, identity verification, and account services, verified by native speakers.
View dataset →
Language Dataset
The Language Corpus
8M+ hours · 29 languages · 60+ accents
The full corpus organized for language coverage: every major region of the world, the accents and dialects each language is actually spoken in, and a wide distribution of subject matter.
View dataset →
Domain Dataset
EARLY ACCESS
Consumer Technology
750k+ hours · English & Spanish
Technical support conversations from a global consumer electronics company. Multi-turn procedural troubleshooting in the casual register of real consumers, now entering our pipeline.
Request Early Access →
Every dataset is licensed, de-identified and voice-anonymized.
Transcribed by our in-house ASR or human-in-the-loop pipeline.

©️ 2026 Sumo AI

©️ 2026 Sumo AI

©️ 2026 Sumo AI

©️ 2026 Sumo AI

