Sumo AI

Sumo AI · Datasets

Real human conversation, at frontier scale.

Enterprise customer service audio, licensed at the source, transcribed by our human-in-the-loop pipeline, and delivered ready for training.

68M

CALLS

8M+

HOURS

29

LANGUAGES

200+

COUNTRIES

THE CATALOG

Four ways into the corpus

Domain-dense collections for depth. The language corpus for breadth. Every dataset single-source, fully licensed, with provenance on every record.

Domain Dataset

Legal Services

500k+ hours · English

Customer conversations from the leading consumer legal platform in the United States. Dense in business formation, trademarks and IP, estate planning, and tax, with native metadata and human annotations.

View dataset →

Domain Dataset

Financial Services

6M+ hours · 29 languages

Multilingual customer conversations from a global financial platform serving 200+ countries. Transfers, fraud and disputes, identity verification, and account services, verified by native speakers.

View dataset →

Language Dataset

The Language Corpus

8M+ hours · 29 languages · 60+ accents

The full corpus organized for language coverage: every major region of the world, the accents and dialects each language is actually spoken in, and a wide distribution of subject matter.

View dataset →

Domain Dataset

EARLY ACCESS

Consumer Technology

750k+ hours · English & Spanish

Technical support conversations from a global consumer electronics company. Multi-turn procedural troubleshooting in the casual register of real consumers, now entering our pipeline.

Request Early Access →

Every dataset is licensed, de-identified and voice-anonymized.
Transcribed by our in-house ASR or human-in-the-loop pipeline.

Request Access