Largest Financial Services
Dataset in the World.

30 Languages. 70 Million Conversations. 200 Countries.

Human

VERIFIED

Human-In-The-Loop Transcriptions

• Word Level Timestamps

• Fully Diarized

• Dual Channel

• PII Redacted

Plug directly into your training pipeline.

• Human-In-The-Loop

• Word Level Timestamps

• Dual Channel

• PII Redacted

Multilingual

30 languages across every major market.

• English, Spanish, French

• Arabic, Korean, Japanese

• Hindi, Urdu, Mandarin

• 20+ more

Native

METADATA

The context around the call, not just the call itself.

• Call Resolution

• Geographic location

• Topic Categorization

• Sentiment Labels

Financial

DOMAIN

Complex conversation

Consumer financial services

• Cross border transfers

• Fraud and disputes

• Account and transactions

• Banking, FX, Credit Cards

Complex conversation

Consumer financial services

• Cross border transfers

• Fraud and disputes

• Account and transactions

• Banking, FX, Credit Cards

What you can build with real conversation data

Frontier models are trained on clean data. Production environments aren't clean.

Robust ASR in Noisy Conditions

Train models that actually work in production

Real calls include background noise, overlapping speech, and heavy accents - the exact conditions where clean-data-trained ASR models degrade. Fine-tune on authentic acoustic environments to close the gap between benchmark WER and production WER.

Speech Recognition

Accent Adaptation

Noise Robustness

Emotion & Tone Recognition

Go beyond surface-level sentiment

Customer service audio is one of the richest natural sources of emotionally dynamic conversation. Train models to detect frustration masked by calm speech, sarcasm, escalation patterns, and tonal shifts that carry more signal than words alone.

Sentiment Analysis

Paralinguistics

Affective Computing

Pragmatic & Indirect Speech

Interpret what people mean, not what they say

"I guess I'll just figure it out myself" isn't a plan - it's a complaint. Customer calls are dense with indirect speech acts and implicit requests that frontier models still take at face value.

NLU

Intent Detection

Pragmatics

Turn-Taking & Conversation Flow

Build voice agents that don't talk over people

Real dialogue involves interruptions, backchannels, long pauses, and implicit cues about when a speaker is done vs. thinking. Train on natural patterns to build systems that handle conversational flow without awkward collisions.

Voice Agents

Dialogue Systems

Real-Time Processing

Code-Switching & Multilingual Mixing

Handle language the way people actually speak it

Diverse customer bases produce natural intra-sentential code-switching - Spanglish, Hinglish, Cantonese-English blends. Models handle each language in isolation but break at the seams. Train on real multilingual speech.

Multilingual

Code-Switching

Language ID

Noisy Transcript Comprehension

Extract meaning from imperfect ASR output

Production pipelines produce run-on, unpunctuated, error-filled transcripts. Fine-tune downstream models •⁠ ⁠summarization, entity extraction, routing - to be robust to the disfluent, messy input they'll actually encounter.

Post-ASR NLP

Entity Extraction

Summarization

Technical Specs

Audio Format

• Dual-channel audio (agent + caller separated)
• 8kHz /16-bit PCM

Text Format

• Fully diarized with speaker-aligned timestamps
• Human-In-The-Loop golden transcripts (<2% WER)

Native Annotations
(Included)

Word-Level Timestamps

PII Redaction

Geography

CSAT

Intent

Sentiment

Overtalk

Human Annotations
(Available)

Emotions

Accent & Dialect

Code Switching

Disfluency

Social Dynamics

Cultural Patterns

Fully Compliant
Fully De-Identified

• All personally identifiable information has been removed from both audio and transcripts.
• The data is fully compliant with CCA, GDPR, and applicable privacy regulations
• Ready for use in model training, evaluation, and research without additional redaction or legal review.

Data
Origins

Data Origins

• Originating from the global financial services industry, serving millions of customers across 200+ countries.
• Spanning account inquiries, transaction disputes, fraud resolution, compliance verification, and multilingual support interactions.
• Legally vetted data ownership and licensing rights

Access Sample Data

Try Our Data

sumo

Overview

Samples

Analysis

Distributions

Use Cases

Technical Specs

Privacy

Catalog

Download Sample Data

Get in Touch

Largest Financial Services Dataset in the World.

Human

Multilingual

Native

Financial

What you can build with real conversation data

Robust ASR in Noisy Conditions

Emotion & Tone Recognition

Pragmatic & Indirect Speech

Turn-Taking & Conversation Flow

Code-Switching & Multilingual Mixing

Noisy Transcript Comprehension

Technical Specs

Fully CompliantFully De-Identified

Fully CompliantFully De-Identified

Data Origins

DataOrigins

Data Origins

Data Origins

Access Sample Data

Try Our Data

Largest Financial Services
Dataset in the World.

Fully Compliant
Fully De-Identified

Fully Compliant
Fully De-Identified

Data
Origins

Data
Origins