Combined Corpus · Language & Topic Distribution

68 million conversations.
Twenty-nine languages.

Over 8 million hours of real customer calls, spoken in the accents of every region the world does business in, about nearly everything people call about.

68M
Calls
8M+
Hours
29
Languages
200+
Countries

Language Coverage

Spoken the way the world speaks

Twenty-nine languages spanning every major region, each delivered at training-relevant volume. Not a token sample of each: every language below ships in the thousands of hours or hundreds of thousands of minutes.

Global English

Spoken in every market served

English4,804,274 hrs

Iberia & Latin America

4 languages

Spanish1,374,309 hrs
Portuguese116,003 hrs
Haitian Creole30,022 hrs
Catalan4,214 hrs

Western & Central Europe

4 languages

French352,303 hrs
German304,288 hrs
Italian171,776 hrs
Dutch51,348 hrs

Northern Europe

3 languages

Swedish54,929 hrs
Norwegian17,248 hrs
Danish12,966 hrs

Eastern & Southern Europe

5 languages

Romanian35,750 hrs
Polish13,990 hrs
Russian12,091 hrs
Bulgarian1,725 hrs
Greek1,614 hrs

Middle East & North Africa

2 languages

Turkish119,255 hrs
Arabic93,745 hrs

South Asia

3 languages

Urdu23,095 hrs
Hindi12,521 hrs
Bengali2,328 hrs

East Asia

3 languages

Chinese53,390 hrs
Japanese19,942 hrs
Korean4,494 hrs

Southeast Asia

4 languages

Indonesian22,996 hrs
Tagalog15,920 hrs
Thai5,158 hrs
Vietnamese1,347 hrs

Exact hours shown per language. Distributions current as of June 2026.

Accent & Dialect Diversity

One language is never one voice

These calls come from platforms serving customers in 200+ countries. Each language arrives in the full range of regional accents and dialects it is actually spoken in, not a single studio-clean variety. Representative coverage below.

English

General AmericanSouthern USNew YorkMidwesternBritish RPScottishIrishIndianPakistaniNigerianGhanaianKenyanSouth AfricanFilipinoSingaporeanCaribbeanAustralianCanadian

Spanish

MexicanCaribbeanCentral AmericanColombianAndeanChileanRioplatenseCastilianCanarianUS bilingual

French

MetropolitanWest AfricanCentral AfricanMaghrebiCanadianBelgianSwissCaribbean

Arabic

EgyptianLevantineGulfIraqiMaghrebiSudaneseModern Standard

Portuguese

BrazilianEuropeanAngolanMozambican

German

Standard GermanAustrianSwissNorthernBavarian-influenced

Chinese

Mainland MandarinTaiwanese MandarinCantonese-accentedOverseas communities

Hindi & Urdu

DelhiMumbaiPunjabi-influencedKarachiLahoriDiaspora

Topic Distribution

Subject matter without a center of gravity

Routing categories, topic labels, and human annotations on every call. The conversations run from international transfers to estate planning to firmware updates, each with its own vocabulary, stakes, and emotional register.

Money & Payments

International transfersPayment failuresRefundsChargebacksFees & exchange ratesBilling disputesReceipts & records

Identity & Security

Identity verificationFraud reportsAccount recoverySuspicious activityCompliance holdsDocument reviewPassword resets

Legal & Business

Business formationTrademarks & IPEstate planningWills & trustsTax questionsContractsAnnual filingsRegistered agent

Devices & Technical

TroubleshootingSetup & pairingFirmware updatesApp supportCompatibilityRepairsAccessories

Orders & Logistics

Shipping & deliveryReturns & exchangesWarranty claimsOrder changesLost packagesCustoms & international

Accounts & Relationships

OnboardingPlan changesSubscriptions & renewalsCancellationsEscalationsComplaintsRetention & win-backProfile updates

Most speech corpora go deep on one subject or wide on languages. This one does both: twenty-nine languages, more than sixty regional accents and dialects, and subject matter that runs the full range of why people pick up the phone.

Combined Corpus · Language & Topic Distribution

68 million conversations.
Twenty-nine languages.

Over 8 million hours of real customer calls, spoken in the accents of every region the world does business in, about nearly everything people call about.

68M
Calls
8M+
Hours
29
Languages
200+
Countries

Language Coverage

Spoken the way the world speaks

Twenty-nine languages spanning every major region, each delivered at training-relevant volume. Not a token sample of each: every language below ships in the thousands of hours or hundreds of thousands of minutes.

Global English

Spoken in every market served

English4,804,274 hrs

Iberia & Latin America

4 languages

Spanish1,374,309 hrs
Portuguese116,003 hrs
Haitian Creole30,022 hrs
Catalan4,214 hrs

Western & Central Europe

4 languages

French352,303 hrs
German304,288 hrs
Italian171,776 hrs
Dutch51,348 hrs

Northern Europe

3 languages

Swedish54,929 hrs
Norwegian17,248 hrs
Danish12,966 hrs

Eastern & Southern Europe

5 languages

Romanian35,750 hrs
Polish13,990 hrs
Russian12,091 hrs
Bulgarian1,725 hrs
Greek1,614 hrs

Middle East & North Africa

2 languages

Turkish119,255 hrs
Arabic93,745 hrs

South Asia

3 languages

Urdu23,095 hrs
Hindi12,521 hrs
Bengali2,328 hrs

East Asia

3 languages

Chinese53,390 hrs
Japanese19,942 hrs
Korean4,494 hrs

Southeast Asia

4 languages

Indonesian22,996 hrs
Tagalog15,920 hrs
Thai5,158 hrs
Vietnamese1,347 hrs

Exact hours shown per language. Distributions current as of June 2026.

Accent & Dialect Diversity

One language is never one voice

These calls come from platforms serving customers in 200+ countries. Each language arrives in the full range of regional accents and dialects it is actually spoken in, not a single studio-clean variety. Representative coverage below.

English

General AmericanSouthern USNew YorkMidwesternBritish RPScottishIrishIndianPakistaniNigerianGhanaianKenyanSouth AfricanFilipinoSingaporeanCaribbeanAustralianCanadian

Spanish

MexicanCaribbeanCentral AmericanColombianAndeanChileanRioplatenseCastilianCanarianUS bilingual

French

MetropolitanWest AfricanCentral AfricanMaghrebiCanadianBelgianSwissCaribbean

Arabic

EgyptianLevantineGulfIraqiMaghrebiSudaneseModern Standard

Portuguese

BrazilianEuropeanAngolanMozambican

German

Standard GermanAustrianSwissNorthernBavarian-influenced

Chinese

Mainland MandarinTaiwanese MandarinCantonese-accentedOverseas communities

Hindi & Urdu

DelhiMumbaiPunjabi-influencedKarachiLahoriDiaspora

Topic Distribution

Subject matter without a center of gravity

Routing categories, topic labels, and human annotations on every call. The conversations run from international transfers to estate planning to firmware updates, each with its own vocabulary, stakes, and emotional register.

Money & Payments

International transfersPayment failuresRefundsChargebacksFees & exchange ratesBilling disputesReceipts & records

Identity & Security

Identity verificationFraud reportsAccount recoverySuspicious activityCompliance holdsDocument reviewPassword resets

Legal & Business

Business formationTrademarks & IPEstate planningWills & trustsTax questionsContractsAnnual filingsRegistered agent

Devices & Technical

TroubleshootingSetup & pairingFirmware updatesApp supportCompatibilityRepairsAccessories

Orders & Logistics

Shipping & deliveryReturns & exchangesWarranty claimsOrder changesLost packagesCustoms & international

Accounts & Relationships

OnboardingPlan changesSubscriptions & renewalsCancellationsEscalationsComplaintsRetention & win-backProfile updates

Most speech corpora go deep on one subject or wide on languages. This one does both: twenty-nine languages, more than sixty regional accents and dialects, and subject matter that runs the full range of why people pick up the phone.

©️ 2026 Sumo AI

©️ 2026 Sumo AI

©️ 2026 Sumo AI

©️ 2026 Sumo AI