Combined Corpus · Language & Topic Distribution
68 million conversations.
Twenty-nine languages.
Over 8 million hours of real customer calls, spoken in the accents of every region the world does business in, about nearly everything people call about.
Language Coverage
Spoken the way the world speaks
Twenty-nine languages spanning every major region, each delivered at training-relevant volume. Not a token sample of each: every language below ships in the thousands of hours or hundreds of thousands of minutes.
Global English
Spoken in every market served
Iberia & Latin America
4 languages
Spanish1,374,309 hrs
Portuguese116,003 hrs
Haitian Creole30,022 hrs
Catalan4,214 hrs
Western & Central Europe
4 languages
French352,303 hrs
German304,288 hrs
Italian171,776 hrs
Dutch51,348 hrs
Northern Europe
3 languages
Swedish54,929 hrs
Norwegian17,248 hrs
Danish12,966 hrs
Eastern & Southern Europe
5 languages
Romanian35,750 hrs
Polish13,990 hrs
Russian12,091 hrs
Bulgarian1,725 hrs
Greek1,614 hrs
Middle East & North Africa
2 languages
Turkish119,255 hrs
Arabic93,745 hrs
South Asia
3 languages
Urdu23,095 hrs
Hindi12,521 hrs
Bengali2,328 hrs
East Asia
3 languages
Chinese53,390 hrs
Japanese19,942 hrs
Korean4,494 hrs
Southeast Asia
4 languages
Indonesian22,996 hrs
Tagalog15,920 hrs
Thai5,158 hrs
Vietnamese1,347 hrs
Exact hours shown per language. Distributions current as of June 2026.
Accent & Dialect Diversity
One language is never one voice
These calls come from platforms serving customers in 200+ countries. Each language arrives in the full range of regional accents and dialects it is actually spoken in, not a single studio-clean variety. Representative coverage below.
English
General AmericanSouthern USNew YorkMidwesternBritish RPScottishIrishIndianPakistaniNigerianGhanaianKenyanSouth AfricanFilipinoSingaporeanCaribbeanAustralianCanadian
Spanish
MexicanCaribbeanCentral AmericanColombianAndeanChileanRioplatenseCastilianCanarianUS bilingual
French
MetropolitanWest AfricanCentral AfricanMaghrebiCanadianBelgianSwissCaribbean
Arabic
EgyptianLevantineGulfIraqiMaghrebiSudaneseModern Standard
Portuguese
BrazilianEuropeanAngolanMozambican
German
Standard GermanAustrianSwissNorthernBavarian-influenced
Chinese
Mainland MandarinTaiwanese MandarinCantonese-accentedOverseas communities
Hindi & Urdu
DelhiMumbaiPunjabi-influencedKarachiLahoriDiaspora
Topic Distribution
Subject matter without a center of gravity
Routing categories, topic labels, and human annotations on every call. The conversations run from international transfers to estate planning to firmware updates, each with its own vocabulary, stakes, and emotional register.
Money & Payments
International transfersPayment failuresRefundsChargebacksFees & exchange ratesBilling disputesReceipts & records
Identity & Security
Identity verificationFraud reportsAccount recoverySuspicious activityCompliance holdsDocument reviewPassword resets
Legal & Business
Business formationTrademarks & IPEstate planningWills & trustsTax questionsContractsAnnual filingsRegistered agent
Devices & Technical
TroubleshootingSetup & pairingFirmware updatesApp supportCompatibilityRepairsAccessories
Orders & Logistics
Shipping & deliveryReturns & exchangesWarranty claimsOrder changesLost packagesCustoms & international
Accounts & Relationships
OnboardingPlan changesSubscriptions & renewalsCancellationsEscalationsComplaintsRetention & win-backProfile updates
Most speech corpora go deep on one subject or wide on languages. This one does both: twenty-nine languages, more than sixty regional accents and dialects, and subject matter that runs the full range of why people pick up the phone.