Teaching a Machine to Speak Nepali

मेसिनलाई नेपाली बोल्न सिकाउँदा

A from-the-trenches account of fine-tuning Chatterbox v3, a voice-cloning text-to-speech model, for Nepali. The tokenizer bugs, the phantom sentences, the overfitting hunt, and the audio that came out the other side.

Base: Chatterbox v3 (Hindi) · Method: LoRA · rank 32 · Data: 4,138 clips · Lang: ne · नेपाली

01 · The mission — one model, any voice, in Nepali

Give it a few seconds of someone’s voice and a line of Nepali text, and it speaks that line in their voice. No per-speaker training. That is the goal.

Chatterbox is an open multilingual TTS model from Resemble AI. The v3 family ships a dedicated Hindi checkpoint, and Hindi shares the Devanagari script and most of its phoneme inventory with Nepali. That overlap is the whole bet. Rather than train a Nepali model from scratch, we graft Nepali onto the Hindi model with a small, cheap LoRA adapter and a corpus of roughly 4,000 Nepali recordings.

It sounds clean on paper. In practice, the path was a chain of subtle bugs, each one invisible until you actually listened to the output. This is that story.

2,455	4,138	210	24 kHz
tokenizer vocab	training clips	LoRA layers	output audio

02 · Under the hood — how the sound gets made

Text does not become audio in one leap. It flows through a pipeline, and every bug in this post lives at one of these boundaries: text not tokenizing right, tokens not conditioning right, audio not stitching right.

Nepali text (देवनागरी)

नमस्ते, तपाईंलाई…

MTLTokenizer

Adds the [ne] token · NFC · decimals · graphemes.

T3 transformer

30 layers · LoRA · text → speech tokens.

S3Gen

Flow + vocoder · tokens → waveform.

Audio

24 kHz · 🔊 cloned voice.

We only fine-tune the T3 transformer, and even then only a LoRA adapter injected into its attention and MLP projections. The vocoder (S3Gen) and voice encoder stay frozen. That keeps the trainable footprint tiny: about 210 small matrices instead of a two-billion-parameter rewrite.


TARGET_MODULES = ["q_proj", "v_proj", "k_proj", "o_proj",
                  "gate_proj", "up_proj", "down_proj"]
LORA_RANK    = 32
LORA_ALPHA   = 64
LORA_DROPOUT = 0.05

03 · The first bug — the tokenizer that spoke in fragments

Every generation starts by telling the model what language it is speaking, using a single special token like [hi] for Hindi. So we passed [ne] for Nepali. Reasonable. Except [ne] was never in the vocabulary.

The bug — [ne] shattered into three tokens. The tokenizer did not recognise [ne] as a unit. It fell back to byte-level splitting and produced three tokens: [, ne, ] — none of which mean “this is Nepali.” The model was being handed punctuation soup as its language signal.

The fix was to register [ne] as a proper single special token in the tokenizer, mirroring how [hi] (ID 722) was defined. It landed at ID 2111:


# before: [ne] -> [1706, 1712, 1720]  (garbage)
# after:  [ne] -> [2111]              (one clean token)

But fixing the language tag uncovered a deeper problem in the preprocessing.

The bug — NFKD normalization was quietly mangling Devanagari. The pipeline ran Unicode NFKD normalization on all text. For Latin scripts that is harmless. For Devanagari it is destructive: NFKD decomposes combining vowel signs (मात्रा) and can split or reorder the very marks that distinguish one Nepali word from another. The model was training on subtly corrupted text.

The fix — skip NFKD for Nepali, use NFC instead. We special-cased the ne path to use NFC (which composes marks correctly) and added a Nepali normalizer for the things that actually needed fixing: decimal expansion, and standardizing inconsistent nasal and vowel-length spellings across the corpus.


def nepali_normalize(text: str) -> str:
    text = unicodedata.normalize("NFC", text)        # NOT NFKD
    text = re.sub(r"([0-9०-९]+)\.([0-9०-९]+)",
                  r"\1 दशमलव \2", text)                # 12.5 -> 12 दशमलव 5
    # standardize anusvara/chandrabindu + vowel length
    return text

We verified every fix before training: [ne] is one token, decimals expand, chandrabindu survives.


[ne] ID: 2111
sample encode (25 tokens): [1706, 1712, 1720, 1740, 1702, 1734, 7, 2]...
decimal test decode: '१२ दशमलव ५ किलो'
chandrabindu test (8 tokens): OK
Tokenizer: all checks passed

04 · The phantom sentence — the danda, the decimal, and a sentence that was not there

TTS models degrade on long inputs, so long paragraphs get split into chunks, synthesized separately, and stitched back together. Nepali ends sentences with a danda, ।, not a period. Simple plan: replace । with . and split there.

The bug — it invented a sentence in the middle of the audio. Random clauses were being read as if they were their own sentences, with wrong intonation and phantom pauses. The naive “replace and split on every dot” rule was also splitting on dots that were never sentence boundaries: decimal points and abbreviations. १२.५ became “twelve. five.”

The fix — expand decimals first, then split on a sentinel. Two-step fix. First expand decimals to words (१२.५ becomes १२ दशमलव ५) so no numeric dots survive. Then, instead of converting । to . and losing the distinction, swap । for a private sentinel character, split on the sentinel plus ? and !, and only then restore real dots. Pre-existing dots are never touched.


_DANDA_SENTINEL = '\x00'
 
def split_sentences(text: str) -> list:
    text = text.replace('।', _DANDA_SENTINEL)                       # danda to sentinel
    parts = re.split(rf'(?<=[{_DANDA_SENTINEL}?!])\s*', text)
    chunks = [p.strip() for p in parts if p.strip()]
    return [c.replace(_DANDA_SENTINEL, '.') for c in chunks]        # restore

Sentences still longer than about 40 seconds get sub-chunked at commas and pause markers, and any fragment below a minimum length gets merged back into its neighbour, because the model repeats and hallucinates when handed a two-word stub with no context.

The bug — noise between chunks. Each chunk carried a tail of low-level noise the model emitted after finishing speech. Concatenated, these tails added up to long dead-air gaps. Fix: an RMS-based trailing-silence trimmer (−40 dB threshold, scans backwards, keeps a 60 ms natural decay) runs on every chunk before stitching.

05 · The smaller demons — the seed that never seeded

A reproducibility bug hiding in plain sight. The seed defaulted to 0, and the guard was:


if args.seed:          # 0 is falsy in Python, so set_seed() never runs
    set_seed(args.seed)

Gotcha — 0 is falsy. With the default seed of 0, if args.seed: evaluates to False, so the seed was never actually set. Every run was non-deterministic, and “set seed 0” silently did nothing. The default changed to 42 and the guard now checks explicitly. Small bug, hours of “why is this different every time.”

06 · The key insight — do not start Nepali from noise. Start it from Hindi.

After adding [ne] to the tokenizer, its embedding row existed in the model, but it was random noise. We measured it against the trained [hi] embedding:


[ne] embedding norm:        0.385   # random init
[hi] embedding norm:        0.553   # trained, meaningful
[ne] vs [hi] cosine sim:    0.058   # essentially orthogonal
[ne] vs random token sim:   0.041   # it IS just noise

Insight — warm-start the [ne] embedding from [hi]. Nepali and Hindi share Devanagari and most of their phonemes. The [hi] vector already encodes “this is a Devanagari language,” a far better starting point than noise. So before training, we copy [hi]’s embedding into the [ne] slot. The model still learns a distinct Nepali representation (the embedding stays trainable), it just converges from a sensible prior instead of from scratch.


NE_ID, HI_ID = 2111, 722
with torch.no_grad():
    model.t3.text_emb.weight.data[NE_ID] = \
        model.t3.text_emb.weight.data[HI_ID].clone()   # warm start
 
# keep the embedding table trainable at a gentler LR
model.t3.text_emb.weight.requires_grad_(True)
optimizer = AdamW([
    {"params": lora_params},
    {"params": [model.t3.text_emb.weight], "lr": args.lr * 0.1},
], lr=args.lr)

It is the language-model equivalent of telling a Spanish speaker to learn Portuguese. You do not hand them a blank slate, you hand them everything they already know that transfers.

07 · The investigation — why was it overfitting so fast?

An earlier run seemed to overfit almost immediately. The instinct was “too little data,” but the real culprit was the validation set itself.

Finding — the val set was the first half of the dataset. The test split was not random. It was a contiguous slice (sentences 25 to 1983), covering only the first half of the corpus. A val set that is not representative gives a loss signal you cannot trust. It looked like overfitting, but it was partly just distribution mismatch between a biased val slice and the training data.

The fix — pool everything, split randomly. We now pool all 4,138 clips, shuffle with a fixed seed, and carve a random 90/10 split. The val curve below is finally trustworthy, and the train/val gap is roughly half what it was before.

— train loss — val loss

The picture is clean. Validation loss falls steadily and bottoms out around epoch 10 (val 3.2016), then begins a slow drift upward while training loss keeps falling. That divergence is the textbook overfitting signal, and it tells us exactly which checkpoint to keep. Epoch 10 is the model.

08 · Listen for yourself — the audio

Every bug above was found by ear. Here is what each stage actually sounds like. Start with the reference, the voice we clone from a single short clip.

inputReference voice

🔊 audio sample — transcript below (clip not included)

Spoken text

राजपुत्रले सभा विसर्जत गरे। त्यहाँ एकत्रित नरनारीहरू हृदयमा भविष्यका प्रति

The model we shipped

The epoch-10 checkpoint: best validation loss, cleanest output.

ship itEpoch 10 · best checkpoint

🔊 audio sample — transcript below (clip not included)

Spoken text

यस्तै, २०८३ वैशाखमा मात्र २ खर्ब ५७ अर्ब ४९ करोड रुपैयाँ रेमिट्यान्स भित्रिएको छ । यो रकम मासिक रूपमा हालसम्मकै बढी हो । यसअघि चैतमा २ खर्ब ९ अर्ब रुपैयाँ रेमिट्यान्स भित्रिएको थियो ।

overfitEpoch 15 · five epochs past best

🔊 audio sample — transcript below (clip not included)

Spoken text

Long-form, chunked and stitched

A full multi-paragraph article run through the sentence splitter, silence trimmer and stitcher: the payoff of the chunking work in section 04.

chunkedMulti-paragraph article

🔊 audio sample — transcript below (clip not included)

Spoken text

यस्तै, २०८३ वैशाखमा मात्र २ खर्ब ५७ अर्ब ४९ करोड रुपैयाँ रेमिट्यान्स भित्रिएको छ । यो रकम मासिक रूपमा हालसम्मकै बढी हो । वैशाखमा अमेरिकी डलरमा रेमिट्यान्स आप्रवाह ३३ प्रतिशतले वृद्धि भई १३ अर्ब २६ करोड पुगेको छ ।

09 · The reference trap — a longer reference clip makes it worse, not better

It is tempting to feed the cloner a long sample of the target voice, on the theory that more audio means a better clone. The opposite is true. The model only looks at the beginning of the reference and discards the rest:


ENC_COND_LEN = 6  * S3_SR      # T3 speech-token prompt: first 6 seconds
DEC_COND_LEN = 10 * S3GEN_SR   # S3Gen decoder reference: first 10 seconds

Hand it a 30-second clip and roughly 20 seconds are thrown away. Worse, if those first few seconds happen to hold a breath, a pause, or a non-representative phrase, that is the entire basis for the cloned voice. A long clip also tends to start with lead-in audio that is the least clean part of the recording.

The same text and model, changing only the reference clip:

raw31-second reference

🔊 audio sample — transcript below (clip not included)

Spoken text

काठमाडौँ — नेपाल उद्योग वाणिज्य महासंघका एसोसिएट उपाध्यक्ष प्रवलजंग पाण्डेले श्रम बजारका विद्यमान परिवर्तन र चुनौतीहरूको सामना गर्न व्यावहारिक नीतिगत लचकता, उद्यम उत्पादकत्व वृद्धि र सुदृढ त्रिपक्षीय सहकार्यको आवश्यकतामा जोड दिएका छन्।

clean8-second reference (trimmed)

🔊 audio sample — transcript below (clip not included)

Spoken text

Finding — short and clean beats long and raw. The fix is a short, continuous, clean clip of the target speaker, roughly 6 to 10 seconds, with no leading silence. Trimming a 31-second reference down to a clean 8-second window noticeably tightened the voice match and cut the background noise, with no change to the model or the text.

One more detail surfaced here. The dateline dash in काठमाडौँ — नेपाल was being voiced as a sound. A dash is a visual separator, not a phoneme, so the normalizer now converts every dash (em, en, and double-hyphen) into a comma-length pause before tokenizing.

10 · What’s next — the next experiment

The shipped model fine-tunes on the Hindi base. But there is a second bet worth testing: the generic 23-language multilingual base. It has seen far more diverse scripts and prosody, so it may generalize better to a new language, even if it is less Devanagari-specialized than the Hindi checkpoint.

The training script is built and waiting. Same LoRA recipe, same tokenizer fixes, same [ne] warm-start from [hi], just a different set of starting weights. Both are Chatterbox v3; only the pretrained T3 differs. A head-to-head is the cleanest way to settle which base transfers better to Nepali.

Finding — the checklist that got us here

✓ [ne] as a single token ✓ NFC, not NFKD ✓ sentinel-based danda splitting ✓ decimal expansion ✓ trailing-silence trim ✓ minimum chunk size ✓ [ne] warm-started from [hi] ✓ random stratified val split ✓ keep epoch 10 ✓ short clean reference clip ✓ dashes to pauses.

मेसिनले नेपाली बोल्यो 🗣 — Chatterbox v3 · Nepali LoRA fine-tune · Oshara AI engineering log. Every bug in this post was caught by listening.