Building your own Indic Transliterator
I find transliteration to be one of the most fascinating aspects of Natural Language Processing. It’s simple yet so useful that we likely use it every day without even realizing it. Basically, transliteration is the process of converting text from one script to another by systematically swapping letters in predictable ways.
Eg. म घर जान्छु is a text in Nepali language transliterated to ma ghar jaanchhu.
While transliteration seems straightforward, the accuracy and efficiency of the process can vary depending on the system being used. Let’s take a look at the current state of Nepali transliteration systems:
State of Existing Nepali Transliteration Systems
- Rule-based Transliteration: While fast, it is highly error-prone, with accuracy often below 50%.
- Google Transliteration: Though more accurate, Google’s API is outdated and slow, with single sentence inference taking 4–5 seconds.
- IndicXlit: Faster and more accurate than rule-based systems, but it struggles with Nepali transliterations where informal language variations are common.
For example, “manxe” and “hunxa” are often used for “मान्छे” (manche) and “हुन्छ” (huncha), or “garnaw” and “padnaw” instead of “garna” (गर्न) and “padna” (पढ्न).
Such inconsistencies exist in other languages as well. The solution? Fine-tune a transliterator to suit your specific needs.
For my Nepali transliteration, I chose IndicXlit, as it’s the most robust option available. IndicXlit provides ample resources in its GitHub repository, along with a workshop to explain the system in detail. In this article, I’ll explain how I fine-tuned it from my perspective. You can find my training code here. If you just want to use the NepaliXlit transliterator, find the app version and the CLI version here.
1. Initial Installations
Start by installing the necessary dependencies, including the IndicNLP library and fairseq.
# install Indicnlp library
!git clone https://github.com/anoopkunchukuttan/indic_nlp_library.git
!git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
# Install the necessary libraries
!pip3 install sacremoses pandas mock sacrebleu tensorboardX pyarrow indic-nlp-library xformers triton
# Install fairseq from source
!git clone https://github.com/pytorch/fairseq.git
%cd fairseq
!pip install --editable ./
%cd ..
2. Create a folder for finetuning
Organize your workspace by creating a dedicated folder for the fine-tuning process.
mkdir Finetuning
cd Finetuning
3. Download the pretrained model
!wget https://github.com/Supriya090/NepaliXlit/releases/download/v1.0/nepalixlit-en-ne.zip
!unzip nepalixlit-en-ne.zip
In the above snippet, I’ve provided a link to the model I fine-tuned using my own dataset. You can further fine-tune it on your own data as needed. If you visit the official IndicXlit GitHub repository, you’ll find pretrained models that support transliteration for 22 Indic languages, given in the snippet below. I started with the following model for my project, and you can easily do the same if you’re looking to fine-tune a model for any of those languages.
# download the IndicXlit models
wget https://github.com/AI4Bharat/IndicXlit/releases/download/v1.0/indicxlit-en-indic-v1.0.zip
unzip indicxlit-en-indic-v1.0.zip
4. Downloading the Unigram Probability Dictionaries
Next, download unigram probability dictionaries for re-ranking the English-to-Indic model, which helps improve transliteration accuracy.
!wget https://github.com/AI4Bharat/IndicXlit/releases/download/v1.0/word_prob_dicts.zip
5. Preparing your training corpus
This is the most crucial part of the process, as it determines the quality of the words your model will be able to transliterate. You’ll need to build a parallel corpus, consisting of source and target words. Each line in your files should contain one word, with letters separated by spaces. The parallel corpus should follow the same format.
For example, if you’re creating an English-to-Nepali transliterator, your source corpus might look like this:
m a n d i n a
s a m v a n d h i
a a u n x u
k a s x a
a a m a
Your target corpus might look like this:
म ा न ् द ि न
स म ् व न ् ध ी
आ उ ँ छ ु
क स ् छ
आ म ा
Once your corpus is ready, create a folder named corpus
and split your data into training, validation, and testing sets. I divided my data into a 70:15:15 ratio for training, validation, and testing. It's important to ensure that the data is not only split proportionally but also that the type and quality of the data are evenly distributed across these sets to maintain consistency.
mkdir corpus
The files should be arranged in the folder structure like this:
# corpus/
# ├── train_ne.en
# ├── train_ne.ne
# ├── valid_ne.en
# ├── valid_ne.ne
# ├── test_ne.en
# └── test_ne.ne
Even with a small amount of data, you can achieve satisfactory results. For my project, I fine-tuned the model using 3,679 parallel word pairs, which I divided into training, testing, and validation sets.
6. Binarize the Corpus
CUDA_VISIBLE_DEVICES="1" fairseq-preprocess \
--trainpref corpus/train_ne --validpref corpus/valid_ne --testpref corpus/test_ne \
--srcdict corpus-bin/dict.en.txt \
--tgtdict corpus-bin/dict.ne.txt \
--source-lang en --target-lang ne \
--destdir corpus-bin
I performed the fine-tuning on a GPU, using CUDA_VISIBLE_DEVICES="1"
. If you're training on a CPU, you can remove this part.
7. Check lang_list.txt
Make sure you have a file named lang_list.txt
that includes all the supported languages of IndicXlit. If you've followed the previous steps, this file will already be part of the cloned repository. If not, you can create the file and add the necessary contents from the repository here.
8. Start Finetuning
You can modify the hyperparameters to suit your system’s specifications. Refer to the fairseq documentation to explore the available options for customizing your training.
CUDA_VISIBLE_DEVICES="1" fairseq-train corpus-bin \
--save-dir transformer \
--arch transformer --layernorm-embedding \
--task translation_multi_simple_epoch \
--sampling-method "temperature" \
--sampling-temperature 1.5 \
--encoder-langtok "tgt" \
--lang-dict lang_list.txt \
--lang-pairs en-ne\
--decoder-normalize-before --encoder-normalize-before \
--activation-fn gelu --adam-betas "(0.9, 0.98)" \
--batch-size 16 \
--decoder-attention-heads 4 --decoder-embed-dim 256 --decoder-ffn-embed-dim 1024 --decoder-layers 6 \
--dropout 0.5 \
--encoder-attention-heads 4 --encoder-embed-dim 256 --encoder-ffn-embed-dim 1024 --encoder-layers 6 \
--lr 0.001 --lr-scheduler inverse_sqrt \
--max-epoch 3 \
--optimizer adam \
--num-workers 0 \
--warmup-init-lr 0 --warmup-updates 4000 \
--keep-last-epochs 2 \
--patience 5 \
--restore-file transformer/checkpoint_best.pt \
--reset-lr-scheduler \
--reset-meters \
--reset-dataloader \
--reset-optimizer
9. Generate Result
Once fine-tuning is complete, you can generate the transliterated outputs as:
CUDA_VISIBLE_DEVICES="1" fairseq-generate corpus-bin \
--path transformer/checkpoint_best.pt \
--task translation_multi_simple_epoch \
--gen-subset test \
--beam 4 \
--nbest 4 \
--source-lang en \
--target-lang ne \
--batch-size 16 \
--encoder-langtok "tgt" \
--lang-dict lang_list.txt \
--num-workers 0 \
--lang-pairs en-ne > output/en_ne.txt
To run a new batch of samples,
bash interactive.sh ne 'source/source.txt' 5 5
python3 generate_result_files_txt.py ne 1
Here, ne
represents the target language (Nepali), source/source.txt
is the input file, and the final two numbers refer to beam size and number of best results (nbest
).
10. Evaluation
Finally, evaluate your model’s performance using metrics like accuracy and others, depending on your needs.
lang_abr=ne
python3 evaluate_result_with_rescore_option.py \
-i output/translit_result_$lang_abr.xml \
-t output/translit_test_$lang_abr.xml \
-o output/evaluation_details_$lang_abr.csv \
--acc-matrix-output-file output/matrix_score_$lang_abr.txt \
--correct-predicted-words-file output/correct_predicted_words_$lang_abr.txt \
--wrong-predicted-words-file output/wrong_predicted_words_$lang_abr.txt
It’s highly recommended to run all of this on Google Colab for GPU convenience. Once your model is ready, you can package it into an application. I’ve done this for NepaliXlit, which you can check out here. It includes both an app version adapted from IndicXlit and a CLI version.

You can find all the code and my training dataset here. Everything is adapted from IndicXlit, so be sure to explore their resources and read the fairseq documentation to fine-tune your hyperparameters effectively.
Happy Coding!