Attention Blog English to Hindi Translation

Rushikesh Darge
6 min readJul 9, 2022
image ref : https://videotranslator.ai/news/how-to-get-the-most-out-of-your-ai-english-to-hindi-video-translation/

Content

Approach
Performance matrix
Business matrix
Data
Preprocessing
Dataset split
Data generator
Model
Encoder decoder Model
Each Block of Attention Layer
Attention model
Summary
Future Work
Bibliography

Approach

Hindi obviously has a lot more words. It’s easy enough to notice the lack of gender specificity to many words in English language while Hindi is specific about things like kanga and kangi, a non living thing but it still has feminine and masculine names for it, while English lacks this. At the same time the relations within Hindi are highly specific most relations have a proper name instead of just using in-law with everything and calling everyone your aunt and uncle. [ref.]

Hindi is very rich and complex language unlike english, it has lots of words and in those words lots of variation to capture that whole essence we require a lots of data and also more complex model is require like GPT and other advance models who has parameters in Billions and train using whole dataset on internet.

But we try to build model with subset dataset and understand how language model works, how data preprocess and other lots of learning. definetly we cannot use this model in production but we can learn lot about working and architecture of Sequence-Sequence and Attention model. Just like we learn from MNIST, titanic dataset.

Performance matrix

Performance matrix we are using BLEU score. BLEU (BiLingual Evaluation Understudy) is a metric for automatically evaluating machine-translated text. The BLEU score is a number between zero and one that measures the similarity of the machine-translated text to a set of high quality reference translations. A value of 0 means that the machine-translated output has no overlap with the reference translation (low quality) while a value of 1 means there is perfect overlap with the reference translations (high quality).

The approach works by counting matching n-grams in the candidate translation to n-grams in the reference text, where 1-gram or unigram would be each token and a bigram comparison would be each word pair. The comparison is made regardless of word order.

Data

For dataset we are using IIT Bombay English-Hindi Translation Dataset from Kaggle. It contains 1,561,840 instances of Hindi — English Translation means we have 1.5 MIllion datapoints.It has very large corpus. it contain csv file of 400MB. When I tried to train with 5lakh datapoints to complete 1 epochs it take more than 35 Minutes on GPU.

dataset

To reduce train time we need to compromise with accuracy. While training model we are taking 50k datapoints.

Preprocessing

Read dataset csv file

After that we check null values, we have some we remove them.

Data is not clean lots of puctution marks, html tags, some english words in hindi column, and other special characters are present so we need to remove those and clean the data.

Now our data looks clean, take a look

In our dataset in somplace we have english present in hindi column and vice versa, so to check that we use Fasttext pretrain model to identify language, we also have other alternatives like spacy and google but fasttext is the faster one.

After loading Fastext model, we pass each row through the model and predict its language.

We found ~5k points that are mismath their place, we have enough data so we remove them.

After that we found the number of words in sentence

we found in both english and hindi have 93–95 percentile length are less than or equal to 10.

Dataset split

we split dataset into 80:20 ratio

We create tokenizer for each language

We convert text to number and after that do padding

Sample of one sentence

Data Generator

We use Yield in datagenerator use of that is a keyword in Python that is used to return from a function without the states of its local variable and when the function is called, execution starts from the last yield statement.

Model

Encoder-Decoder Model

credits: https://medium.com/@kriz17

We got 0.24 BLEU score of encoder-decoder model, now lets see what score give by attention.

Attention model

First visualize the attention model architecture then we start the coding part

There are three attention mechanism to get context vector

  1. Dot method
  2. Concat method
  3. General method

Each Block of Attention Layer

Encoder

In the call block of the encoder, we define a function such that it takes an input sequence and the initial states of the encoder. The input of the input sequences is passed through the embedding layer, and finally, the output of the embedding layer is passed into the Encoder LSTM. The call function returns all the outputs of the encoder as well as the last time steps of the hidden and cell states.

Attention

In the call function of attention, the Attention mechanism takes mainly two, first is scoring_functionwhich are three that we already see Dot, concat and General. and other is att_unitsmeans number of units in Dense layer of attention block. These two variables include the hidden states of the decoder and all the outputs of the encoder. Based on the scoring function, If it is a Dot attention mechanism then we will find the alpha by multiply the the encoder outputs and decoder hidden states this give us alpha.

If it is a concat attention mechanism then we will find alpha by adding encoder output and decoder hidden states after passing through dense layer then add together passing through tanh activation here we get alpha.

if it is a general attention mechanism then pass encoder output through dense layer and then multiply wiith decoder hidden state we get alpha.

After getting alpha we pass it through softmax activation fuction and now our alpha name as attention weights if we multiply this weights to encoder output we get context_vector .

One Step Decoder

The one-step decoder is modified in such a way that it will return the necessary weights. its return are predicted_out, dec_h_state, dec_c_state, att_weights, context_vector . Lets understand them one by one.

we get att_weights and context vector passing decoder hidden state and encoder_output to the Attention block .

After concatenating embedding_vector and context_vector when we pass it to the LSTM layer we get dec_h_state, dec_c_state , and dec_output . If we pass dec_output to the dense layer we predicted_out .

Decoder

The Decoder involves initializing an empty Tensor array and it will store the outputs at each and every time step. Once we call the one-step decoder for each token in the decoder_input, we can proceed to store the output in the defined variable for the Tensor array.

Encoder Decoder

Here we combine overall process of both the elements of the encoder and decoder.

Custom Loss Function

This loss function we are ignoring the loss the padded zeros. i.e when the input is zero then we donot need to worry what the output is. This padded zeros are added from our end preprocessing to make equal length for all the sentences.

Model Fitting

Then we fit one by one model with different attention mechanism.

And we get best BLEU score in concat attention mechanism which is 0.31 which is higher than normal seq-seq model.

Summary

All Attention models gives better result than normal seq-seq model. seq-seq model give us 0.24 BLEU score where concat attention 0.31 that a signifience difference.

Future Work

  • Train model on more dataset
  • Add more layers and increase BLEU score

Bibliography

Dataset : https://www.kaggle.com/datasets/vaibhavkumar11/hindi-english-parallel-corpus
Reference 1 : https://www.tensorflow.org/guide/keras/custom_layers_and_models
Reference 2 : https://fasttext.cc/

For the full implementation of this case study refer to this Github repo.

Feel free to connect with me on LinkedIn ,GitHub or Kaggle.

--

--