Deep learning is most important when applied to data that is sparse in training data. For example, if we have our standard 25 x 50 dataset and we want to train a neural network to automatically label the objects, we can do that using the Google BERT model.
BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations that delivers state-of-the-art results on a wide range of tasks related to natural language processing (NLP).
NLP operates in a simplistic way by observing and recognizing how we communicate as people, in terms of language, ways to express and chat. That’s where BERT gets a better understanding of the nature of our search terms, in fact. As mentioned, Google BERT considers the meaning of words in a sentence or expression.
We use a new BERT model which is the result of an improvement in the pre-processing code.
In all, our new system proposed a new estimation for the word length for the literature dataset. This term-length estimation is not precise enough for the original task, but it significantly outperformed the original estimates and made use of all of the data we could grab. In this article, we’ll present how we came to this estimate and then explore the insight it offers.
In the original pre-processing code, we randomly select WordPiece tokens to mask.
But they also contain a monotonically increasing number of masked Words.
The new technique is called Whole Word Masking. In this case, we always mask all of the tokens corresponding to a word at once. The overall masking rate remains the same.
This improved BERT algorithm will not use the previous realize algorithm for testing masked words.
The training is identical – we still predict each masked WordPiece token independently. The improvement comes from the fact that the original prediction task was too ‘easy’ for words that had been split into multiple WordPieces.
See how the BERT system got better?
Note that when we detect more than one word class with an identical OR in the training set, we still propose separate classes and groupings, not merge and overlap. This is so that the resulting classification, when normalized to a scale of 0 to 1, is equal to the original. Also, this is so that a model output produced by merging multiple similar groupings such as the one shown above would be treated as an “identical OR”.
If we train on a specific part of the text (rather than just the whole thing), we can use different types of transformers, and/or add additional text to the translation. If we train on new categories, we can use a deeper BERT encoder and a different pre-trained state representation, or even use another pre-trained model and try to train on a different topic.
Finally, imagine if you want to train on something other than Latin. For example, people in South America have different dialects of Spanish. You would train on that specific subset of language in another network.
If the subject of your project is something you can learn from the Internet, you can always drop the text as background into a Neural Turing Machine (TensorFlow) training dataset.