Kickstart your coding journey with our Python Code Assistant. An AI-powered assistant that's always ready to help. Don't miss out!
A pre-trained model is a model that was previously trained on a large dataset and saved for direct use or fine-tuning. In this tutorial, you will learn how you can train BERT (or any other transformer model) from scratch on your custom raw text dataset with the help of the Huggingface transformers library in Python.
Pre-training on transformers can be done with self-supervised tasks, below are some of the popular tasks done on BERT:
To get started, we need to install 3 libraries:
If you want to follow along, open up a new notebook, or Python file and import the necessary libraries:
If you're willing to pre-train a transformer, then you most likely have a custom dataset. But for demonstration purposes in this tutorial, we're going to use the cc_news
dataset, we'll be using huggingface datasets library for that. As a result, make sure to follow this link to get your custom dataset to be loaded into the library.
CC-News dataset contains news articles from news sites all over the world. It contains 708,241 news articles in English published between January 2017 and December 2019.
Downloading and preparing the dataset:
There is only one split in the dataset, so we need to split it into training and testing sets:
You can also pass the seed
parameter to the train_test_split()
method so it'll be the same sets after running multiple times.
Output:
Let's see how it looks like:
Output (stripped):
As mentioned previously, if you have your custom dataset, you can either follow the link of setting up your dataset to be loaded as above, or you can use the LineByLineTextDataset
class if your custom dataset is a text file where all sentences are separated by a new line.
However, a better way to set up your custom dataset is to split your text file into several chunk files using the split
command or any other Python code, and load them using load_dataset()
as we did above, like this:
If you have your custom data as one massive file, then you should divide it into a handful of text files (such as using the split
command on Linux or Colab) before loading them using the load_dataset()
function, as the runtime will crash if it exceeds the memory.
Next, we need to train our tokenizer. To do that, we need to write our dataset into text files, as that's what the tokenizers library requires the input to be:
The main purpose of the above code cell is to save the dataset object as text files. If you already have your dataset as text files, then you should skip this step. Next, let's define some parameters:
The files
list is the list of files to pass to the tokenizer for training. vocab_size
is the vocabulary size of tokens. max_length
is the maximum sequence length.
truncate_longer_samples
is a boolean indicating whether we truncate sentences longer than the length of max_length
, if it's set to False
, we won't truncate the sentences, we group them together and split them by max_length
, so all the resulting sentences will have the length of max_length
.
Let's train the tokenizer now:
Since this is BERT, the default tokenizer is WordPiece. As a result, we initialize the BertWordPieceTokenizer()
tokenizer class from the tokenizers
library and use the train()
method to train it, it will take several minutes to finish. Let's save it now:
The tokenizer.save_model()
method saves the vocabulary file into that path, we also manually save some tokenizer configurations, such as special tokens:
unk_token
: A special token that represents an out-of-vocabulary token, even though the tokenizer is a WordPiece tokenizer, the unk
tokens are not impossible, but rare.sep_token
: A special token that separates two different sentences in the same input.pad_token
: A special token that is used to fill sentences that do not reach the maximum sequence length (since the arrays of tokens must be the same size).cls_token
: A special token representing the class of the input.mask_token
: This is the mask token we use for the Masked Language Modeling (MLM) pretraining task.After the training of the tokenizer is completed, let's load it now:
Of course, if you want to use the tokenizer multiple times, you don't have to train it again, simply load it using the above cell.
Now that we have the tokenizer ready, the below code is responsible for tokenizing the dataset:
The encode()
callback that we use to tokenize our dataset depends on the truncate_longer_samples
boolean variable. If set to True
, then we truncate sentences that exceed the maximum sequence length (max_length
parameter). Otherwise, we don't.
Next, in the case of setting truncate_longer_samples
to False
, we need to join our untruncated samples together and cut them into fixed-size vectors since the model expects a fixed-sized sequence during training:
Most of the above code was brought from the run_mlm.py
script from the huggingface transformers examples, so this is actually used by the library itself.
If you don't want to concatenate all texts and then split them into chunks of 512 tokens, then make sure you set truncate_longer_samples
to True
, so it will treat each line as an individual sample regardless of its length. If you set truncate_longer_samples
to True
, the above code cell won't be executed at all.
Output:
For this tutorial, we're picking BERT, but feel free to pick any of the transformer models supported by huggingface transformers library, such as RobertaForMaskedLM
or DistilBertForMaskedLM
:
We initialize the model config using BertConfig
, and pass the vocabulary size as well as the maximum sequence length. We then pass the config to BertForMaskedLM
to initialize the model itself.
Before we start pre-training our model, we need a way to randomly mask tokens in our dataset for the Masked Language Model (MLM) task. Luckily, the library makes this easy for us by simply constructing a DataCollatorForLanguageModeling
object:
We pass the tokenizer
and set mlm
to True
, and also set the mlm_probability
to 0.2 to randomly replace each token with [MASK]
token by 20% probability.
Next, let's initialize our training arguments:
Each argument is explained in the comments, refer to the TrainingArguments
docs for more details. Let's make our trainer now:
We pass our training arguments to the Trainer
, as well as the model, data collator, and the training sets. We simply call train()
now to start training:
The training will take several hours to several days, depending on the dataset size, training batch size (i.e increase it as much as your GPU memory fits), and GPU speed.
As you can see in the output, the model is still improving and the validation loss is still decreasing. You usually have to cancel the training once the validation loss stops decreasing or decreasing very slowly.
Since we have set logging_steps
and save_steps
to 1000, then the trainer will evaluate and save the model after every 1000 steps (i.e trained on steps x gradient_accumulation_step
x per_device_train_size
= 1000x8x10 = 80,000 samples). As a result, I have canceled the training after about 19 hours of training, or 10000 steps (that is about 1.27 epochs, or trained on 800,000 samples), and started to use the model. In the next section, we'll see how we can use the model for inference.
Before we use the model, let's assume we don't have model
and tokenizer
variables in the current runtime. Therefore, we need to load them again:
If you're on Google Colab, then you have to save your checkpoints in Google Drive for later use, you can do that by setting model_path
to a drive path instead of a local path like we did here, just make sure you have enough space there.
Alternatively, you can push your model and tokenizer into the huggingface hub, check this useful guide to do it.
Let's use our model now:
We use the simple pipeline API, and pass both the model
and the tokenizer
. Let's predict some examples:
Output:
That's impressive, I have canceled the training and the model is still producing interesting results! If your model does not make good predictions, then that's a good indicator that it wasn't trained enough.
And there you have a complete code for pretraining BERT or other transformers using Huggingface libraries, below are some tips:
load_best_model_at_end
to True
if you don't want to keep track of the loss, as it will load the best weights in terms of loss when the training ends.truncate_longer_samples
to False
, then the code assumes you have larger text on one sentence (i.e line), you will notice that it takes much longer to process, especially if you set a large batch_size
on the map()
method. If it takes a lot of hours to process, then you can either set truncate_longer_samples
to True
so you truncate sentences that exceed max_length
tokens or you can save the dataset after processing using the save_to_disk()
method, so you process it once and load it several times.auto_find_batch_size
in the TrainingArguments()
class, you can pass it as True
so it'll find the optimal batch size for your GPU, avoiding Out-of-Memory errors. Make sure you have accelerate
library installed: pip install accelerate
.If you're interested in fine-tuning BERT for a downstream task such as text classification, then this tutorial guides you through it.
Other related tutorials:
Check the full code here.
Happy learning ♥
Take the stress out of learning Python. Meet our Python Code Assistant – your new coding buddy. Give it a whirl!
View Full Code Generate Python Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!