This punctuation predictor trained bi-directional LSTMs to learn how to automatically punctuate a sentence. The set of operation it learns include: comma, period and question mark.
The Blog Authorship Corpus - http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm
punctuation | precision | recall | f1-score |
---|---|---|---|
None | 0.93 | 0.95 | 0.94 |
Comma | 0.26 | 0.26 | 0.26 |
Period | 0.41 | 0.33 | 0.36 |
Question mark | 0.14 | 0.11 | 0.12 |
Requirements:
- Python 3.x
- Numpy
- Keras
tar zxvf input.tar.gz
to unzip the data
Training is done on blog data (xml file) stored in input
.
- Parameters
--output_directory
: output directory name, default = "output"
--checkpoint_name
: default = "blstm"
--vectorizer_name
: default = "blstm"
--sequence_length
: sequence length for punctuating, default=50
--file_number
: trained file number, default = 350
--batch_size
: default = 128
--epochs
: default = 25
python train.py
- Parameters
--input
: input string only consist of lower case alphabet
--model_path
: default="output/blstm.h5"
--vectorizer_path
: default="output/blstm.pickle"
--sequence_length
: should be the same as training sequence length, default=50
python predictor.py --input "this is a string of text with no punctuation this is a new sentence"
I began with was a single uni-direction LSTM but it got confused with comma and period. After changing to bi-direction LSTM, the performance of period is much better.
-
Hidden layer initialization - In most task, the neural network generate good results when starting with a zero initial state
-
Chunking data in different way - I chunk the article in a fix size so a single sentance could be seperated in two chunks. It might be harder to predict the punctuation of the last word.
-
Try different pretrained embedding model (Glove, Wang2vec etc) to capture semantic relatedness
-
Try different models (CNN, CRF)
-
Self-attention