Punctuation_Predictor

This punctuation predictor trained bi-directional LSTMs to learn how to automatically punctuate a sentence. The set of operation it learns include: comma, period and question mark.

Dataset

The Blog Authorship Corpus - http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm

Performance

punctuation	precision	recall	f1-score
None	0.93	0.95	0.94
Comma	0.26	0.26	0.26
Period	0.41	0.33	0.36
Question mark	0.14	0.11	0.12

Example Run

Requirements:

Python 3.x
Numpy
Keras

Training

tar zxvf input.tar.gz to unzip the data Training is done on blog data (xml file) stored in input.

Parameters

--output_directory: output directory name, default = "output"

--checkpoint_name: default = "blstm"

--vectorizer_name: default = "blstm"

--sequence_length: sequence length for punctuating, default=50

--file_number: trained file number, default = 350

--batch_size: default = 128

--epochs: default = 25

python train.py

Testing

Parameters

--input: input string only consist of lower case alphabet

--model_path: default="output/blstm.h5"

--vectorizer_path: default="output/blstm.pickle"

--sequence_length: should be the same as training sequence length, default=50

python predictor.py --input "this is a string of text with no punctuation this is a new sentence"

Model Setup

I began with was a single uni-direction LSTM but it got confused with comma and period. After changing to bi-direction LSTM, the performance of period is much better.

Future Work

Hidden layer initialization - In most task, the neural network generate good results when starting with a zero initial state
Chunking data in different way - I chunk the article in a fix size so a single sentance could be seperated in two chunks. It might be harder to predict the punctuation of the last word.
Try different pretrained embedding model (Glove, Wang2vec etc) to capture semantic relatedness
Try different models (CNN, CRF)
Self-attention

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Punctuation_prediction.ipynb		Punctuation_prediction.ipynb
README.md		README.md
input.tar.gz		input.tar.gz
predictor.py		predictor.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Punctuation_Predictor

Dataset

Performance

Example Run

Training

Testing

Model Setup

Future Work

About

Releases

Packages

Languages

sunnychiuu/Punctuation_Predictor

Folders and files

Latest commit

History

Repository files navigation

Punctuation_Predictor

Dataset

Performance

Example Run

Training

Testing

Model Setup

Future Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages