Authors
Laurence Dyer¹ | L.J.Dyer@wlv.ac.uk / ljdyer@gmail.com | GitHub
Anthony Hughes¹ | A.J.Hughes2@wlv.ac.uk | GitHub
Dhwani Shah | D.R.Shah2@wlv.ac.uk | GitHub
Burcu Can | B.Can@wlv.ac.uk
¹Laurence Dyer and Anthony Hughes contributed equally to this work as first authors.
Code repositories
Data cleaning
Text Data Cleaner
repo (GitHub) | demo (Colab)
Clean text data for use in machine learning and natural language processing applications
Evaluation
Feature Restoration Evaluator
repo (GitHub) | demo (Colab)
Quantitative and qualitative evaluation of ML models for restoration of textual features
Models
Naive Bayes Space Restorer
repo (GitHub) | demo (Colab)
Train Naive Bayes-based statistical machine learning models for restoring spaces to unsegmented sequences of input characters. (Referred to in the paper as NB.)
BiLSTM Char Feature Restorer
repo (GitHub) | demo (Colab)
Train character-level BiLSTM models for restoration of features such as spaces, punctuation, and capitalization to unformatted texts. (Referred to in the paper as BiLSTMCharSpace/BiLSTMCharE2E.)
CRF for Punctuation Restoration
repo (GitHub)
Conditional Random Fields model for a punctuation restoration task. (Referred to in the paper as CRF.)
Punctuation Restoration using Transformer Models
repo (GitHub)
An extension of the paper Punctuation Restoration using Transformer Models for High-and Low-Resource Languages accepted at the EMNLP workshop W-NUT 2020. (Referred to in the paper as BERTBiLSTM.)