View on GitHub

Comparison of Token- and Character-Level Approaches to Restoration of Spaces, Punctuation, and Capitalization in Various Languages

A portal to GitHub repositories associated with the paper

Authors

Laurence Dyer¹ | L.J.Dyer@wlv.ac.uk / ljdyer@gmail.com | GitHub
Anthony Hughes¹ | A.J.Hughes2@wlv.ac.uk | GitHub
Dhwani Shah | D.R.Shah2@wlv.ac.uk | GitHub
Burcu Can | B.Can@wlv.ac.uk

¹Laurence Dyer and Anthony Hughes contributed equally to this work as first authors.

Code repositories

Data cleaning

Text Data Cleaner repo (GitHub) | demo (Colab)
Clean text data for use in machine learning and natural language processing applications

Evaluation

Feature Restoration Evaluator repo (GitHub) | demo (Colab)
Quantitative and qualitative evaluation of ML models for restoration of textual features

Models

Naive Bayes Space Restorer repo (GitHub) | demo (Colab)
Train Naive Bayes-based statistical machine learning models for restoring spaces to unsegmented sequences of input characters. (Referred to in the paper as NB.)

BiLSTM Char Feature Restorer repo (GitHub) | demo (Colab)
Train character-level BiLSTM models for restoration of features such as spaces, punctuation, and capitalization to unformatted texts. (Referred to in the paper as BiLSTMCharSpace/BiLSTMCharE2E.)

CRF for Punctuation Restoration repo (GitHub)
Conditional Random Fields model for a punctuation restoration task. (Referred to in the paper as CRF.)

Punctuation Restoration using Transformer Models repo (GitHub)
An extension of the paper Punctuation Restoration using Transformer Models for High-and Low-Resource Languages accepted at the EMNLP workshop W-NUT 2020. (Referred to in the paper as BERTBiLSTM.)