Chapter 11 Resources and Benchmarks for NLP

Authors: Nico Hahn

Supervisor: Daniel Schalk

Frameworks such as TensorFlow or Keras allow users to train a wide range of different models for different tasks. Let us assume that two models for a simple question-answer system are trained, one with attention and one without attention. How can these models be evaluated in order to find the model better suited to the task? Quite simply, through benchmarking. This section looks at some of the most commonly used benchmarking datasets and at pre-training resources.

11.1 Metrics

For many of the benchmarking datasets in natural language processing, a leaderboard exists in which different models are compared with each other. Depending on the task, the models are evaluated with different metrics. In this section we will introduce those used for the benchmarking datasets presented later.

Exact match (EM): The percentage of predictions that match any one of the answers exactly.

(Macro-averaged) F1 score (F1): Each answer and prediction is tokenized into words. For every answer to a given question, the overlap between the prediction and each answer is calculated and the maximum F1 is chosen. This score is then averaged over all the questions. Formally speaking:

\[ \begin{aligned} F1 &= \frac{2 \cdot \hbox{precision}\cdot\hbox{recall}}{\hbox{precision}+\hbox{recall}} \\ \hbox{precision} &= \frac{\hbox{number of same tokens}}{\hbox{length(predicted tokens)}} \\ \hbox{recall} &= \frac{\hbox{number of same tokens}}{\hbox{length(labeled tokens)}} \end{aligned} \]

Perplexity: Perplexity is a measurement of how well a probability model predicts a sample. A low perplexity indicates the probability distribution is good at predicting the sample. In NLP, perplexity is a way of evaluating language models. A model of an unknown probability distribution \(p\), may be proposed based on a training sample that was drawn from \(p\). Given a proposed probability model \(q\), one may evaluate \(q\) by asking how well it predicts a separate test sample \(x_1, x_2, ..., x_N\) also drawn from \(p\). The perplexity of the model \(q\) is defined as \[ b^{-\frac{1}{N}\sum_{i=1}^N\log_bq(x_i)} \] where \(b\) is customarily \(2\). (Martinc, Pollak, and Robnik-Šikonja 2019)

BLEU: BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Scores are calculated for individual translated segments—generally sentences—by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation’s overall quality. Intelligibility or grammatical correctness are not taken into account. (Papineni et al. 2002)

Accuracy: Accuracy is the ratio of number of correct predictions to the total number of input samples.


Matthews correlation coefficient: The MCC is used as a measure of quality of binary classifications. It takes true and false positives and negatives into account and is regarded as a balanced measure which can be used even if the classes are imbalanced. The MCC can be calculated directly from the confusion matrix using the following formula:

\[\hbox{MCC}=\frac{\hbox{TP}\cdot\hbox{TN}-\hbox{FP}\cdot\hbox{FN}}{\sqrt{(\hbox{TP}+\hbox{FP})(\hbox{TP}+\hbox{FN})(\hbox{TN}+\hbox{FP})(\hbox{TN}+\hbox{FN})}} \] (Boughorbel, Jarray, and El-Anbari 2017)

11.2 Benchmark Datasets

11.2.1 SQuAD

The first Version of the Stanford Question Answering Dataset was released in 2016. The dataset was created with the aim of advancing the field of reading comprehension. Reading text and answering questions about it is a demanding task for machines and requires large data sets of high quality. Most of the datasets before the release of the first version of SQuAD were either of high quality or of large size, but not both.

With the help of crowdworkers, 107.785 question-answer pairs were created for 536 Wikipedia articles. For each question, the answer is a segment of text, or span, from the corresponding reading passage. Pairs were collected in a two-step process. In the first step the crowdworkers were asked to generate five questions and their answers per paragraph.

In the second step, each crowdworker was shown only the questions along with the paragraphs of the corresponding article and was asked to choose the shortest span in the paragraph that answered the question. As a result of this process, questions in the dev-set multiple answers.

The goal of this procedure was to get a more robust evaluation and to obtain an indicator of human performance on SQuAD.

One shortcoming of reading comprehension systems is that they tend to make unreliable guesses on questions to which no correct answer is possible. With this in mind, the second version of SQuAD was released in 2018. In addition to the approximately 100.000 questions from the first version, 53.775 new, unanswerable questions on the same paragraphs are contained in this dataset.

The accuracy of models trained on SQuAD is evaluated using two different metrics, exact match and (Macro-averaged) F1 score, both ignoring punctuation and articles.

To evaluate human performance, the second answer to each question is treated as the human prediction. (Rajpurkar et al. 2016; Rajpurkar, Jia, and Liang 2018)

Humans achieve an EM score of 86.831 and a F1 score of 89.452.

Currently, the best performing model achieves an EM score of 90.386 and a F1 score of 92.777.

Examples of SQuAD and the leaderboard and can be viewed here:

11.2.2 CoQA

CoQA is a dataset for building Conversational Question Answering systems. Humans are capable of gathering information through conversations that include several interrelated questions and answers. The aim of CoQA is to enable machines to answers conversational questions.

The data set is made up of 127k Q/A pairs, covering seven different domains such as Children’s Stories or Reddit. Five of these domains are used for in-domain evaluation, meaning models have already seen questions from these domains, and two are used for out-of-domain evaluation., meaning models have not seen any questions from these domains. To create the Q/A pairs, two people received a text passage, with one person asking the other person questions about the text and the other person answering. Using multiple annotators has a few advantages:

  1. A natural flow of conversation is created.
  2. If one person gives an incorrect answer or a vague questions is asked, the other person can raise a flag. Thus bad annotators can easily be identified.
  3. If there is a disagreement, the two annotators can discuss it via a chat window.

Similar to SQuAD, three additional answers are collected for each question. However, since the answers influence the flow of the conversation, the next question always depends on the answer to the previous question. For this reason, two different answers to the same question can lead to two different follow-up questions. In order to avoid incoherent discussions, annotators are shown a question that they must answer first. After answering, they are shown the original answer, and they must then confirm that their answer has an identical meaning.

Compared to SQuAD 2.0, there is a greater variety of question types in CoQA. While almost half of the questions in the SQuAD start with what, less than a quarter of the questions in the CoQA begin with this token. Another major difference is that questions in CoQA are on average 5.5 words long, compared to an average length of 10.1 in SQuAD. It is also worth mentioning that about 10% of the answers in CoQA are either yes or no, whereas there are no such answers in SQuAD.

Like SQuAD, trained models are evaluated using a macro-average F1 score. Models are evaluated separately on the in-domain dataset and the out-of-domain dataset. (Reddy, Chen, and Manning 2018)

Humans achieve a F1 score of 89.4 for in-domain and a F1 score of 87.4 for out-of-domain.

Currently, the best performing model achieves a F1 score of 91.4 for in-domain and a F1 score of 89.2 for out-of-domain.

Examples of CoQA and the leaderboard and can be viewed here:

11.2.3 (Super)GLUE

Most models in NLP are designed to solve a specific task, such as answering questions from a particular domain. This limits the use of models for understanding natural language. In order to process language in a way that is not limited to a specific task, genre, or dataset, models should be able to solve a variety of tasks well.

The General Language Understanding Evaluation benchmark dataset is a collection of tools created with this in mind. It is designed to encourage and favor models that share common linguistic knowledge across tasks. These tasks include textual entailment, sentiment analysis and question answering. Some tasks come with a lot of training data, others with less. Common to all datasets is that they were not created specifically for GLUE, but are existing datasets. Models that are evaluated on GLUE only need to have the ability to process single-sentence and sentence-pair inputs and make appropriate predictions. This test suite contains a total of nine sentence or sentence-pair NLU tasks, built on established annotated datasets. There are three distinct types of tasks in GLUE: Single-Sentence Tasks, Similarity and Paraphrase Tasks and Inference Tasks.

Single-Sentence Tasks:

The first single-sentence task is CoLA, the Corpus of Linguistic Acceptability, which consists of English acceptability judgments derived from books and journal articles on linguistic theory. Each datapoint consists of a sequence of words and an annotation as to whether this sequence is a grammatical English sentence. Matthews correlation coefficient is used as the evaluation metric.

The Stanford Sentiment Treebank task consists of sentences from movie reviews and the corresponding sentiment (positive/negative). Accuracy is used for evaluation.

Similarity and Paraphrase Tasks

The Microsoft Research Paraphrase Corpus consists of pairs of sentences and the goal is to predict whether two sentences are semantically equivalent. For evaluation, F1 score and accuracy is used.

Quora Question Pairs is similar to MRP in that the aim is to predict whether two questions are semantically equivalent and F1 and accuracy is used for evaluation.

The Semantic Textual Similarity Benchmark consists of sentence pairs human-annotated with a similarity score from 1 to 5. The goal is to predict these scores. Pearson and Spearman correlation coefficients are used for evaluation.

Inference Tasks:

The Multi-Genre Natural Language Inference Corpus is a collection of sentence pairs with textual entailment annotations. Based on a premise sentence and a hypothesis sentence, the aim is to predict whether the premise entails the hypothesis, contradicts the hypothesis, or neither of the two. Models are evaluated using accuracy.

Recognizing Textual Entailment is akin to MNLI, only this time with a two-class split.

The Winograd Schema Challenge is a reading comprehension task in which a system must read a sentence containing a pronoun and pick the speaker of that pronoun from a list of choices. To transform this task into a classification problem, pairs of sentences are constructed by replacing the ambiguous pronoun with any possible referent. The task is to predict if the sentence with the pronoun substituted is entailed by the original sentence. Evaluation is done using accuracy.

The last task in GLUE is based on the first version of SQuAD. The task is converted into a sentence pair classification task by pairing each question and each sentence in the respective context, with the aim of determining whether or not a sentence contains the answer to the question. The task is evaluated on accuracy.

The models are scored separately for each task and then a macro-average of these scores is calculated to determine a system’s position on the ranking. If a task has multiple metrics, an unweighted average of these metrics is used as the score for the task when calculating the overall macro average. (Wang et al. 2018)

The human baseline score is 87.1, while the best model score is currently 90.6.

Roughly one year after the release of GLUE, models surpassed human performance. In response to this, a new benchmark, SuperGLUE, was introduced. It follows the same principles as GLUE, however the tasks included are more challenging. The two hardest tasks in GLUE, Recognizing Textual Entailment and the Winograd Schema Challenge, remain, the rest were selected based on difficulty for current NLP approaches. There are a total of eight different tasks in this benchmark.

Boolean Questions consists of a text passage together with a corresponding yes/no question. Models are evaluated using accuracy.

Commitment Bank is a three-class textual entailment task. Accuracy and F1 is used for evaluation, where for multi-class F1 the unweighted average of the F1 per class is calculated.

Choice of Plausible Answers is a causal reasoning task in which a model is given a premise sentence, a question and two possible answers. It must then decide which answer is the correct one. Accuracy is used for the evaluation.

Multi-Sentence Reading Comprehension is a QA task where each example consists of a paragraph, a question and a list of answers. Models must predict which answers are correct. Evaluation metrics are F1 over all answer choices and the exact match of the answer set of each question.

Reading Comprehension with Commonsense Reasoning Dataset is a multiple choice QA task. Each data point consists of a paragraph, a fill-in-the-gap sentence in which an entity is masked, and a list of possible entities to choose from. The entities can be expressed using several different surface forms, all of which are considered correct. Models are evaluated using max (over all mentions) token-level F1 and exact match.

Words in Context is a word sense disambiguation task in which a model is given two sentences and a polysemous word. Models must decide whether the word is used with the same meaning in both sentences. Accuracy is used for evaluation.

(Wang et al. 2019)

For SuperGLUE, the human baseline score is 89.8, which is above the best model score, presently 89.3.

More information about the tasks and the leaderboard for both GLUE and SuperGLUE is available here:

11.2.4 AQuA-Rat

One task that most people know from their time at school is solving algebraic word problems. For humans, this task can be easy, depending on a person’s mathematical abilities, since we only have to perform a series of arithmetic operations. However, since programs can be endlessly complicated, it is a considerable challenge to induce them directly from question-answer pairs. The Algebra Question Answering with Rationales dataset attempts to make this task more feasible for machines by providing not only the correct answer but also step-by-step instructions for deriving the correct answer, the so-called rationale. Models trained on AQuA-Rat must not only predict the correct answer, but also the rationale.

The dataset contains over 100.000 questions, and each question has five different options as to what the correct answer is. It also contains the answer rationale and the correct option. The problems cover a wide range of topics, for instance probability theory or calculus, with a variety of difficulty levels. To create the dataset, examples of exams such as the GMAT (Graduate Management Admission Test) and GRE (General Test) were taken from the Internet. This part of the dataset is called the seed dataset. Besides, crowdsourcing was used to generate further questions. For this users were presented with five questions from the seed dataset and asked to select one of these questions and write a similar question. Users were also forced to rephrase the rationales and answers to avoid paraphrasing the original questions. These created questions were then passed to another user for quality control.

The rationales are evaluated using average sentence level perplexity and the BLEU score. If a model is unable to generate a token for perplexity computation, an unknown token is predicted. The correctness of the answers is evaluated by calculating the accuracy of the predictions.

This is a relatively new dataset and as of now there is no online leaderboard for it. The authors of the original paper used an attention-based sequence to sequence model as their baseline method. The authors generated a program containing both instructions that generate output and instructions that simply generate intermediate values used by following instructions. The program uses a latent predictor network which generates an output sequence conditioned on an arbitrary number of input functions and staged back-propagation to save memory. Going into further depth about this program would be beyond this book so I’d advise to have a look at the original paper. The program outperformed the baseline model and achieved a perplexity of 28.5, a BLEU score of 27.2 and has and accuracy of 36.4. (Ling et al. 2017)

The paper and examples of the dataset can be found here:

11.2.5 SNLI

When it comes to understanding natural language, the understanding of entailment and contradiction is essential. The characterization and use of these relationships in computational systems is called natural language inference and is fundamental for tasks such as commonsense reasoning and information retrieval. The Stanford Natural Language Inference Corpus is a collection of sentence pairs that are labeled either as entailment, contradiction or semantic independence. While there are other datasets that try to accomplish this particular task, they all have problems of size, quality and vagueness.

SNLI consists of about 570k record pairs. Again, crowdworkers were used to construct the dataset. For this purpose they were shown the caption of a photo but not the photo itself, and asked to write three alternative captions: One that is definitely a true description of the photo, one that could be a true description of the photo, and one caption that is definitely a false description of the photo. By not showing the photo, the authors wanted to ensure that each pair of sentences could be reconstructed based on the available text alone. To quantify the quality of the corpus, about 10% of all created sentence pairs were validated. For this purpose, each crowdworker was shown five pairs of sentences and asked to mark them with one of the three labels. Each set was shown to a total of five crowdworkers. For each pair, a gold label was awarded if at least three of the five annotators chose the same label. About 98% of all sentence pairs received a gold label, the rest were given a placeholder label. (Bowman et al. 2015)

The models are evaluated again with the accuracy of the predicted label. There is no measurement of human performance for the SNLI corpus. At present, the most accurate model is a semantics-aware BERT (SemBERT) with an accuracy of 91.9. The paper and examples of the dataset can be found here:

11.2.6 Overview

Below is a brief overview of all of the datasets discussed in this chapter, including some other interesting datasets. If you would like to learn more about one of the datasets, for each dataset the corresponding paper is linked.

Name Task Size Description
SQuAD 2.0 Question Answering, Reading Comprehension 150,000 Paragraphs w questions and answers
CoQA Question Answering, Reading Comprehension 127,000 Answering interconnected questions
GLUE General Language Understanding Nine different NLU tasks
SuperGLUE General Language Understanding Eight different NLU tasks
AQuA-Rat Question Answering, Reading Comprehension, Mathematical Reasoning 100,000 Solving algebraic word problems
SNLI Natural Language Inference 570,000 Understanding entailment and contradiction
Irony Sarcasm Analysis Corpus Classification, Sentiment Analysis 33,000 Ironic, sarcastic, regular and figurative tweets
WikiText-103 & 2 Language Modelling 100M+ Word and character level tokens from Wikipedia
WMT 14 English-German Language Translation 4.5M Sentence pairs for translation
VOiCES Speech Recognition 3,900 Voices in complex environmental settings. 15h material

11.3 Pre-Trained Models

In the last chapters we’ve already heard quite a bit about pre-trained models like BERT or GPT-3. But exactly on what data are they trained on? Let’s find out.

11.3.1 BERT

The pre-training corpus used for BERT consists of the BookCorpus and the entirety of the English Wikipedia.

The BookCorpus: This dataset was released in 2015. To create the corpus, 11,038 free books were collected from the Internet. All of these were written by authors who have not yet been published. To be included, a book had to have more than 20,000 words to filter out the shorter stories that might be noisy. The dataset includes over 16 different genres, for example Romance, Science Fiction or Fantasy. In total there are about 1 billion words, 1.3 million unique words and 74 million sentences with an average sentence length of 13 words in the BookCorpus. (Zhu et al. 2015)

English Wikipedia: For Wikipedia only text passages were extracted, while lists, tables and headings were ignored. In total, this dataset contains 2.5 billion words.

According to the authors, it is crucial to use a document-level corpus rather than a shuffled sentence-level corpus like the Billion Word benchmark dataset to extract long sentences. (Devlin et al. 2018)

11.3.2 OpenAI GPT-3

The dataset used for pre-training GPT-3 consists of a filtered version of the Common Crawl dataset and multiple curated high quality datasets, including an extended version of WebText, two books corpora and the English language Wikipedia.

Common Crawl: The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata and text extracts. To improve the quality of Common Crawl, two techniques are used: (1) filtering Common Crawl and (2) fuzzy deduplication.

  1. To improve the quality, the original WebText was used as a proxy for high quality documents. A classifier was trained to distinguish these documents from the raw text in the Common Crawl. This classifier was used to re-sample Common Crawl by prioritizing documents for which higher quality was predicted. A logistic regression was trained for the classifier using characteristics from the standard Spark and HashingTF tokenizer. A document was kept in the dataset, if

\[\hbox{np.random.pareto}(\alpha) > 1 - \hbox{document\_score}.\] A value of 9 was chosen for \(\alpha\) in order to obtain both high and low scoring, but mostly high scoring documents.

  1. To prevent overfitting, documents were fuzzily deduplicated using Spark’s MinHashLSH implementation with 10 hashes. WebText was also fuzzily removed from Common Crawl. This decreased dataset size by around 10%.

WebText: The WebText dataset is a dataset created by web scraping that emphasizes the quality of the documents. Only websites that have been curated/filtered by humans have been scrapped. To simplify this task, all outbound links from Reddit, a social media platform, which received at least 3 Karma, were used. The resulting dataset contains the text subset of these 45 million links. Fuzzy deduplication was also used here.

Books1 and Books2: These are two internet-based books corpora on which fuzzy deduplication was performed. Nothing more is known about these datasets.

The datasets used to train GPT-3 are shown in the table below.

Dataset Quantity (tokens) Weight in training mix
Common Crawl (filtered) 410 billion 60%
WebText2 19 billion 22%
Books1 12 billion 8%
Books2 55 billion 8%
Wikipedia 3 billion 3%

(Brown et al. 2020)

11.3.3 Google 5T

Google 5T also uses a dataset based on Common Crawl for pre-training their model, called the “Colossal Clean Crawled Corpus” (C4). To improve the quality of Common Crawl, the following heuristics were used:

  • Only keep lines that end in a period, exclamation mark, question mark, or closing quotation mark.
  • Remove any page that contains a word from the “List of Dirty, Naughty, Obscene or Otherwise Bad Words”.
  • Remove any line containing the word Javascript to remove warnings about enabling Javascript.
  • Remove any page containing the phrase “lorem ipsum”.
  • Remove all pages that contain “{” because some pages may have accidentally contained code.
  • To deduplicate the dataset, discard all but one on any three-sentence span occurring more than once in the dataset.

Furthermore, langdetect was used to filter out any pages that were not classified as English with a probability of at least 99%.

(Raffel et al. 2019)

11.4 Resources for Resources

If you are interest in further NLP tasks or dataset, there are two websites worth checking out.

Papers With Code highlights trending Machine Learning research and the code to implement it. Their mission is to create a free and open resource with ML papers, code and evaluation tables. Anyone can contribute by downloading data, training their own model and comparing their model to others.

To see the newest trends in NLP, check out the link below.

If you want to refine your natural language processing (NLP) skills, finding accessible and relevant datasets can be one of the biggest bottlenecks. A lot of time can be spent searching for accessible datasets for the learning task at hand or trying to curate your own data instead. This is where The Big Bad NLP Database, managed by Quantum Stat, comes in. It is a central location for NLP datasets. Currently there are over 500 data entries for general NLP tasks, such as question answering or language modeling. While most of the datasets are in English, there are also a number of datasets in other languages. Just have a look for yourself!


Boughorbel, Sabri, Fethi Jarray, and Mohammed El-Anbari. 2017. “Optimal Classifier for Imbalanced Data Using Matthews Correlation Coefficient Metric.” PloS One 12 (6). Public Library of Science San Francisco, CA USA: e0177678.

Bowman, Samuel R, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. “A Large Annotated Corpus for Learning Natural Language Inference.” arXiv Preprint arXiv:1508.05326.

Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.”

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” CoRR abs/1810.04805.

Ling, Wang, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. “Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems.” arXiv Preprint arXiv:1705.04146.

Martinc, Matej, Senja Pollak, and Marko Robnik-Šikonja. 2019. “Supervised and Unsupervised Neural Approaches to Text Readability.” arXiv Preprint arXiv:1907.11779.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “BLEU: A Method for Automatic Evaluation of Machine Translation.” In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 311–18. Association for Computational Linguistics.

Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” arXiv Preprint arXiv:1910.10683.

Rajpurkar, Pranav, Robin Jia, and Percy Liang. 2018. “Know What You Don’t Know: Unanswerable Questions for Squad.”

Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. “SQuAD: 100,000+ Questions for Machine Comprehension of Text.”

Reddy, Siva, Danqi Chen, and Christopher D. Manning. 2018. “CoQA: A Conversational Question Answering Challenge.” CoRR abs/1808.07042.

Wang, Alex, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019. “Superglue: A Stickier Benchmark for General-Purpose Language Understanding Systems.” In Advances in Neural Information Processing Systems, 3261–75.

Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. “Glue: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” arXiv Preprint arXiv:1804.07461.

Zhu, Yukun, Ryan Kiros, Richard S. Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. “Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books.” CoRR abs/1506.06724.