Numbers in NLP: a Survey
This article is based on the following paper “Representing Numbers in NLP: a Survey and a Vision”: https://arxiv.org/pdf/2103.13136.pdf.
Numeracy is the understanding of numbers and how to use them. Numbers are used to reason with quantities and counting, which are crucial to understanding the world. Today, numbers in the NLP are typically treated as words or ignored, and given no special consideration. However, the human brain has different representations for words and numbers. This article summarizes the content of the following paper, “Representing Numbers in NLP: a Survey and a Vision”: https://arxiv.org/pdf/2103.13136.pdf. We will cover topics of the importance of numbers, explain the problems and limitations faced today, methods for improving numeracy in NLP, and general recommendations to follow.
Importance of Numbers
There are 150 million numbers in the 6 million pages of the English Wikipedia. Numbers are an important part of the text and convey a special meaning. Consider the following sentence, "I woke up at 11" to understand that we need knowledge of words, but also numeracy. Our brains decode the string 11 and conclude that 11 is a number that denotes the day. Hence, we can conclude the person woke up late. Take another example: "I earn $11 a month," 11 now has a different meaning. If we replace 11 with 10 instead, the text carries the same meaning, but there is a significant difference between 11 and 10. We see that context is a key to understanding numbers.
Currently in NLP, the numbers are typically filtered out during preprocessing or tokenized as UNK (unknown) token. Some libraries will split them into arbitrary tokens, for example, “1234” might be split into “12–34” or “1–234”. Unfortunately, these representations of numbers cause models to perform poorly with regard to numbers. For example, Bert performs 5 times worse when an answer to a question is a number vs a word.
Better number representation will benefit domains, like scientific articles and financial documents. It will also enable detection of sarcasm and model dialog involving price negotiations.
Numeracy Tasks
There are various tasks involving numeracy. We will be looking at tasks that fall into the following 2 categories: Granularity and Units.
Granularity: the encoding of a number is exact (ex. birds have two legs) or approximate (ex. The boy is 160 cm tall).
Units: numbers are abstract or grounded. Abstract numbers do not specify a unit. Grounded specify a unit, ex. 1 apple. Grounded tasks require you to understand the context of the unit.
Within in these 2 categories, that are 7 standard tasks:
Simple Arithmetic: Applying basic arithmetic operation, such as addition, subtraction, and ect. over numbers alone
Numeration: Decoding a numerical value from a string ex) “19” can be decoded as 19 or 19.0
Magnitude Comparison: Compare two or more numbers and figure out which is the largest
Arithmetic Word Problems: Like math word problems you find in school textbooks. These are generally ground versions of math problems.
Exact Facts: The context of the number involves commonsense knowledge. Ex) “A dice has 6 sides” will allow you to know that the statement “a dice has 5 sides” is false.
Measurement Estimation: Attempting to approximately guess measurements objects given some dimensions. Ex) How many seeds are in a watermelon
Numerical Language Modeling: Task of making numeric prediction from completely unlabeled data
Methods for Number Representations
There are many ways to represent numbers. Number representation is split into various groups. For this article, we will focus on methods of encoding (numbers -> embedding) and/or decoding (embedding-> numbers) numbers. There are 2 types of representations: String-based and Real-based.
String-based sees numbers as the same words. They will assign some token id and look up their embeddings in their architecture.
String-based representations treat numbers as words. There are a few tweaks that can be a few improvements, like notation, tokenization, and pooling.
Notation is changing the representation of a number into a different notation. For example, change “80” to scientific notations (8e1) or English words (eighty).
Tokenization is changing how text is tokenized. You can change tokenization from word level to character/digit level tokenization. An issue with word-level is that numbers frequently get tokenized to UNK tokens.
Pooling is used to analyze the effect of tokenization. So single number might be associated with multiple tokens. For example, 100 can be represented with the tokens 10–0 or 1–0–0. Generally, since they are treated as words, no pooling is used.
Real-based used the numerical value of the number to perform some computations.
Real-based representation treats numbers by their value. They are usually expressed as encoders and decoders. Survey of real based methods are generally about direction (encode, decode, or do both), scale (linear vs log) and discretization (binning vs continuous values).
Direction is used to define if the method decodes or encodes the number, or applies both.
Scale is to scale the number. Generally scaled based on log or linear factor.
Discretization is binning numbers and learning the embedding on each bin. The bins can be on a linear or log scale.
Survey of Existing NLP Methods
String Based Representation Methods:
The following methods are prominent string-based methods used in previous works.
Word Vectors & Contextualized Embedding: Typical NLP methods to handle numbers. This method will act as a baseline for our comparisons with existing methods.
GenBERT: This is a model built on top of pre-trained BERT but tokenizes numbers at the digit level and works well on tasks that are word problems and simple arithmetic. This model is both an encoder and a decoder.
NumBERT: Is a model of BERT trained from scratch on a dataset where all numbers are converted to scientific notation. The model follows subword tokenization and no pooling.
DigitRNN, DigitCNN: This is a method that pools single-digit embeddings to represent full numbers. Uses RNN and CNN for pooling
DigitRNN-sci, Exponent: This is similar to DigitRNN but uses scientific notation. Exponent embedding learns to do a look upon the exponent.
Real Based Methods:
The following methods are prominent real-based methods used in previous works.
DICE: An encoder that preserves the relative magnitude between two numbers and their embeddings.
Value Embedding: A parameterized encoder for real numbers where one feeds a scalar magnitude of the number through a shallow network.
Log Value: encoding the log-scaled value of the number. There is also a log scaled decoder called RGR.
Log Laplace: a decoder that is log-based
Flow Laplace: an expressive decoder that learns its own density mapping. The scaling is learned by the model, so it can potentially be anything.
MCC (Multi-class classification): A decoder that outputs values in log scaled bins of numbers.
Discrete Latent Exponent (DExp): a decoder where the model parameterized a multinomial distribution for an exponent.
GMM: Gaussian mixture model that learns by pre-training on the means and variance over the training corpus.
GMM-prototype: Same as GMM but the model learns the prototype embeddings and is encoder only.
Modeling Recommendations
For string-based representation, using scientific notation instead of decimal notation yielded better results. Character or digit tokenization was better than a word or sub-word tokenization.
For real-based representation, using a log scale vs linear scale improves performance. Combined with binning increases model performance.
For abstract tasks such as numeration and magnitude comparison, DICE, Value and Log embedding perform exceptionally well.
Real-based methods work well on approximation tasks, such as to measure estimation and language modeling.
Conclusion
In this article, we summarize the latest advances and the work on numeracy in NLP. We divided numerical tasks into seven general tasks and investigated methods for solving those tasks. Numeracy in NLP is still in its infancy, and several unanswered research questions and experimentation can be done. Improvements in numeracy in NLP models will allow them to understand the world better with numbers.