Grey swans, and what a bit of linguistics can teach you about machine learning
Grammar and inductive bias in statistical modeling in natural language.
Inductive bias in machine learning
“All the swans we’ll ever see will be white!”
How confident are you in this prediction?
If you had seen only one white swan, you shouldn't be as confident as if we had seen thousands.
Also, what if you have seen several swans with varying shades of white, with some being on the borders of whiteness?
Here too, you should be less confident that there aren't any darker swans out there than if all the swans they’d seen were uniformly chalk white.
So statistical prediction has nuance. The nuances are inductive biases. The inductive bias of a machine learning algorithm is the set of assumptions that must be combined with the training data to transform the algorithm's outputs into logical deductions.
While philosophers have been thinking about the problem of induction for centuries, Thomas Mitchell was one of the first to frame the problem in terms of machine learning. He showed in rigorous terms that machine learning requires inductive bias to work.
However, inductive bias involves assumptions that may be wrong. If those assumptions are wrong, then the prediction algorithm will perform poorly. For example, a set of assumptions that is true for the training data may not hold when we use the algorithm to make decisions in the real world. It is necessary for people who are implementing machine learning algorithms to be aware of those algorithms' inductive biases and understand when those biases may fail.
Inductive bias in natural language processing
Natural language processing (NLP) is an umbrella term for tasks related to how computers process and analyze natural language data. Natural language understanding (NLU) is a subfield of NLP concerned with programs comprehending text. We distinguish this from speech recognition, the task of converting spoken language to text.
Inductive bias appears in text preprocessing.
Text preprocessing refers to all the munging you do to a corpus before applying more canonical NLU modeling to it. Even this preprocessing involves inductive bias.
A token is some fundamental unit of natural language text, for example, a word, a phrase, a stem of a word, a phoneme, or a character. A corpus is a dataset of natural language text. A common preprocessing step is segmentation or tokenization, which converts raw natural language input into tokens. Many algorithms introduce "special tokens," meaning tokens that do not come from natural language words directly but instead add some useful information to NLU algorithms. For example, we may elect to use a separation token that indicates the end of sentences instead of punctuation because punctuation is ambiguous. Similarly, often we may introduce an "unknown" token to catch unrecognized words.
Segmentation/tokenization itself is a non-trivial inductive inference problem. For example, German writes compoundnouns without spaces. For example, "computational linguistics" is written by combining the German word for "computer" and the German word for "linguistics" to get "Computerlinguistik."
Chinese takes this problem to the extreme. Written text in these languages do not separate words with spaces as if sentences themselves were "compound words." For example, the phrase "只有河马在那里" means "there are only hippos there." The phrase "-只有身孕的河马在那里" means "there is a pregnant hippo there." Both phrases have the character pair "只有" (underlined). In the first phrase, "只有" is a single word meaning "only." In the second phrase, "只" and "有" are separate words, "只" is a measure word for tigers, and the "有" is part of the adjective phrase that means "pregnant" ("有身孕的").
Tokenization algorithms need an inductive bias to choose between tokenizing inputs like "只有" and "Computerlinguistik" into one or two tokens. One common bias is to take the longest vocabulary match from a large manually-segmented vocabulary. Since it is an inductive inference problem, there will be mistakes, and so you are never guaranteed consistent unique tokenization.
The other approach is to abandon word-based indexing and to do all indexing via just short subsequences of characters, regardless of whether particular sequences cross word boundaries or not. This approach is an appealing solution for Chinese (and other written languages that use glyphs) because, unlike letters in an alphabet, Chinese characters have stand-alone meaning. The meaning of many (though not all) Chinese words are compositions of the meanings of the characters in the words. For example, when the Chinese encounter something novel, they tended to name it using compounds of character primitives. Hippo combines the character for river and horse into "river horses" in Chinese. Lobster is "dragon shrimp," onions are "foreign shallots."
Like what you’re reading? Growing newsletter is a slog. Help me find like-minded readers.
Semantics vs. Syntax
In the context of natural language, syntax refers to the study of the rules of grammar that determine whether the way the words in an utterance are combined is valid or invalid. Semantics refers to the study of the actual meaning of an utterance.
For example, consider the following sentences.
The dog through the field the rabbit chased.
The rug believes steel chocolate.
I didn't rob the bank, the bank robbed me!
“The dog through the field the rabbit chased” is syntactically invalid, meaning it violates the grammatical rules of English. A syntactically valid alternative would be, "The dog chased the rabbit through the field." "The rabbit chased the dog through the field" would also be syntactically valid. However, these are semantically different. Imagine that this was a grammatical error made by a non-native speaker, and you were tasked with correcting the text. Which one of these syntactically valid alternatives would you choose as the correct form?
This proofreading task is an inference task. You would employ an inductive bias, and it would derive from your understanding of real-world concepts of dog and rabbit and their potential interactions.
The second sentence is syntactically valid; it does not violate the rules of English. However, semantically-speaking, the sentence is meaningless.
The third sentence is syntactically valid. Further, a syntactic analysis tells us that the subject and the object in the first part of the sentence switched roles in the second part of the sentence. However, a semantic analysis is much richer. The first part of the sentence suggests a criminal act, perhaps robbing at gunpoint or wire fraud. The second part of the phrase suggests possibly a metaphorical crime rather than a literal one. Perhaps the speaker is speaking in frustration about high interest rates, or a foreclosure, or high fees. That said, it is possible they mean an actual crime, such as the Wells Fargo account fraud scandal. Just like in the first example, a human's understanding of a sentence's semantics relies on the concepts they learned by living a life.
Syntax determines the statistical patterns that NLU patterns learn.
The syntax of a natural language is a set of rules. However, if you have ever learned a second language, you'll remember how difficult it is not to learn basic grammatical rules but to learn all the exceptions and special cases. Throughout the history of AI, researchers' efforts to write these rules as program logic have famously failed to capture the richness of natural language. Further, those rules are in constant flux as the culture evolves.
In contrast, language models in NLU excel at encoding complex syntactic variety. They do so by learning statistical regularities between tokens. In particular, deep neural network language models can learn syntax that is far too complex to code manually, given enough data.
The inductive biases of a language model will bias the model towards one certain syntactic features of a corpus. A syntactic feature is a way of characterizing the syntax of a target corpus. Some syntactic features include:
incorporation — when verbs form compounds with their direct object or adverbial modifier
grammatical case - how noun phrase's form depends on case
agreement - when the relationship between a word's form depends on which words it co-occurs with
word order — the ordering of the subject, verb, and object in a phrase and how close they are to one another.
dative construction — whether or not a subject and the object can switch their places for a given verb without altering the verb's structure
verbal intransitivity — An intransitive verb is a verb that has a subject but no object.
Many syntactic features directly affect the statistical patterns between tokens that language models learn. Syntactic features determine statistical regularities such as:
co-occurrence between tokens.
the orderings of tokens that co-occur
the distance between (number of tokens between) tokens that co-occur
the absolute position of tokens and sets of co-occurring tokens within a phrase, sentence, or longer utterance, as well as position relative to other tokens and co-occurring tokens
Syntactic features impact the performance of a learning algorithm.
Consider the example of word order. 86.5% of the languages use either subject-object-verb or subject-verb-object orders, 9% of the languages use verb-object-subject order, and object-verb-subject and object-subject-verb languages are extremely rare (Tomlin 2014).
Ravfogel et al. 2019 analyzed the sensitivity of recurrent neural networks (a language model that uses a deep neural network architecture) on a prediction task and found that was performance was higher in subject-verb-object ordering than subject-object-verb ordering.
Higher performance on subject-verb-object orderings might occur because recurrent neural networks have an inductive bias that assumes words that are close together are more likely to be related. Word-ordering has an impact on the closeness between words in the text. For example, Japanese has a subject-object-verb ordering, and much can happen between the subject-object and the verb. That distance may make it hard for statistical learning algorithms to learn the relationship.
Another example of a syntactic feature that impacts statistical learning is verbal intransitivity; intransitive verbs have a subject but no object. Many prediction tasks will depend on whether a noun phrase is correctly identified as a subject or object, and the presence or absence of intransitive verbs are helpful clues.
Why think about all this when we can just use cutting-edge deep learning?
The cutting-edge for natural language modeling is currently transformer networks. In the next post, I’ll examine why examinating inductive bias is essential to understanding what transformer networks can and can’t do.
Go Deeper
Ravfogel, S., Goldberg, Y., & Linzen, T. (2019). Studying the inductive biases of RNNs with synthetic variations of natural languages. arXiv preprint arXiv:1903.06400.
Tomlin, R. S. (2014). Basic Word Order (RLE Linguistics B: Grammar): Functional Principles. Routledge.