Parameterization and its structure

Another macro trend you need to help you understand the arc of AI is parameterization. This roughly corresponds to how many pieces of information are used to formulate a problem solution.

Imagine an automated bartender. A single-parameter AI might consider only whether the bar is open when deciding whether to sell a beer.

  • If the bar is open, sell the beer.

But this isn't good enough to get a liquor license. You also have to make sure the customer is of legal drinking age. So we pursue a two-parameter solution:

  • If the bar is open and the customer's age > 21, sell the beer.

This passes the regulatory burden, but it's unneighborly to let someone continue drinking if they've had too much (more parameters). For business reasons we'll need to decline absurd orders and forbid service to known troublemakers (more parameters). And we should remember to withhold sales of special bottles being saved for the holiday party (even more parameters).

Some of these decisions require, themselves, a complex bundle of parameters. We might detect if someone has had too much to drink based on their age, gender, weight, and order history (add another ten or so parameters). Or analyze their gait and speech using security cameras (add a few million parameters).

As we approach the nuance humans use in the real world, the parameterization required to truly solve a problem grows and grows and grows. So too has the parameterization of our models over time.

The arc of parameterization

There are three rough checkpoints you should know about to understand the trends happening today.

Rule Systems: Human-sized Parameterization

The parameterization of rule systems is tightly coupled to the domain knowledge being represented. Each rule written down is roughly a parameter in the system, and the rules are arranged in an intentional way such that they're not independent of each other: some rules are contingent on others. In theory, careful structuring of rule dependencies could allow humans to add arbitrarily many special cases that only trigger in the right moment. But in practice writing and maintaining these special cases at scale is too hard for humans to do well.

Machine Learning: Large but Unwieldy Parameterization

In the machine learning era, model parameterization increased dramatically via a practice called feature engineering. Engineers would write code to transform an input into every possible atomic descriptor that could be applied. A simple text document might be transformed into a list of:

  • every word in the document (each word is called a unigram)
  • every pair of words in the document (each pair is called a bigram)
  • every word stem of the above (the word stem of running is run-)
  • every word shape of the above (the word shape of Disney World man is Aaa Aaa aaa)
  • all of those, combined with the page number they were on
  • all of those, combined with the paragraph number they were in

And so on. This process would result in documents represented as arrays with millions of features. The power of this approach is that it's an easy way to represent a document in a way that computers can manipulate using math instead of language understanding. The computer can find a correlation between first names on a PDF form (Feature #4,304,884) coming immediately after the bigram *First Name (*Feature #12,443,129).

But this approach had a serious problem: overfit. Programmers were expanding the parameter space beyond their ability to meaningfully engineer (via problem representation) or learn (via parameter tuning) structures about how these features should or shouldn't interact with each other. And thus there was often nothing stopping the model from simply memorizing all the first names in the training set, because they were present as part of the problem representation as well (John Doe - Feature #1,123,124; Jane Smith - Feature #6,222,132).

Deep Learning: Jaw-droppingly Enormous Parameterization

In the deep learning era, model parameterization has grown once more, making the size of classic models appear small by comparison. Parameters of the world's best language models are now in the tens of billions.

When these monster-sized models were first introduced, many machine learning practitioners recoiled — wasn't this a guarantee for overfit disaster? But it soon became apparent that something else was going on. Because deep learning involves learning a problem's representation alongside its solution, the representation can adopt nested structures that fight overfit in the solution — almost like a rule system hidden within it.

This is a drastic simplification, and there are many other things at play like large datasets and overfit-reducing training strategies, but in general the takeaway you should have is this: deep learning models are capable of incorporating breathtaking volumes of information in their decision-making process. This is fantastic for performance (greater nuance for our automated bartender) but also a cause for caution and oversight (unwanted correlations in the decision-making process).