algorithm: Nonlinear Function
Created: November 29, 2022
Modified: November 29, 2022

algorithm

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Fundamentally an algorithm is any computational procedure: something that takes in data and spits out some function of that data.

Computer science draws a formal distinction between an algorithm and the function that it computes. For example, quicksort, merge sort, bubble sort, and bogosort are all sorting algorithms. They all compute the same function of their input, but differ in their internal structure and performance characteristics. Knowing that we want to sort a list of numbers, we can consider which algorithm to use in terms of purely computational tradeoffs: runtime, memory usage, parallelizability, etc.

By contrast, in popular discourse the word 'algorithm' is often used to refer to the function itself. When people criticize the 'Twitter algorithm', they are referring to the actual rankings of tweets, not the specific pattern of computations that Twitter performs to produce those rankings. Perhaps people should speak more precisely about the 'Twitter ranking function' in such cases. However, the laziness of the discourse is not entirely unwarranted, since (unlike sorting a list of integers) there is no Platonic definition of the Twitter ranking function, separate from its implementation: the algorithm defines the function. It is logically possible that we could find another algorithm to compute the same function, but it's not clear how we would go about looking for such a thing, and indeed there are so many possible functions relative to the space of short algorithms^[Say there are roughly 27=1282^7 = 128 characters (including emojis, etc) commonly used in English-language tweets. Then there are 128280128^{280} possible tweets, and (128280)!(128^{280})! possible rankings of those tweets. Suppose the Twitter ranking algorithm is represented by 1GiB=2332^{33} bits of weights, so there are 22332^{2^{33}} computations representable within this space. A very rough application of Stirling's approximation estimates that the first quantity has logarithm

log(128280!)128280(280log1281)128281=21961,\log (128^{280}!) \approx 128^{280} \cdot (280 \log 128 - 1) \approx 128^{281} = 2^{1961},

while the second has logarithm approximately 2332^{33} (at this level of approximation, the base of the logarithm isn't particularly relevant). The former quantity is so unimaginably large compared with the second that it's vastly unlikely that any given ranking function is implementable at all, and given a single implementation, vastly unlikely that there exists a second. Even allowing the model to be specified by a terabyte (2432^{43} bits) or even a petabyte (2532^{53} bits) of weights barely affects the comparison.] that it's unlikely in general that a random function will have multiple short algorithms (this gestures at the hypothesis that computation is important).

This is generally true of computations implemented by machine learning models: there is usually no short specification of the function to compute (akin to the task of sorting), beyond the model weights themselves. The closest we get is a specification of the task in terms of the loss function and data used to train the model.This specification is generally, but not always, larger than the trained model. We can then talk about whether a given task admits solution at a given level of test loss by a model at a given level of complexity (number of weights, FLOPs, etc.), analogous to the classical analysis of the complexity of algorithms that implement a given function.


Separately, even within formal computer science there is also some variation in the level of abstraction at which algorithms are defined. We generally agree that algorithms exist at a higher level than actual code: merge sort instantiated in Rust, Python, or JavaScript is the same algorithm in each case, but a different program. Some 'algorithms' exist at even higher levels of abstraction: expectation-maximization is a kind of meta-algorithm, a recipe for statistical learning consisting of abstract steps that can be instantiated (or approximated) in a variety of ways for different problems.