teaching machine learning: Nonlinear Function
Created: April 16, 2020
Modified: January 25, 2022

teaching machine learning

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.
  • Rob wants to firm up his foundations. He wants to understand relevant stats, probabilistic models, inference, and maybe work our way up to deep graphical models. Stuff for Rob:
    • What are the foundations? (first I'll list topics. but then think of SCAFFOLDING)
      • bayesian vs frequentist. different QUESTIONS.
      • Bayesian inference
        • bayes rule
        • prior, likelihood, posterior, predictive distribution.
        • utilities and decision-making.
          • EXERCISE: make Rob work this through for a simple example. like?
            • making a decision based on a single parameter. maybe COVID Rt.
        • normalizing constants. why are they important? why is finding them hard?
        • exponential families
        • conjugacy
      • The Statistical Rethinking intro chapters are mostly conceptual, but really good: http://xcelab.net/rmpubs/sr2/statisticalrethinking2_chapters1and2.pdf
      • Concept of probabilistic programming. We could use TFP!!! But honestly we should use Stan because its inference is more turnkey and we can't be fucking around with inference problems at first. But for learning it might be good to understand how inference is hard. But not at first. Let's move to TFP when we get to discussing inference.
        • TODO: look for examples of courses taught with PPLs
        • A program is a sampling process.
          • Examples:
            • random-effects regression
            • A hidden markov model
            • Sigvisa
          • Parameters vs latent variables. Latent variables are part of the process. Parameters are inputs.
        • Inference as rejection sampling. It's so good.
        • Back off: graphical models. It's just a program where the control flow is always the same. Common notation in the literature.
        • A 'sampling program' is the most general concept you can have, because programs are the most general structure there is. But to do inference, we sometimes need to take advantage of model-specific structure. We may need to zoom and reason about graph structure, or particular properties of particular distributions, etc.
        • Black-box vs white-box inference:
          • black-box are inference methods that can be mechanically applied to a model.
          • a white-box inference method is anything that requires specific derivation.
            • example: Gaussian integrals. Discrete graphical models. Anything involving conjugacy.
          • in reality all inference methods exploit structure. It's just a question of the API they expect.
      • Models: __goal here is to reinforce the previous concepts through examples. What does the sampling process look like? What variables would we condition on? What is the predictive distribution?
        • simple probabilistic models in genomics?
        • Linear / polynomial regression
          • hierarchical / mixed-effects regression
        • State-space models
        • Dimensionality reduction (PCA)
        • Matrix factorization
        • Matchbox: adding bells and whistles to the store
      • Inference
        • exact vs approximate inference
        • Optimization:
          • MLE
          • MAP by gradient ascent
        • Variational optimization. instead of fitting a point mass, fit a Gaussian. (or whatever).
          • EM is the degenerate case where we think about fitting by coord ascent on parameters and latents, and the parameters are represented by a point esdtimate and the latents by a distribution.
        • MCMC:
          • not sure the best way to cover. definitely want to get to HMC.
          • math fundamentals if Rob wants: markov chains, stationary distributions, mixing, ergodicity. how do we build a Markov chain with a given stationary dist? MH!
          • Also Gibbs as an example of a "non-MH" approach (at least in the mechnsim; the theory justifies it as MH).
      • TODO: deep learning. deep graphical models. BNNs. normalizing flows.
    • Scaffolding:
      • models as sampling programs. generators of 'possible worlds'.
      • inference by rejection sampling.
      • the hard part: how do you develop intuition for what a model will infer?
  • For an intro ML class:
    • Start with classification:
      • Nearest neighbors. Initial exploration of overfitting and generalization.
      • Perceptron.
      • Logistic regression.
    • The (supervised) learning problem. Empirical risk minimization. generalization.
      • The Computational Learning Theory book (Kearns) has some really nice discussion of generalization, the tradeoff between the size of your hypothesis class and generalization performance, etc. I copied some of this under the no free lunch theorem.
    • Regression: nearest neighbors. linear regression, basis functions, kernelized regression.
    • Kernels. (NO SVMs. kernelize the perceptron maybe)
    • Unsupervised learning. Generative and discriminative models. Learning as compression. GMMs. PCA.
    • Bayesian inference. Epistemic and aleatoric uncertainty. Variational inference. Bayesian linear regression and/or probabilistic PCA. (kernelize it!).
    • Model selection. Cross-validation. Bayesian model selection. Decision theory.
    • Optimization. Gradient descent. SGD. Subgradients.
    • Deep learning basics. MLPs.
    • Differentiable programming. Autodiff, forward and reverse mode.
    • Architecture. Convnets, RNNs, transformers.
    • Properties of high-dimensional loss surfaces. Saddle points. Resnets.
    • Deep generative models. Autoregressive models. VAEs. normalizing flows. Self-supervised learning.
      • Homework assignment: implement a VAE on MNIST. (hw will need to explain monte carlo VI and the reparameterization trick)
    • Ensembling. SGD model averaging. Bootstrap. Bagging.
    • Quick survey of other topics:
      • RL.
      • Active learning / bandits / bayes opt.
      • GANs
      • Collaborative filtering.
      • Graphical models.
      • structured prediction
      • Causal inference
  • Memes, illustrations, and links:

practical things that students should learn/know:

  • einsum notation