Math explanations

View source | View history | Atom feed for this file

Creation date: 2017-12-25
Last substantive revision date: 2018-10-10
Last modification date: 2024-04-18
Generated on: 2025-03-16
Completion status: notes

I’ve been thinking more about math explanations again (something I used to think a lot about when I was first learning abstract math) because I’m trying to get a good understanding of machine learning. In particular I’m thinking about what makes explanations good or bad. This is an ignorant thinking page for now, so don’t expect me to know anything about the topic.

In my mind some things that make explanations good are:

Writing for the right background. I think many books are not giving the background with sufficient specificity. Like they give general topics like “basic probability” without clarifying what that means, even though the meaning changes depending on location and generation.
Anticipating questions. It really frustrates me when a writer explains something, then doesn’t explain some follow up point that is important to me. On the other hand, it feels satisfying when I ask a question in my mind, and immediately in the next sentence or paragraph the writer answers it. I feel like this is especially important to do with written content. With in-person explanations or interactive discussions, it might be less important to anticipate everything beforehand. And anticipating everything is hard because there are a lot of things someone could ask.
Giving good motivations. Not doing this properly would be things like giving a definition without writing about how it is thought of in practice, or giving alternative formulations and things like that.
Giving good examples. I feel like this has something to do with writing good unit tests in software engineering. Like you want to get the good edge cases covered.
(Maybe) Giving a list of examples first before giving the solutions to those examples, and inviting the reader to do the examples themselves. I noticed this in Boolos, Burgess, and Jeffrey’s Computability and Logic. This style of writing essentially creates extra exercises for the reader.
Filling in a lot of details. This is sort of similar to the anticipating questions one, but I get frustrated when a lot of steps are skipped. Because in writing the book, the author had to take the same mental steps, and yet they are not recording those steps! It’s like they encode their steps into this really strangely terse format, and then when you read the text you have to do the decoding. Why not just write it out verbosely and save us a step?
Notation. Maybe this is just me, but I generally don’t like things like not explicitly bounding bounded variables (e.g. writing $\sum_{i} f (x_{i})$ instead of $\sum_{i \in S} f (x_{i})$ ; without the $i \in S$ , the set from which the $i$ values takes is implicit, which means more work for the reader to keep track of such things), writing expressions but calling them functions (e.g. “let $f (x)$ be a function …”), not introducing each variable (e.g. using variables like $f$ without saying “let $f : X \to Y$ be a function” first, or using $n$ without introduction even though it usually means a whole number), not exposing types (e.g. when using random variables and saying things like $P (X = x)$ , what is the type of $=$ ?).
For “tutorial” type material, an embedded/built-in “spaced repetition” type process where material from earlier chapters quietly shows up in later chapters to remind readers, possibly in slightly altered form to solidify understanding.
When using terms, saying explicitly whether a term is commonly used in the field or whether it is made up on the spot or only used in some sub-communities. This is similar to how when explaining a controversial topic, one should be able to state the other side’s views clearly (i.e. pass the Ideological Turing Test for other views).
When giving an abstraction (e.g. definition), say it in multiple ways, and also give the reason you’re defining it that way.
Seriously, why don’t all textbooks have multiple choice questions and error-spotting exercises mixed in with the exposition?
Explaining why naive versions one might come up with cannot be true. Example.
Purposely going down wrong paths (while warning readers that it is a wrong path) that a naive reader would want to go down, just to illustrate why it doesn’t work. Terence Tao mentions this in this comment. You could think of this as a kind of imaginary Ideological Turing Test where one earnestly explores a mistaken path to show why it might seem reasonable, but then showing why it isn’t actually the right approach.

An example from Tim Gowers (go to the part in the post that says “Here’s what might have happened if I had struggled on with the sentence I was in the middle of writing”).

I think it might be difficult to make this interesting, because there are so many wrong paths that one could go down. What seems like a “natural” wrong path to go down for one person might seem absurd to another person (who has a different background or personality). It seems bad if these “going down wrong paths” in texts started to seem just tedious to go through (even the correct paths can seem tedious at times!).

Another example is Michael Nielsen’s bitcoin blog post, where he iteratively builds up the ideas of bitcoin by considering naive approaches to a digital currency, then fixing the problems in them.
For exercises, I think it’s important to explain somewhere what the exercise is intended to teach. Textbooks often warn that not doing the exercises will mean not actually learning the material (or similar). If this is truly the case, it should be straightforward to justify the importance of each exercise, but I never see this done in practice. My personal experience has been that the vast majority of exercises are not that interesting or enlightening, and I have a hard time figuring what the exercise was for, other than an application of some trick. If explaining the point of the exercise spoils the exercise, it can be explained in some place other than immediately near the exercise.
Sticking to standard terminology. While there are some imperfections in the standard terminology, it is often too confusing to switch between different terminology, so my preference is that explainers stick to the standard ones, or use defensive/fool-proof terminology/notation. Another option is to explain all the variations in terminology so that readers are not confused when they go read other books (Peter Smith’s Gödel book does a great job of this), or to have some resource that all readers can use to find out the variations in terminology (I wish Wikipedia could be such a resource, but it’s often very incomplete, even on common undergraduate topics).

To give an example, Boolos, Burgess, and Jeffrey’s Computability and Logic uses “function” to mean “total or partial function”, and “total function” to mean “function”. This sort of thing is confusing because now there are two “claims” on the meaning of “function”. (This actually becomes sort of a problem in the book because when introducing characteristic functions, the book does not clarify that these must be total.) A defensive writing style would avoid the word “function” and use “total or partial function” and “total function”. Similarly in set theory, $\subset$ sometimes means “subset” and sometimes means “proper subset”. A defensive style of writing would avoid this and use $\subseteq$ and $⊊$ .
Given two concepts that could be mistaken for each other, there are four possibilities with four different ways to help the reader: $X ⟺ Y$ (prove that all $X$ are $Y$ and that all $Y$ are $X$ ); $X ⟹ Y$ and $Y / ⟹ X$ (prove that all $X$ are $Y$ , and give an example of something that is $Y$ but not $X$ ; $Y ⟹ X$ (similar to previous); and $X / ⟹ Y$ and $Y / ⟹ X$ (give an example of something that is $X$ but not $Y$ , and separately an example of something that is $Y$ but not $X$ ). For the last of these, I like the example given by Terence Tao in Analysis I to illustrate how “disjoint” and “distinct” are entirely different concepts: ${1, 2, 3}$ and ${2, 3, 4}$ are distinct but not disjoint, and $\emptyset$ and $\emptyset$ are disjoint but not distinct.

Another example from logic: complete set of formulas vs maximal consistent set of formulas. These seem to be equivalent in the standard formal systems we work with, but are there systems in which they are different? Is one necessarily stronger than the other?
There is a kind of paternalism trade-off in explanations when deciding e.g. whether to ask the reader to prove a theorem before reading the proof in the book. Some authors will (1) just assume that the reader has enough “math culture” in them to know that they should try to prove the theorem before reading the proof; other authors will (2) gently prod the reader to prove the theorem before reading on (usually in the book’s preface); still others will (3) just not have any sort of opinion on this, or say the reader ought to know best what to do. In terms of paternalism, I would say (2) is most paternalistic, followed by (1), and then (3) is least paternalistic.

My current feeling is that books aimed at beginners (undergraduate level) and self-studiers should probably lean toward being paternalistic. For self-studiers, it is difficult sometimes to pick up on “math culture” without reading many different books or reading up on discussion threads (on MathOverflow, Reddit, math blogs, etc.).

Just to give some examples:
- Michael Nielsen’s neural networks book has a page that explains his stance on exercises. I guess this doesn’t fit into any of the (1)–(3) I listed above.
- Tim Gowers has advice in his Cambridge teaching posts to the effect of (2), but maybe in his official lecture notes he emphasizes this less (so closer to (1)).
- Computability and Logic does some of (2), where they say “we’ll show all the examples first, then the proofs later, so the reader can do them”.
- I like how in Goldrei’s logic book, he mixes in exercises throughout the text, and often includes answers. I guess this is an instance of (2).
Explaining the mental images/mental pictures that the author has.
Anticipating common misconceptions and errors.
When something introduced at an “introductory level” is slightly inaccurate or is a simplification, this should be mentioned (to avoid the learner memorizing definitions that will be superseded).

Example: Tadelis’s Game Theory: An Introduction slowly introduces ideas like IESDS and rationalizability—first in the context of static games of complete information without mixed strategies, then with mixed strategies (“IESDS and Rationalizability Revisited”).

Example: “Some authors, particularly in introductory textbooks, initially define the extensive-form game as being just a game tree with payoffs (no imperfect or incomplete information), and add the other elements in subsequent chapters as refinements.”

Thing that might look like an example that I don’t think is an example: teaching classical physics before relativity.
Using lots of pictures (Pugh’s analysis book emphasizes this). I find many verbal/formula-based explanations much harder to follow (it’s often easier to just write down the formula myself).
Giving full solutions to exercises. How should we distinguish “examples”/“theorems” from “exercises”? Normally we tend to think of “examples”/“theorems” as the proofs that some with solutions, and “exercises” as the proofs that the reader must fill in. But I think a better way to do this is for examples/theorems to be the ones that the reader can’t be expected to give (examples that emphasize a certain point, a theorem that uses a particular new trick), and exercises to be everything else. I like the way Terence Tao leaves many things to exercises in his Analysis I.

For solutions, I think it’s additionally important to present multiple solutions for each problem. At least one of the AoPS books (Intermediate Counting & Probability) does this to some extent, which I like. It’s pretty annoying to work hard on an exercise, and think that I have got it, only to find the solution in the book does it some different way (which is a learning opportunity to be sure, but also means I can’t really check my solution).
Making the reader do a lot of work: Leary & Kristiansen’s logic book uses a phrase that goes something like “the unique readability theorem is one of those results that’s important to know and good to prove oneself, but boring to read”. Actually I think many results are like this, at least in the weaker sense that they are interesting to read but more interesting if you prove it yourself.

Also see generation/testing/pre-testing effect.
Presenting multiple organizations of a material: the only way to do this right now is to just look at multiple books. But I think a good textbook should take the initiative and present multiple ways of doing things (both at the low level of multiple proofs of a theorem and at the high level of multiple ways of organizing the subject). See also reverse mathematics and John Stillwell’s books.
I think when people try to explain mathematics, by default they ask something like “how do I talk about this subject so as to cover all the main points?” But I think the important question is more like “how can I create a transformation such that, if I apply this transformation to other minds, they become competent in the subject?”

Tutorial vs reference style: some explanations are written in tutorial style, where there’s a lot of context throughout the whole document, and you should just start at the beginning and walk through the document. There are also more reference style documents that depend on less immediate context. Some supposedly tutorial style writing can start to feel like reference style writing when they list a lot of theorems/proofs without much motivation.

Personally I find a lot of tutorial-based explanations difficult to follow because I can’t keep a lot of things in my head at once, unless I’m the one generating the thoughts (for instance, when I’m programming I can keep many variables fresh in my mind, but when I’m reading someone else’s code I find it difficult to do the same). I wish more variable/term tables were given in math explanations so that if one forgets the context one can periodically look it up again without scrolling all over the place.

Actually though, I suspect it’s not the tutorial format that is the problem, but rather the tutorial-writer’s too-high expectations of the tutorial-reader. If the writer took the effort to break things down more and to have e.g. Anki cards available to help the reader, then I think a tutorial format would still be the best way to learn.

Being clear about ontology helps, I think, and helps to avoid confusing exposition. Something I don’t like is when $\frac{d y}{d x}$ is treated both as a function and as a variable. In my world, variables cannot change once you assign them a value. It doesn’t make sense to talk about “let $x = 3$ and see what happens as $x$ increases”, because everything is static. If you want to talk about changes, you define a static lookup table, i.e. a function. And a function is not some machine that computes outputs from inputs; it is just a graph. Of course, once you learn the subject, you can freely go between various mental images (including ones that would be confusing to a beginner).

Maybe it makes sense for some people to think of variables as changing, but the fact remains that you can formalize this in logic without any moving parts.

On page 21 of these notes the notational confusion of $\nabla f (A x)$ (where $A$ is an $m$ by $n$ matrix and $f : R^{m} \to R^{n}$ is a function) is mentioned.

It seems like a lot of people complain about poor notation but then they just get used to it. Whereas I have something like a gag reflex to confusing notation and have difficulty understanding explanations until they use good notation.

My MATH 334 page talks about the confusing notation of the chain rule too.

This guide to backpropagation by Michael J. C. Gordon is interesting, especially since he spends a lot of time reviewing basic calculus results and understands functional programming (likes functions more than expressions, and defines types!). Unfortunately I find some other notational issues, like the difficulty of distinguishing between multiplication and function application (the latter is a space like in Haskell, but it’s sort of hard to tell between an explicit space and the kerning).

Since in my experience most exposition is horrible, I am a big fan of shopping around to find the really good books. Not sure I’ve really succeeded in doing this for ML though.

External links

Reflections on working through Tao’s Analysis