This is excellent. Amazing (equation + picture)/text ratio.

Complaint: you don't define your notion of "space". In chapter 1 it's some informal notion that you use to motivate the definition of a set (??), in 1.3 and 1.4 it becomes clear by space you mean "set". Then later you start talking about dimension of spaces, implying not only do they come with a topology now they have a well defined dimension, so a locally Euclidean Hausdorff space or something - but maybe you just mean R^n.

Comment for other commentators in this thread: not all expositions is tailored for the masses. A piece of pedagogical literature that does not appeal to your background doesn't mean it's not good. There's a very clear need for exposition on basic structures in probability theory and this fits there.

It's like the definition of set defined here should really be a subset, and the definition of space should be a set. Maybe just say a set is any collection of objects?

I'm not really sure who the intended audience is here. There's a lot of material covered very briefly in a very short space, and not enough details that anyone who doesn't already know it would be able to pick up anything substantive.

The minimum audience would be those who've taken an introductory point set topology, introductory analysis course, and a probability/statistics course or read extensively on the subject. If you do not have that background, you are not qualified to understand this material, no matter how much the author attempts to dumb it down

As a self-taught programmer, I appreciate learning about mathematical topics from the bottom up (from a « pure mathematics » point of view), after having gained some intuition. It is easier for me to grasp because it relates very much to my daily life of programming, with its type systems, class transformations, mappings. And the author is right in that probability theory is often employed in even looser terms than other areas of mathematics. It feels to me like building up something in java/c++/haskell vs building up the same thing in python/javascript. For a lot of people, python is simpler to handle, but I usually have to go back to my c++ to feel reasonably safe that I’m applying my functions to the right objects.

Do you know of anything which can help people get over the hurdle to know enough to use this content?

For me it's only worked when colleagues have explaining concepts to me when they were needed, after a several occurrences of this everything finally started to make sense and I could then make use of material like this.

There are no shortcuts with math. If you really want to learn it, you must be willing to put in a large number of hours over a long period time in order to master it. Are you willing to do that?

I have always found Russian math book writers to be on point, not going too much over your head, also respecting the reader's intelligence. If you like it, then you will love his calculus book, that one is also a real gem.

Thanks for the reference anyway! I downloaded the Russian version in form of PDF document and enjoy reading it:-) What was great about USSR is its level of popularization of science. All these books written for kids or high school students were amazing.

Echoing other comments here, this seems like a hard way to start learning probability. It sounds like the goal is to make probability easier to understand based on what you say here (https://betanalpha.github.io/writing)

> In this case study I attempt to untangle this pedagogical knot to illuminate the basic concepts and manipulations of probability theory and how they can be implemented in practice

But I think this is too hard. I really loved "Probability For The Enthusiastic Beginner" http://a.co/2kp5PZd

"For Scientists and Engineers" sounds to me like it's targeting people who already have a strong background in more advanced mathematics, but not necessarily probability and measure theory. If so I think this is a decent way to go about it.

I'm an engineer and sometimes mathematician that works with fairly in-depth probability theory related things and this looks to be a condensed version of a lot of the basic stuff I had to self learn when I was getting into what I work on now. I'm in a niche area though, and I do wonder if this really is that useful to most scientists and engineers.

Depends on your goal, statistics and probability theory are separate (though of course related) fields with different applications. For me I really needed the measure-theoretic bits because I was (am) working on modeling ergodic processes. This article honestly doesn't go into enough detail to be especially useful but I like the direction the author approaches it from.

I'm familiar with the text you mentioned, it's certainly good and would be better than this article for most, but comparing a textbook to a short web article isn't exactly fair.

The goal is worthy, but the product is inadequate to say the least. This thing is littered with typos, and enough of the exposition is sufficiently irrelevant or incorrect to be unintuitive. That said, I like the graphics and layout.

For example, when he discusses power sets in order to introduce sigma algebras, he implies that a sigma algebra is a better-behaved alternative to a power set. However, a power set is always itself a sigma algebra (after all, even a power set of an uncountable set still is closed under complements and countable unions).

Later, when discussing probability distributions, he writes:

> [W]e want this allocation [of a conserved quantity] to be self-consistent – the allocation to any collection of disjoint sets, A_n ∩ A_m=0, n≠m, should be the same as the allocation to the union of those sets,
> ℙπ[∪(n=1 to N) A_n]=∑(n=1 to N)ℙπ[A_n].

The condition `A_n ∩ A_m=0, n≠m` is actually incorrect, since A_n and A_m are sets and 0 is an integer. The author means the empty set, but typo'd.

Sometimes he frequently uses words like "conserved" or "well-defined" without giving us a clue as to what these mean. In what context are probabilities "conserved"? What distinguishes "well-defined" from "not well-defined"?

I'm a software engineer. A non-trivial amount of my time is devoted to reading code and finding bugs. Sloppy reasoning, inconsistencies and outright errors like that are big red flags to me. It doesn't help that the whole section on sigma algebras is somewhat irrelevant, since he doesn't really explore measure theory as the basis for modern probability.

IMO a better resource is the series of "Probability Primer" videos from mathematicalmonk on YouTube[1]. He does an excellent job (IMO) of covering all pertinent pre-requisites and being mostly rigorous without necessarily proving every single fact or exhaustively covering all edge and corner cases. He also makes a good effort to recommend advanced (and rigorous) treatments of the subject (and ancillary ones like measure theory). A readable version of this YouTube series would be a great resource, and if Michael Betancourt is reading, I'd encourage him to pursue that in his next iteration of this product.

> It doesn't help that the whole section on sigma algebras is somewhat irrelevant, since he doesn't really explore measure theory as the basis for modern probability.

Christ, practically all of measure theory is irrelevant for applied work, in much the same way that an engineer shouldn't care about the definition of a real number.

There's a model of real analysis due to Solovay that used the axiom of dependent choice instead of the full axiom of choice. In the Solovay model, all sets are measurable. Thus any results that require measure theory inherently depend on the axiom of choice.

I'd be worried if I was relying on Choice as an applied scientist.

Edit: same goes for Lebesgue vs. Riemann integration. To quote Richard Hamming:
Does anyone believe that the difference between the Lebesgue and Riemann
integrals can have physical significance, and that whether say, an airplane would
or would not fly could depend on this difference? If such were claimed, I should
not care to fly in that plane.

It matters by convention, because the textbooks are written that way.

My point is that you don't need that level of formal rigour to do applied work. You can derive the Feynman-kac formula via a scaling limit of discrete-time Markov chains. Add some levy process (a.k.a compound Poisson processes) and you're basically done.

If you want to be ultra-rigourous in your definitions, then you need measure theory, yes. But even Einstein didn't need that for his description of Brownian motion. If a scaling limit is good enough for him, it's good enough for me.

I confused this with the book "Probability and Statistics for Engineers and Scientists" by Anthony Hayter and I got excited.

I am kind of a beginner in Machine Learning and was struggling badly with basic probability and Statistics concepts. I went through so many resources and somehow none of them clicked. Then I stumbled upon this book and I realized this is exactly the kind of book I needed. It assumes no prior knowledge and is very heavy on examples. Other books just dive into jargon/symbol laded theory without giving simple examples or building concepts from ground up.

I mentioned this because I feel someone might benefit from this suggestion.

Wow, this seems like a particularly hard way to learn probability.

One thing I noticed about myself as I did more and more work with probability is that I started thinking in terms of distributions a lot more.

These days I find it very difficult to think without using them. In just about everything I do now I tend to think about moving probability mass around.

Here's my non-standard, nutshell, IMHO advice in using probability theory:

(1) Random Variables. Go outside. Observe a number. Then that is the value of a random variable. To have a random variable, that the number be random in the sense of unpredictable is not needed. For the phrase and/or criterion "truly random", mostly f'get about it, but we return to that for the subject of random number generation below. So, net, your data, all your data, are the values of random variables.

(2) Distributions. Sure, each random variable has a distribution. And there is the Gaussian, uniform, binomial, exponential, Poisson, etc. distributions.

Sometimes in practice can use some assumptions to conclude that a random variable has such a known distribution; this is commonly the case for exercises about flipping coins, rolling dice, shuffling cards.

For another example, suppose customers are arriving at your Web site. Well maybe the number of arrivals since noon have stationary (over time) independent increments -- maybe you can confirm this just intuitively. Then, presto, bingo, the arrivals are a Poisson process, and the times between arrivals are independent, identically distributed exponential random variables -- see E. Cinlar, Introduction to Stochastic Processes. Further, since might be willing to assume that the arrivals are from many users acting independently, the renewal theorem says that the arrivals will be approximately Poisson, more accurately for more users -- see W. Feller's second volume.

Sometimes the central limit theorem can be used to justify a Gaussian assumption.

Still, net, in practice, mostly we don't and can't know the distribution. To have much detail on a distribution of one variable takes a lot of data; the joint distribution on several variables takes much more data; the amount of data needed explodes exponentially with the number of joint variables. So, net, don't expect to know or find the distribution.

Often you will be able to estimate mean and variance, etc. but not the whole distribution. So, usually need to proceed without knowing distributions. In simple terms: Distributions -- they exist? Yup. We can find them? Nope!

(3) Independence. Probability theory is, sure, part of math, but, really, the hugely important, unique feature is the concept of independence.

One of the main techniques in applied math is divide and conquer. Well, where you can make an independence assumption lets you so divide.

Independence? A simple criterion for practice is, suppose you are given random variables X and Y. You are even given their probability distributions (but NOT their joint probability distribution). Then X and Y are independent if and only if knowing the value of one of them tells you nothing more than you already know about the value of the other one.

The hope here is that often in practice you can check this criterion just intuitively from what you know about the real situation. E.g., does a butterfly flapping its wings in Tokyo tell you more about weather tomorrow in NYC? My intuitive guess is that this is a case of independence which means that for predicting weather of NYC tomorrow, we can just f'get about that butterfly.

(4) Conditioning. For random variables X and Y, can have the conditional expectation of Y given X, E[Y|X]. Such conditioning is the main way X tells you about Y. Then there is a function f(X) = E[Y|X], and f(X) is the best non-linear least squares estimate of Y. Note that E[E[Y|X]] = E[Y] which means that E[Y|X] is an unbiased estimate of Y.

(5) Correlation. If you don't have independence, then likely use the Pearson correlation -- it works like the cosine of an angle. If random variables X and Y are independent, then their Pearson correlation coefficient is 0 -- proof is an easy exercise just from the basic definition and properties of independence.

(6) The Classic Limit Theorems. Pay close attention to the central limit theorem (CLT) and the weak and strong laws of large numbers (LLN). The CLT is the main reason we get a Gaussian distribution, and the LLN is the main reason we take averages.

(7) Random Number Generation. A sequence of random numbers are to look, for some practical purposes, like a sequence of random variables that are all independent and have uniform distribution on [0,1]. Are they "truly random"? Maybe not. But if they are, then they are independent and identically distributed (i.i.d.) on [0,1] -- and that's all there is to it, and don't have to struggle to say or understand more.

Probability theory expositions, especially for [software] engineers, would be better served if they were well typed. What is the type of a random variable, E[Y|X], E[E[Y|X]]? Hint, a random variable is not a scalar, but rather a function, the probability distribution.

Hmm, a random variable (in the sense of measure theory, as in OP) is indeed a function - but it's not a probability distribution.

An R.V. is a measurable function from the sample space into the reals. A probability distribution is a function assigning probabilities to measurable sets, formally, a function from the sigma-algebra into [0,1].

So in particular, a R.V. (like a gaussian) can take on negative values. A probability distribution cannot.

Also, the domain of the R.V. is the sample space. But the domain of the probability distribution is the sigma-algebra over that sample space.

A distribution is a real valued function of a real variable. The domain of the function is the whole real line.

Note: Below, borrowing from D. Knuth's TeX, we use the underscore character '_' to denote the start of a subscript.

Details: For real valued random variable X, probability measure P, and the set of real numbers R, the cumulative distribution of X is the function F_X: R --> R where, for x in R, F_X(x) = P(X <= x).

If F_X is differentiable, then the probability density distribution of X is the real valued function of a real variable f_X: R --> R where, for all x in R, f_X(x) = d/dx F_X(x) where d/dx is the calculus derivative.

For the connections with sigma algebras, that is more advanced than most engineers care about, but here are some of the details:

For real numbers a and b with a < b, there is the open interval

(a,b) = {x|a < x < b}

A topology on R is a collection of subsets regarded as open and that satisfy the axioms for a topology -- the sets in a topology are closed under finite intersections and arbitrary unions and both R and the empty set are open. The usual topology on R is the smallest (a short argument shows that this "smallest" is well defined) topology that has each open interval an element of the topology -- right, the topology regards the open intervals as open.

The usual reason to discuss a topology is to have a means of defining continuous functions, a means more general than from the usual "for each epsilon greater than zero, there exists a delta greater than zero such that ..." or in terms of limits of sequences. Indeed, there are advanced situations where we can use topologies to define continuous functions where epsilon and delta and where converging sequences don't work. If curious, look up Moore-Smith convergence, nets, and filters or just Kelley, General Topology.

Well, a sigma algebra is like a topology, that is, is a collection of subsets: A sigma algebra is closed under countable unions and relative complements. Right, we avoid uncountable unions because otherwise we will get stuck in a big mud hole. It is an early exercise that there are no countably infinite sigma algebras.

The reason for sigma algebras is to permit defining a measurable function, that is, one where we can apply the Lebesgue integration theory. The integral of calculus is due to B. Riemann and is the Riemann integral. W. Rudin, Principles of Mathematical Analysis shows that for a continuous real valued function with domain a compact set (closed and bounded, where closed is the complement of an open set) has a Riemann integral. Well in this case, the Lebesgue integral gives the same numerical answer -- same thing. The advantage of the Lebesgue approach is that the function can be even bizarre and its domain can be much more general. Indeed, in probability theory, expectation is just the Lebesgue integral. In simple terms, Riemann partitioned on the X axis, and Lebesgue partitioned on the Y axis.

Well, given the usual topology on R, we can ask for the smallest sigma algebra on R that has the topology as a subset. That sigma algebra is the Borel sets of R. Uh, Lebesgue was a student of E. Borel. In Rudin will find the Heine-Borel theorem.

So, in probability theory, we have a sample space. Each point in the sample space is a trial, i.e., essentially a real world experimental trial (note: really our attitude is that in all the universe we see only one such trial -- if this seems far out, then blame the Russians, e.g., A. Kolmogorov, E. Dynkin, etc.!). Well, an event is a subset of the sample space, that is, a set of trials. So, flip a coin. Let H be the event, the set of all trials where, that the coin comes up heads.

Well, to apply Lebesgue's theory of integration, we want the set of all events to be a sigma algebra.

Then a probability measure is a measure in the sense of Lebesgue's measure theory, that is, a real value function, in the case of probability taking values in [0,1], and with domain the sigma algebra of events. So, for the event H, we can ask for the probability of H, that is, P(H), which is a number in [0,1]. For a fair coin tossed by an honest member of the FBI we have P(H) = 1/2.

Then a real valued random variable X is just a real valued function with domain the sample space and also measurable: This part about being measurable is that for each Borel set A, a subset of R, the set of all trials w so that X(w) is in A is an event, that is, an element of the sigma algebra on the set of trials. That is, the inverse image under X of the Borel sets are events, elements of the sigma algebra on the sample space.

So, with X measurable in this way, we have a near perfect shot at defining the expectation of X, E[X]. For this we have a little two step dance:

First we look at X^+ ('^' denotes a superscript) where X is >= 0. So, X^+ is the positive part of X. Similarly X^- is the negative part of X. So, X = X^+ + X^-. Uh, I'm working quickly from memory; maybe we want X^- to be -X where X < 0 and 0 otherwise. So both X^+ and X^- are >= 0 and we have X = X*+ - X^-. Either way.

Well, we can use Lebesgue's theory to integrate X^+ and X^-. Biggie stuff: The X need only be measurable, and that admits lots of really wildly bizarre functions. We've got great generality, and that's good to have in various limiting arguments. Uh, we like limiting arguments because that is our main way to approximate which our main way to being healthy, wealthy, and wise!

So the Lebesgue integral of X^+ we write as E[X^+]. Similarly for X^-. Now no way do we want to be subtracting one infinity from another since permitting that would trash the usual laws of arithmetic.

So, for our second step, in the case X^- >= 0, if at least one of E[X^+] and E[X^-] is finite, then we define E[X] = E[X^+] - E[X^-].

Now we've defined expectation ("average") of a real random variable X. Our definition is just the Lebesgue integral. For the Lebesgue integral, we wanted the sigma algebras.

On the real line we can consider the sigma algebra of Lebesgue measurable sets; that's larger than the Borel sets. Then we just ask, assume, assert, believe, ..., that our random variables are measurable with respect to the sigma algebra of Lebesgue sets and the sigma algebra of the events. Uh, right, Lebesgue measure on R assigns Lebesgue measure b - a to interval (a,b) and extends from there. Fine details are in various texts by Rudin, Royden, etc.

That's the beginnings of the role of sigma algebras in advanced approaches to probability, statistics, and stochastic processes. It turns out, the sigma algebra approach is for several parts of what we want in probability, much nicer, e.g., for defining independence and conditional expectation. E.g., if we want to know that some set of uncountably infinitely many random variables are independent, we can. Same for conditioning on uncountably infinitely many random variables, e.g., the past history of a stochastic process.

below. In short, the answer is that a real valued random variable X is a real valued function. The domain of the function is a set of trials. So, for a trial w (usually written as lower case omega), X(w) is a real number.

Then the event for real number x

X <= x

is really shorthand notation for

{w|X(w) <= x}

So, typically we don't mention the w.

Moreover, typically for all but grad school mathematicians taking a course in "graduate probability" we don't mention that X is a function. Instead we just say something like, X is the number we get from running an experimental trial, one of all the numbers we "might have gotten" considering the probability distribution of X.

You are correct: You sense some mushy ground under the foundations of probability theory, and you are not nearly the first to so sense.

Long an answer was, "it works great in practice" which is doesn't make the mush any more firm.

Well, in 1933 A. Kolmogorov gave a solid mathematical foundation for probability theory. That's the usual foundations for advanced work in probability, statistics, and stochastic processes. My post

Some of the consequences are surprising, but I omit those. And we end up assuming that in all the universe all we ever see is just some one trial and don't say anything about the other trials but imagine about them a lot. That point may be hard to swallow.

IIRC, one line of argument is just that in probability there are lots of possibilities we just don't distinguish. E.g, maybe the police have long since concluded that nearly everyone driving a car with custom installed, hidden compartments is a drug dealer and then conclude that a person with such compartment is "likely" a drug dealer. Well, of course, actually, they might not be a drug dealer and have the car and its compartments for some other reason. So, the police are putting all owners of cars with hidden compartments in a box and refusing to distinguish them, insisting that they all be treated the same until there is evidence otherwise. It may be that more can be said. For now, make of such lines of thought what you will.

I don't see the point of introducing sigma-algebras if you're not doing probability based on measure theory.

As others have said I wouldn't suggest this exposition to someone learning probability for the first time, but it's not as bad if you're familiar with the material and need a quick review.

> The set of all sets in a space, X, is called the power set, P(X). The power set is massive and, even if the space X is well-behaved, the corresponding power set can often contain some less mathematically savory elements. Consequently when dealing with sets we often want to consider a restriction of the power set that removes unwanted sets.

I wish people could teach math in plain English. I don't know why the math and physics world refuses to write for the reader. I took this class before, and I still don't know what the author means by "less mathematically savory elements".

Here's you explain things to humans:

> There is a set called the power set that contains all the sets in a space. This set is huge, and it contains [less mathematically savory elements]. This is why we usually use a restricted version that removes the unwanted sets.

Seriously, there's no point to this sort of fancy language. Math is already hard. No need to make it harder.

"There is a set called the power set that contains all the sets in a space"

I don't think he's the best expositor and some of his terminology is crappy, but I understood the author from what I've read so far. I literally have no idea what you're trying to say; this has no meaning

I find that this guide unhelpfully conflates probability and inference in a few places. Probability theory on its own is interesting but not terribly useful without the infrastructure of estimation.

NO NO NO!!! Don’t start with Venn diagrams, sets, and other such fluff. Reminds me of the thin, little book they tried sticking on us in my probability class; undergrad EE. It was meant for math majors.

There is a book “Probability and Statistics for Engineers and Scienctists” by Raymond Walpole. That book is excellent. Rolling dice and pulling colored marbles from jars is how you teach probability.

I studied probability during my undergrad (and high school) using dice, coins and other such things. It made sense to me but there was a dark area in my understanding. It felt like a blind spot and I could never get into it. In the final year of engineering, we had someone do a quick refresher on probability as a prelude to a longer course on pattern recognition and he described the whole thing using set theory (Venn diagrams, functions mapping from one space to another etc.) and I felt that the blind spot was illuminated. So, I don't know if starting from there would make sense but I do think it's useful, atleast sometime in your studies, to look at the whole system through this lens.

I've been working through http://www.greenteapress.com/thinkbayes/ and am quite enjoying it. My only complaint is that he, as intended, teaches using programs and a computer and I learn better by doing stuff by hand. He also has a think stats book at http://www.greenteapress.com/thinkstats/ which people might find interesting.

There is a good connection between probability and Venn diagrams: Both are about area. Probability is about area where the area of everything under consideration is 1. So, there is a set of trials. It has area 1. Each subset of the set of trials is an event and has an area, its probability. Then we can move on to random variables, distributions of random variables, independence of events and random variables, the event that a random variable has value <= some real number x, etc.

In pure math, since H. Lebesgue in about 1900, the usual good theory of area is Lebesgue's measure theory. The ordinary ideas of area we learned in grade school, plane geometry, and calculus are all special cases. But Lebesgue's theory of area handles some bizarre, pathological, extreme cases. And we can show that there can be no really perfect theory of area -- e.g., there have to be some bizarre subsets of the real line to which no nice theory of area can assign a length. But, once we have the Lebesgue theory, the usual way to show that there is a subset of the real line without an area uses the axiom of choice.

Well, in 1933, A. Kolmogorov wrote a paper showing how Lebesgue's theory of area would make a solid foundation for probability, and that approach is the standard one for advanced work in probability, statistics, and stochastic processes.

I agree that to build fundamental intuition dice and marbles are great. They only take you so far, though, and it would be terribly wasteful not to utilize mathematical machinery that already exists. Practically applied mathematics is a difficult tool to wield but incredibly powerful. I.e. you need to know when and how to apply it, but when it's used correctly it's immensely practical.

Complaint: you don't define your notion of "space". In chapter 1 it's some informal notion that you use to motivate the definition of a set (??), in 1.3 and 1.4 it becomes clear by space you mean "set". Then later you start talking about dimension of spaces, implying not only do they come with a topology now they have a well defined dimension, so a locally Euclidean Hausdorff space or something - but maybe you just mean R^n.

Comment for other commentators in this thread: not all expositions is tailored for the masses. A piece of pedagogical literature that does not appeal to your background doesn't mean it's not good. There's a very clear need for exposition on basic structures in probability theory and this fits there.

reply