Last year I actually made an applied-CoreML app to solve sudoku puzzles where MNIST came in very handy.
I wrote about it here: https://blog.prototypr.io/behind-the-magic-how-we-built-the-...
600,000?!? Even divided by 81 that's over 7000! How long did this take?
I just hacked into my app's flow to upload a "scan" of the isolated puzzle to my server instead of slicing it and sending the component images to CoreML.
Then I sat there and flipped through page after page of Sudoku puzzles and scanned them from a few different angles each, sliced them in bulk on the server, and voila: data!
The app already had the code for "isolate the puzzle and do perspective correction" so the uploaded images all looked something like this: https://magicsudoku.com/example-uploaded-image.png
By "slicing in bulk" I mean the server was the one that split that out into 81 smaller images rather than the app doing the slicing and uploading 81 small images.
Taking them from different angles was done because the perspective correction adds distortions that I didn't want my model to be sensitive to.
Another way to formulate this question: "given training data that only tells you about digits, how do you know whether something is a digit or not?" Given that the training data never actually defines what isn't a digit, how can we ensure that the model actually sees a digit at test time? If we cannot ensure this (e.g. an adversary or the real world supplies inputs), how can we "filter out" bad inputs?
A quick hack solution that works well in practice is to examine the "predictive distribution" across digit classes. Researchers have empirically found that entropy tends to be higher (i.e. more smooth) when the model sees an OoD input. However, the OoD problem is not fully solved.
Here's a nice survey paper on the topic: https://arxiv.org/abs/1809.04729
Note that methods that tie OoD to the task at hand (classification) are not actually solving OoD, they are solving "predictive uncertainty" of the task.
1) Integrated. Represent 'no number' as class number 11 in the original model. Retrain it with this additional class (needs additional training data).
2) Cascading. Train a dedicated model for 'number' versus 'no number' (binary classifier), and use that in front of the original model.
Note that the MNIST data comes already extracted from original image, centered in fixed-size images of 28x28 pixels. In a practical ML application these steps would also need to be done before classification can be performed.