Remember DSEE / ClearCase? They had all sorts of complicated virtual file systems to deliver tagged and branched contents of source code repositories. But drive space expanded with a Moore's-law style curve and now we have "git pull". Far simpler. System administrators don't hate our guts for adopting git.
Remember PHIGS? We needed display lists for graphics engines because host machines were too slow. Silicon Graphics took the other approach, and now we have GL.
Remember terminal concentrators like Digital's LAT? You don't? Good. I wish I didn't. (Handling 9600 baud interrupts was too big a load for a host machine. Really.)
Remember optical typesetting machines? The digital outlines / images for creating nice-looking letters used to be too big and complex to use for creating the images for actual pages. You want to use Univers or Gill Sans to set a document? Fine. Buy a Selectric typeball. Or go pay Linotype or Monotype a bundle for a little optical thingy with images of all the letters on it. Take the lid off your typesetting machine and put that thingy into it. You want to set Japanese? Too bad for you. Apple, Adobe, Chuck Bigelow and Kris Holmes, and Matt Carter, and Donald Knuth, decided to ride the exponential rocket and the rest is history.
The bright side: Margaret Hamilton and her team on the Apollo moonshot project used simple, reliable, radiation-hardened, and redundant computers with bugfree software to get those guys to the moon and back.
Let's be careful: Generals always fight the last war. Before starting new valiant efforts, we should carefully assess whether the appropriate technology for the planned delivery date is JMOS -- Just a Matter of Software. Sometimes it might not be the case. But most often it will be.
Let's take computer vision. Alex Krizhevsky et al destroyed the ImageNet competition with a neural network in 2012, kicking off the current AI hype cycle. Essentially everything in their model had been known about since the late 80s. But we also didn't know how to train deep networks much before this (it turned out how you initialise the neural network was important), and we also didn't have a big enough dataset to train such a deep model on until ImageNet. Since then, we have built models that perform another order of magnitude better than the 2012 model, mainly because of improvements to the architectures (a combination of ingenuity and a lot of trial and error).
So compute is necessary, but it isn't enough, I don't buy that we've 'brute forced' image recognition in the same way as chess.
Many of these things have required the giant leaks in compute, but still wouldn't work at all without the concurrent improvements in algorithms.
Along these lines, here's a classic blog post:
"Grötschel, an expert in optimization, observes that a benchmark production planning model solved using linear programming would have taken 82 years to solve in 1988, using the computers and the linear programming algorithms of the day. Fifteen years later — in 2003 — this same model could be solved in roughly 1 minute, an improvement by a factor of roughly 43 million. Of this, a factor of roughly 1,000 was due to increased processor speed, whereas a factor of roughly 43,000 was due to improvements in algorithms! Grötschel also cites an algorithmic improvement of roughly 30,000 for mixed integer programming between 1991 and 2008."
I'll agree that the author emphasizes compute power, but his real point still holds. Monte Carlo search may not be classic brute force, and neural networks guiding it may also not be standard, but the two just let you effectively search on a massive scale.
Yes. The point of the author is that it doesn't do this symbolically.
Don't get confused with the terms "brute force", "neural net", etc.
The main idea of the author is that AI that uses brute force, simpler statistical methods, NN, etc, wins over AI that tries to implement some deeper reasoning about the problem domain the way humans do (when thinking about it consciously).
A NN doesn't work with the domain objects directly and abstractly (e.g. considering a face, facial features, smiles, etc as first class things and doing some kind of symbolic manipulation at that level).
It crunches numbers that encode patterns capturing those things, but its logic is all about numbers, links between one layer and another, and so on -- it's not a program dealing with high level abstract entities.
To put it in another way, it's the difference between teaching, say, Prolog to identify some concept and a NN to do the same.
E.g. from the link "The most successful form of symbolic AI is expert systems, which use a network of production rules. Production rules connect symbols in a relationship similar to an If-Then statement. The expert system processes the rules to make deductions and to determine what additional information it needs, i.e. what questions to ask, using human-readable symbols."
A NN does nothing like that (not in any immediate, first class, way, where the rules are expressed as plain rules given by the programmer, like "foo is X", "bar has the Y property", etc).
Here's another way to see it: how you'd solve a linear equation with regular algebra (the steps and transformation etc), and how a NN would encode the same.
A symbolic algebra system will let you express an equation in symbolic form (more or less like a mathematician would write it), and even show you all the intermediate steps you'd take until the solution.
A NN trained to solve the same type of equations doesn't do that (and can't). It just tells you the answer (or an approximation thereof).
'compute beats clever'? 'fast > fancy'? 'better big than bright'? 'in the end, brute force wins'?
Or at least call it "AI's Bitter Lesson" or something!
> One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.
> The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries.
From what I know about brain development, "search and learning" are key mechanisms. Plus massive overproduction and selection, which is basically learning. Maybe that's the main takeaway from biology.
Also, as in evolution, ~random variations occur during neuronal proliferation, so there's also selection on epigenetic differences. The same sort of process occurs in the immune system.
In this way, organisms can transcend limitations of their genetic sequences. There's learning at levels of both structure and function.
But from the AI researcher's view, "the other" doesn't require time; someone else is advancing the hardware, which is outside of the AI researcher's area. The general method to-be-run on better hardware is known today; it doesn't have to be researched. So should the AI researcher just twiddle their thumbs, waiting for the hardware to improve?
In games like chess, it has long been known that if you have a big enough database, an optimal game can be played. For each board configuration, the entry in the database supplies the optimal move.
Momentum lends itself to easy statistical support, and paradigm shifts are notoriously difficult to predict with any degree of confidence.
No matter how long in the tooth a particular trend might be, matter how certain you are that a reversal is imminent, it's hard to push against the weight of trend-line evidence.
Maybe as someone who hasn't been steeped in AI for the past several decades, I'm not able to appreciate the depth of emotion behind Sutton's statements. I find this kind of vague pontificating to be boring. It seems aimed more at convincing the author of a position, than doing a critical analysis and convincing the reader.
Will compute+data solve AI? Will structured algorithms solve AI? Will neuroscience provide key breakthroughs? Who knows? We'll find out when we find out. This article provides little value beyond historical reminiscing.
In the meanwhile, there are many interesting problems begging for attention where data or compute is limited.
I'll believe it when these techniques can solve real problems in a robust way. More importantly, when it can solve real problems in a robust way, opinions don't matter! Proponents won't need to go around trying to convince people almost in the manner of superstitions belief.
I'm not sure if that had written 10-20 years ago, that "learning" would figure out so predominantly. Who's to say there isn't a third such big method?
Also, while the lesson fit the facts (easy in hindsight), it will hold... until it doesn't anymore. The end of Moore's law has been long heralded, and we're starting to enter this era. Progress can be made, probably, but transistors can't get any tinier, and you can only put so much cores on one chip. Hardware may continue to provide "free gains" but those will likely be at an order of magnitude (or more?) smaller than before.
I think Moore's law is interesting. Technically Moore's law is about transistor density/integration, in effect it became about CPU performance and similar phenom were seen in disc and network performance. Just now we are seeing a move in general architecture away from spinning rust and towards chip based storage - ssd's and optane (or just huge DRAM) which has been much slower than I thought, but is still happening. There will be more progress as we wring out the opportunities in architecture and network devices, but overall you are right - no more Mooore's.
Also there's been a wave of progress funded by excitement - it's really hard to see how Google justified the spend on Deepmind's TPU infastructure, but they did - in contrast to a rational investment from a research council which would never have bought into Alphazero and the rest.
There's opportunity to do more - big gaps in datasets, evaluation metrics, refinement of techniques (mac nets, adversarials etc), but it's back to hardscrabble now - and I'm interested to see if this is a Warren Buffet moment. After all, you only see who's wearing shorts when the tide goes out!
As you should be, but anti-hindsight bias (or hindsight anti-bias?) is even worse. Not accusing you of that; just making a general observation. Hindsight should inform, not bias in either direction.
It's the best kind of bias.
I clearly need my morning coffee.
In the minds of many business executives and government officials, "explainable AI" means, quite literally, "show it to me as a linear combination of a small number of features" (sometimes called "drivers" or "factors") that have monotonic relationships with measurable outcomes.
I would go further: most people are understandably scared and worried of intelligence that arises from scalable search and learning by self-play.
It seems self-evident to me that 'brute force' is the most general strategy there is. Any (computable) problem is theoretically solvable by just coding the simplest, most obvious solution, which is usually pretty easy. The run-time of brute force is sometimes an issue, but that just means you need more of it!
I'd like to remind everyone that science is in the business of understanding, making things less opaque, less magic and engineering benefits from both.
A human is significantly better on game 11 than game 1 (I recently got into Starcraft). Current ML systems are not. It's up for discussion how to take the human's previous experience into account, but the total amount of experience is significantly less that the computer's.
Sutton takes an even more extreme point of view, suggesting that most human feature engineering is similarly a waste of time. It's hard to argue with if you know the history: some of the best computer vision algorithms use exactly two mathematical operations, convolution (which itself only requires addition and multiplication) and the max(a,b) function. (This is true because both ReLU and MaxPool can be implemented with max(), and because a fully connected layer is a special case of a convolution.) A similar story occurred in speech recognition, with human designed features like phonemes and MFCC are giving way to end-to-end learning. Indeed, even general purpose fully connected neural networks started to work much better once the biologically-motivated sigmoid() and tanh() were replaced with the much simpler ReLU function, which is is just ReLU(x) = max(x, 0). What really made the difference was leveraging GPUs, using more data, automating hyperparameter selection, and so on.
I'm not sure if there's really a lesson there, or if this trend will hold indefinitely, and I'm not sure why the lesson would be "bitter" even if it holds. Certainly opinions are mixed. One the one hand, many researchers such as Andrew Ng are big proponents of end-to-end learning; on the other hand, no one can currently conceive of training a self-driving car that way. But avoiding domain-specific, human-engineered features may be a viable guiding philosophy for making big, across-the-board advances in machine learning.
In fact, wasn't there an article posted here recently saying that they'd had good results with using learned features to feed traditional non-NN-based machine learning?
They anticipate that real advances will be made by massively scaling up the compute power they throw at any given problem. That’s driving their fundraising efforts.
If the past 5 years are anything to go by, they’re right.
These aren't discarded, they are part of ML vision networks today. Edges are one of the 3x3 convolutions that a network can learn, SIFT/etc are the dense / clustering nets, I'll admit I just googled Generalized Cylinders (very interesting). There are others like SLAM as well.
to make this more readable on a desktop...