Hacker News new | comments | show | ask | jobs | submitlogin
Uber cracked two 80s video games by giving an AI algorithm a new type of memory (www.technologyreview.com)
75 points by rbanffy 5 months ago | hide | past | web | 40 comments | favorite

Before I read the article, I thought it was about a different type of cracking, one which is often applied to 80s video games.

Same ;)

From the article: lots of behaviors that are necessary to advance within the game do not help increase the score until much later

I found this passage very interesting, as it seems like the definition of the benefits that can extend from delayed gratification. That optimizing for a local maxima isn't always the best way to get the best global maxima. Maybe algorithms that can encapsulate that will be useful in getting corporations away from the habit of chasing the highest short-term gains at the expense of long term viability.

Another way to look at it is that these games have less obvious causal relationships. If the game is Pac-Man, it is immediately obvious that eating a dot makes the score go up, because the effect always and immediately follows the cause. The further the effect is separated from the cause, the more possibilities the machine must entertain. Imagine, in a different game, you activate three switches and a door opens. Why did the door open? Maybe all three switches must be activated for the door to open. Maybe the final switch controls the door and the other switches control something else. Maybe you must always activate the switches in a particular order as a security measure. The game probably gives context (e.g. the third switch is labeled 'open door') that a human can use to eliminate the many possibilities, but the machine must experiment before knowing what is relevant. When you separate cause and effect in time, the machine must deal with many possible 'switches'.

Go-explore is weird for me. The main reasons for it's success are all based on very specific thing that barely transfer to anything such as the ability to save a state and restart from it at anytime (in this case using features from the emulator). It look almost like an engineering project.

I don't like bad-mouthing the work of others, so I kind of feel bad for saying that and maybe someone can prove me wrong: But is this not borderline "cheating" (in quotes because obviously there is no agreed-upon rules)?

The goal never was to solve Montezuma's revenge just for the sake of it. We could have done that way before by using hard engineering. The interesting thing about this game is that it reveals inherent flaws in our reinforcement learning approach. And so, if you manage to solve this game using a domain-transferable approach, surely that means you have made some significant progress in RL. That's doesn't seem to be the case here?

I thought so too at first, but here's my new thinking:

The most efficient way to learn real world tasks (for say a robot) is to learn in a realistic simulated environment, which ideally can be controlled by the agent, e.g. rewinding state after a failure to correct that particular behaviour. Ideally the simulation is regularly calibrated against real experiments, including confidence intervals on the real and simulated sensor inputs.

Given that model of learning, Uber's approach here is very reasonable.

Good point! Uber definitly has simulated road environments, so it makes sense indeed. Thanks for sharing.

I had the same though after reading the article, thanks for a good explanation of it.

This solution does not seem to bring us much closer to the possibly unsolvable problem of finding an ultimate reward for RL, as it is very specific to the game.

If they manage to get that behavior emerge from something not so explicit though, that would be a huge achievement.

From Alex Irpan's response:

I was sour on the results themselves, because they smelled too much like PR, like a result that was shaped by PR, warped in a way that preferred flashy numbers too much and applicability too little

Harsh! OpenReview's double blind system seems to work quite well in this regard as a peer review mechanism.

Quick Opinions on Go-Explore


Of course it's harsh when you pull the quote out of context:

> Like Go-Explore, this post had interesting ideas that I hadn’t seen before, which is everything you could want out of research. And like Go-Explore, I was sour on the results themselves, because they smelled too much like PR, like a result that was shaped by PR, warped in a way that preferred flashy numbers too much and applicability too little.

That little bit is actually talking about a different, earlier, AI game-playing effort. The bulk of Irpan's blog post is describing Go-Explore, and it is about as harsh as the grandparent of this comment suggests.

I think the writing article is making it very clear what he dislikes: that the research is twisted by PR into making inappropriate claims and comparisons. Even though it contains interesting and novel ideas. That is in no way harsh on the actual research and validity of the results itself.

Link to Uber Engineering page on this: https://eng.uber.com/go-explore/

From the linked page:

To enable the community to benefit from Go-Explore and help investigate its potential, source code and a full paper describing Go-Explore will be available here shortly.

Thanks, was looking for that link.

Tangent: I notice I find it really annoying whenever an internet article talks about a blog post or other article, yet doesn't link to the source on the spot. Take this sentence from technologyreview's article:

> The approach leads to some interesting practical applications, Clune and his team write in a blog post released today

There is no excuse to have a sentence like this and not have "a blog post" be a hyperlink. It feels rude somehow, like it's breaking internet etiquette.

I agree, its rude and breaking an internet etiquette. It also makes it much more difficult for the search robots to make a mapping between pages.

In the article, Uber says "Surprisingly, despite considerable research effort, so far no algorithm has obtained a score greater than 0 on Pitfall."

I played pitfall as a kid and it seems quite straightforward for a computer to solve... jump over the puddle. I'd like if someone could talk more about this game in particular, specifically why it's so hard for AI to solve. Any interesting paper/link on the subject?

I don't know Pitfall, but generally speaking, the games that are hard to solve are the ones with sparse rewards. In the game, do you get points when you overcome some obstacles? If not and you only get points for successfully completing the entire level, then it becomes very hard for an agent to learn: How can it know if it is improving within the level itself?

I would do the following.

1) Train a model to predict what happens next given an input 2) Each frame, predict the next frames given all possible inputs 3) Choose the input that maximizes uncertainty

I would expect this to learn to avoid deaths relatively quickly. It doesn’t need to be good at knowing what will happen next, just better at recognizing specific dead ends (e.g. spikes or holes).

You can reward moving as far to the right as possible or something like that.

Ah, that would be it. There's no way to tell you are heading to victory.

Shouldn’t the AI be rewarded for exploring places it hasn’t explored before? It seems like that would help it progress in Pitfall.

This is glossed upon in the article:

"AI researchers have typically tried to get around the issues posed by by Montezuma’s Revenge and Pitfall! by instructing reinforcement-learning algorithms to explore randomly at times, while adding rewards for exploration—what’s known as “intrinsic motivation.”

But the Uber researchers believe this fails to capture an important aspect of human curiosity. “We hypothesize that a major weakness of current intrinsic motivation algorithms is detachment,” they write. “Wherein the algorithms forget about promising areas they have visited, meaning they do not return to them to see if they lead to new states."

Yes it's an active topic of research: https://openreview.net/pdf?id=rJNwDjAqYX

Couldn't you use negative rewards, by scoring deaths so that it's better to avoid it?

Wouldn't the optimal strategy be just to stand still then?

This sounds like the plot to War Games:

An AI tries to figure out how to win at nuclear war, then concludes that:

"Nuclear war is a strange game in which the only winning move is not to play."

I too have never scored greater than 0 on Pitfall! Harder than Ghost n' Goblins.

"The problem with both Montezuma’s Revenge and Pitfall! is that there are few reliable reward signals. Both titles involve typical scenarios: protagonists explore blockish worlds filled with deadly creatures and traps. But in each case, lots of behaviors that are necessary to advance within the game do not help increase the score until much later."

Opinion: Equally true in non incubator-assisted entrepreneurship... that is, real entrepreneurship...

Can someone explain how this might help in optimising vehicle routes as opposed to existing combinatorial algorithms made expressly for this purpose?

Given that they name "robot learning" as an application, the target domain is probably self-driving cars, not route optimization.

“Better reinforcement-learning algorithms could ultimately prove useful for things like autonomous driving and optimizing vehicle routes”

Ah, I had missed that line. Unless I missed it in the Uber post too, that does seem to be a claim the TR writer added?

This kind of research should be done by universities or already profitable companies. Using someone else's VC money for this seems questionable at best.

It is extremely useful for recruiting talent. Top talent wants to do basic research, so this is a powerful lure for keeping them at the company rather than in academia. I'm at NeurIPS this week, and many companies are here and they heavily advertise how many papers they have in the conference as part of their recruitment efforts for both (applied) AI research and AI engineering positions.

Exactly. This is one thing I always ask when interviewing for a new job, and any company that cannot give me a decent answer goes to the bottom of my list.

This is nonsense. What if this is research around problems they need to solve in order to improve their self-driving car? Because clearly that's the goal here.

Should they tell investors to patiently wait while some academic comes up with a solution to their problem?

If I were VC with stakes in uber, and they wouldn't be researching this area, I would be pretty mad.

Yet again, research and breakthrough are not made by profitable companies. And research money most often come from government with little to no incentive on ROI.

>> Yet again, research and breakthrough are not made by profitable companies.

I'm pretty confident that this is flat out wrong. Just within the tech sphere you have Intel, Microsoft, Apple, Google, IBM, Samsung, and Baidu that have all historically put out large amounts of research. Outside of the tech sphere look at Pfizer, Johnson & Johnson, GM, Toyota.

Here's a slightly dated summary of research spending by large (profitable) companies https://www.recode.net/2017/9/1/16236506/tech-amazon-apple-g....

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact