Hacker News new | comments | show | ask | jobs | submitlogin
Show HN: This Question Does Not Exist (stackroboflow.com)
274 points by yeldarb 2 months ago | hide | past | web | 141 comments | favorite

I created this site using a Fast.ai trained language model using the Stack Overflow data dump.

Full writeup available here: https://stackroboflow.com/about/index.html

Interesting things I’ve noticed so far:

* It does a remarkably good job of context switching between programming languages based on the semantics of the question! If the question is about SQL it often includes SQL in < code > tags. If it’s about JavaScript it will include JavaScript! The syntax isn’t perfect due to the tokenizer mangling some things but it’s pretty close!

* The English grammar isn’t perfect but it’s pretty good.

* It doesn’t seem to lose closing track of closing tags and quotes.

* It's learned to sometimes pre-emptively thank people for their answers and to "edit" in "updates" at the end of the post.

If you find any interesting ones you can share them with the permalink! Use the "Fresh Question" button to load a new one.

I found this part of the about page interesting:

> Originally, I wanted to predict the number of upvotes and views questions would get (which, intuitively, I thought would be a good proxy for their quality). Unfortunately, after working on this for about a week straight I've come to the conclusion that there is no correlation between question content and upvotes/views.

> I tried several different models (including adapting an AWD_LSTM classifier, a random forest on a bag of words, and using Google's AutoML) and none of them produced anything better than random noise.

> I also tried using myself as a "human classifier" and given two random questions from StackOverflow I can't predict which one will be more popular.

> Answering Questions Right now the model only generates questions. In version 2 I want to train it to answer questions. If I could get this working it'd actually become a useful tool instead of a fun toy.

Looking forward to that part :D

I mean, those answers are probably not going to be correct, but I wonder how close they will be to something useful.

Yes, many times the questioner does not actually need an answer to the question, he just needs to look a little closer to the situation, which is potentially able to be automated. But one should not disguise such automation as an 'answer': more like a query autocheck but more tooled-up.

I wonder what percentage of questions just need a correctly working example because the questioner is unsure of how to use a given API. Automation of this I imagine could actually be doable.

Thanks, brilliant work! Some questions are downright hilarious (see a suite of automated packaging techniques [0]), and the broken English just adds extra ESL credibility to the questions.

[0] >I want to write an update statement with a sequence of values I can run through a database. i've written the below code to broken up my character string into the columns. All of the articles i've read seem to suggest that I 'll need a suite of automated packaging techniques for my environment to all update the database.

What s the best way to update the column ids?


I wouldn't doubt it was written by a human if I saw it on stackoverflow.

Thanks for including permalinks to questions, that's great for sharing!

How does this compare to the gpt by openai https://github.com/openai/gpt-2 ?

It’s a different model, the AWD_LSTM. The inventor of this model spoke about GPT-2 on this podcast and talked a bit about the differences: https://overcast.fm/+Goog4jsR8

"I have creating a PNG Image file where I am printing out the image's with different colors and Image Types. Now I am sure I am drawing properly, but what I'm seeing is that the image is not differently jpeg (ie FF or Chrome) and Safari (for Firefox) is different from the one in Firefox. "

As a bit of a connoisseur of babblebots over the decades, one of the interesting things about this generation is that it is producing text that has a very interesting effect in my mind. There is a part of the parsing process where the above text went down smooth; yup, that's what Stack Overflow questions from early developers tend to look like. That part of my brain issues no objection. But the next layer up screams bloody murder about how nonsensical that is. And it's not just "that's a bad question but I still see the order under it", but nonsense.

It's a combination I've not experienced before. Previous generation babblebots could often produce a lot of fun text, but every processing level above raw word processing has always been able to tell it's computer garbage, even when it blundered onto a particularly entertaining piece of garbage. We've actually successfully moved up a level here.

As I'm describing subjective experience, YMMV.

The experience you are describing reminds me of the comparative illusion [0], which is a grammatical illusion where certain sentences seem grammatically correct when you read them, but upon further reflection actually make no sense, example:

"More people have been to Berlin than I have."

[0] https://en.wikipedia.org/wiki/Comparative_illusion

Fascinating. There's a sentence I picked up from a friend in childhood, "Although the Moon is only an eighth the size of the Earth, it is much farther away," which seems to be similar, but not quite a CI, if I'm reading the Wikipedia article correctly. Thanks for the link.

This is like a mental tarpit, where you waste time reading trying to understand what the person is saying only to realize they were a bot and all your effort was for nothing, that is time you will never get back now.

This is a terrifying way to destroy an online community if a person floods it with nonsense content like this.

Terrifying -> inevitable. Imagine a botnet full of fake AI users trained with a corpus of legit HN posts. Let them loose commenting on random articles, beginning slowly but ramping up until they’re 99% of all comments.

In a few years, the standard Silicon Valley “Growth Hacking” job description will include using AI to deploy fake content to your competitors’ sites, destroying their user community.

Two potential solutions: Reputation and a new account fee.

Nonsense flooding will make it more difficult for people to establish their identities on a network, but once it's established, they'll be in the clear. If someone has to pay to have their first thread or two reviewed, it will take serious money to flood a site to death.

(A similar solution to email spam has been waiting to happen for decades -- charge a fraction of a penny per email, and nobody is harmed but spammers. Maybe allow exceptions for officially recognized organizations that have to send a lot of messages, like political campaigns.)

I was with you until you said you wanted to exempt the politicians. I would charge them double.

I share your concern.

This would be even better: Email recipients grant free access to whoever they want. A tiny price would be charged only when sending to someone who has not granted such access.

I like the idea.

Unfortunately, even though snail mail has associated costs I still get a ton of junk.

Is this a future? How much content on twitter/fb is autogenerated, auto-liked and auto-shared?

Some is deliberately auto-generated, like https://twitter.com/choochoobot ; but yes, there is definitely an awful lot of auto-liked auto-shared fake engagement out there.

If the answer is not "zero" then the answer is "too much."

I'm positively sure these tactics are already being deployed as a weapon in order to shut down debate of certain inconvenient topics and disrupt problematic communities.

Ugh. Like SV hasn't yet made the internet suck enough...

Indeed, there’s something almost unsettling about text that initially appears to follow a sort of internal logic, yet doesn’t. Some of the results read like a programming fever dream:

“I set each thread * pointer, adding a new thread and in a loop inside this function. The thread would be immediately on the thread, but the thread resulted in the exception. If I return the thread to the first thread and finally the thread is left, the thread doesn't hang, and I couldn't kill thread # 1 - because the thread method made first thread calls the native thread. But, the thread is waiting for the thread blocking and all the other threads to be started. In other words, the thread is always destroyed.”

Unsettling is exactly the word. https://stackroboflow.com/#!/question/16662 leaves me trying to work out what on earth the poster's really trying to do - even though I know full well that there is no poster...

A few times, I've come across Stackoverflow questions on technical topics I'm not very familiar with, and the question makes no sense to me (there are clear spelling, grammatical, and consistency errors). But there's an answer, and a comment exchange that seemed to resolve the question. So, I conclude, it's just my unfamiliarity prevents me from seeing through those errors.

A related phenomena is seeing fundamental errors in a newspaper article on a topic you're expert in... but believing articles on topics you're not familiar with.

This can operate as a partial turing test: a gradient for iteration.

You might like this post from last week:

Humans who are not concentrating are not general intelligences


On the other hand, cherrypicked excerpts can be terrifyingly convincing: "What is the best way to login to my Ruby application in a browser via Perl?"

We've all been asked a question like that and had a cold dread creep over us as we try to formulate a response...

It's the Uncanny Valley of text synthesis.

This one is golden:

"What's the best way to indeed start a process on an OS x machine?

What is the best way to start a process on Mac OS x Snow Leopard?

There I just need to be able to run the OS x.exe from the command line and it's working fine (make it available in Windows). But I'm on an Mac and I haven't figured out how to do this for a Linux machine.

Another reason I ask is that I only have a Unix shell running with the Python process in it (it's my an Ubuntu machine, nothing didn't work in the shell).

Thank you in advance"

https://stackroboflow.com/#!/question/16993 Thanks Steve

Without too much sarcasm I have received support requests that were far too close to this.


Laughing way to hard, but is it having a stroke? https://stackroboflow.com/#!/question/14791

This reminds me of the Ponzo famous line on Becket's Waiting for Godot.

I propose a new kind of Turing Test.

Gather equal numbers of the least intelligible questions from SO (possibly using a metric based on low views/upvotes/comments/answers over long time) and a random selection from stackroboflow.

Present human judges with both sets of questions and ask them to tell the difference.

Having read numerous SO questions from newbie developers whose grasp of English was tenuous at best, I doubt I could tell the difference.

The next step up: the same test, but with mathematics or scientific papers judged by non-experts in the field.

We may actually be there already - I'm not sure.

All of which makes me wonder when we'll reach the point where the bar has been raised so high that the comparison will need to be against the best SO questions and scientific/mathematics papers judged by subject matter experts.

This is the most prototypical stackoverflow question I have ever seen:


"What the heck what does $ (' # ') do?" https://stackroboflow.com/#!/question/7492

But is it web scale?

"View from View to View, Need to open this new View in View"


A common problem for everyone, i'm sure.

"I come from a C # background, this is my first Silverlight project, but I'm new to Windows."

Now that is comedy!

I got a surprisingly comprehensible (and similarly recursive) ListView question: https://stackroboflow.com/#!/question/15327.

Similarly this one: https://stackroboflow.com/#!/question/19297

Sounds like jumpstarting your userbase would be easy, once you allow users to define other users :)

https://stackroboflow.com/#!/question/22733 I hate when fellow coders do that.

Another pearl: Creating a PDF from PDF. The situation is as follows: We have a video file hosted by Google Map.

It's like reading a doco-satire about my life.

I can claim to have experience [0] with generating funny nonsense based on Stackoverflow data (what a wired thing to say :))

Seems like you beat me to my plan to make a Neural Network based variant and I really like the results (especially that they stay a topic instead of totally drifting off into fun nonsense like my Markov Chains.

Have you tried also using other Stackexchange sites as a source? In my experience they result in more fun questions as they have more "human" interactions (especially the more personal advice based sites) which creates things like: - Do Greeks driving affect the whaling industry? - Essential windsurfing equipment to fish? - Do mountaineers eat grass? - Can I toast

[0] https://news.ycombinator.com/item?id=16947038

I haven't yet! It's on my list of things I'd like to try.

I reviewed 1600 edits at StackOverflow. And I can say that some of the automatically generated questions are more intelligible than the average SO question. For example, this one looks fine to me: https://stackroboflow.com/#!/question/11235

It's so close to being intelligible, but I still can't quite parse it, like so many actual SO posts.

Likewise, I'm not sure I'd think anything was strange if I came across https://stackroboflow.com/#!/question/12110.

Fascinating. I wonder if our current discussion boards on the interwebs can survive the coming influx of content like this and the next generations of it that follow.

There are a lot of SO questions posted by very weak non-native speakers of English and some of these are hard to distinguish from those. Kind of scary!

What possible positive outcomes do you see for this kind of (admittedly inevitable) capability?

AI will render political discussion between honest human strangers impossible on the open internet.

At some point this technology will extend into what's left of print, then talk radio, then TV. An endless supply of Markov punditry.

"Markov punditry" Nice coinage there.

:) I doubt it's original; it reminds me of the character Markov Chaney from Robert Anton Wilson's books.

I am actually a bit worried that I’m already starting to see search engine traffic coming in...

I hope that the good will outweigh the bad. I’d love to create an answer generator, for example.

Once enough questions are generated I’m going to try creating a classifier to see if a neural net can differentiate between real questions and fake ones.

> I am actually a bit worried that I’m already starting to see search engine traffic coming in...

Brilliant, now you only need to come up with a way to use this for good and keep (at least slightly) ahead of the cost in the long run.

You could put up a robots.txt denying all search engines.

But the issue is not just what he could do, but what malicious content generation systems could do.

The issue is, as stated:

> I am actually a bit worried that I’m already starting to see search engine traffic coming in...

We can discuss hypothetical systems that could maliciously flood us with generated content. The creator of this particular service which is being discussed here and now could also begin taking steps to ensure that his creation does not inadvertently create a problem for some hapless Google user.

Well, no. You have to read further up the thread to see the issue I was referring to.

>I wonder if our current discussion boards on the interwebs can survive the coming influx of content like this and the next generations of it that follow.

Yes the robots.txt is a good and trivial step he could take to ensure well behaved robots do not pick up his content. So your comment suggesting robots.txt is a good comment in its narrow frame, but one that missed the larger picture. That minor problem is solved. The interesting problem is of a different nature.

I was thinking the same, as my immediate reaction was to attempt to understand the question and formulate a solution

I was clicking through a couple funny ones but as soon as I got one[1] that fell into this uncanny valley I immediately forgot this was generated and tried to understand it, getting super confused.

[1] https://stackroboflow.com/#!/question/13535

I was cycling through some answers, when suddenly the following, completely unrelated text shows up in a random code block:

I can feel the admin is different

You sure you didn't just accidentally create a self-aware AI? Forgot to permalink sadly

There is immense value in training these to synthesize test data sets for sensitive information you can't safely put in a preprod environment.

Health information would be the main case I can think of now.

Having synthesized data for testing new services in govt would be a huge improvement.

De-identification is basically impossible and there are a bunch of companies who will lie to you if you pay them to, but synthesized data covers many use cases for de-identification and for homomorphic encryption.

Reminds me of https://git-man-page-generator.lokaltog.net/, which I always found hilarious :)

> This is NOT real git documentation!


Awesome. Can you create a neutral net that arbitrarily closes questions as off topic or non constructive? ;)

No need for a neural network, you can just use Math.random (or your respective language's RNG).

Sidenote, would this technically count as a neural network with only one weight (which is randomly initialized)?

It’s something I’m interested in!

Unfortunately I’ve come to the conclusion that upvotes on Stackoverflow aren’t correlated with question content (or I’m not skilled enough to be able to differentiate between “good” and “bad” questions). Check out the linked write up for more detailed info.

> arbitrarily

I think the original comment is being sarcastic and suggesting that the actual humans that close discussions and mark them as "off topic" don't understand the question and perform these actions at random. This is a sentiment shared by many who don't "live" in those types of forums.

Ah, missed the operative word there. I could definitely do that! ;D

Could you not also train a classifier that correlates question content with mod decision? Questions like "what is the best X? " that are obviously subjective , for example.

Maybe even some kind of crazy generative model that learns to post questions that aren't closed by the AI moderator!

I think we've all been here before:

"i've been asked to use Json to call a webservice. I don't modify a JSON object at all. However, when calling JSON returned by the Json object, it fails because the object life isn't array!"


Excellent! It's great when it tries to generate code: https://stackroboflow.com/#!/question/8138 (the last line here made me laugh)

This one looks like a genuine java program.


``` Cat cat = new Cat(); Cat cat = new 2nd Cat ");" ```

I heard you like paths, so I put some paths in your path:


Very fun. But how am I supposed to help Charset solve their urgent problem? I'll just answer here.

Q: How to use a JSON string in a funky way https://stackroboflow.com/#!/question/49913

A: Dear Charset, I hope this might resolve your issue.

window.location = JSON.parse('[{"use": "https://www.youtube.com/embed/0ROzGihgCj8?rel=0&amp;autoplay...

https://stackroboflow.com/#!/question/12875 "It works fine in GCC but it does not work in GCC / GCC."

Absolute gold: "Is there a animal out there that someone can apply to do the sort of thing I'm looking for?"

Not sure what happened to the title that time.


(perhaps op is a vim user)

Try a Python or a Pony, or maybe even an OCaml.

Probably OCaml: that looks like an ML-style function signature.

Oh wow, this is amazing. My favorite so far is: https://stackroboflow.com/#!/question/22890

>How can I do this software?

Although it sounds more like a question from Quora.

having spent some time in the triage & edit queues, this 100% sounds like stackoverflow.

This is great... “I'm getting errors with Line 1, Line 39, Column million” lol

When trying to pack your code into a one-liner goes too far...

Nice! This is also what every question looked like when I was new to programming.

I'm voting to close as duplicate.

Right, after adding tags and answers, comments need to be added as well...

[Sorry, this question has been protected by a moderator.]

(^.^ this “so-SO” comment made me smile.)

Oh, great, now I know how my clueless questions look like to a knowledgeable person! Example:

>"I need to create an image from a imported wav file (for a user - friendly format find enough header for the cookie). I looked for a solution, but that didn't work either."

> ... Thinking I have to use the first two but it's not possible to use Jquery.

> So: Is it recommended to use a Perl function

This is just like the real thing.

I presume this is due to tokenization or something, but there's a lot of extra whitespace in the code samples that make them look very unrealistic:

  def _ _ init__(self, default): 
  " " " 
  See if the default value for the field on a view is 
  " " " 

  < select > 
  < option > value < / option > 
  < option > value < / option > 
  < / select >
And indentation is also missing completely. Maybe you need to use another NN to guess which language the fake code is in and autoformat it accordingly!

It is, the tokenizer isn't reversible (and it adds spaces all over the place).

But a lot of these I should be able to add to my regex that converts the output back into more human readable format (in the raw output, there's a space before every punctuation mark so I already remove those extraneous spaces from periods, commas, etc).

I just haven't gotten around to adding in any heuristics specifically for code but adding a bit more post-processing is on my to-do list.

I updated my regexes to clean up some of the tokenizer noise last night. So many of the formatting in the code snippets should look a bit more natural now.

Comgratulations, you have simulated a million monkeys at typewriters with a million monkeys at typewriters. Has anyone really been far even as decided to use even go want to do look more like?

This one boggles my mind, it even has code:


Final question didn’t end in a question mark - perfect!

“I want to do something like this

$ _ -1 = object();”

We all do....

Now I see that the virtual question is different every time. Great work. It read better than most SO questions.

This one is clearly written by a broken agent: https://stackroboflow.com/#!/question/1450

Reminds me of those online chatbots I used to torture back 10 or 15 years ago. One I started asking about personal information about its creator. It was remarkably evasive, constantly attempting to switch the subject.

Another gem: https://stackroboflow.com/#!/question/17035

"I have got a big "someone" who will be going to be using the asp.net site.

I have a black box and a background in firefox, where they have a width of 100%.

They will never know of a color.

They come from a background color."

Film starring Liam Neeson?

Every good invention can be terrifying if it falls in the hands of bad guys (Nuclear technology for example). It's true for AI also. I am sure bad guys must be training similar AI agents by only feeding fake news, conspiracy theories etc. and it's easy to build AI agents as there is so much Open Source material online about AI.

I'm trying to imagine a productive use case for this? Maybe in reverse for attempting to answer questions?

Think things like election meddling. Propagating truly fake news to cater to the emotions of what people simply want to be true. Humans are weak against Confirmation Bias, ten minutes on Facebook will show you for sure.

Yes. That was the rationale OpenAI made just a few weeks ago to not release their new language models:


use case is to spread misconceptions in the society (that's what bad guys want right?) in an automated way. especially during elections.

I think it would still be considered a "Does Not Exist"-valid website if the generated questions would have some auto-formatter for the code. Main issue I see is extra spaces everywhere, often in a syntax breaking way (and missing spaces for formatting) (not that all SO questions have those).

Yeah this is a shortcoming of the tokenizer. It splits things up in ways that are not 1:1 mappable back to their source unfortunately.

I did a bit of post-processing to get it formatted a bit better (re-combining the “would“ and “n’t” tokens and changing html tags to markdown for example) but there’s still room for improvement.

Spacing specifically is different based on the context. Outside of code blocks you want a space after a period. Inside you probably don’t. But since the tokenizer has one in both places there’s no opportunity for the neural net to learn this (it can’t see any difference). And my naive formatted doesn’t know the difference either. (If you’re curious you can find it in the JS file)

I updated my regexes to clean up some of the tokenizer noise last night. So many of the formatting in the code snippets should look a bit more natural now.

Huh, in this question, there are a lot of words that get repeated five consecutive times: https://stackroboflow.com/#!/question/13733

Is there a reason why? (I don't know anything about AI.)

The way the language model is trained is by rewarding it for correctly predicting the next word in a sequence.

The output of the model is a predicted probability distribution of the next word and a “state” — the next iteration takes the state output of the previous interation and generates another word and state (and this process repeats many times).

Since there’s a probabilistic dimension, what may have happened in this case is that it happened to repeat once by chance and the model had learned that if something repeats 2x it’s likely that it will repeat a third, fourth, and fifth time.

Basically it’s just trying to game the loss function which rewarded it for predicting the next word in the sequence correctly.

Thanks for the explanation. Your description superficially reminds me of a Markov chain (https://en.wikipedia.org/wiki/Markov_chain). Is this related or is it totally different?

I haven't read the paper the work is based on, but if the RNN outputs a probability distribution for the next letter/word then they form Markov Chains (since then they only depend on the current state and not the previous state)!

RNNs are just fancy parametric functions that take a (state, input)-pair and return a new (state', output)-pair.

From what I understand, repeated output is a common failure state with neural net stuff in particular, though I don't know why.

Because it needs to determine if a string is a '????? '''''?????, of course.

This desperately needs some AI-generated expert answers!

This is similar to

thispersondoesnotexist.com thisresumedoesnotexist.com

Yes, I was heavily inspired by them :) Glad someone made the connection!

I actually hadn't seen thisresumedoesnotexist.com yet; but I loved https://thiscatdoesnotexist.com and https://thisrentaldoesnotexist.com


Oh my god, some of these look terrifying, pure nightmare fuel.

You might enjoy https://www.thiswaifudoesnotexist.net

Like the Airbnb, it's StyleGAN+GPT-2 (finetuned in this case on anime plot synopses+summaries: https://www.gwern.net/TWDNE#gpt-2-anime-plot-synopses-for-gp... ).

I'm currently training an improved 'portrait' anime StyleGAN to fix up some of the faces' issues.

Just added a browsable archive of all of the questions it has generated thus far: https://stackroboflow.com/browse/index.html

And now, tooltip previews on that page for browsing convenience.

This looks absolutely amazing! I would be very curious to know how you went about conceptualizing the project and the AI beneath. Do you have a blogpost on it or planning to write one?

Yep, there's a writeup on the site: https://stackroboflow.com/about/index.html

This is very refreshing!

"I'm starting a new website using VB. I make a migration file and save it to a local Azure database"

This comment is worth about as much as this website.

The software that wrote this comment does not exist.

It has better grammar than the real one anyway.

I really would like a bot like this to produce ideas of things to create with programming in general.

Any ideas on the possible dataset?

Can we please get a Jon Skeet neural network to provide answers?

Giggles Love it.

Funny, yet terrifying at the same time. How can I be sure that HN isn't just a really well trained Neural Net?

How can I be sure that i'm not just a really well trained Neural Net?

You are a very well trained neural net... The concept is based off of actual Neurons in our brain. Can't tell if you're serious or trolling though lol.

Isn't a brain just that?

A friend of mine actually suggested trying to generate a hacker-news-comment language model next.. sounds like a fun project.

I’ll have to look and see if there’s an archive of them available.

An archive of HN comments? This is it.

Wait, you're saying it's not?

How can I be sure that I’m not in a coma? Nothing external can be believed.


This is lame.

I protest getting -4 points. They used github and stackoverflow. Wrote a function to connect the two based on tags and then randomly generate a question off of that. It's lame. Do something useful or cool.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact