Hacker News new | comments | show | ask | jobs | submitlogin
To show how easy it is for plagiarized news sites to get ad revenue, I made one (www.cnbc.com)
127 points by airstrike 13 days ago | hide | past | web | 37 comments | favorite

Don't mean to be snarky, but this is not how easy it is to get ad revenue. It's how easy it is to get approved for ad networks. She didn't even get adsense. I find it completely unremarkable that anyone could set up a non-adult site that has human generated content and get ads placed. The traffic needed for real ad revenue is a different story. I bet that site gets close to zero traffic (not even enough to cover hosting). SEO (black-hat or whatever) is the trick IMO not getting ad revenue. Plagiarized new domains get no weight in the Google engine.

Honestly, it'd be pretty interesting to see an article like this which continues after the 'approved for ad networks' part and shows how such a site could rank in Google, do well on social media sites, etc.

Could be interesting to see how scammers are doing that, and lead to some potentially interesting insights about black hat SEO, social media marketing, targeted ads, etc.

Because yeah, as you said, getting approved by an ad network is only part of the story, and not very much of it at that.

It may be a low bar, but being able to automate this (scraping website ansible playbook?), makes the effort required as near-nil. They only have to clear $50 to pay for ALOT of domains and hosting.

Sure but you need thousands of legitimate-seeming pageviews to get that $50 back, and the networks - even or especially the bottom tier ones - are likely to be hotter on click fraud than scraped content.

You would have to get ~16,000 pageviews to make $50. (assuming $3.00 CPM -- which would be low for adsense, but not for these second tier networks).

And 16,000 page views without unique content and links might happen, but most likely not within a year or five.

If you're lucky, you trigger something in Google's black box and they rank your site better than others at the same level, but you'll still only do long tail, and even on long tail, you'll compete with the original source of the article, which has a billion links pointing to its domain. Since you'll also need to go for quantity, you'll have a giant amount of pages as well, which will not help you even with niche rankings.

I doubt that the site would pull in 10 actual, human visitors per day on average with just scraped content.


its really not easy to get the ad revenue, scraped websites can't rank good and get enough visitors, of course one can make fraud clicks system, but if one can do this level, he may easily find more interesting things, but not peanut $$

If the pirate bay can run on ad money, then so can plagiarized news sites, I suppose.

It’s a dirty business, no cost is too great for eyeballs.

In Turkish whatever, you search the first few pages of results is from the Turkish largest news outlets because the SEO’ed for everything and Google doesn’t care.

Do you want to learn how to renew your driver license? Good luck with that because your search results will bring you a wall of text articles that are almost the same for every search term.

“Lately people started to ask themselves how to renew their driver's license. But do they consider the risks of renewing drivers licenses? Experts agree that renewing the driver's license can be a complicated thing. Now strap on and get ready to learn how to renew your driver's license”

Think to have pages like that on CNN, BBC and others. They are the top result for so many searches.

Plagiarism of news, on the other hand, is more nuanced IMHO. There’s nothing stopping you to say “NBC reports that” anyway. As per the article, you can not use their assets but you can create or even generate articles about the news based on the news.

The ad business is dirty. I’m almost proud of blocking ads.

In the U.K. it’s supermarket opening times, especially near holidays.

Google for “Aldi opening times Easter Sunday” and you’ll get articles from the lower quality newspaper websites.

It’s pathetic.

Oh, definitely. Especially in these pandemic days that was something that I tried and failed. On the Turkish web apparently the news outlets gave up any hope of respect and now every single one of them is doing it. The biggest ones, the leftie ones, the right-wing ones, the cushy with the government ones. All of them.

No way Google isn't aware of this, there's a local Google office in Turkey, they have a large presence and full Turkish language support on most of the products.

Maybe it's simply part of the business model now. If a supermarket wants people to find their opening times, maybe they should buy an ad placement. There's no money in the high-quality organic search results I guess.

I'm watching this exact problem for SEO in Turkey and I can say that it was in full-force way before pandemic. Google doesn't care. Instead, they're busy flagging pages discussing "penisilin (penicillin in Turkish) application for kids" as AdSense Policy Violation since the page contains "penis".

I suspect one of the hard parts of this for Google is that many news sites legitimately publish the same articles because of wire services and correspondence arrangements, like AP and Reuters. Hard to tell whether the new site is plagiarizing or syndicating.

Don't they always include the source as AP or Reuters in the body of text somewhere?

You can copy that text too...

Doesn't say how much she made from it. Guess: very close to zero, if not zero.

She did say: “I didn’t want to be taking ad revenue from legitimate advertisers, so I only briefly activated advertisements from the partners to see what surfaced and to take a few screenshots.”

Including hosting? I'm sure it's actually in red.

If you include the hit to your professional reputation from actually plagiarizing a news site for revenue and getting blacklisted from the industry, then what did it cost? Everything.

Or you get placed as a cto with a fat raise.

My dream job.

I'm not. I knew a developer who, in his spare time, developed a clever scraper. He scraped the top stories and results from Google, then scraped similar content based on Google's own ranking, then submitted that content to his own aggregator sites (all resolving to the same server). He ran ads on it. He got plenty of traffic and was net <$500/month in 2006.

It's not that expensive to run a site and the right advertising partners (cough Taboola cough) pay nicely.

i remember in 2004 or so when an agency i worked for had a wikipedia clone running with adsense and tons of SEO which made 20k per month and basically kept the company afloat. I was a young junior dev and while i was impressed by it, it never felt right to me (which it obviously wasn't in many ways). As far as i remember this only worked for about a year at best until Google penalised those sites more and more.

Unethical Continuation of This Idea:

Scrape existing news sites, and use machine learning to paraphrase everything so Google doesn't detect plagiarism.

US military has already been working on this for ~ a decade. There was a contract out of Redstone Arsenal where they writing "story spinners" to scrape and re-word war-time propaganda.

I wonder if mechanical turk is still cheap enough to have humans paraphrase and insert seo keywords.

In this new dystopia, generating content for the machines to read could be a decent job for a human.

How easy is it to get distribution over social media sites like facebook or Twitter? The distribution costs would be close to zero there right?

>"It all underscores the fact that the ad tech space is so convoluted, it’s easy to make money from legitimate advertisers just by setting up a web page.

That means there’s significant incentive to create sites with not just with low-quality clickbait or A.I.-generated nonsense, but sites filled with outright plagiarized content."

It's really easy to get any content on the internet but really hard to verify if they are plagiarized. Basically anyone can place some ads on their websites, but if the site posts nothing but copied content, I doubt if it will last.

> These firms mostly sold “popunder” ads, which pop up a new link in a browser tab when you click something

Who is buying ads on these networks? There cannot possibly be any returns can there?

It might be a "victim filter", some scams are created to avoid wasting time in people smart enough not to fall for the scam in the following steps.

Often it's just affiliate fraud, you load casinos, aliexpress and what not affiliate links as be hope for the payout. That is why they redirect like crazy in order to hide the tracks since the sites offering affiliate services don't want it.

We're working on another way for disseminating news. It might make plagiarism a little more difficult, while also working a little better for our audience: https://blog.nillium.com/what-can-napster-teach-local-news/

How does that help prevent plagiarism?

Because it isn't full articles -- just updates as they happen coming straight from the newsroom, more like tweets. It's not to say that people can't plagiarize, but it wouldn't be as easy or make as much sense as just copy and pasting an article.

There's not much in the post so I'm gonna guess it's a form of content fingerprinting like we see with YouTube's Content ID, plus whatever is used in plagiarism-detection software used in schools and universities.

OK first try. But needs more work.

Not much proven so far.

Many site seem to translate to language X and back to English to clean the data.

Research this.

Anyone using GANs yet?

How do you stop sites blocking your scraper?

There's money for the ad companies to allow you to plod along then steal your hard earned money because you are breaking the rules. Are they?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact