Hacker News new | comments | show | ask | jobs | submitlogin
We Used Broadband Data We Shouldn’t Have (fivethirtyeight.com)
284 points by Amorymeltzer 4 months ago | hide | past | web | 94 comments | favorite





Nice to see they’re being so transparent about setting the record straight. Mistakes like this happen in the journalism business, but few outlets make a serious effort correct misreporting beyond burying corrections where nobody sees them.

Overall they do such a good job of that. They seem to really value self-improvement and reflection. Its part of what keeps me as a regular reader. One editor even does a yearly public review of the work he did that year, what he did wrong, and how he plans on improving it.

https://fivethirtyeight.com/tag/mea-culpa/


Harry Enten is probably my favorite personality of FiveThirtyEight, mainly because of the amount of humility he has despite the fact that he's very good at his job from what I can tell.

Yeah, most news outlets will edit the original article with a small correction. Props to them for a full postmortem article.

I am usually not one who favors government regulation of journalistic outlets, however I do have a proposal.

Media outlets should be required to publish a correction of a story at least as visible and for at least as long as the initial incorrect version.

This would prevent a site from first claiming something and then burying the correction somewhere where nobody finds it.


In Germany, people covered by factual statements in the news have a right (under certain conditions) to demand a "Gegendarstellung" presenting their reply (consisting only of factual statements, not opinions), and that reply has to be in an equally prominent location/size/etc in the same media outlet.

https://en.wikipedia.org/wiki/Right_of_reply

https://de.wikipedia.org/wiki/Gegendarstellung


Of course factual statements need not be true. There was a famous case in Italy where a politician denied having been incarcerated, while sending his reply from a state prison.

Does a "Gegendarstellung" not require a court order?

I would very much love if the media outlets agree on self imposed "best practices for corrections".

Which in the end should be a benefit for those complying because they can advertise it and be more believable


The right exists without a court order (and I'd wager it is more often granted than not by established outfits, with notable exceptions like tabloids - like Germany's largest newspaper Bild - and the yellow press). However, some will have to resort to go to the courts if the publishing entity does not comply. Which will take some effort, it being a civil rights matter - e.g. paying court fees up front (and claiming the money back in the same suit, leaving the applicant with the risk of an opposing party with no money).

I think a specific subdomain or directory for all corrections would also be helpful. Visibility should be long term without an expiration or redirect.

A separate list of corrections would definitely be a good counterbalance, but it would probably only matter to those who are already trying to verify the correctness of the story in the first place and not to those who took it at face value.

This would require the government determine "the truth" which is probably a really bad idea.

I'm curious, how would you punish those who did not comply?


would it, or would it leave it up to the article publisher, but say IF AND WHEN you issue a retraction/correction, here is the way you must do it? I don't know that I agree with the proposal per se, but I think this specific criticism is misguided/wrong.

Say the fed implemented this as you described: Why would any news company bother issuing corrections then?

Publishers would just silently edit the article, forcing the federal government to monitor all news outlets content for changes. In the face of this daunting task, some naive senator would suggest that all news first be submitted to the fed for review.

The only solution to this problem is the nationalization of the news media, which would be a very bad thing. Sometimes the only way to win is not to play.


Yea, 538 is one of the only outlets I can think of that quietly lives up to the standards that most media entities seem to think they hold themselves to.

They're like Vox, minus the self-back-patting about intellectual honesty and plus the _actual_ intellectual honesty.


Yeah this is wonderful. Some of the original data just didn’t make sense. For example, London County, which is home to the Dulles technology corridor and has the highest median income of any county in the US, with a 3.5% poverty rate, has only 50% home broadband uptake?

I live a few miles outside of Loudoun County, still in DC metro, and I have NO terrestrial internet access beyond dialup. It is quite possible that large sections of Loudoun don't have real broadband, just DSL. The county doesn't even have 100% cell phone coverage. There's Dulles/Leesburg, and everywhere else.

Where is a few miles outside of Loudon? Clarke County? That’s closer to WV than DC. And incredibly sparsely populated. Clarke is 85 people per square mile and Fauquier is 105. Fairfax County is 2,800.

Parts of Loudoun are very sparsely populated; indeed, it has the most unpaved roads of any county in VA. But since 1980, the population has grown from 60,000 to 385,000. Almost all of that growth has been in Dulles/Leesburg/Ashburn, etc. So in terms of percentage of households who have broadband access, the other parts of the county doesn’t have much of an impact on the overall number.


most people don't realize how dismal even cellular coverage is in many parts of the country. they worry about broad band yet we cannot even guarantee you have coverage with any mobile device. worse people fight against it too

Why is DSL not "real broadband"?

Because the current US govt definition of "broadband" is a minimum of 25mbps down, and most DSL can't achieve those speeds.

According to footnote 11, the FCC data set uses 10/1 as the definition of "broadband"

I have two DSL lines, one 1 Mbps, the other 2 Mbps thanks to a patented range extender is no longer manufactured because the patent holder got acquired.

I agree. This error has a pretty happy ending, but I think more political and ideological diversity in academia and journalism would help to spot these errors sooner and improve the overall quality of science and science journalism. In particular, in this case, a progressive research firm was trusted by academics who were trusted by journalists. Mistakes will happen regardless, but I wonder if a more skeptical eye somewhere in the pipeline would have caught this error sooner. Fortunately, there are bipartisan organizations like [Heterodox Academy][0] which are working to correct this.

[0]: https://heterodoxacademy.org


The major ISPs have a ridiculous amount of resources and are free to publish datasets answering this question at any moment. In fact, I'd be astounded if they don't already have this data, both for themselves and for their competitors. So until ISPs share that data, I'm terminally unsympathetic to any sort of bias in the public record on this particular topic.

Also, the selection of left-leaning people doing publicly accessible research is, AFAICT, a self-selection. The problem isn't that our institutions are biased, but that conservatives (in general) choose not to pursue scientific careers in the same numbers as liberals. This suggests a fundamental limit to your proposed solution because there aren't enough "heterodox" collaborators to ensure viewpoint diversity. One of the studies published on the website you mentioned even suggests this underlying cause.

Also, I'll note that in the vast majority of academia, internal politics (what research methodologies, maths, etc. someone likes to use) are far more of a problem than external politics (of the sort you'd typically call "politics" in general conversation). E.g. I'm not at all worried about "liberal bias" in physics or chemistry or CS; if it's a problem at all, then it's a minor one compared to other problems.


Very frequently people call up a broadband provider, get told there is broadband at their address, then they get told later that in fact it is not. Thus it's not clear that providers have a real handle on this at an operational level, never mind a synoptic level.

As for left v right, it's true that adjunct faculty who teach English are rabid leftists because they are working four jobs that are 60 miles apart, have no health insurance, qualify for food stamps, etc.

One reason you see so few rightists in conventional academia is that they can go work for places like the Manhattan Institute, Hoover Institution, etc. and get paid much more, not have to teach, not deal with the rat race. There is a continuous drumbeat of reports on how there is no broadband problem in the U.S. that is funded by the industry -- these don't get much attention outside the trade press because normal people know that students camp outside the school some nights because they don't have broadband at home.


> ...people call up a broadband provider, get told there is broadband at their address, then they get told later that in fact it is not. Thus it's not clear that providers have a real handle on this at an operational level, never mind a synoptic level.

The provider (or the provider's CSR) is being lazy. They absolutely do have the data.

In AT&T's case, they have a Java software interface ("CPSOS") into which you can plug in an address and get back the distance in thousands of feet from the address to the nearest RT or CO (major network interconnection). The distance between those points gives you an estimate for the maximum DSL speed that can likely be supported; you can guarantee up to 6mpbs down at up to 5k feet if I remember right, and it tapers off from there to 10k feet, at which point you start saying, "we might be able to get you 512kbps but we're not sure, we'd have to try installing it and see what happens."

The sense you get, both as a consumer and as a service provider that gets a little more of an inside view into these companies, is that they absolutely do have the data and the capability, but that isn't where their priorities are. They just want to make a buck with as little capital investment as possible, and everything that they do reflects that.


Whether or not the ISPs have this information is a fundamentally separate issue from whether ideological homogeneity biases science. There is no dichotomy here.

The degree to which self selection plays a role is an interesting question. Firstly, there is a certain amount of overt harassment--for example, professors in the humanities and social sciences openly admit that they would be inclined to reject a candidate if they knew he or she was conservative. Also there are lots of other data, anecdotal or quantitative. Much more is needed, however.

Importantly, the proportion of conservatives to liberals has been falling since the 90s, so something is driving the trend. It's also possible that conservatives are selecting out of a career that they (evidently correctly) perceive to be hostile to them or otherwise corrupt (more interested in furthering an agenda than seeking truth). The point is that even if it is self-selection, we may be able to improve the ratio enough to improve the science. And it's not like the ratio needs to be 50:50--probably 30:70 would suffice. Right now many fields are less than 10:90, and intuition suggests that the effect is nonlinear. If only 10% of faculty are conservative, intimidation causes them to suppress their already marginal voice, but 30% might be enough to allow them to feel secure in providing the sort of criticism those fields seem to desperately need. Of course, I'm picking on the academy, but the same probably holds true for journalism.

Again, the point isn't to assign blame, but to improve the science.


> Whether or not the ISPs have this information is a fundamentally separate issue from whether ideological homogeneity biases science. There is no dichotomy here.

I'm sort of making two only vaguely related claims here.

Claim 1: In this particular case, I don't think bias is even a problem, because we shouldn't even be doing this leg work on behalf of ISPs in the first place.

My point is that ISPs choose not to contribute to the quality of publicly available information. When entire industries (or governments) willfully hide data that's necessary for a populace to make an informed decision, I default to a worst-case assumption on that data. Not because I think that's most likely, but out of contempt for the lack of transparency and as a forcing function for greater transparency.

In other words, these "biased" researchers are already a hell of a lot more charitable toward the ISPs' position than I think we ought to be. If ISPs aren't willing to share granular data on network accessibility, then assume it's a major problem until they muster up the will to share that data.

>... the humanities and social sciences...

Claim 2: On this issue, I think it's helpful to discuss "critical studies" and related departments separately from the rest of "academia" and separately from the natural and mathematical sciences in particular. I don't think that the liberal bias in natural/math sciences has the same underlying causes OR the same long-term effects.

I.e., I don't think that chemistry departments are hiring based upon political beliefs, and I also don't think that political homogeneity effects the quality of chemistry research.

You also have to explain why political diversity is even important in the face of other forms of homogeneity. E.g., I'd rather a very liberal, very theoretical physics department hire a flaming liberal experimentalist than yet another theorist who happens to be conservative. The importance of political diversity in humanities and social sciences is obvious, but I don't see why science would suffer if all algebraists were libertarians...

My assertion is that if you want a politically diverse chemistry department, you have to 1) justify why that diversity is even necessary / more significant than other major problems; and then 2) probably also solve different underlying causes.

> Of course, I'm picking on the academy, but the same probably holds true for journalism.

And also literally every other profession, from clergy to CEOs to career criminals. There's even a trite quote about it.


> Claim 1

I respect that your opinion is that ISPs have a social responsibility to disclose this information so think tanks and academies don't have to find it. I'm curious if anyone thought to ask them, but that's neither here nor there. I don't generally expect businesses to provide data apart beyond that which is required by law, but I'll concede the point since it's unrelated to my claims.

> Claim 2: On this issue, I think it's helpful to discuss "critical studies" and related departments separately from the rest of "academia" and separately from the natural and mathematical sciences in particular. I don't think that the liberal bias in natural/math sciences has the same underlying causes OR the same long-term effects.

I agree, and I intended to. I should have done better, but I quickly authored that comment from my phone in the waiting room at the vet clinic.

> You also have to explain why political diversity is even important in the face of other forms of homogeneity. E.g., I'd rather a very liberal, very theoretical physics department hire a flaming liberal experimentalist than yet another theorist who happens to be conservative. The importance of political diversity in humanities and social sciences is obvious, but I don't see why science would suffer if all algebraists were libertarians...

Again, I don't mean to apply my "political diversity" quip to apolitical fields.

> My assertion is that if you want a politically diverse chemistry department, you have to 1) justify why that diversity is even necessary / more significant than other major problems; and then 2) probably also solve different underlying causes.

I agree, and I don't think political diversity is important in chemistry, since few chemistry topics are politically polarized.

> "Of course, I'm picking on the academy, but the same probably holds true for journalism." And also literally every other profession, from clergy to CEOs to career criminals. There's even a trite quote about it.

Probably some degree of political diversity is good for all fields, but it's uniquely critical for politically polarized epistemological fields.


Out of curiosity, what's the quote you mentioned?

Unfortunately they don't discuss two of the biggest disparities I've seen in reporting on broadband internet access:

1) How do they define "Access"? Does it mean actual subscriptions? Does it mean the building/home is connected? Or does it mean a line passes the household, but there's actually no way to connect to it? (Look up New York City's lawsuit against Verizon's FIOS rollout.)

2) How do they define "Broadband"? In 2010 the FCC defined it as 4 Mbit/s down, 1 Mbit/s up. In 2015 they redefined it as 25 Mbit/s down, 3 Mbit/s up. Currently I have 50Mbit/s down and 1Mbit/s up which Comcast absolutely defines as "Broadband", but it doesn't meet the FCC definition.


Subscriptions is not generally a very useful metric. I would expect "Access" to mean that you would be able to sign up for and receive service, which might involve just a package arriving and you plug it in, but might mean people show up with a digger and spend a day running cable.

As illustration: The UK has some community owned projects which supply Internet access to otherwise under-served rural areas. The government ensures commercial suppliers don't run "spoiler" projects (e.g. when a community project announces they're coming to some village, the incumbent Telco can't suddenly remember they decided to run ultra-fast Internet to that village with a free introductory offer)

B4RN (Broadband For the Rural North but pronounced "Barn") is the most famous. They run dedicated fibre near to all the properties in a rural area, if you own some cottage in a little country village they're serving, the fibre probably runs in the grass outside your fence.

But people (probably volunteers since it's B4RN) need to come dig up your lawn and run cable to your cottage before you can actually use the service. B4RN charges a fair amount of money for this setup, although it's waived if you instead pay even more money to become a shareholder in their corporate entity. But still, the cottage would have "Access" once that fibre is outside, because B4RN have a commitment to do the install if you pay for it.

It's silly to say one cottage in a row doesn't have "Access" to B4RN's 1000Mbps symmetric Internet just because the old lady who lives there doesn't need the Internet. If you bought that cottage you could fork over the price of a Nintendo Switch and get Gigabit within a week or so.

Subscription _could_ be a good metric if we were arguing that Internet service is unacceptably expensive in specific regions and thus prices out the lower middle classes or working classes, but that's clearly not the focus of Vox's investigations.

You're right that it's important to classify "Broadband" and give some idea how fast that actually is. Certainly that's something that needs to be in the footnotes for such an article.


> Subscription _could_ be a good metric if we were arguing that Internet service is unacceptably expensive in specific regions and thus prices out the lower middle classes or working classes, but that's clearly not the focus of Vox's investigations.

Affordability is actually the main point of 538's second article: "Lots Of People In Cities Still Can’t Afford Broadband" ( https://fivethirtyeight.com/features/lots-of-people-in-citie... )

However, that article is centered around the data for Washington D.C. and their retraction specifically talks about the data for D.C being incorrect.


... but then, there is no place on earth where you couldn't "sign up and recieve service" if you have a few billion to spare.

From who?

I admit that "a few billion" which is clearly beyond what I was suggesting might open doors, but not by just signing up somewhere.

For example, the cable TV company offers Internet service to many people in my city, and indeed on my road. They put unsolicited advertisements through my door every week. But, despite the insistence that I could have "Super fast fibre-based Internet" tomorrow, in fact I can't, my building isn't connected and the owners of the building (an opaque off-shore property holding corporation) would need to sign off on work to change that, which they won't do even if I wanted it.

Now sure, I could buy all the flats in the building, then leverage control of those flats to force the current owner to sell me the building, and then I'd have the right to get that work done. That means now I'm paying maybe £2-5M to get their merely "super-fast" broadband to my flat in a major city. Short of your "billions" but not by so many orders of magnitude...

Or another example, for a few years the incumbent telco technically offered FTTP to any customer who could get their existing FTTC product, for a fee they'd work out what it cost them to run the extra cables and so on, add a percentage and if you agreed you'd pay that to get FTTP. The service was weakly advertised but immediately over-subscribed and it was effectively cancelled altogether without more than a handful of deployments. A few billion wouldn't do much about that I think, you'd need tens of billions to buy that incumbent telco and "change their minds" about the priority of such bespoke services that way.


> From who?

From someone?

> I admit that "a few billion" which is clearly beyond what I was suggesting might open doors, but not by just signing up somewhere.

Well, but what does "sign up somewhere" really mean? If you called a major telco and made it clear that you have a few billion to spend and you wanted a gigabit internet link, you don't think they would find a way? I mean, think Bill Gates calling ... you think he would have to do much more than order what he wants to get it delivered in a reasonable time frame?

> Now sure, I could buy all the flats in the building, then leverage control of those flats to force the current owner to sell me the building, and then I'd have the right to get that work done. That means now I'm paying maybe £2-5M to get their merely "super-fast" broadband to my flat in a major city. Short of your "billions" but not by so many orders of magnitude...

And really, you wouldn't have to do any of that. For one, you don't have to buy it, you simply have to tell them that you pay them a billion bucks to sign off on things, and secondly, the telco will happily do that for you.

But also, they wouldn't have to sign off on anything. The telco could just as well install an LTE cell just for you, or a laser link from across the street, ...

> A few billion wouldn't do much about that I think, you'd need tens of billions to buy that incumbent telco and "change their minds" about the priority of such bespoke services that way.

Erm ... what's the yearly profits of that telco? What's a few billion in comparison to that? You don't think they'd take a substantial increase in profits for a few days of work for some of their technicians?


You don't need to make such an extreme argument. I'm not sure what dollar threshold I would use but I don't think I'd get much of an argument saying that, if it costs $10K to get a hookup somewhere, the average household in that area doesn't really have access to broadband.

We both know this is a disingenuous argument, because average households don't have a few billion to spare.

And neither do many households have a few thousand or even a few hundred bucks to spare, which is precisely why the argument "available access is if you could sign up" that I was responding to doesn't work. There is no place where you couldn't get broadband internet, and in most places within developed countries for a lot less than a billion, but that doesn't mean households could actually reasonably afford it. You would at least have to specify some price limit for availability.

> How do they define "Broadband"?

There's a footnote in the article which says that the FCC data uses the 2016 redefinition of broadband back down to 10mbps:

> The FCC has two data sets on broadband connections. For our analysis, we used the data specifying connections with at least 10 Mbps downstream and 1 Mbps upstream.


small anecdote on down/up speeds:

I recently signed with Comcast (it was that or DSL) for their 150Mbit/s down plan, and didn't even check the "up" speed, like a fool.

I just assumed around 50Mbit/s, maybe 25Mbit/s at the worst.

It's 5Mbit/s... so I'm downloading gigabytes of huge datasets in a minute, making a small change, then uploading... and uploading... and uploading... better get a coffee....


Plus, your internet packets get de-prioritized behind phone and TV on the consumer internet plans. During peak usage hours, you won't even reach those high download speeds.

I recently got a 75/15 package with Comcast Business for $150 monthly. The guaranteed bandwidth seems to be worth it.


I don't think this is correct. From what I understand about DOCSIS, the video-related channels and Internet-related channels are completely separate frequency ranges on the coax line. The reason your bandwidth is reduced during peak usage hours is because you're sharing something like a 10gbps fiber line with hundreds of other people.

> ...the video-related channels and Internet-related channels are completely separate frequency ranges on the coax line...

Yes, that's why they can use band pass filters to enable and disable specific services and packages. i.e., specialty channels that are part of a premium package are grouped together in certain frequency ranges, and they use equipment to block or allow signals at varying frequencies on each consumer connection.


But doesn't their VOD service use a DOCSIS packet connection?

I thought that in an average residential setup, TCP traffic costs something like 1kB up for every 10kB down, just to manage the TCP connection and send requests from client etc . Maybe UDP/video streaming use less upload though.

FYI: A gigabit comcast connection is commonly only half that advertised speed. Upload speed included with their gigabit plan is 40mbit. 40mbit. And the upload is always 40mbit in my tests.

I'm just waiting for them to start advertising total speed. So your 50/5 connection would be advertised as 55Mbps!

>In 2015 they redefined it as 25 Mbit/s down, 3 Mbit/s up.

How can that be? That's the same time we achieved net neutrality. That's not net neutral. That's fast lane on download, slow lane on upload. Asymmetric bandwidth is a prime reason why we all can't host our own clouds and social networks.


I have never seen anyone define net neutrality as including symmetric bandwidth. I am also skeptical that lack up upload bandwidth has much to do with the lack of success for self-hosted social networks.

I've heard two theories about capped upload bandwith: making possible to upsell premium hosting services and P2P damage control.

I think the simplest explanation is that the transport uses a fixed allocation of bandwidth between upstream and downstream flows, and they've chosen an allocation that matches typical workloads. It looks like full-duplex DOCSIS (data-over-coax protocol) was just announced last year:

https://en.wikipedia.org/wiki/DOCSIS


That's begging the question. The up/down ratios they choose determine what workloads are possible/typical. More upstream means there would be other workflows enabled.

Network neutrality has nothing to do with bandwidth.

Network neutrality means if I purchase a pipe with x Mbps of potential bandwidth, my communication on that pipe to any destination should have equal priority.

In other words, an ISP shouldn't be able to add rules that say, "When our pipe is congested and we need to slow or drop packets, Google has the highest priority, followed by Amazon, then x, y, z." You cannot shape or throttle traffic basted on where it comes and goes.

You could still have congestion. Maybe the route to AWS has a shorter round trip than to Digital Ocean because of which backbones the ISP has decided to pay for and connect to. But the point of Network Neutrality is that ISPs cannot priorities one packet type over the other. They must try to deliver each payload from each customer evenly and, if they can't, slow or block those payloads evenly and without prejudice.

The one exception to this is if the customer is on IPv6 and uses the prioritization available in the protocol. But even then, the customer is sending an indication to the ISP saying "this packet is not as important" or "this packet is very important."


But different technologies support different speeds. Fiber can do symmetric much easier than cable can as an example, and cable is much more ubiquitous.

So if I'm reading this right, there are three data sources here:

1) US Census, which is based on surveying households "do you have broadband, Y/N"

2) FCC data, which is based on ISPs self-reporting (In a footnote the article says they're using Pai's new slower definition of broadband, 10Mbps, not the 2015 definition of 25Mbps.)

3) ASU/Iowa, which depends on a derived variable in commercially-purchased data which "denotes interest in ‘high tech’ products and/or services [including] personal computers and internet service providers" as a proxy for broadband ownership

...and the first two roughly match each other, while the third doesn't. The academics claim the company that sold them the data told them it was a reasonable proxy for broadband, the company says they didn't say that.


I just wanted to comment that I think this is brilliant and the kind of analysis and general skeptical of data we should see more of.

Just for context,of its not obvious, I work with data. Both putting it together and analyzing it. And one of my chief frustrations with academia (and biggest lessons to people I advise) is a kind of "cultural reverence for the data set".

Just because data is collected, in no way does that assure that's it's right or suitable, even if the valuable name says it is.

Be skeptics. Private suppliers have incentive to sell you data. Private industries have incentives to keep data from you (it constitutes competitive advantage). Government data has political interference on what is collected, even if you're lucky enough to live in a world where the actual collection is independent and rigorous. Reporters and responses to surveys and interviews may be innacurate even when people thought they were being honest, and on socially contentious topics they usually don't have that.

And even if you managed to avoid all that, it doesn't mean your data isn't problematic. Our census in my country, for example, is done in the winter time. How good is that at tracking information in seasonal towns?

Proper data collection is some of the hardest work you can do, and proper analysis comes from measuring, corroborating, justifying, hypothesising on the data you have. It does not involve just calculating a stat or, god forbid, just testing things for statistical significance just because it's on your data set.

For all those reasons, I highly commend this article. We need more of it.


I'm really confused by this : surely if you want to make a data set that has Internet speed binned by county surely the way to do that is as follows:

1. Go to a large Internet services provider (Amazon, Google, Akamai, Netflix).

2. Ask them to statistically sample the TCP flow rate observed in client traffic, by source IP address.

3. Get a data set that geolocates IP addresses to ZIP code (Amazon for example has this data).

4. Join the two.


I've learned that getting a good data set is often the most challenging part of data analysis. You think "it should be easy to get this data" and usually it's not, unless someone else has already done the work.

One flaw I can already identify with your data set is that it doesn't differentiate between traffic originating from workplaces, or cellphone traffic. A source like Netflix could be a heavily biased set since it relies on self-selected subscribers; there is no skipping research about if their user base provides an adequate sample at the county level. Asking a huge company like Google and getting them to pay attention is another challenge.

The article writes about how the FCC does have more granular data, but "the commission is wary of “one carrier learning about another carrier’s market share or where their customers are,” Rosenberg said." So you are at the mercy of competitive businesses willingly releasing data that competitors might find useful.


They want data on Internet use (and specifically broadband Internet use) by county. That isn't a trivial ask.

You can get reasonable rates at a larger scale using survey approaches (like OxIS, the British survey series from the Oxford Internet Institute) but these aren't granular enough to compare small areas. Otherwise you're using some kind of proxy data.

Your approach has two issues. First, people without Internet access are not in the sampling frame at all, which is a critical problem if you're interested in network access. Secondly, the geolocation in step 3 is also not particularly reliable (mine tends to think I'm 100 miles away from where I am) so you'll get the same kinds of problems they found here in that the data look kinda plausible but aren't actually reliable enough to draw defensible conclusions from.


How many people are represented by a single IP address? How do you deal with changing IP addresses (most residential ISPs assign dynamic addresses)? How do you distinguish between residential and commercial traffic? How do you deal with cellular devices (for which GeoIP data is much less accurate)? How do you account for people who don't use the service you got the data from? (Netflix in particular has a limited subscriber base, but all the providers would exclude, say, elderly folks who do nothing but email).

Getting good data at nationwide scale is never as easy as it sounds in your head, unfortunately.


All good questions:

1. Typically a single IP address represents a single connection. Yes there are providers who NAT multiple subscribers onto one IP but they are rare (because if they do that then they have to maintain NAT logs in order to identify criminal subscribers to law enforcement -- easier to just have 1:1 IP to subscriber).

2. Residential addresses are in fact not really dynamic. Yes they can change from time to time but for the most part they don't (see #1).

3. Cellular traffic can be identified because the cell carriers use specific identifiable netblocks.

4. It doesn't matter if not everyone uses the sampling service because that's the point of sampling.


Google do this with YouTube already: https://www.google.com/get/videoqualityreport/m/

This methodology would completely miss the point of the 538 story, which was to report on access.

It makes a reasonable assumption that everyone with access has similar usage profile, at the level of country averages

That assumption isn't helpful, and even if it were, isn't necessarily reasonable. But I'll just fwd you to pmyteh's post instead of rehashing those claims here.

Responsible adults; I approve.

Notice we applaud the careful report on a research report in exactly the same way we applaud the post-outage report.


Personally - there is something that I would like:

A monitor of exactly how much traffic is used by ads vs content.

So if I load a page, and say that page is just an article with text. What % of the content is the bandwidth-consumption I am interested vs the ads surrounding it?

The reason why this number is important is Mobile.

So a user signs up for "3 gigs of data" - how much of that 3 gigs is consumed by ads and shit they dont want/need?

Actually - it would be good to have a standard on reporting for any given page "this page weighs in at 50KB for content and 500KB for ads..."

Does this exist?


Large providers do scrape this kind of information, but I'm not sure anyone discloses it.

I don't recall which FCC action it was last year, but as I recall, the large providers are no longer required to at least show they attempted to offer broadband to all households.

Previously they had to show state and fed gov't this info. Now they get to concentrate on providing access to the most profitable while ignoring the less profitable.


Kudos to FiveThirtyEight on being transparent and analyzing what happened. But also...this was a series of mistakes, some of them pretty scary.

FiveThirtyEight's biggest mistake seems to be trusting an academic dataset when they had no idea how it was collected. This is understandable, especially when the data was published on the Arizona State University's Center for Policy Informatics data portal. (You can go there right now and download the bad data - scroll to CATALIST DATA here https://policyinformatics.asu.edu/broadband-data-portal/data...) A university should be a trusted source. But FiveThirtyEight took an unbelievable outlier from this dataset and wrote an entire post about it (https://fivethirtyeight.com/features/lots-of-people-in-citie...). The dataset claims that only 29% of Washington D.C.'s adults have broadband. (The real number according to the other datasets FiveThirtyEight looked at in the new post is closer to 70%.) They even make a point of how extreme the Washington D.C. datapoint is on the histogram in the article as the only large county with such a low percent. That should be a clue to question your data.

What I find worse is that the academic researchers published this dataset. They bought behavioral marketing data and trusted a salesperson that the variable HTIA (“Denotes interest in ‘high tech’ products and/or services as reported via Share Force. This would include personal computers and internet service providers. Blended with modeled data.”) was a good proxy for broadband access. To be clear, HTIA includes modeled data, which means they took demographics, voting records, and whatever other individual data they could grab (maybe they have records of your purchases, I'm just guessing), and predicted whether each adult in the US was interested in tech. This is the kind of data companies buy for ad campaigns, figuring that if they advertise to these adults, it might be better than random. There's no reason to think the aggregates of these numbers would be accurate or calibrated correctly, especially for an entirely different purpose (broadband vs high tech).

It's disturbing that these sort of datasets are floating out there in academia and really makes you wonder what other bad data is being blindly trusted to write blog posts, research papers, and news articles.


Does anyone here know if there is a way to opt-out of being in the Catalist dataset?

I few things I don't like about this;

"After further reporting, we can no longer vouch for the academics’ data set. The preponderance of evidence we’ve collected has led us to conclude that it is fundamentally flawed.... The idea behind the stories was to demonstrate that broadband is not ubiquitous in the U.S. today, even as more of our lives and the economy go online. We stand by this sentiment and the on-the-ground reporting in the two stories even though we have lost confidence in the data set."

If the data you used to reach a conclusion is fundamentally flawed, it's pretty disingenuous to claim you stand by the sentiment. So they started with a conclusion, set out to prove it, later found the data they used to prove it was flawed, but still believe it's true.

The second thing I don't like is it seems readers are very confused between access and usage and their sloppy wording often conflates the two. It appears they were studying usage (actual subscriptions) not access (availability of a high speed connection).

Lastly, they also seem to disregard an LTE wireless connection as usage of broadband, when I would have assumed it would clearly be considered. If LTE wireless is more commonly used as a form of access to broadband internet in certain areas (i.e. rural areas where density can't justify running the fiber, or dense metro where the LTE is so good there's no need for a wire) then not surprising you'll find broadband "usage" is low in those areas, even if those households are absolutely using broadband internet through an LTE hotspot.


> So they started with a conclusion, set out to prove it, later found the data they used to prove it was flawed, but still believe it's true.

Nate Silver and 538 are fairly hardcore Bayesians, and this is how pretty much all Bayesian thinking works.

You start out with some prior "sentiment" (a.k.a. prior belief), and use then use data to update your "sentiment".

In turn, invalid data would mean that you'd revert back to your original prior sentiment, and when you get new data you'd start Bayesian inference once again.

Edit: Looking at the on-the-ground investigative reporting and the other sources and studies they've cited in the related articles, I actually agree that they have decent evidence to support their belief without these data sets.[0][1]

I mean, ignoring the data sets, would you argue against the idea that many people in cities can't afford broadband or that many places in rural America have crummy internet infrastructure? Certainly, I'm less confident than I would be with the additional data, but my confidence is still relatively high without it.

I do agree with your other points. Mobile internet access--with or without hotspots--is unreasonably ignored. The conflation between access and usage was a lesser concern since I managed to navigate the articles well enough.

[0]: https://fivethirtyeight.com/features/the-worst-internet-in-a...

[1]: https://fivethirtyeight.com/features/lots-of-people-in-citie...


>> If the data you used to reach a conclusion is fundamentally flawed, it's pretty disingenuous to claim you stand by the sentiment. So they started with a conclusion, set out to prove it, later found the data they used to prove it was flawed, but still believe it's true.

It's just a sentiment. I have a strong sentiment that LTE reception in the middle of the Sahara is terrible. I haven't collected any data whatsoever to support that sentiment, yet I fully stand by it. There's nothing disingeneous about that.


Read a different way, they have a hypothesis they set out to prove or disprove. Their experiment was flawed and the data thrown out, so the hypothesis is neither confirmed nor disproved. It's still a hypothesis, and one backed by more anecdotal evidence. It would be disingenuous to suggest it's false without evidence supporting same. I think they were transparent and logical in their reporting.

Well, the other data they introduced and which isn't disputed supports the idea that broadband penetration is uneven. Just not at the widespread granular and detailed level that the faulty data purported to show.

You do make a reasonable point about usage and access. Which is more relevant depends on what you're trying to show. Access is whether you can get it at all while usage could well bring in economics as well.

It's a fair point about LTE. They don't seem to have included satellite either, which definitely gets used in rural areas where nothing else is available. I suspect though that neither of these would affect overall conclusions that much.


Your point reminds me of an anecdote from a talk [1] shared on HN a few days ago [2] where the speaker quoted Saul Bellow on FDR's fireside chats as saying people were "not considering but affirming" their content. The point of that article was expressing similarities between the adoption of radio and that of the internet, and in that particular anecdote showing that one of the dangers of radio was its ability to communicate with, and persuade people by emotion and sentiment more than facts. Poignant, then, to see an example like this where the internet likewise is interested in swaying by sentiment rather than, even in place of, facts.

Although, in some ways, what FiveThirtyEight is saying is worse than the Bellow quotation. They have not just "not considered," but have re-considered and decided to disregard the facts, while affirming what they originally wanted those facts to demonstrate. Just take their example of D.C., where they used a number that said there was 28.8% coverage, when in reality the coverage is 70%+. How can they then justify the sentiment that even metropolitan places like D.C. lack coverage?

The discrepancy, in their words, comes from totally different definitions: "We looked into it and found that the data set we used had a fundamentally different understanding of broadband access than other sources did." That FiveThirtyEight would stand by a sentiment derived from one definition of broadband coverage, and apply it to a definition that yields entirely different, indeed contradictory, data, seems irresponsible.

[1] http://idlewords.com/talks/ancient_web.htm [2] https://news.ycombinator.com/item?id=16106331


I'm probably being too harsh but...

Good on them for writing this, it's important to admit when you're wrong. However, I feel like this outlet has a larger responsibility to be actual data analysts as well as journalists (over say a more traditional journalist for a more traditional news outlet). As such, why was the analysis done in the postmortem article not done prior to publishing the original articles? A good analyst is one you can trust, and trust for an analyst is built by drawing conclusions from highly defensible data, and highly defensible data is data which has undergone sever scrutiny of the analyst before conclusions are drawn, not after. Also, they should probably update the now erroneous articles with a disclaimer indicating that much of the research is now invalid.


It's true that we have lost trust in their analysis, but there is an important thing to remember here: fivethirtyeight is not a peer reviewed journal. You should treat their articles in roughly the same category as a CNN article on something Trump tweeted, not in the category as an article published in Nature. While they do tend to do better data analysis than the Associated Press, 538 articles are not refereed and should not be treated as such. The assumption of a "highly defensible" data analysis is a little strong for them. This is very apparent on their sports section, which should give you a clue about the actual rigor of their analyses. Don't cite a 538 article in your academic research unless the 538 article is directly citing peer-reviewed research (and even then you should probably just cite the original research).

If you think of them as a news organization that uses data as it's gimmick to sell page views, you will be less surprised at events like this (disappointed, yes- surprised, no). They have the same incentive to sensationalize things as a regular news organization. Their mission is not to increase the knowledge of the human race, their mission is to bring in page views to sell ads and make money.

It may sound like I'm coming down harsh on fivethirtyeight. I genuinely enjoy reading some of their articles, but I make sure to remember what kind of organization they actually are and don't fall for the trap of thinking they are a think tank staffed with postdocs.


This anchors motives too much in economic determinism. It's a useful mental model to remember this is a factor, and it can come to dominate and cause problems, but it is not always the one true factor to rule them all through which we should view all motives.

You basically end up arguing that it is all about money, and real journalism cannot happen under profit-seeking organizations. It also trivializes that journalism's big challenge right now is how to do real journalism when the big tech players have vacuumed up their ad revenue.

I also wouldn't treat 538 as the same as CNN writing about a thing Trump tweeted. They actively talk about and discuss their journalistic goals. They try to be openly self-critical about what and how they cover topics. They are trying to compete by not doing the same thing as other organizations.


I think that real journalism can happen under profit seeking organizations, because people find value in real journalism. However, real journalists don't do original research, they compile insights from experts. Journalists have a tendency to get it wrong when they try to add original insight in an article. Just about every person who is an expert in a field has a story about when their field made it into the news cycle and the journalists butchered some important concept.

I'm a statistician, and we always work with investigators that are experts in their field- unless we are researching statistical methods, then we act as our own experts. The statisticians handle the data analysis and make sure that the investigators don't make silly data mistakes. The investigators handle the reasoning and mechanisms behind the research. When they work together they can collect good data that they are familiar with and know how to interpret correctly. When they work separately they are prone to mistakes.

Fivethirtyeight (and also the ASU researchers) fell into this trap. They were not involved in the data collection, so they don't really know what each variable means, and just took someone's word for it who, (back to the economic motives) has an incentive to sell their dataset and not to tell the researchers to go look elsewhere.

I'll admit I was too harsh in comparing fivethirtyeight to an article about Trump's tweet, 538 is typically decent investigative journalism. However, I maintain that it isn't on the level of peer-reviewed academic research. Articles on 538 don't go through peer review. They aren't submitted and languish for months with revisions and answering follow up questions. I'm not saying that this mistake would have been caught by a referee, but in the peer-review process a referee could have caught it by asking questions about the analysis process and if it's valid to use a variable named "X" was being used as a proxy for a variable named "broadband access."


>I think that real journalism can happen under profit seeking organizations, because people find value in real journalism. However, real journalists don't do original research, they compile insights from experts. Journalists have a tendency to get it wrong when they try to add original insight in an article. Just about every person who is an expert in a field has a story about when their field made it into the news cycle and the journalists butchered some important concept.

It's much much worse than this. I was once interviewed about my research by an NPR reporter. He had already decided what his story was about and tried a variety of tricks get me to say some pithy quote he had devised so he could use it on air. The problem was that my research actually debunked, rather that supported, the story he wanted to run, and the pithy quote was scientifically unsupportable.

I often wonder whether some of the quotes I hear on NPR are edited versions of the interviewees saying "no it wouldn't be accurate to say x" cut down to "x".


I think this is fair and is a sound refutation to my stance on how rigorous they should be. Your reply reads like a debearding-of-santa, which is maybe what I needed in this case.

I expect way more of FiveThirtyEight articles than I expect of CNN articles, and I value FiveThirtyEight articles much more than I value CNN articles.

I think the idea is that we all make mistakes. The best we can do is admit them and learn from them, just like any other postmortem.

Also, they did update the erroneous articles:

https://fivethirtyeight.com/features/lots-of-people-in-citie...

https://fivethirtyeight.com/features/the-worst-internet-in-a...


Also, they did update the erroneous articles:

Ah so they did, my mistake.

I think the idea is that we all make mistakes.

I am in total agreement, we all do make mistakes, and it's refreshing when people own up to them. I guess my main question still stands though. As a top-tier data journalism outlet, why was a data QA not performed prior to analysis, or, before publishing results? And if there was data QA performed, where did it fail? It just seems that they got bit by a time crunch to publish content amid the FCC repeal and sloppy analysis was the result.


Everyone can do better. It's impossible to research every story from first principles of physics.

They probably just didn't think of this possible flaw in data. I mean, it's not like the university researchers caught wind of any problem, either.



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: