Hacker News new | comments | show | ask | jobs | submitlogin
PostgreSQL used fsync incorrectly for 20 years (fosdem.org)
676 points by lelf 2 months ago | hide | past | web | 307 comments | favorite





Years ago (2010 iirc), I reported to the Postgres list, with a patch, that (one of) the reason Postgres “didn’t work on NFS” was because it couldn’t deal with short writes. I got told the usual “you’re holding it wrong” instead of an acknowledgement of PG not sticking to the spec.

I patched my own systems (wrap the calls in a loop as per standard practice) and then proceeded to run literally hundreds of thousands of PostgreSQL instances on NFS for many more years with no problems.

The patch was eventually integrated it seems but I never found out why because I lost interest in trying to work with the community after that experience.


This saddens me.

Looking at this thread - the patch is obviously right. It doesn't matter about NFS, Linux, or any host of random crap people brought up. (I unfortunately suspect if you hadn't mentioned any of that, and just said "write doesn't guarantee full writes, this handles that" it may have gone better).

The documentation (literally on every system) is quite clear:

"The return value is the number of bytes actually written. This may be size, but can always be smaller. Your program should always call write in a loop, iterating until all the data is written."

(see https://www.gnu.org/software/libc/manual/html_node/I_002fO-P...)

Literally all documentation you can find about write on all systems that implement it clearly mentions it doesn't have to write all the bytes you ask it to.

Instead of saying "yeah, that's right", people go off in every random direction, with people arguing about your NFS mount options, what things can cause write to do that and whether they are possible[1] or they should care about them.

Sigh.

[1] spoiler alert: it doesn't matter. The API allows for this, you need to handle it.


I'm a bit confused by this perception of the thread. There was one person doubtful the patch is the right approach (Tom), but still offered concrete review comments (of actual issues that'd at the very least need to be documented in code comments). Others agreed that it's something that needs to be done.

> Instead of saying "yeah, that's right", people go off in every random direction, with people arguing about your NFS mount options, what things can cause write to do that and whether they are possible[1] or they should care about them.

Given that changing the mount option solved the immediate problem for the OP, I fail to see how it's random.

> [1] spoiler alert: it doesn't matter. The API allows for this, you need to handle it.

Sure, it's not like the return value was ignored before though. I agree retrying is the better response, but erroring out is a form of handling (and will also lead to retries in several places, just with a loop that's not tightly around the write()).


> Given that changing the mount option solved the immediate problem for the OP, I fail to see how it's random.

It's random because pretty much anyone could volunteer that opinion. By diagnosing the problem, making a patch and reaching out the author has already made their intention of fixing the problem clear. Unless it has been established that the problem isn't valid a workaround isn't really relevant.

Effective communication and exchange of ideas needs to have a good ratio between work and value. I used to frequent meetups where people would present projects with thousands of hours or work and priorities behind them. There would almost always be someone with less experience stating their "ideas" on what should be done instead. Eventually those people would end up talking among themselves where their ideas could flow freely without any restriction of actual work being done.

Experienced people provide value. They try to understand the problem, add their own experience to it and validate the work that has already been done. They make the problem smaller and closer to a solution. They don't, or shouldn't, casually increase the scope for little reason.


> It's random because pretty much anyone could volunteer that opinion. By diagnosing the problem, making a patch and reaching out the author has already made their intention of fixing the problem clear. Unless it has been established that the problem isn't valid a workaround isn't really relevant.

I'm baffled by this. Even if the fix had been immediately committed, the workaround of using nointr still would have been valuable, because a fixed version of postgres wouldn't immediately have been released.

You seem to argue in a way that entirely counteract your own later comments.


If someone spends x amount of work hours on something, that is what they want feedback on. They aren't looking for quick suggestions on other paths to explore. It isn't a brain storming session. It is work done being represented by an e-mail, code or a product. You are being presented with their theory for a solution. At some point something else might be relevant, but that isn't something to assume. The assumption should be that the person presenting have made their choices based on their situation.

It is often the same with software. Good feedback on software isn't random ideas, suggestions or feature requests that adds hundreds of hours of work on a whim. It is feedback that considers the work that has already been done. Anyone can come up with something else, especially in theory and with a blank slate. It doesn't really require anything other than an opinion. Hacker News certainly is proof of that.


> I agree retrying is the better response, but erroring out is a form of handling

In this case it is “a form” but that specific handling is provably wrong. That is important: everybody could have tried and proved again. Therefore just dismissing the correct handling and keeping the wrong one is, let me repeat, also provably wrong.


They were not ewview comments. They were arguments about where it ends. He's totalyl unwilling, and says you shouldn't run of of NFS. But then other people are, and are talking about mount options, when the problem clearly lies in the code not following the specs.

Why are:

> 1. If writes need to be retried, why not reads? (No, that's not an invitation to expand the scope of the patch; it's a question about NFS implementation.)

and

> 4. As coded, the patch behaves incorrectly if you get a zero return on a retry. If we were going to do this, I think we'd need to absorb the errno-munging currently done by callers into the writeAll function.

not review comments?

> when the problem clearly lies in the code not following the specs.

There were plenty of questions about how exactly the fix should look like below

https://www.postgresql.org/message-id/AD4A13A5-8778-4D94-BBB...

How's that not review?


You had me until you tried to justify bubbling the error instead of handling it at its source while possible.

That can be a correct response, for instance, if you have some sort of async i/o in place so you don't want to wait for a write that's not ready because you'll block.

Hm? I explicitly said that retries would be better?

If “write” expects you to always call it in a loop, I wonder why it wasn’t simply implemented so that it calls itself in a loop in the first place. Or at least provide a “writeAll” wrapper function.

What if the disk is literally unplugged after a partial write? (or the network cable unplugged, etc.)

Callers would still need to be informed that a partial write occurred and would need to do something about it.


When I call write(), it is typically already inside a bigger loop. If the system can't write my whole buffer right now, then I am happy to get the CPU back so I can do more work before trying again (and now I might have more data for it too).

This is literally the case study from worse is better :) https://www.jwz.org/doc/worse-is-better.html

Actually as great as Postgres is and as generally approachable the community is - my experience was the same a few times and I read it on the mailing list happening to others:

Someone comes along with a patch or idea. Bunch of big Postgres people come knock it and it dies right there.

Happened to me when I suggested a more multi-tenant approach back around 2010 and today we have Citus. I was told that (paraphrased) no users were asking for that sort of thing.

I see it kind of happening with the Foreign Key Array patch that the author asked for help to rebase and no one bothered to reply.

Someone suggested replacing the Postmaster with a threaded approach so it could scale better (and showed benchmarks of their implementation handling more connections). Community response was there were already 3rd party connection pools that do the job. An outsider looking in considers this nuts - most people would not run additional components if they did not need to!

Another example: NAMEDATALEN restricts every identifier to 63 bytes - a limit that causes frequent issues (more so now that we have partitioning) and also a problem for multi-lingual table names. It’s been proposed to increase this limit a few times. Every time the answer has been: abbreviate your table names or recompile your own version of Postgres.

Could name a few other examples too I’ve noticed over the years and just sighed at. I don’t expect every idea to be accepted or even welcomed - but there is that sense of bias against change.


While this can and does suck when it happens to you, this is exactly what it takes to keep products focused so they don't die death by a thousand cuts (or feature requests). For every awesome feature embarked upon, there's an opportunity cost of bug fixes, stability updates, tech debt reductions, and other inglorious but necessary work. Aggressively de-scoping is the difficult but necessary work of keeping a product alive in a competitive marketplace. And yes, it's a marketplace even if the product is open source.

Yep, I think it’s an unfortunate side effect of dealing with an onslaught of bug reports, many of which are user error or someone else’s problem. It’s common in any kind of user support, you start seeing the same report over and over.

I even saw something similar when I went to the ER recently. Even doctors will pickup on one thing you say and draw conclusions from that and dismiss everything else.


"Even doctors will pickup on one thing you say and draw conclusions from that and dismiss everything else."

This pattern seems really common, and is what scares me about the future in general. The 'experts' concentrate on the stuff they understand / are best to the detriment of where the actual focus needs to be. In a lot of cases, this is despite the insistence of the supposed non-expert who is the one suffering as a result.

Some of the worst cases are as a child, where you get in trouble twice. First for wasting adults time because you didn't tell them properly, and then again for pointing out that you did.


I've had this experience with doctors so often it is chilling.

Elderly relatives writhing in pain, only to have doctors say it's indigestion (it was a perforated ulcer and an uncle had previously died from a wrongly diagnosed ulcer perforating). My partner was misdiagnosed with flu when it was pneumonia which then developed into pleurisy (I'd never seen either of the latter, but was telling the doctor that's what the symptoms looked like - 15 years later he still suffers pain from the pleurisy). I had an arm paralyzed through severe pain and the consultant doctor planned an operation "to cut the nerve" - I said I thought it was a frozen shoulder and that such a procedure was unnecessary; 6 months later the paralysis began to subside and the consultant agreed it was a frozen shoulder). Another relative died of bowel cancer that was said to be back pain (she died in the hospital where she worked). I know of several people who were telling the doctor they had cancer, only to have the doctors dismiss it as trivial, with most of these people dying because of their untreated cancer. As a child I had joint pains for years that were diagnosed as "growing pains" but turned out to a hip disease (younger cousins ended up with the same condition and because I'd already had it, they were more readily diagnosed by family members).

In both directions (treating trivial as serious and treating serious as trivial) I've seen so many mistakes. I'd be much happier to see a doctor google the symptoms rather than jump to a conclusion about what is wrong.

There's a famous anecdote where junior doctors are taught the importance of observation, by senior doctors tricking them into tasting urine. It doesn't seem to be a lesson they learn. Even when their own objective test results are contra-indicative of their pre-judgement I've seen doctors scratch their heads but stick with their incorrect pre-judgement.

When doctors I know have a family member go into hospital, you should see how attentive my doctor friends get concerning what is being said and done to their relatives. Some doctors will not even allow relatives to go into hospital for non-emergency treatment at certain times of year (because of timetabling there can be very inexperienced doctors on duty at certain times of the year).


>frozen shoulder

As someone who suffered through that on _both_ shoulders I can sympathize. For me, the doctor missed it. The chiropractor I was sent to, took one look and said it was 'frozen shoulder'. I have never even heard of such a thing before. It took nearly two years to get full movement on my right shoulder. Then the left froze :-(


> coolestuk

I'm just curious. Are these experiences in the UK?

I have seen and heard of some similar things, but my experience is only with the US healthcare system.


I’m a complete outsider to PostgreSQL specifics, but: if that is the reason, then this is a case for improving messaging.

If I’d been in the patch submitter’s shoes, I wouldn’t have thought twice about writing off that community.

If I had gotten your reply, instead: 100% fair play, thank you for your consideration, and send my love to the dev team.

Messaging matters.


I somehow responded with this to the wrong post earlier:

Honestly, if someone's spending time working on a high value open source project (which PG absolutely is), I'd rather they spend less time (than I do) crafting their internet comments to sound nice and more time contributing to society. And I hope people who actually use the product feel the same way, understand why every single use case can't be carefully considered every time it comes up, and don't take it personally.


But messaging influences public perception influences inclination of future potential contributors to participate influences quality of the software. As we see here.

People like to think they can escape politics. They can’t. Any group of >1 humans will involve politics.

Learning how to be respectful and polite is like learning how to touch type: it’s a small, painful price to pay once, for a lifetime of copious reward.


I disagree that this is required. You can see from Linus Torvald's backtracking on decades of abrasive behavior that it was never an important part of Linux after all, so an abrasive experience for people trying to help other open source projects is probably going to be superfluous too. You can still reject ideas without disregarding them or the person.

"We don't support running Postgres on NFS" isn't the same thing as "fuck you Intel ACPI team; you're dumber than a chimpanzee". Equating disagreement and criticism with Linus-isms is why the relationship between users and developers is such a mess to begin with. Being a maintainer requires you to say "no" sometimes, but it doesn't require you to be a jerk.

No, it's worse.

Linus was trying to make things work, with profanity. Postgres couldn't be bothered.

Sure, performative profanity isn't everyone's cup of tea, but milquetoast passive-aggressive dismissals of people like OP who ARE TOTALLY RIGHT aren't actually nice.


> You can see from Linus Torvald's backtracking on decades of abrasive behavior that it was never an important part of Linux after all

It's ten years too soon to conclude that Torvalds backing away from abrasive behaviour didn't kill Linux.


I’m not sure if that logic holds. Who’s to say Linux would not have been more or less successful if Linus had behaved differently? For all we know, Linux may have succeeded despite his behavior, rather than because of it.

That said, I feel that a strong and positive community around a project is always an asset. I’ve seen many more projects fail due to community interaction being bad than I have from it being good.


Are you sure you're disagreeing with the post you responded to? It sounds like you entirely agree with them, but you open as though you disagree...

The patch author is doing the work. Telling them “no” isn’t going to make them focus instead on the project leadership’s other priorities like it would in a corporate team.

It's not at simple as that.

The patch author does the "initial" work. But then it's up to the team to learn the patch, understand it and keep maintaining it.

Every line of code is baggage.

If there is no demand for something at the time, it makes sense for maintainers to reject that. It's up to them to maintain that patch from now on.


There is another scenario, I submitted a Pull Request to a OSS project, the authors discussed it and rejected it and then implemented it in the exact same way as I did. That was hurtful.

I can't upvote your comment for some reason, but exactly this

>Telling them “no” isn’t going to make them focus instead on the project leadership’s other priorities like it would in a corporate team.

No one is asking them to. An open source project is not a corporation, it has no shareholders who require growth at all costs. So someone doesn't contribute to the project, if there's enough other contributors to keep it healthy then who care? No need to try and get every single potential contributor to contribute code to the project.


You can’t pay an opportunity cost on an opportunity you don’t have. Whatever the reason for rejecting the patch in this case, it is not a missed opportunity to work on bugs and tech debt as suggested by the parent.

Honestly, if someone's spending time working on a high value open source project (which PG absolutely is), I'd rather they spend less time (than I do) crafting their internet comments to sound nice and more time contributing to society. And I hope people who actually use the product feel the same way, understand why every single use case can't be carefully considered every time it comes up, and don't take it personally.

> crafting their internet comments to sound nice

It's not hard.


Yea, that's definitely something that happens. Partially due to some conservatism, partially due to having way more patches than review bandwidth, ...

> Another example: NAMEDATALEN restricts every identifier to 63 bytes - a limit that causes frequent issues (more so now that we have partitioning) and also a problem for multi-lingual table names. It’s been proposed to increase this limit a few times. Every time the answer has been: abbreviate your table names or recompile your own version of Postgres.

I think if it were easy, it'd immediately be fixed, but there's a fair bit of complexity in fixing it nicely. For partially historical and partially good reasons object names are allocated with NAMEDATALEN space, even for shorter names. So just increasing it would further waste memory... We're going to have to fix this properly one of these days.


Re NAMEDATALEN specifically - you at least acknowledge it eventually needs fixing. On the mailing list there were a lot of important Postgres people who don’t seem to agree.

Agreeing that there is something worthy of fixing is a first step. It should have happened with this NFS patch and imo some other stuff. The considerations for how and when should be dear with separately.


> Agreeing that there is something worthy of fixing is a first step. It should have happened with this NFS patch and imo some other stuff. The considerations for how and when should be dear with separately.

But there were like multiple people agreeing that it needs to be changed. Including the first two responses that the thread got. And there were legitimate questions around how interrupts need to be handled, about errors ought to be signaled if writes partially succeed and everything. Do you really expect us to integrate patches without thinking about that? And then the author vanished...


> Do you really expect us to integrate patches without thinking about that?

Whoa! You made huge leap there. At what point was it suggested that patches be recklessly applied?

That didn't happen. Your quote actually suggests a reasonable progression and at no point is there any suggestion, implied or otherwise, that changes be integrated without due consideration.

Not irrationally dismissing criticism != abandoning sound design and development.


Well, that's the thing - changing NAMEDATALEN is a seemingly small change, but it'll require much more work than just increasing the value. Increasing the value does not seem like a great option, because (a) how long before people start complaining about the new one and (b) it wastes even more memory. So I assume we'd switch to a variable-length strings, which however affects memory management, changes a lot of other stuff from fixed-length to variable-length, etc. So testing / benchmarking needed and all of that.

Which is why people are not enthusiastic about changing it, when there are fairly simple workarounds (assuming keeping the names short is considered to be a workaround).


> (a) how long before people start complaining about the new one

Very likely many years, or even never. People don't use large names because they like it, they always prefer small ones.

How much memory are we talking about?


> Could name a few other examples too I’ve noticed over the years and just sighed at. I don’t expect every idea to be accepted or even welcomed - but there is that sense of bias against change.

not just bias against change. while there are some very talented and friendly people in the pg community, there are a few bad actors that are openly hostile and aggressive, that feel that they must be part of every discussion. it gets worse when it is a change that they made that is causing an issue, as ego takes a front seat.

unfortunately, these few bad actors make dealing with the pg mailing lists in general very unpleasant, and have made myself (a popular extension maintainer) and others try to keep interaction to an absolute minimum. that's not good for the community.


I'm honestly curious who are those bad actors, and examples of such behavior. I'm not trolling you, I really am curious - because that does not match my experience with the community at all.

I'm sure there were cases of inappropriate / rude communication, but AFAICS those were rare one-off incidents. While you're apparently talking about long-term and consistent behavior. So I wonder who you're talking about? I can't think of anyone who'd consistently behave like that - particularly not among senior community members.


I know that you read the mailing lists, just spending a week on bugs, or following any major or minor discussions on hackers should be enough for you to figure a couple of them out, but I'm not going to personally call anyone out by name on hacker news.

my comment, which is based on the multiple interactions I've had with the community, stands as is: some fantastic people, and a few aggressive bad actors that spoil things.


I don't think this is really limited to Postgres. Any relatively big OS project the core committers they tend to do what they do. I submitted a patch to a pretty big OS project and pinged the person assigned to the area. That person pretty much wrote his own version of my patch. I guess it got fixed so that was good. I guess the good thing about open source is if you want you can just build your own version with whatever you want. However smaller projects are more receptive since they want all the attention and help they can get.

I think the "big project" angle is that maintainability is of higher priority than for smaller project. So more minor things need to to be addressed than in smaller projects. And that then either leads to being very nitpicky in reviews (which costs both sides cycles), or polishing patches pretty liberally (which costs the submitter experience, but reduces the time required for both sides).

> Someone comes along with a patch or idea. Bunch of big Postgres people come knock it and it dies right there.

This is not unique to Postgres. I've seen this behavior on many development mailing lists (e.g. Mutt-dev).


Well, just proposing things and waiting for the team to implement them is easy.

The core team has also to prioritize, deal with already planned features, and shoot down tons of people with inane ideas as well (not just good ones).


Yes 63 Identifier limit is a real pain, especially for constraint names, where a long name can convey a lot of valuable info. NOT IN FKs ie opposite of current FK’s for IS IN, would be useful for XOR constraints I would guess they could be implemented quite easily using existing FK infrastructure.

Not the worst thing in the world. I have trouble reading code with very long identifiers. Conventionally, a line of code should not exceed 80 characters or so. That's pretty hard if identifers are 60+ characters by themselves. If your project has standard abbreviations for things, "63 characters ought to be enough for anybody"

I had a project last year where DB names were autogenerated per tenant, with the tenant ID being a UUID, so you're left with 63-36=27 characters. Starts to feel narrow.

(Putting tenant_id as a column was not an option because for each tenant, a third-party software was started that wanted to have its own DB.)


> and today we have Citus.

After reading this, I have been wondering if the other requests/ideas are not startup ideas


I'm sure some of them are. And Citus is not the only startup pushing PostgreSQL in a way the community did not want to.

If you look at https://wiki.postgresql.org/wiki/PostgreSQL_derived_database..., it's a pretty long list - some of the product are successful, some are dead. And then there are products adding some extra sauce on top of PostgreSQL (e.g. timescaleDB).


I kind of wish the pg community came and read this. I guess this is one of the reason why MySQL were much more dominant in the early days and pg had relatively small usage.

IMHO there were other / more important reasons why MySQL was initially more successful.

The relevant thread: https://www.postgresql.org/message-id/46ED7555-9383-45D2-A5C...

There's literally one person saying that NFS is crapshot anyway, and several +1s for retries. And the former was accompanied with several questions about the concrete implementation.


One the messages told me to go lobby NFS vendors.

Many others were as you note bikeshedding the commit.

Only one, if I recall, questioned why it was a controversial patch at all.

If you check the patch that was committed eventually, IIRC it’s identical to what I proposed.

As I said, I don’t look back particularly favourably on my interactions with the community.


I feel this is a little unfair to the community and I encourage others to read the thread before forming a negative opinion.

It seems like a fairly good example of an engineering discussion. Even if the patch is correct and conforming and solves a problem, that doesn't mean there is no use for further discussion, and the surrounding discussion does seem to have merit.

Things like:

* Are there closely-related problems still left unsolved (e.g. retrying reads)?

* Is something not configured according to best practices known by other community members?

* Expressing an opinion that the use case is dangerous, so that nobody (including other people reading the thread in the future) will take it as an endorsement for running postgres on NFS.

* Some legitimate-sounding questions about the specifics of the patch (around zero returns, what writeAll handles versus its callers, etc.).

* At least one person agreed with you that it's a reasonable thing to do.

I don't think HN is really the right place to assess the technical merits of a postgres patch, but the discussion itself seems well within reasonable bounds.


Which of the messages on the thread do you consider as bikeshedding? I personally don't see any - that's not to say I agree with everything said on the thread, but overall it seems like a reasonable discussion.

IMHO it's a bit strange to assume everyone will agree with your patch from the very beginning. People may lack proper understanding of the issue, or they may see it from a different angle, and so on. Convincing others that your solution is the right one is an important part of getting patch done. I don't see anything wrong with that.


Looking at your submission, it was very good: a clean way to reproduce, reference to the specification that tells users of a syscall how to use it. I'm perplexed that you got so much pushback on this when it seemed pretty straight-forward. I'd guess that this was your first patch to postgres and the default reaction is to be defensive and decline the patch.

Thinking about the social structure for a minute, I honestly think you might have done better to leave out some of the content of your patch submission, let people push back on the more obvious stuff, and have ready answers. There's a phenomena where curators feel they must push back, and they feel weird when there's nothing to criticize - you can get around this by giving them something to criticize, with a ready-at-hand response!

Sorry this happened.


Somewhat OT, but that is generally referred to as a duck[1], and it _definitely_ has its uses.

> The artist working on the queen animations for Battle Chess... did the animations for the queen the way that he felt would be best, with one addition: he gave the queen a pet duck. He animated this duck through all of the queen’s animations, had it flapping around the corners. He also took great care to make sure that it never overlapped the “actual” animation.

> As expected, he was asked to remove the duck, which he did, without altering the real Queen animation.

[1]: https://softwareengineering.stackexchange.com/a/122148


I had an American friend who lived in Germany for many years who had a similar approach to dealing with immigration officials. He learned quickly that no matter how thorough he was when applying to extend his stay, his packet would always be "missing" some form. So he started leaving a few forms out of his packet, that way what they asked for was something he already had.

It’s sill wrong to suggest to every contributor to put “the duck” in his contribution, when the fact is then that it’s the people who handle the contribution who are the problem.

I've seen references to something similar in home remodeling - required paint and trim are stored in a closet during the work, but that closet itself never gets repainted. When the homeowner says "hey, you missed this!" they get the option of a discount and doing it themselves or the contractor sending someone to use the materials already on site. People who complete it themselves have a higher feeling of accomplishment because they were involved rather than simply hiring it done.

Haha cool, I've been around a while but I've never heard of this. I wonder if this concept can be combined with confessional debugging, aka rubber-ducking? :)

This is good advice and something I’ve learned over the past years.

One of the interesting points I’ve relfected on over time has been how _my_ issue was solved with the patch, however for others it wasn’t.

When I think about this I of course recognise that Postgres owes me nothing, however we both had similarly aligned objectives (fix bugs, do good), but because of our inability to successfully communicate, we didn’t get to a productive outcome and so a fixable bug sat in the release for some time.

I do wonder how we go about improving that kind of circumstance.


It's hard. I'm pretty sure that in the general case, the problem isn't solvable.

In Morgan LLywelyn's Finn Mac Cool ( https://www.amazon.com/dp/0312877374/ ), Finn is a man with no family from an inferior tribe, and he's a troop leader in a band of scummy soldiers of no social standing whatever. On the other hand, his potential is obvious -- he will eventually work his way up to self-made king.

While stationed in the capital, he falls in love with a respectable, middle-class blacksmith's daughter. She won't give him the time of day because of the difference in social class, but over time, as his respectability steadily climbs, she warms up to him.

They have a failed sexual encounter, and everything falls apart. She feels too awkward to approach him anymore. He believes (incorrectly) that they've become married at a level inappropriate to her class, and eventually comes to her home to propose a much higher grade of marriage. But she doesn't know how to accept without -- in her own eyes -- suffering a loss of dignity. A painful scene follows in which it's obvious that he wants to marry her, she wants to marry him, her mother wants her to marry him, but somehow none of them can see how to actually get to that point.

People really don't go for innovation in social protocols, even when the protocols they know are failing them.


I know that as ‘Take out the duck’.

http://mamchenkov.net/wordpress/2012/07/23/a-duck/


> you can get around this by giving them something to criticize, with a ready-at-hand response!

So you suggest that the patches should be obviously “worse” in order to have better chance to be accepted?


He is (so much was plain). If those sub-optimal patches are accepted against expectations, a follow up patch with improvements can, in due time, be submitted.

This reply from Tom Lane reminded me of the PC losering issue from the “worse is better” essay [1].

> 2. What is the rationale for supposing that a retry a nanosecond later will help? If it will help, why didn't the kernel just do that?

[1] https://www.jwz.org/doc/worse-is-better.html


I have found the Postgres community extremely helpful and friendly. Perhaps this part of the interaction explains why they dropped the issue for a while?

https://www.postgresql.org/message-id/223BF4DE-B274-4428-832...


Do you have some examples where you disagreed with them or pointed out their mistakes? Obviously everyone is "extremely helpful and friendly" when you agree with them or don't point out their mistakes.

The "asshole reviewer" is definitely a thing.


I don't think I've encountered "asshole reviewer" in postgres community, but maybe my "pain threshold" is just higher than yours. More often than not it's a valid disagreement about technical matters.

Review in the PostgreSQL community does tend to be on the "harsh" side.

I've only submitted one patch (\pset linestyle unicode for psql), and with a few rounds of review and revision it made it in. Overall I found this process nitpicky but productive. However, there is a point at which the harshly critical review can be detrimental, and from the other comments mentioned here, it sounds like this has been the case in the past.

With regard to the semantics of write(2) and fwrite(3) these are clearly documented in SUS and other standards, and the "valid disagreement" in this case may have been very counter-productive if it killed off proper review and acceptance of a genuine bug. There are, of course, problems with fsync(2) which have had widespread discussion for years.


> Review in the PostgreSQL community does tend to be on the "harsh" side.

Can you share an example of a review that you consider harsh? (feel free to share privately, I'm simply interested in what you consider harsh)

I admit some of the reviews may be a bit more direct, particularly between senior hackers who know each other, and from time to time there are arguments. It's not just rainbows and unicorns all the time, but in general I find the discussion extremely civilized (particularly for submissions from new contributors). But maybe that's just survivor bias, and the experience is much worse for others ...

> I've only submitted one patch (\pset linestyle unicode for psql), and with a few rounds of review and revision it made it in. Overall I found this process nitpicky but productive. However, there is a point at which the harshly critical review can be detrimental, and from the other comments mentioned here, it sounds like this has been the case in the past.

Oh, 2009 - good old days ;-) Thanks for the patch, BTW.

There's a fine line between nitpicking and attention to detail. We do want to accept patches, but OTOH we don't want to make the code worse (not just by introducing bugs, even code style matters too). At some point the committer may decide the patch is in "good enough" shape and polish it before the commit, but most of that should happen during the review.


Harsh? All reviews I have got have been nice. The only annoying thing is how some issues can be bikeshedded to death.

I think you patch was too perfect. Always leave some small thing for the gatekeepers to latch on, then be all "oh, yes, totally overlooked this, thank you, I'll fix it ASAP".

I think one of the Dilbert books had this concept. Make one bullet point ridiculous so your boss can have some input and ask you to remove the “Be involved in a land war in Asia” step in your ten point plan.

This is the first I'm hearing of the "take out the duck" concept. I've run software development teams for quite some time now and I have zero problem taking patches that are perfect without nitpicking them.

What the hell is wrong with people?


People have insecurities and egos. Sometimes you come across someone who need to feel useful, or need to feel important, or have a need to assert their power. Or that simply don't trust you, and assumes that if they don't see a problem, something is lurking under the surface.

With a lot of people you will never need this. But especially in larger projects where any number of people might latch on and review submissions, the chances for someone to find something to complain about goes up rapidly.

And when one person has found something to complain about, all kinds of other social dysfunction quickly becomes an issue too.


We’re taking barely-evolved apes whose brains are made for running down antelope and picking bugs out of each other’s hair, and putting them in charge of incredibly complex machines that have only existed for a short time. There are bound to be some problems.

It's something that emerges in big open source projects. Not really related to software development teams that all answer to the same boss.

It's by far not limited to only open source projects. There are many situations where some fault, no matter how small or irrelevant, has to be found in order to sate the ego of the person doing the review or consideration of a problem or piece of work.

I don't think it's limited to open source project either. But open source projects are unusual in that often even when commit rights etc. are controlled, being able to opine on submitted patches is open, and a way of building social status within the project. And while that often works well, it also does attract people who may even mean well, but will be overly critical because it's their way of feeling they're contributing.

In most corporate projects, odds are you're dealing with a much smaller pool of reviewers.


Did you actually read the thread and conclude it to be unreasonable gatekeeping, or are you assuming that it was?

Claiming that there is no problem is in some ways better than gatekeeping and in some ways worse. But you're right, I should not have called this gatekeeping, since the motivations are muddled.

That's an excellent patch. Partial writes (and reads) is a known behavior which is often overlooked. It rarely (if ever) occurs on local file systems, and even the network ones manifest it only from time to time, usually when there is some kind of congestion/segmentation is going on.

It's a pity a lot developers out there do not have an ingrained awareness about this.


Everyone in that thread knows what a partial write is. That's why the patch author saw this error message:

    > 2011-07-31 22:13:35 EST postgres postgres [local] LOG: connection authorized: user=postgres database=postgres
    > 2011-07-31 22:13:35 EST ERROR: could not write block 1 of relation global/2671: wrote only 4096 of 8192 bytes
    > 2011-07-31 22:13:35 EST HINT: Check free disk space.
    > 2011-07-31 22:13:35 EST CONTEXT: writing block 1 of relation global/2671
    > 2011-07-31 22:13:35 EST [unknown] [unknown] LOG: connection received: host=[local]
If they didn't know what a partial write was, they wouldn't have identified it specifically as an error case with its own informative error message.

The proposed patch retries rather than throwing an error.


By "short writes" you mean partial writes?

Yes, thank you.


In any large organizations, fiefdoms are bound to develop. If you don't have principles for dealing with it, what ends up happening is that new ideas will be crowded out and claimed by one of the leaders of the fiefdom. The Linux model has formalized this process. I think more large open source projects should use this model.

The same pattern emerges in big commercial projects too. Why do you think it took 20 years to fix console resizing in Windows?

Ballmer and his ilk wouldn't allow it.

this is why we have facebook

Try committing to a Microsoft open source project or working with a lot of ex-Microsoft employees. You’ll want to kill yourself.

It sounds like you have an interesting anecdote to share.

If you can just give us the anecdote rather than making a vague, sweeping critique of MS OSS projects, you might avoid the downvotes.


<off-topic>I committed to a Microsoft open source project and worked with (ex-)Microsoft people. The experience was great :)</off-topic>

I've had this experience when I've tried to contribute to ASP.NET Core and .NET Core - in the case of ASP.NET Core, they weren't in the slightest bit interested in fixing something that's obviously broken. In the case of .NET Core, it was made clear that adding something new to the encryption library was going to take a long time, possibly years, even though there was demand for it.

I've given up on both :/


> In short, PostgreSQL assumes that a successful call to fsync() indicates that all data written since the last successful call made it safely to persistent storage. But that is not what the kernel actually does. When a buffered I/O write fails due to a hardware-level error, filesystems will respond differently, but that behavior usually includes discarding the data in the affected pages and marking them as being clean. So a read of the blocks that were just written will likely return something other than the data that was written.

> Google has its own mechanism for handling I/O errors. The kernel has been instrumented to report I/O errors via a netlink socket; a dedicated process gets those notifications and responds accordingly. This mechanism has never made it upstream, though. Freund indicated that this kind of mechanism would be "perfect" for PostgreSQL, so it may make a public appearance in the near future.

https://lwn.net/Articles/752063/

A real-life example can be found at https://stackoverflow.com/questions/42434872/writing-program...


The linked LWN article (from April 2018) is a great summary of the problem and potential solutions and its cause:

Ted Ts'o, instead, explained why the affected pages are marked clean after an I/O error occurs; in short, the most common cause of I/O errors, by far, is a user pulling out a USB drive at the wrong time. If some process was copying a lot of data to that drive, the result will be an accumulation of dirty pages in memory, perhaps to the point that the system as a whole runs out of memory for anything else. So those pages cannot be kept if the user wants the system to remain usable after such an event.


That justification is bogus however. There's already separate logic for the case the entire underlying device vanishes.

> entire underlying device vanishes

And that would then fail because of the hardware layer's bugs with reporting a device disconnect correctly. I mean, if the user follows the rules and pulls the stick out of a host port or a powered hub, sure, it's likely going to work per spec. But if it's on a daisy-chained 2003-era USB2 hub connected to a cheap USB3 hub? Yeah, good luck.


Is that really justification for incurring unsignalled dataloss? If that's actually common enough, count the number of uncleared errors on per-mount basis, and shut down the filesystem if the memory pressure gets too high while significant portions of memory are taken by dirty buffers that can't be cleaned due to IO errors.

Honestly it would be simpler just to make the "mark clean on write error" behavior a tunable flag rather than try to finesse this. Having the block layer not starve the system on bad hardware as a default behavior seems correct to me.

Also USB bus resets are not unheard of. Or moving devices from one port to another. If the device comes back within a minute or two you probably shouldn't throw out those writes.

If someone quickly pulls an usb drive, plugs it in another system, and then plugs it back in to the original system, then flushing writes could cause massive data corruption if those writes are relative to an outdated idea of what's on the block device. Sounds like a misfeature to me

> If someone quickly pulls an usb drive, plugs it in another system, and then plugs it back in to the original system, then flushing writes could cause massive data corruption

That's user error, though. The kernel should react to removable media being pulled by sending a wall message to the appropriate session/console, stating something similar to "Please IMMEDIATELY place media [USB_LABEL] back into drive!!", with [Retry] and [Cancel] options. That way, the user knows what to expect -- OS's used to do this as a matter of course when removable media was in common use. In fact, you could even generalize this, by asking the user to introduce some specific media (identified by label) when some mountpoint is accessed, even if no operations were actually in progress.


The drive was in a corrupt state the first time it got unplugged. And it's nothing to shrug off, it might have been in the middle of rewriting a directory and lose all the contents.

So what are the odds that A) you get it back into a non-corrupt state B) the sectors affected by finishing the write will re-corrupt it C) you do this in one minute?


> Also USB bus resets are not unheard of.

They're initiated by the host, not by a USB device.

> Or moving devices from one port to another. If the device comes back within a minute or two you probably shouldn't throw out those writes.

This would be a nice feature. Although these writes would need to be buffered. Probably also throttled. There'd also be some risk with devices that have identical serial numbers. Some manufacturers give all of their USB disks / memory sticks same serial number...


> They're initiated by the host, not by a USB device.

Or by a power flicker. Which can be caused by plugging in other devices too.

> Although these writes would need to be buffered. Probably also throttled.

You don't necessarily have to allow new writes, the more important part is preserving writes the application thinks already happened. But that could be useful too.

> Some manufacturers give all of their USB disks / memory sticks same serial number...

You have the partition serial number too, usually.


> There'd also be some risk with devices that have identical serial numbers. Some manufacturers give all of their USB disks / memory sticks same serial number...

On most OSes the HW serial number of the disk is now usually supplemented in the disk management logic with the GPT “Disk GUID”, if available. Most modern disks (including removable ones like USB sticks) are GPT-formatted, since they rely on filesystems like ExFAT that assume GPT formatting. And those that aren’t are effectively already on a “legacy mode” code-path (because they’re using file systems like FAT, which also doesn’t support xattrs, or many types of filenames, or disk labels containing lower-case letters...) so users already expect an incomplete feature-set from them.

Plus: SD cards, the main MBR devices still in existence, don’t even get write-buffered by any sensible OS to begin with, precisely because you’re likely to unplug them at random. So, in practice, everything that needs write-buffering (and will ever be plugged into a computer running a modern OS) does indeed have a unique disk label at some level.


Does it always work? I had issues with encrypted filesystems that would stay mounted after the device itself disconnected and required a forced unmount before I could use them again.

Did I just get downvoted for giving an example where the OS doesn't seem to handle the disconnect of the physical device at all?

Write error is EIO, device disappearing is (or should be) ENXIO. Kernel should be able to tell the difference instead of just ignoring write failures.

Bigtable, and presumably Spanner, goes beyond that. Data chunks have checksums and they're immediately re-read and verified after compactions, because it turns out that errors do happen and corruption also happens, even when you use ECC — roughly once every 5PB compacted, per Jeff Dean figures.

Sounds like Google is continuing to act poorly when it comes to upstreaming their code.

I like to beat on them just like everyone else but what is the expectation? Everything they write if it runs at an OS level?

Google should be working with the kernel team to get their code mainlined, rather than keep their patches out of tree for years on end.

Perhaps they tried. Do you have evidence otherwise?

Doesn't this affect all databases? Or is it a different issue?

https://www.sqlite.org/atomiccommit.html

SQLite does a "flush" or "fsync" operation at key points. SQLite assumes that the flush or fsync will not return until all pending write operations for the file that is being flushed have completed. We are told that the flush and fsync primitives are broken on some versions of Windows and Linux. This is unfortunate. It opens SQLite up to the possibility of database corruption following a power loss in the middle of a commit. However, there is nothing that SQLite can do to test for or remedy the situation. SQLite assumes that the operating system that it is running on works as advertised. If that is not quite the case, well then hopefully you will not lose power too often.

Also this seems related:

https://danluu.com/file-consistency/

That results in consistent behavior and guarantees that our operation actually modifies the file after it's completed, as long as we assume that fsync actually flushes to disk. OS X and some versions of ext3 have an fsync that doesn't really flush to disk. OS X requires fcntl(F_FULLFSYNC) to flush to disk, and some versions of ext3 only flush to disk if the the inode changed (which would only happen at most once a second on writes to the same file, since the inode mtime has one second granularity), as an optimization.

The linked OSDI '14 paper looks good:

We find that applications use complex update protocols to persist state, and that the correctness of these protocols is highly dependent on subtle behaviors of the underlying file system, which we term persistence properties. We develop a tool named BOB that empirically tests persistence properties, and use it to demonstrate that these properties vary widely among six popular Linux file systems.


Yeah, IMO it's a Linux fsync bug. fsync() should not succeed if writes failed. fsync() should not clean dirty pages if writes failed. Both of these behaviors directly contravene the goals of user applications invoking fsync() as well as any reasonable API contract for safely persisting data.

Arguably, POSIX needs a more explicit fsync interface. I.e., sync these ranges; all dirty data, or just dirty data since last checkpoint; how should write errors be handled; etc. That doesn't excuse that Linux's fsync is totally broken and designed to eat data in the face of hardware errors.

That Dan Luu blog post you linked is fantastic and one I really enjoyed.


I worked on PostgreSQL's investigation and response to this stuff and I agree with you FWIW. But apparently only FreeBSD (any filesystem) and ZFS (any OS) agree with us. Other systems I looked at throw away buffers and/or mark them clean, and this goes all the way back to ancient AT&T code. Though no one has commented on any closed kernel's behaviour.

I doubt anyone worried much when disks just died completely in the good old days. This topic is suddenly more interesting now as virtualisation and network storage create more kinds of transient failures, I guess.


I'm a FreeBSD dev, that probably colors my opinion :-).

Yet another reason to use ZFS I guess.

There have been bugs involved on both sides.

On the kernel side the error reporting was not working reliably in some cases (see [1] for details), so the application using fsync may not actually get the error at all. Hard to handle an error correctly when you don't even get notified about it.

[1] https://www.youtube.com/watch?v=74c19hwY2oE

On the PostgreSQL side, it was the incorrect assumption that fsync retries past writes. It's an understandable mistake, because without the retry it's damn difficult to write an application using fsync correctly. And of course we've found a bunch of other bugs in the ancient error-handling code (which just confirms the common wisdom that error-handling is the least tested part of any code base).


Right; I've been following the headlines on this saga for years :-).

> On the PostgreSQL side, it was the incorrect assumption that fsync retries past writes.

That's the part I'm claiming is a Linux bug. Marking failed dirty writes as clean is self-induced data loss.

> It's an understandable mistake, because without the retry it's damn difficult to write an application using fsync correctly.

This is part of why it's a bug. Making it even more difficult for user applications to correctly reason about data integrity is not a great design choice (IMO).

> And of course we've found a bunch of other bugs in the ancient error-handling code (which just confirms the common wisdom that error-handling is the least tested part of any code base).

+100


I'm not sure.

It's easy to blame other layers for not behaving the way you want/expect, but I admit there are valid reasons why it behaves the way it does and not the way you imagine. The PostgreSQL fsync thread started with exactly "kernel is broken" but that opinion changed over time, I think. I still think it's damn difficult (border-line impossible) to use fsync correctly in anything but trivial applications, but well ...

Amusingly, Linux has sync_file_range() which supposedly does one of the things you describe (syncing file range), but if you look at the man page it says "Warning: ... explanation why it's unsafe in many cases ...".


> It's easy to blame other layers for not behaving the way you want/expect,

Sure; to be clear, I work on both sides of the this layer boundary (kernel side as well as on userspace applications trying to ensure data is persisted) on a FreeBSD-based appliance at $DAYJOB, but mostly on the kernel side. I'm saying — from the perspective of someone who works on kernel and filesystem internals and page/buffer cache — cleaning dirty data without successful write to media is intentional data loss. Not propagating IO errors to userspace makes it more difficult for userspace to reason about data loss.

> but I admit there are valid reasons why it behaves the way it does and not the way you imagine.

How do you imagine I imagine the Linux kernel's behavior here?

> The PostgreSQL fsync thread started with exactly "kernel is broken" but that opinion changed over time, I think. I still think it's damn difficult (border-line impossible) to use fsync correctly in anything but trivial applications, but well ...

The second sentence is a good argument for the Linux kernel's behavior being broken.


I think the only real defense of the linux et al behaviour here is that:

a) there's no standardized way to recover from write errors when not removing dirty data. I personally find that more than an acceptable tradeoff, but obviously not everybody considers it that way.

b) If IO errors, especially when triggered by writeback, cause kernel resources to be retained (both memory for dirty data, but even just a per-file data to stash a 'failed' bit), it's possible to get the kernel stuck in a way it needs memory without a good way to recover from that. I personally think that can be handled in smarter ways by escalating per-file flags to per-fs flags if there's too many failures, and remounting ro, but not everybody sees it that way...


That sounds reasonable to me.

> b) If IO errors, especially when triggered by writeback, cause kernel resources to be retained (both memory for dirty data, but even just a per-file data to stash a 'failed' bit), it's possible to get the kernel stuck in a way it needs memory without a good way to recover from that.

To put it more succinctly: userspace is not allowed to leak kernel resources. (The page cache, clean and dirty, is only allowed to persist outside the lifetime of user programs because the kernel is free to reclaim it at will — writing out dirty pages to clean them.)

> I personally think that can be handled in smarter ways by escalating per-file flags to per-fs flags if there's too many failures, and remounting ro, but not everybody sees it that way...

Yeah, I think that's the only real sane option. If writes start erroring and continue to error on a rewrite attempt, the disk is failing or gone. The kernel has different error numbers for these (EIO or ENXIO). If failing, the filesystem must either re-layout the file and mark the block as bad (if it supports such a concept), or fail the whole device by with a per-fs flag. A failed filesystem should probably be RO and fail any write or fsync operation with EROFS or EIO.

If the device has been failed, it's ok to clean the lost write and discard it, releasing kernel resources, because we know all future writes and fsyncs to that file / block will also report failure.

This model isn't super difficult to understand or implement and it's easier for userspace applications to understand. The only "con" argument might be that users are prevented from writing to drives that have "only a few" bad blocks (if the FS isn't bad block aware). I don't find that argument persuasive — once a drive develops some bad sectors, they tend to develop more rapidly. And allowing further buffered writes can really exacerbate the amount of data lost, for example, when an SSD reaches its lifetime write capacity.


Isn't one of the difficulties the inability to decide when the I/O error is transient (e.g. running out of space with thin provisioning) vs. permanent (e.g. drive eaten by a velociraptor)?

Also, isn't it possible to use multipath to queue writes in case of error? I wonder if that's safe, though, because it will keep it in memory only and make it look OK to the caller.


Up until very recently (2-3 years ago) there was very little discussion of how to do consistent disk I/O properly, so it doesn't surprise me at all that many applications don't actually work.

Combine this with I/O error handling often being broken (in kernels, file systems, applications) and applications that are supposed to implement transaction-semantics can easily turn into "he's dead jim" at the first issue.


Tomas Vondra goes over this in the talk. There's a rationale for that behavior: pulling out a USB stick out of the USB socket may have been what's triggered the fsync() failure. In that case, there's no way the kernel will be able to retry reliably.

Silent error ignoring is never a great API. Especially for a data integrity operation.

fsync in case of the USB disappearing should simply return an error and drop the dirty pages.


> Doesn't this affect all databases? Or is it a different issue?

Possibly, https://wiki.postgresql.org/wiki/Fsync_Errors notes both MySQL and MongoDB had to be changed.

Note that the issue here different from either of the bits you quote. The problem here is that if you fsync(2) it fails and you fsync(2) again, on many systems the second call will always succeed because the first one has invalidated/cleared all extant buffers, and thus there's nothing for the second one to sync. Which is a success.

AKA because of systems' shortcuts an fsync success effectively means "all writes since the last fsync have succeeded", not "all writes since the last fsync success have succeeded". Writes between a success and a failure may be irrecoverably lost


> Doesn't this affect all databases?

Yes, most. And several did similar changes (crash-restart -> recovery) to handle it too. It's possible to avoid the issue by using direct IO too, but often that's not the default mode of $database.


> Doesn't this affect all databases?

No, this is an artifact of storage engine design. Direct I/O is the norm for high-performance storage engines -- they don't use kernel cache or buffered I/O at all -- and many will bypass the file system given the opportunity (i.e. operate on raw block devices directly) which eliminates other classes of undesirable kernel behavior. Ironically, working with raw block devices requires much less code.

Fewer layers of abstraction between your database and the storage hardware make it easier to ensure correct behavior and high performance. Most open source storage engines leave those layers in because it reduces the level of expertise and sophistication required of the code designers -- it allows a broader pool of people to contribute -- though as this case shows it doesn't not necessarily make it any easier to ensure correctness.


> Most open source storage engines leave those layers in because it reduces the level of expertise and sophistication required of the code designers -- it allows a broader pool of people to contribute -- though as this case shows it doesn't not necessarily make it any easier to ensure correctness.

Another reason is that it's easier to deploy — you can just use some variable space on your filesystem rather than shaving off a new partition.


In practice, implementations can be deployed either way. The storage hardware is always dedicated to the database for large-scale or performance-sensitive databases anyway, making it less of an inconvenience. For some common environments, raw block devices substantially increase throughput versus the filesystem, so there are real advantages to doing so when it makes sense.

Yea, but the largest portion of databases these days is not deployed on dedicated hardware that's been carefully capacity planned, and has a dedicated DBA.

It's not like it shouldn't be possible to make a durable application with buffered IO though. The OS designers just haven't given it much focus.

Notably, Go had failed to use F_FULLFSYNC on MacOS until I reported it. Fix landing in 1.12, but won't be back-ported.

https://github.com/golang/go/issues/26650


I think it is a different matter.

MacOS documented its abnormal fsync behavior, Golang just didn't follow what was clearly described in those documents. Linux didn't document what happens on fsync failure, there is really nothing for applications to follow.

Also note that, the strange MacOS fsync() behavior is obvious, its fsync latency on the most recent mbp by default is close to the one observed on Intel Optane when we all know that mbp comes with much cheaper/slower SSD compared to Intel Optane. The same can't be said for the Linux fsync issue here.


According to https://wiki.postgresql.org/wiki/Fsync_Errors MacOS also has the Linux issue anyway.

> Doesn't this affect all databases?

Didn't affect LMDB. If an fsync fails the entire transaction is aborted/discarded. Retrying was always inherently OS-dependent and unreliable, better to just toss it all and start over. Any dev who actually RTFM'd and followed POSIX specs would have been fine.

LMDB's crash reliability is flawless.


> Any dev who actually RTFM'd and followed POSIX specs would have been fine.

So I'm trying to do that [1] and it seems to me the documentation directly implies that a second successful call to fsync() necessarily implies that all data was transferred, even if the previous call to fsync() had failed.

I say this because the sentence says "all data for the open file descriptor" is to be transferred, not merely "all data written since the previous fsync to this file descriptor". It follows that any data not transferred in the previous call due to an error ("outstanding I/O operations are not guaranteed to have been completed") must now be transferred if this call is successful.

What am I missing?

[1] https://pubs.opengroup.org/onlinepubs/009695399/functions/fs...


Technically, there is nothing besides "all data written since the previous fsync" since the previous fsync already wrote all the outstanding data at that point in time. (Even if it actually failed.) I.e., there is never any to-be-written data remaining after fsync returns. Everything that was queued to be flushed was flushed and dequeued. Whether any particular write failed or not doesn't change that fact.

> the previous fsync already wrote all the outstanding data at that point in time. (Even if it actually failed.)

I'm sorry, but this... just doesn't make sense. An event can't both happen and also fail to happen.

Also, the queue business is an implementation detail. Our debate is over what the spec mandates, not how a particular implementation behaves.


> I'm sorry, but this... just doesn't make sense. An event can't both happen and also fail to happen.

Consider a multithreaded application where a and b point to the same file. Note this isn't exactly what Postgres does:

    T1                   T2
    -------------------- --------------------
    write(a,x) start
    write(a,x) success

    fsync(a) start
                         read(b,x) start
                         read(b,x) success
    fsync(a) failed
                         write(b,y) start
                         write(b,y) success

                         fsync(b) start
                         fsync(b) success

y has a data dependancy on x (for example: table metadata), and the fsync() for y succeeded, so is T2 given to expect that x was recorded on disk correctly?

  The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to
  the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined.
  The fsync() function shall not return until the system has completed that action or until an error is detected.
All of the data was transferred to the device. The device may have failed to persist some or all of what was transferred, but it all got transferred.

There's no mention of the OS retrying, or leaving the system in a state that a subsequent fsync can retry from where it left off. So you can't assume anything along those lines.


But the spec doesn't imply in any way that you'll need to write() again if fsync() fails. And dropping dirty flags seems way out of spec because now you can read data from cache that is not on disk and never will be even if future fsync() succeeds.

So I don't buy the spec argument.

You have me curious now... I don't know anything about LMDB but I wonder if its msync()-based design really is immune... Could there be a way for an IO error to leave you with a page bogusly marked clean, which differs from the data on disk?


The spec leaves the system condition undefined after an fsync failure. The safe thing to do is assume everything failed and nothing was written. That's what LMDB does. Expecting anything else would be relying on implementation-specific knowledge, which is always a bad idea.

> I don't know anything about LMDB but I wonder if its msync()-based design really is immune.

By default, LMDB doesn't use msync, so this isn't really a realistic scenario.

If there is an I/O error that the OS does not report, then sure, it's possible for LMDB to have a corrupted view of the world. But that would be an OS bug in the first place.

Since we're talking about buffered writes - clearly it's possible for a write() to return success before its data actually gets flushed to disk. And it's possible for a background flush to occur independently of the app calling fsync(). The docs specifically state, if an I/O error occurs during writeback, it will be reported on all file descriptors open on that file. So assuming the OS doesn't have a critical bug here, no, there's no way for an I/O error to leave LMDB with an invalid view of the world.


> The spec leaves the system condition undefined [...]. The safe thing to do is assume everything failed

This is key.

Often programmers do 'assumption based programming'.

"Surely the function will do X, it's the only reasonable thing to do, right?". As much it is human, this is bad practice and leads to unreliable systems.

If the spec doesn't say it, don't assume anything about it, and keep asking. To show that this approach is feasible for anyone, here is an example:

Recently I needed to write an fsync-safe application. The question of whether close()+re-open()+fsync() is safe came up. I found it had been asked on StackOverflow (https://stackoverflow.com/questions/37288453/calling-fsync2-...) but received no answers for a year. I put a 100-reputation bounty on it and quickly got a competent reply quoting the spec and summarising:

> It is not 100% clear whether the 'all currently queued I/O operations associated with the file indicated by [the] file descriptor' applies across processes. Conceptually, I think it should, but the wording isn't there in black and white.

With the spec being unclear, I took the question to the kernel developers (https://marc.info/?m=152535098505449), and was immediately pointed at the Postgres fsyncgate results.

So by spending a few hours on not believing what I wished was true, I managed to avoid writing an unsafe system.

Always ask those in the know (specs and people). Never assume.


> Any dev who actually RTFM'd and followed POSIX specs would have been fine.

Yeesh. The POSIX manual on fsync is intentionally vague to hide cross-platform differences. There are basically no guarantees once an error happens. I guess that's one interpretation of RTFMing, but... clearly it doesn't match user expectations.


Mr. Chu, I hope you never lose your tenacity with respect to writing blurbs on LMDB's performance and reliability. I have enjoyed the articles comparing LMDB with other databases' performance and hope you continue to point the spotlight on the superior design decisions of LMDB.

Thanks, glad you're getting something out of those writeups. Hopefully it helps other developers learn a better path.

What if another process calls fsync? It will get the error. Then when LMDB calls fsync no error will be reported. And thus the transaction will not be retried. Is this scenario dealt with?

Newer versions of linux (but not plenty of other OSs) guarantee that each write failure will be signalled once to each file descriptor open before the error occurred. So that ought to be handled, unless the FD is closed inbetween.

LMDB is single-writer. Only one process can acquire the writelock at a time, and only the writer can call fsync() so this scenario doesn't arise.

If you open the datafile with write access from another process without using the LMDB API, you're intentionally shooting yourself in the foot and all bets are off.


I'm pretty sure it does affect any database/application relying on buffered I/O. Even if you use the fsync() interface correctly, you're still affected by the bugs in error reporting.

Technically there is no bug in error reporting. fsync() reported an error as it should. The application continued processing, instead of stopping. fsync() didn't report the same error a second time, which leads to the app having problems.

The application should have stopped the first time fsync() reported an error. LMDB does this, instead of futile retry attempts that Postgres et al do. Fundamentally, a library should not automatically retry anything - it should surface all errors back to the caller and let the caller decide what to do next. That's just good design.


No. Kernels before 4.13 may or may not report the fsync error correctly, depending on various conditions.

There are more details in the talk [1] I posted earlier, and in the LWN articles related to this issue.

[1] https://www.youtube.com/watch?v=74c19hwY2oE


Ah I see. https://lwn.net/Articles/718734/

Someone issuing a sync() could cause an error to be cleared before the app's fsync() happens. That's a drag.


in the reported case, fsync does not persist all data to disk, but it reports success. How does LMDB deal with that situation?

In the reported case, fsync reported an error, then (more data may or may not have been written), then fsync is tried again and reports a success, which masks the fact that data from the previous fsync didn't get fully written.

As I already wrote - in LMDB, after fsync fails, the transaction is aborted, so none of the partially written data matters.


Hold on, they are saying that if sync fails (for example if someone types "sync" at the console), then the database calls fsync() it will not fail even though the data is gone. I don't see how any database the uses the buffer cache could guard against this case.

The kernel should never do this. If sync fails, all future syncs should also fail. This could be relaxed in various ways: sync fails, but we record which files have missing data, so any fsync for just those files also fails.

(Otherwise I agree with LMDB- there should be no retry business on fsync fails).


You're right, that was another error case in the article that I missed the first time.

In LMDB's case you'd need to be pretty unlucky for that bug to bite; all the writes are done at once during txn_commit so you have to issue sync() during the couple milliseconds that might take, before the fsync() at the end of the writes. Also, it has to be a single-sector failure, otherwise a write after the sync will still fail and be caught.

E.g.

   LMDB       other
  write
  write
  write       sync
  write
  fsync
If the device is totally hosed, all of those writes' destination pages would error out. In that case, the failure can only be missed if you're unlucky enough for the sync to start immediately after the last write and before the fsync. The window is on the order of nanoseconds.

If only a single page is bad, and the majority of the writes are ok, then you still have to be unlucky enough for the sync to run after the bad write; if it runs before the bad write then fsync will still see the error.


We've been using LMDB at Cloudflare to store small-ish configuration data, it has been rock solid.

Thank you and the rest of the contributors for such a great library.


LMDB claimed speed and reliability seems remarkable (from a quick glance). I would guess is easier to achieve such, for a KV store, than for much more complex Relational Database. Got me thinking though. Mayby Postgres could take advantage of LMDB? Mayby by using LMDB as it’s cache? instead of using OS page cache, maybe writing the WAL to LMDB?

LMDB itself only uses the OS page cache. The way for LMDB to improve an RDBMS is for it to replace the existing row and index store, and eliminate any WAL. This is what SQLightning does with SQLite.

Have looked at replacing InnoDB in MySQL, but that code is a lot harder to read, so it's been slow going. Postgres doesn't have a modular storage interface, so it would be even uglier to overhaul.


Thanks, make sense, I think Postgres are planning go have pluggable storage interface in nest version 12, would that help? Also nobody has mention data checksum added v9.3, do you know if this helps avoid this kind of fsync related corruption?

> Also nobody has mention data checksum added v9.3, do you know if this helps avoid this kind of fsync related corruption?

Not really, I think. Page-level checksums don't protect against entire writes going missing, unfortunately.


LMDB is not a great fit for something like the WAL, where new data is written at the end, and old data discarded at the start. It leads to fragmentation (especially if WAL entries are larger than a single page).

Maybe with https://github.com/kellabyte/rewind which should support WAL for lmdb

"have an fsync that doesn't really flush to disk. OS X requires fcntl(F_FULLFSYNC) to flush to disk"

I remember highlighting this problem to the Firebird db developers maybe 13 years ago. AFAIR they were open to the problem I'd pointed out and went about fixing it on other platforms so that the db behaved everywhere as it did on OS X. I'm probably in the bottom 5% of IT professionals on this site. I'm rather amazed to find out that so late in the day Postgres has come round to fixing this.

I haven't used Firebird in years and can't find a link to the discussion (could have been via email).


Postgres has offered F_FULLFSYNC on OSX since 2005?

> Both Chinner and Ts'o, along with others, said that the proper solution is for PostgreSQL to move to direct I/O (DIO) instead.

Wait, is "direct I/O" the same as O_DIRECT?

The same O_DIRECT that Linus skewered in 2007?

> There really is no valid reason for EVER using O_DIRECT. You need a buffer whatever IO you do, and it might as well be the page cache. There are better ways to control the page cache than play games and think that a page cache isn't necessary.

https://lkml.org/lkml/2007/1/10/233

> Side note: the only reason O_DIRECT exists is because database people are too used to it, because other OS's haven't had enough taste to tell them to do it right, so they've historically hacked their OS to get out of the way.

https://lkml.org/lkml/2007/1/10/235

More background from 2002-2007: https://yarchive.net/comp/linux/o_direct.html


Turns out sometimes people other than Linus have more experience with IO than Linus.

I think there's pretty good reasons to go for DIO for a database. But only when there's a good sysadmin/DBA and when the system is dedicated to the database. There's considerable performance gains in going for DIO (at the cost of significant software complexity), but it's much more sensitive to bad tuning and isn't at all adaptive to overall system demands.


Yes, more than anything I'm amused by the fact that when you do the Linus-approved thing in 2007 it leaves you in this terrible situation in 2019, and when the other kernel experts rub their heads together their solution is to abandon the gentle advice from 10 years earlier.

Yeah, which is one of the reasons PostgreSQL went with buffered I/O, not to have to deal with this complexity. And it served us pretty well over time, I think.

I don't think that's really true. It worked well enough, true, but I think it allowed us to not fix deficiencies in a number of areas that we should just have fixed. IOW, I think we survived despite not offering DIO (because other things are good), rather than because of it.

I don't think we disagree, actually.

Yes - from a purely technical point of view, DIO is superior in various ways. It allows tuning to specific I/O patterns, etc.

But it's also quite laborious to get right - not only does it require a fair amount of new code, but AFAIK there is significant variability between platforms and storage devices. I'm not sure the project had enough developer bandwidth back then, or differently - it was more efficient to spend the developer time on other stuff, with better cost/benefit ratio.


I have to say he’s wrong when he says there’s no reason for EVER using O_DIRECT.

One HUGE reason is performance for large sequential writes when using fast media and slow processors. Specifically, when the write speed of the media is in the same ballpark as the effective memcpy() speed of the processor itself, which, believe it or not, is very possible today (but was probably more unlikely in 2007) when working with embedded systems and modern NVMe media.

Consider a system with an SoC like the NXP T2080 that has a high speed internal memory bus, DDR3, and PCIe 3.0 support. The processor cores themselves are relatively slow PowerPC cores, but the chip has a ton of memory bandwidth.

Now assume you had a contiguous 512 MiB chunk of data to write to an NVMe drive capable of a sustained write rate of 2000 MB/s.

The processor core itself can barely move 2000 MB/s of data, so it’s clear why direct IO would perform better since you’re telling the drive to pull the data directly from the buffer instead of memcpy-ing into into an intermediate kernel buffer first. With direct IO, you can perform zero-copy writes to the media.

This is why I’m able to achieve higher sequential write performance on some modern MVMe drives than most benchmark sites report, all while using a 1.2 GHz PowerPC system.


> Wait, is "direct I/O" the same as O_DIRECT?

No, there are other ways of doing DIO besides that particular interface.


Care to elaborate? And won't use those different interfaces the same direct i/o implementation, much like e.g. raw devices? I take above question as rhetorical and would think the discussion above as well as Linus' rant apply to those too.

Historic note (like from the 80's) - any time a machine was rebooted we'd type sync; sync; sync; reboot - the explanation was that the only guarantee was that the second sync wouldn't start until the first sync successfully completed, plus one for good luck...

https://utcc.utoronto.ca/~cks/space/blog/unix/TheLegendOfSyn...

  people were told 'do several sync commands, typing each by hand'

  But this mutated to just 'sync three times', so of course people started writing 'sync; sync; sync'

I still type sync (just once) before I reboot. Just a habit.

The last thing I do before I leave my work desktop for the day is

    ./eod.sh && sync && sync
20+ year old habits die hard.

The only machine I ever developed on that needed a sync before reboot was an ancient SPARC workstation.

I forgot to sync almost every time, but it would always boot after a fsck.


I always get some kind of error message from dmraid about having been unable to stop the RAID array on shutdown/reboot. I thus manually do a sync(1) in hopes that the data survives. Hasn't failed me thus far, at least.

I have a flash drive that I sometimes put a video on to watch it on a small TV in the basement and I've noticed that Linux doesn't copy the file right away. The 'cp' does finish quickly but the data is not on the flash drive yet. You either have to eject and wait or sync and wait for it to actually transfer.

Needless to say, this tripped me up few times and videos weren't fully transferred.


That behaviour is defined by a mount option.

What's the proper way to mount flash devices?

Try the sync option. From the man page:

> All I/O to the filesystem should be done synchronously. In the case of media with a limited number of write cycles (e.g. some flash drives), sync may cause life-cycle shortening.


Thanks so much!

I also noticed that on Gnome. I never investigated the implementation details of that progress bar but my gut feelings is that the file is read from (or written to) the buffer cache quickly and the progress bar goes near to 100%, then stays there until the last writes succeed and actually write to the USB stick. Then the eject button sometimes need extra time to finish the sync and tells me to wait a little. I always remove the stick when it tells me it's safe to do it.

Your intuition is correct. The writes are quickly buffered to RAM and then fsync or close writes them out to the slo media. The (naive) progress bar probably only tracks progress buffering the writes — it's the simplest way to track progress, if inaccurate.

Yes. You can also run "watch cat /proc/meminfo" in console and watch "Dirty" and "Writeback" fields to see dynamics (and how long you have to wait).

"sync;sync;halt" - followed by turning off the power so we could open up the cabinet to do maintenance...

I think it's worthwhile to note that this, even before both kernel and postgres changes, really only is an issue if there are very serious storage issues, commonly causing more corruption than just forgetting the data due to be fsynced. The kernel has its own timeout / retry logic and if those retries succeed, there's no issue. Additionally, most filesystems remount read-only if there's any accompanying journaling errors, and in a lot of cases PG IO will also have a metadata effect.

i/o errors are an ongoing issue for AWS M5 and C5 instance types with their cloudy nvme devices, they have a tendency to randomly disappear or have extended timeouts

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1788035

unfortunately the upstream patches being backported don't provide any real write guarantee


I don't really understand why people went with this headline and not "PostgreSQL developers discover most programs have used fsync incorrectly for decades, including PostgreSQL".

Because most normal programs do open-write-sync-close, and that mostly works as expected.

Postgres does open-write-close and then later, in another unrelated process, open-fsync-close. They discovered this doesn’t always work, because if somebody somewhere does a fsync on that file for any reason, their process might miss the error as it doesn’t stick.


I agree, the initial report on pgsql-hackers had a subject that blamed PostgreSQL too strongly and I think that carried through to articles and blogs and Twitter etc.

You are right, this affects other things too. And indeed several projects took action to remove any notion of fsync retry, referencing the LWN article etc.

POSIX is probably underspecified. AN Austin Group defect report would be interesting...


It's too long by 26 characters?

Am I misunderstanding this or does this mean Linux literally does not provide any way to ensure writes are flushed to a file in the presence of transient errors?

Does anyone know if Windows's FlushFileBuffers is susceptible to this as well? (P.S., interesting bit about FILE_FLAG_WRITE_THROUGH not working as you might expect: https://blogs.msdn.microsoft.com/oldnewthing/20170510-00/?p=...)


For the record and not mentioned in Tomas's talk: the PostgreSQL release due out next week will add a PANIC after any fsync failure (in other words, no longer retry). The same thing has been done by MySQLand MongoDB and probably everyone else doing (optional) buffered IO for database-ish stuff who followed this stuff on LWN etc.

https://wiki.postgresql.org/wiki/Fsync_Errors

Only FreeBSD & Illumos do the sane thing.


The wiki uses the word "presumably". I take that as "the devs say it works but nobody tested it".

OpenBSD has brought forth a patch earlier this month to try and make fsync(2) less of a mess to use on OpenBSD[1], though it hasn't been committed yet.

[1] https://marc.info/?l=openbsd-tech&m=154897756917794&w=2


Note that that patch doesn't really fix the issue. You can write();close();open();fsync(); and you'll miss the issue if the OS failed during writeback before the open(). That's worse than on new-ish linux, where at least it'll only forget the error if there's a lot of memory pressure.

I can see a mention of SQLite and MySQL. Also worth mentioning since this can affect any system:

https://www.firebirdsql.org/pdfmanual/html/gfix-sync.html

https://www.firebirdsql.org/file/documentation/reference_man...

I wonder how Oracle handles this. Raw device/partition and its own FS business logic?


Direct IO is the default for raw devices, and while apparently not the default otherwise IIRC it's pretty widely used by Oracle shops.

While oracle supports raw devices, and have their own reasonably good way of managing that, most installations I see used is files for storage.

(Which leaves the question of how oracle handles this unanswered, of course)


There is an old paper http://pages.cs.wisc.edu/~remzi/Classes/736/Papers/iron.pdf that analyzes how various filesystems do error handling and not too long ago it was fairly bad. My own experience was some of the older Windows would not even check if a write command failed. I used to laugh when developers would state how robust their databases were when the underlying filesystem does not even check for many errors. Hopefully things are better now.

FWIW MongoDB fixed[1] this last year in their WiredTiger storage engine.

[1] https://jira.mongodb.org/browse/WT-4045


The kernel itself isn't really a transaction manager. If there is an I/O error on a file, then I'd only expect it to percolate up in that immediate "session". When should that error flag be expected to carry over until, filesystem remount? Or even longer, with the talk of storing a flag in the superblock?

Specifically it seems like asking for trouble to open() a path a second time, and expect fsync() calls to apply to writes done on the other file descriptor object - there's no guarantee they're even the same file [0]! At the very least, pass the actual file descriptor to the other process. Linux's initial behavior was in violation of what I've said here, but the patch to propagate the error to every fd open at the time of the error should resolve this to satisfaction.

I would think about the only reasonable assumption you could make about concurrent cross-fd syncing is that after a successful fsync(fd), any successful read(fd) should return data that was on disk at the time of that fsync(fd) or later. In other words, dirty pages shouldn't be returned by read() and then subsequently fail being flushed out.

disclaimer: It's been a while since I've written syscall code and I've had the luxury of never really having to really lean into the spec to obtain every last bit of performance. Given how many guarantees the worse-is-better philosophy forewent, I don't see much point to internalize what few are there.

[0] Ya sure, you can assert that the user shouldn't be otherwise playing around with the files, I'm just pointing out that translating path->fd isn't a pure function.


Have there been any real-world consequences from this, and how can they be prevented?

Does MySQL have the same flaw?


The original report from Craig Ringer was based on a real system that experienced a failure like this on thin provisioning.

MySQL in buffered IO mode surely has the same problem, and has implemented the same solution: PANIC rather than retrying. Same for WiredTiger (MongoDB).

Other systems using direct (non buffered) IO are not affected.


This can happen when io device have intermittent failure. Bad device or faulty cable.

Interestingly enough, such issues are becoming more common. It's not just about devices being less reliable, but e.g. thin provisioning being used more widely etc.

I read this at lwn.net a while ago but it seems there is no fix to it. How is MySQL doing? I believe Oracle etc are not having this problem as they deal with disk directly.

I think that all databases are affected by this to some degree when not using Direct-IO, and I think Oracle and MySQL can both run with or without direct-IO.

Does anyone know if FoundationDB is affected?

FoundationDB does not use buffered IO, so no.

The meta point here is that just in OSS folks assume that the code has been reviewed since it's open, but the reality is that unless someone actually does you really don't know. The popularity of a project doesn't mean there aren't fatal flaws.

No need to bring out that dead horse for a beating. No one in this thread has yet made the claim that OSS code is flawless; you're picking a fight with a straw man.

It's a genuine issue, and one that is made time and again by people who think "open == security" whenever there's a discussion about something like Google or iMessage, when the armchair security experts come out of the woodwork to promote their favourite whatever-it-is.

Sure, it mightn't be made in this thread yet, but that doesn't make it an irrelevant, invalid, or uninteresting observation. I think that the spirit of discussion, so integral to what separates HN from other websites, means we should not poo-poo this line of inquiry just because you're bored of it.


I think the point is much often closer to "open is a prerequisite to security".

It's more like "open is a prerequisite for personal verification of security".

A system can be closed and secure, just you can't verify it.


Yes, but I tend to view security as a somewhat epistemological phenomenon. It's not enough for the security to exist "somewhere out there in the universe" in an absolute, objective sense. If you have no way of verifying it, it could simply be a lie, and is thus useless for threat modelling.

> Sure, it mightn't be made in this thread yet, but that doesn't make it an irrelevant, invalid, or uninteresting observation.

I really think it does. It's like "the sky isn't green!" or "the earth isn't flat!" or "vaccines don't cause autism!" Sure, these are all true things, but they weren't exactly topics of discussion on this thread before you brought them up.

By all means, discuss the article, and rebut comments you feel espouse an inaccurate worldview. (IMO) preemptive rebuttals like this are only useful or interesting when they're somewhat novel, or represent some special insight into a particular field that outsiders wouldn't have. This one has neither.

My particular take on why this dead horse is irrelevant (as well as tedious and boring):

Fsync isn't a security issue, it's a data loss issue. Arguably, the Postgres behavior is quite reasonable and the article's headline is just inaccurate. Linux has been reviewed, e.g., https://danluu.com/file-consistency/ from 2017, summarizing research from 2001-2014, all of which pointed towards deficiencies in its data preservation behavior. The Linux community know they lose data and propose that users should accept it.[0]

The Postgres <-> Linux fsync investigation has been ongoing for a long time, with lots of eyeballs on both sides of the kernel/userspace boundary. This isn't really a "bug escapes major application developers for 20 years!" so much as "Linux can't agree to provide an API to make file data consistent."

[0]: https://lwn.net/Articles/752105/

[1]: http://rhaas.blogspot.com/2014/03/linuxs-fsync-woes-are-gett...


> but these weren't exactly topics of discussion on this thread before you brought them up

Well, we're sorry we didn't recognise you as the discussion warden, but I think that's how a conversation works: people are free to bring up the points that they feel relevant, and people can either continue the train of thought or not. If it has no appeal to you, you're free to let it die a natural death rather than make pronouncements on what's relevant or not.


Humorously, someone in this thread has now actually made the claim that their open source database product has flawless crash reliability: https://news.ycombinator.com/item?id=19127011

Multiple independent research teams with their crash analysis/fuzzing tools confirmed the fact. Along with over 7 years deployment in large enterprises: Zero crash-induced corruption, zero startup/recovery time. Crash-proof by design.

Then make whatever appropriate response to that comment you feel is necessary; it still isn't a reasonable top-level comment.

but not that it's because it's free software, rather that it's because they read the docs.

No one is beating a dead horse here, there is a valid point - "The popularity of a project doesn't mean there aren't fatal flaws"

The point is valid, but uninteresting. It's the default. Everyone who has skimmed HN for more than two weeks has seen this by example, if not by comment, time and time again.

Personally I assume if a project is at least somewhat popular it works most of the time and frequent/serious bugs are reported and researchable. For that to work I also report or second bugs I encounter. That doesn't mean there aren't any bugs in even the most popular OSS, especially in edge cases and rare scenarios.

With closed source though I often don't know the popularty and how many/what kind of bugs have been reported, but just the reputation of the vendor.

I prefer a popular software from a highly reputable vendor over a somewhat popular OSS. But I also prefer the most popular and battletested OSS, like postgresql and linux, over any closed system, e.g. SQL Server and Windows.


Any project can contain fatal flaws irrespective of its review policies. Particularly when we are concerned with the subtle behaviour of the interaction between write and fsync. Even if you read the standard it's not clear exactly how the system as a whole should behave; there are a number of situations which aren't mentioned in the standard at all.

It would be quite possible for a review, and even extensive testing, to fail to pick up on some system-specific subtleties.


This is a contrarian view and I will sound phobic but for this very reason- false sense of security + possibility of malicious check ins- I now place less emphasis on open/closed source and more on repute. I am happier to download exe's from good companies these days than a package from arch-aur

It's a perfect scenario to hear some rants from Linus. While, the "code" has changed and he really has to calm down for this :)



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: