I patched my own systems (wrap the calls in a loop as per standard practice) and then proceeded to run literally hundreds of thousands of PostgreSQL instances on NFS for many more years with no problems.
The patch was eventually integrated it seems but I never found out why because I lost interest in trying to work with the community after that experience.
Looking at this thread - the patch is obviously right. It doesn't matter about NFS, Linux, or any host of random crap people brought up.
(I unfortunately suspect if you hadn't mentioned any of that, and just said "write doesn't guarantee full writes, this handles that" it may have gone better).
The documentation (literally on every system) is quite clear:
"The return value is the number of bytes actually written. This may be size, but can always be smaller. Your program should always call write in a loop, iterating until all the data is written."
Literally all documentation you can find about write on all systems that implement it clearly mentions it doesn't have to write all the bytes you ask it to.
Instead of saying "yeah, that's right", people go off in every random direction, with people arguing about your NFS mount options, what things can cause write to do that and whether they are possible or they should care about them.
 spoiler alert: it doesn't matter. The API allows for this, you need to handle it.
> Instead of saying "yeah, that's right", people go off in every random direction, with people arguing about your NFS mount options, what things can cause write to do that and whether they are possible or they should care about them.
Given that changing the mount option solved the immediate problem for the OP, I fail to see how it's random.
>  spoiler alert: it doesn't matter. The API allows for this, you need to handle it.
Sure, it's not like the return value was ignored before though. I agree retrying is the better response, but erroring out is a form of handling (and will also lead to retries in several places, just with a loop that's not tightly around the write()).
It's random because pretty much anyone could volunteer that opinion. By diagnosing the problem, making a patch and reaching out the author has already made their intention of fixing the problem clear. Unless it has been established that the problem isn't valid a workaround isn't really relevant.
Effective communication and exchange of ideas needs to have a good ratio between work and value. I used to frequent meetups where people would present projects with thousands of hours or work and priorities behind them. There would almost always be someone with less experience stating their "ideas" on what should be done instead. Eventually those people would end up talking among themselves where their ideas could flow freely without any restriction of actual work being done.
Experienced people provide value. They try to understand the problem, add their own experience to it and validate the work that has already been done. They make the problem smaller and closer to a solution. They don't, or shouldn't, casually increase the scope for little reason.
I'm baffled by this. Even if the fix had been immediately committed, the workaround of using nointr still would have been valuable, because a fixed version of postgres wouldn't immediately have been released.
You seem to argue in a way that entirely counteract your own later comments.
It is often the same with software. Good feedback on software isn't random ideas, suggestions or feature requests that adds hundreds of hours of work on a whim. It is feedback that considers the work that has already been done. Anyone can come up with something else, especially in theory and with a blank slate. It doesn't really require anything other than an opinion. Hacker News certainly is proof of that.
In this case it is “a form” but that specific handling is provably wrong. That is important: everybody could have tried and proved again. Therefore just dismissing the correct handling and keeping the wrong one is, let me repeat, also provably wrong.
> 1. If writes need to be retried, why not reads? (No, that's not an invitation to expand the scope of the patch; it's a question about NFS implementation.)
> 4. As coded, the patch behaves incorrectly if you get a zero return on a retry. If we were going to do this, I think we'd need to absorb the errno-munging currently done by callers into the writeAll function.
not review comments?
> when the problem clearly lies in the code not following the specs.
There were plenty of questions about how exactly the fix should look like below
How's that not review?
Callers would still need to be informed that a partial write occurred and would need to do something about it.
Someone comes along with a patch or idea. Bunch of big Postgres people come knock it and it dies right there.
Happened to me when I suggested a more multi-tenant approach back around 2010 and today we have Citus. I was told that (paraphrased) no users were asking for that sort of thing.
I see it kind of happening with the Foreign Key Array patch that the author asked for help to rebase and no one bothered to reply.
Someone suggested replacing the Postmaster with a threaded approach so it could scale better (and showed benchmarks of their implementation handling more connections). Community response was there were already 3rd party connection pools that do the job. An outsider looking in considers this nuts - most people would not run additional components if they did not need to!
Another example: NAMEDATALEN restricts every identifier to 63 bytes - a limit that causes frequent issues (more so now that we have partitioning) and also a problem for multi-lingual table names. It’s been proposed to increase this limit a few times. Every time the answer has been: abbreviate your table names or recompile your own version of Postgres.
Could name a few other examples too I’ve noticed over the years and just sighed at. I don’t expect every idea to be accepted or even welcomed - but there is that sense of bias against change.
I even saw something similar when I went to the ER recently. Even doctors will pickup on one thing you say and draw conclusions from that and dismiss everything else.
This pattern seems really common, and is what scares me about the future in general. The 'experts' concentrate on the stuff they understand / are best to the detriment of where the actual focus needs to be. In a lot of cases, this is despite the insistence of the supposed non-expert who is the one suffering as a result.
Some of the worst cases are as a child, where you get in trouble twice. First for wasting adults time because you didn't tell them properly, and then again for pointing out that you did.
Elderly relatives writhing in pain, only to have doctors say it's indigestion (it was a perforated ulcer and an uncle had previously died from a wrongly diagnosed ulcer perforating). My partner was misdiagnosed with flu when it was pneumonia which then developed into pleurisy (I'd never seen either of the latter, but was telling the doctor that's what the symptoms looked like - 15 years later he still suffers pain from the pleurisy). I had an arm paralyzed through severe pain and the consultant doctor planned an operation "to cut the nerve" - I said I thought it was a frozen shoulder and that such a procedure was unnecessary; 6 months later the paralysis began to subside and the consultant agreed it was a frozen shoulder). Another relative died of bowel cancer that was said to be back pain (she died in the hospital where she worked). I know of several people who were telling the doctor they had cancer, only to have the doctors dismiss it as trivial, with most of these people dying because of their untreated cancer. As a child I had joint pains for years that were diagnosed as "growing pains" but turned out to a hip disease (younger cousins ended up with the same condition and because I'd already had it, they were more readily diagnosed by family members).
In both directions (treating trivial as serious and treating serious as trivial) I've seen so many mistakes. I'd be much happier to see a doctor google the symptoms rather than jump to a conclusion about what is wrong.
There's a famous anecdote where junior doctors are taught the importance of observation, by senior doctors tricking them into tasting urine. It doesn't seem to be a lesson they learn. Even when their own objective test results are contra-indicative of their pre-judgement I've seen doctors scratch their heads but stick with their incorrect pre-judgement.
When doctors I know have a family member go into hospital, you should see how attentive my doctor friends get concerning what is being said and done to their relatives. Some doctors will not even allow relatives to go into hospital for non-emergency treatment at certain times of year (because of timetabling there can be very inexperienced doctors on duty at certain times of the year).
As someone who suffered through that on _both_ shoulders I can sympathize. For me, the doctor missed it. The chiropractor I was sent to, took one look and said it was 'frozen shoulder'. I have never even heard of such a thing before. It took nearly two years to get full movement on my right shoulder. Then the left froze :-(
I'm just curious. Are these experiences in the UK?
I have seen and heard of some similar things, but my experience is only with the US healthcare system.
If I’d been in the patch submitter’s shoes, I wouldn’t have thought twice about writing off that community.
If I had gotten your reply, instead: 100% fair play, thank you for your consideration, and send my love to the dev team.
Honestly, if someone's spending time working on a high value open source project (which PG absolutely is), I'd rather they spend less time (than I do) crafting their internet comments to sound nice and more time contributing to society. And I hope people who actually use the product feel the same way, understand why every single use case can't be carefully considered every time it comes up, and don't take it personally.
People like to think they can escape politics. They can’t. Any group of >1 humans will involve politics.
Learning how to be respectful and polite is like learning how to touch type: it’s a small, painful price to pay once, for a lifetime of copious reward.
Linus was trying to make things work, with profanity. Postgres couldn't be bothered.
Sure, performative profanity isn't everyone's cup of tea, but milquetoast passive-aggressive dismissals of people like OP who ARE TOTALLY RIGHT aren't actually nice.
It's ten years too soon to conclude that Torvalds backing away from abrasive behaviour didn't kill Linux.
That said, I feel that a strong and positive community around a project is always an asset. I’ve seen many more projects fail due to community interaction being bad than I have from it being good.
The patch author does the "initial" work. But then it's up to the team to learn the patch, understand it and keep maintaining it.
Every line of code is baggage.
If there is no demand for something at the time, it makes sense for maintainers to reject that. It's up to them to maintain that patch from now on.
No one is asking them to. An open source project is not a corporation, it has no shareholders who require growth at all costs. So someone doesn't contribute to the project, if there's enough other contributors to keep it healthy then who care? No need to try and get every single potential contributor to contribute code to the project.
It's not hard.
> Another example: NAMEDATALEN restricts every identifier to 63 bytes - a limit that causes frequent issues (more so now that we have partitioning) and also a problem for multi-lingual table names. It’s been proposed to increase this limit a few times. Every time the answer has been: abbreviate your table names or recompile your own version of Postgres.
I think if it were easy, it'd immediately be fixed, but there's a fair bit of complexity in fixing it nicely. For partially historical and partially good reasons object names are allocated with NAMEDATALEN space, even for shorter names. So just increasing it would further waste memory... We're going to have to fix this properly one of these days.
Agreeing that there is something worthy of fixing is a first step. It should have happened with this NFS patch and imo some other stuff. The considerations for how and when should be dear with separately.
But there were like multiple people agreeing that it needs to be changed. Including the first two responses that the thread got. And there were legitimate questions around how interrupts need to be handled, about errors ought to be signaled if writes partially succeed and everything. Do you really expect us to integrate patches without thinking about that? And then the author vanished...
Whoa! You made huge leap there. At what point was it suggested that patches be recklessly applied?
That didn't happen. Your quote actually suggests a reasonable progression and at no point is there any suggestion, implied or otherwise, that changes be integrated without due consideration.
Not irrationally dismissing criticism != abandoning sound design and development.
Which is why people are not enthusiastic about changing it, when there are fairly simple workarounds (assuming keeping the names short is considered to be a workaround).
Very likely many years, or even never. People don't use large names because they like it, they always prefer small ones.
How much memory are we talking about?
not just bias against change. while there are some very talented and friendly people in the pg community, there are a few bad actors that are openly hostile and aggressive, that feel that they must be part of every discussion. it gets worse when it is a change that they made that is causing an issue, as ego takes a front seat.
unfortunately, these few bad actors make dealing with the pg mailing lists in general very unpleasant, and have made myself (a popular extension maintainer) and others try to keep interaction to an absolute minimum. that's not good for the community.
I'm sure there were cases of inappropriate / rude communication, but AFAICS those were rare one-off incidents. While you're apparently talking about long-term and consistent behavior. So I wonder who you're talking about? I can't think of anyone who'd consistently behave like that - particularly not among senior community members.
my comment, which is based on the multiple interactions I've had with the community, stands as is: some fantastic people, and a few aggressive bad actors that spoil things.
This is not unique to Postgres. I've seen this behavior on many development mailing lists (e.g. Mutt-dev).
The core team has also to prioritize, deal with already planned features, and shoot down tons of people with inane ideas as well (not just good ones).
(Putting tenant_id as a column was not an option because for each tenant, a third-party software was started that wanted to have its own DB.)
After reading this, I have been wondering if the other requests/ideas are not startup ideas
If you look at https://wiki.postgresql.org/wiki/PostgreSQL_derived_database..., it's a pretty long list - some of the product are successful, some are dead. And then there are products adding some extra sauce on top of PostgreSQL (e.g. timescaleDB).
There's literally one person saying that NFS is crapshot anyway, and several +1s for retries. And the former was accompanied with several questions about the concrete implementation.
Many others were as you note bikeshedding the commit.
Only one, if I recall, questioned why it was a controversial patch at all.
If you check the patch that was committed eventually, IIRC it’s identical to what I proposed.
As I said, I don’t look back particularly favourably on my interactions with the community.
It seems like a fairly good example of an engineering discussion. Even if the patch is correct and conforming and solves a problem, that doesn't mean there is no use for further discussion, and the surrounding discussion does seem to have merit.
* Are there closely-related problems still left unsolved (e.g. retrying reads)?
* Is something not configured according to best practices known by other community members?
* Expressing an opinion that the use case is dangerous, so that nobody (including other people reading the thread in the future) will take it as an endorsement for running postgres on NFS.
* Some legitimate-sounding questions about the specifics of the patch (around zero returns, what writeAll handles versus its callers, etc.).
* At least one person agreed with you that it's a reasonable thing to do.
I don't think HN is really the right place to assess the technical merits of a postgres patch, but the discussion itself seems well within reasonable bounds.
IMHO it's a bit strange to assume everyone will agree with your patch from the very beginning. People may lack proper understanding of the issue, or they may see it from a different angle, and so on. Convincing others that your solution is the right one is an important part of getting patch done. I don't see anything wrong with that.
Thinking about the social structure for a minute, I honestly think you might have done better to leave out some of the content of your patch submission, let people push back on the more obvious stuff, and have ready answers. There's a phenomena where curators feel they must push back, and they feel weird when there's nothing to criticize - you can get around this by giving them something to criticize, with a ready-at-hand response!
Sorry this happened.
> The artist working on the queen animations for Battle Chess... did the animations for the queen the way that he felt would be best, with one addition: he gave the queen a pet duck. He animated this duck through all of the queen’s animations, had it flapping around the corners. He also took great care to make sure that it never overlapped the “actual” animation.
> As expected, he was asked to remove the duck, which he did, without altering the real Queen animation.
One of the interesting points I’ve relfected on over time has been how _my_ issue was solved with the patch, however for others it wasn’t.
When I think about this I of course recognise that Postgres owes me nothing, however we both had similarly aligned objectives (fix bugs, do good), but because of our inability to successfully communicate, we didn’t get to a productive outcome and so a fixable bug sat in the release for some time.
I do wonder how we go about improving that kind of circumstance.
In Morgan LLywelyn's Finn Mac Cool ( https://www.amazon.com/dp/0312877374/ ), Finn is a man with no family from an inferior tribe, and he's a troop leader in a band of scummy soldiers of no social standing whatever. On the other hand, his potential is obvious -- he will eventually work his way up to self-made king.
While stationed in the capital, he falls in love with a respectable, middle-class blacksmith's daughter. She won't give him the time of day because of the difference in social class, but over time, as his respectability steadily climbs, she warms up to him.
They have a failed sexual encounter, and everything falls apart. She feels too awkward to approach him anymore. He believes (incorrectly) that they've become married at a level inappropriate to her class, and eventually comes to her home to propose a much higher grade of marriage. But she doesn't know how to accept without -- in her own eyes -- suffering a loss of dignity. A painful scene follows in which it's obvious that he wants to marry her, she wants to marry him, her mother wants her to marry him, but somehow none of them can see how to actually get to that point.
People really don't go for innovation in social protocols, even when the protocols they know are failing them.
So you suggest that the patches should be obviously “worse” in order to have better chance to be accepted?
> 2. What is the rationale for supposing that a retry a nanosecond later
will help? If it will help, why didn't the kernel just do that?
The "asshole reviewer" is definitely a thing.
I've only submitted one patch (\pset linestyle unicode for psql), and with a few rounds of review and revision it made it in. Overall I found this process nitpicky but productive. However, there is a point at which the harshly critical review can be detrimental, and from the other comments mentioned here, it sounds like this has been the case in the past.
With regard to the semantics of write(2) and fwrite(3) these are clearly documented in SUS and other standards, and the "valid disagreement" in this case may have been very counter-productive if it killed off proper review and acceptance of a genuine bug. There are, of course, problems with fsync(2) which have had widespread discussion for years.
Can you share an example of a review that you consider harsh? (feel free to share privately, I'm simply interested in what you consider harsh)
I admit some of the reviews may be a bit more direct, particularly between senior hackers who know each other, and from time to time there are arguments. It's not just rainbows and unicorns all the time, but in general I find the discussion extremely civilized (particularly for submissions from new contributors). But maybe that's just survivor bias, and the experience is much worse for others ...
> I've only submitted one patch (\pset linestyle unicode for psql), and with a few rounds of review and revision it made it in. Overall I found this process nitpicky but productive. However, there is a point at which the harshly critical review can be detrimental, and from the other comments mentioned here, it sounds like this has been the case in the past.
Oh, 2009 - good old days ;-) Thanks for the patch, BTW.
There's a fine line between nitpicking and attention to detail. We do want to accept patches, but OTOH we don't want to make the code worse (not just by introducing bugs, even code style matters too). At some point the committer may decide the patch is in "good enough" shape and polish it before the commit, but most of that should happen during the review.
What the hell is wrong with people?
With a lot of people you will never need this. But especially in larger projects where any number of people might latch on and review submissions, the chances for someone to find something to complain about goes up rapidly.
And when one person has found something to complain about, all kinds of other social dysfunction quickly becomes an issue too.
In most corporate projects, odds are you're dealing with a much smaller pool of reviewers.
It's a pity a lot developers out there do not have an ingrained awareness about this.
> 2011-07-31 22:13:35 EST postgres postgres [local] LOG: connection authorized: user=postgres database=postgres
> 2011-07-31 22:13:35 EST ERROR: could not write block 1 of relation global/2671: wrote only 4096 of 8192 bytes
> 2011-07-31 22:13:35 EST HINT: Check free disk space.
> 2011-07-31 22:13:35 EST CONTEXT: writing block 1 of relation global/2671
> 2011-07-31 22:13:35 EST [unknown] [unknown] LOG: connection received: host=[local]
The proposed patch retries rather than throwing an error.
If you can just give us the anecdote rather than making a vague, sweeping critique of MS OSS projects, you might avoid the downvotes.
I've given up on both :/
> Google has its own mechanism for handling I/O errors. The kernel has been instrumented to report I/O errors via a netlink socket; a dedicated process gets those notifications and responds accordingly. This mechanism has never made it upstream, though. Freund indicated that this kind of mechanism would be "perfect" for PostgreSQL, so it may make a public appearance in the near future.
A real-life example can be found at https://stackoverflow.com/questions/42434872/writing-program...
Ted Ts'o, instead, explained why the affected pages are marked clean after an I/O error occurs; in short, the most common cause of I/O errors, by far, is a user pulling out a USB drive at the wrong time. If some process was copying a lot of data to that drive, the result will be an accumulation of dirty pages in memory, perhaps to the point that the system as a whole runs out of memory for anything else. So those pages cannot be kept if the user wants the system to remain usable after such an event.
And that would then fail because of the hardware layer's bugs with reporting a device disconnect correctly. I mean, if the user follows the rules and pulls the stick out of a host port or a powered hub, sure, it's likely going to work per spec. But if it's on a daisy-chained 2003-era USB2 hub connected to a cheap USB3 hub? Yeah, good luck.
That's user error, though. The kernel should react to removable media being pulled by sending a wall message to the appropriate session/console, stating something similar to "Please IMMEDIATELY place media [USB_LABEL] back into drive!!", with [Retry] and [Cancel] options. That way, the user knows what to expect -- OS's used to do this as a matter of course when removable media was in common use. In fact, you could even generalize this, by asking the user to introduce some specific media (identified by label) when some mountpoint is accessed, even if no operations were actually in progress.
So what are the odds that A) you get it back into a non-corrupt state B) the sectors affected by finishing the write will re-corrupt it C) you do this in one minute?
They're initiated by the host, not by a USB device.
> Or moving devices from one port to another. If the device comes back within a minute or two you probably shouldn't throw out those writes.
This would be a nice feature. Although these writes would need to be buffered. Probably also throttled. There'd also be some risk with devices that have identical serial numbers. Some manufacturers give all of their USB disks / memory sticks same serial number...
Or by a power flicker. Which can be caused by plugging in other devices too.
> Although these writes would need to be buffered. Probably also throttled.
You don't necessarily have to allow new writes, the more important part is preserving writes the application thinks already happened. But that could be useful too.
> Some manufacturers give all of their USB disks / memory sticks same serial number...
You have the partition serial number too, usually.
On most OSes the HW serial number of the disk is now usually supplemented in the disk management logic with the GPT “Disk GUID”, if available. Most modern disks (including removable ones like USB sticks) are GPT-formatted, since they rely on filesystems like ExFAT that assume GPT formatting. And those that aren’t are effectively already on a “legacy mode” code-path (because they’re using file systems like FAT, which also doesn’t support xattrs, or many types of filenames, or disk labels containing lower-case letters...) so users already expect an incomplete feature-set from them.
Plus: SD cards, the main MBR devices still in existence, don’t even get write-buffered by any sensible OS to begin with, precisely because you’re likely to unplug them at random. So, in practice, everything that needs write-buffering (and will ever be plugged into a computer running a modern OS) does indeed have a unique disk label at some level.
SQLite does a "flush" or "fsync" operation at key points. SQLite assumes that the flush or fsync will not return until all pending write operations for the file that is being flushed have completed. We are told that the flush and fsync primitives are broken on some versions of Windows and Linux. This is unfortunate. It opens SQLite up to the possibility of database corruption following a power loss in the middle of a commit. However, there is nothing that SQLite can do to test for or remedy the situation. SQLite assumes that the operating system that it is running on works as advertised. If that is not quite the case, well then hopefully you will not lose power too often.
Also this seems related:
That results in consistent behavior and guarantees that our operation actually modifies the file after it's completed, as long as we assume that fsync actually flushes to disk. OS X and some versions of ext3 have an fsync that doesn't really flush to disk. OS X requires fcntl(F_FULLFSYNC) to flush to disk, and some versions of ext3 only flush to disk if the the inode changed (which would only happen at most once a second on writes to the same file, since the inode mtime has one second granularity), as an optimization.
The linked OSDI '14 paper looks good:
We find that applications use complex update protocols to persist state, and that the correctness of these protocols is highly dependent on subtle behaviors of the underlying file system, which we term persistence properties. We develop a tool named BOB that empirically tests persistence properties, and use it to demonstrate that these properties vary widely among six popular Linux file systems.
Arguably, POSIX needs a more explicit fsync interface. I.e., sync these ranges; all dirty data, or just dirty data since last checkpoint; how should write errors be handled; etc. That doesn't excuse that Linux's fsync is totally broken and designed to eat data in the face of hardware errors.
That Dan Luu blog post you linked is fantastic and one I really enjoyed.
I doubt anyone worried much when disks just died completely in the good old days. This topic is suddenly more interesting now as virtualisation and network storage create more kinds of transient failures, I guess.
On the kernel side the error reporting was not working reliably in some cases (see  for details), so the application using fsync may not actually get the error at all. Hard to handle an error correctly when you don't even get notified about it.
On the PostgreSQL side, it was the incorrect assumption that fsync retries past writes. It's an understandable mistake, because without the retry it's damn difficult to write an application using fsync correctly. And of course we've found a bunch of other bugs in the ancient error-handling code (which just confirms the common wisdom that error-handling is the least tested part of any code base).
> On the PostgreSQL side, it was the incorrect assumption that fsync retries past writes.
That's the part I'm claiming is a Linux bug. Marking failed dirty writes as clean is self-induced data loss.
> It's an understandable mistake, because without the retry it's damn difficult to write an application using fsync correctly.
This is part of why it's a bug. Making it even more difficult for user applications to correctly reason about data integrity is not a great design choice (IMO).
> And of course we've found a bunch of other bugs in the ancient error-handling code (which just confirms the common wisdom that error-handling is the least tested part of any code base).
It's easy to blame other layers for not behaving the way you want/expect, but I admit there are valid reasons why it behaves the way it does and not the way you imagine. The PostgreSQL fsync thread started with exactly "kernel is broken" but that opinion changed over time, I think. I still think it's damn difficult (border-line impossible) to use fsync correctly in anything but trivial applications, but well ...
Amusingly, Linux has sync_file_range() which supposedly does one of the things you describe (syncing file range), but if you look at the man page it says "Warning: ... explanation why it's unsafe in many cases ...".
Sure; to be clear, I work on both sides of the this layer boundary (kernel side as well as on userspace applications trying to ensure data is persisted) on a FreeBSD-based appliance at $DAYJOB, but mostly on the kernel side. I'm saying — from the perspective of someone who works on kernel and filesystem internals and page/buffer cache — cleaning dirty data without successful write to media is intentional data loss. Not propagating IO errors to userspace makes it more difficult for userspace to reason about data loss.
> but I admit there are valid reasons why it behaves the way it does and not the way you imagine.
How do you imagine I imagine the Linux kernel's behavior here?
> The PostgreSQL fsync thread started with exactly "kernel is broken" but that opinion changed over time, I think. I still think it's damn difficult (border-line impossible) to use fsync correctly in anything but trivial applications, but well ...
The second sentence is a good argument for the Linux kernel's behavior being broken.
a) there's no standardized way to recover from write errors when not removing dirty data. I personally find that more than an acceptable tradeoff, but obviously not everybody considers it that way.
b) If IO errors, especially when triggered by writeback, cause kernel resources to be retained (both memory for dirty data, but even just a per-file data to stash a 'failed' bit), it's possible to get the kernel stuck in a way it needs memory without a good way to recover from that. I personally think that can be handled in smarter ways by escalating per-file flags to per-fs flags if there's too many failures, and remounting ro, but not everybody sees it that way...
> b) If IO errors, especially when triggered by writeback, cause kernel resources to be retained (both memory for dirty data, but even just a per-file data to stash a 'failed' bit), it's possible to get the kernel stuck in a way it needs memory without a good way to recover from that.
To put it more succinctly: userspace is not allowed to leak kernel resources. (The page cache, clean and dirty, is only allowed to persist outside the lifetime of user programs because the kernel is free to reclaim it at will — writing out dirty pages to clean them.)
> I personally think that can be handled in smarter ways by escalating per-file flags to per-fs flags if there's too many failures, and remounting ro, but not everybody sees it that way...
Yeah, I think that's the only real sane option. If writes start erroring and continue to error on a rewrite attempt, the disk is failing or gone. The kernel has different error numbers for these (EIO or ENXIO). If failing, the filesystem must either re-layout the file and mark the block as bad (if it supports such a concept), or fail the whole device by with a per-fs flag. A failed filesystem should probably be RO and fail any write or fsync operation with EROFS or EIO.
If the device has been failed, it's ok to clean the lost write and discard it, releasing kernel resources, because we know all future writes and fsyncs to that file / block will also report failure.
This model isn't super difficult to understand or implement and it's easier for userspace applications to understand. The only "con" argument might be that users are prevented from writing to drives that have "only a few" bad blocks (if the FS isn't bad block aware). I don't find that argument persuasive — once a drive develops some bad sectors, they tend to develop more rapidly. And allowing further buffered writes can really exacerbate the amount of data lost, for example, when an SSD reaches its lifetime write capacity.
Also, isn't it possible to use multipath to queue writes in case of error? I wonder if that's safe, though, because it will keep it in memory only and make it look OK to the caller.
Combine this with I/O error handling often being broken (in kernels, file systems, applications) and applications that are supposed to implement transaction-semantics can easily turn into "he's dead jim" at the first issue.
fsync in case of the USB disappearing should simply return an error and drop the dirty pages.
Possibly, https://wiki.postgresql.org/wiki/Fsync_Errors notes both MySQL and MongoDB had to be changed.
Note that the issue here different from either of the bits you quote. The problem here is that if you fsync(2) it fails and you fsync(2) again, on many systems the second call will always succeed because the first one has invalidated/cleared all extant buffers, and thus there's nothing for the second one to sync. Which is a success.
AKA because of systems' shortcuts an fsync success effectively means "all writes since the last fsync have succeeded", not "all writes since the last fsync success have succeeded". Writes between a success and a failure may be irrecoverably lost
Yes, most. And several did similar changes (crash-restart -> recovery) to handle it too. It's possible to avoid the issue by using direct IO too, but often that's not the default mode of $database.
No, this is an artifact of storage engine design. Direct I/O is the norm for high-performance storage engines -- they don't use kernel cache or buffered I/O at all -- and many will bypass the file system given the opportunity (i.e. operate on raw block devices directly) which eliminates other classes of undesirable kernel behavior. Ironically, working with raw block devices requires much less code.
Fewer layers of abstraction between your database and the storage hardware make it easier to ensure correct behavior and high performance. Most open source storage engines leave those layers in because it reduces the level of expertise and sophistication required of the code designers -- it allows a broader pool of people to contribute -- though as this case shows it doesn't not necessarily make it any easier to ensure correctness.
Another reason is that it's easier to deploy — you can just use some variable space on your filesystem rather than shaving off a new partition.
MacOS documented its abnormal fsync behavior, Golang just didn't follow what was clearly described in those documents. Linux didn't document what happens on fsync failure, there is really nothing for applications to follow.
Also note that, the strange MacOS fsync() behavior is obvious, its fsync latency on the most recent mbp by default is close to the one observed on Intel Optane when we all know that mbp comes with much cheaper/slower SSD compared to Intel Optane. The same can't be said for the Linux fsync issue here.
Didn't affect LMDB. If an fsync fails the entire transaction is aborted/discarded. Retrying was always inherently OS-dependent and unreliable, better to just toss it all and start over. Any dev who actually RTFM'd and followed POSIX specs would have been fine.
LMDB's crash reliability is flawless.
So I'm trying to do that  and it seems to me the documentation directly implies that a second successful call to fsync() necessarily implies that all data was transferred, even if the previous call to fsync() had failed.
I say this because the sentence says "all data for the open file descriptor" is to be transferred, not merely "all data written since the previous fsync to this file descriptor". It follows that any data not transferred in the previous call due to an error ("outstanding I/O operations are not guaranteed to have been completed") must now be transferred if this call is successful.
What am I missing?
I'm sorry, but this... just doesn't make sense. An event can't both happen and also fail to happen.
Also, the queue business is an implementation detail. Our debate is over what the spec mandates, not how a particular implementation behaves.
Consider a multithreaded application where a and b point to the same file. Note this isn't exactly what Postgres does:
The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to
the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined.
The fsync() function shall not return until the system has completed that action or until an error is detected.
There's no mention of the OS retrying, or leaving the system in a state that a subsequent fsync can retry from where it left off. So you can't assume anything along those lines.
So I don't buy the spec argument.
You have me curious now... I don't know anything about LMDB but I wonder if its msync()-based design really is immune... Could there be a way for an IO error to leave you with a page bogusly marked clean, which differs from the data on disk?
> I don't know anything about LMDB but I wonder if its msync()-based design really is immune.
By default, LMDB doesn't use msync, so this isn't really a realistic scenario.
If there is an I/O error that the OS does not report, then sure, it's possible for LMDB to have a corrupted view of the world. But that would be an OS bug in the first place.
Since we're talking about buffered writes - clearly it's possible for a write() to return success before its data actually gets flushed to disk. And it's possible for a background flush to occur independently of the app calling fsync(). The docs specifically state, if an I/O error occurs during writeback, it will be reported on all file descriptors open on that file. So assuming the OS doesn't have a critical bug here, no, there's no way for an I/O error to leave LMDB with an invalid view of the world.
This is key.
Often programmers do 'assumption based programming'.
"Surely the function will do X, it's the only reasonable thing to do, right?". As much it is human, this is bad practice and leads to unreliable systems.
If the spec doesn't say it, don't assume anything about it, and keep asking. To show that this approach is feasible for anyone, here is an example:
Recently I needed to write an fsync-safe application. The question of whether close()+re-open()+fsync() is safe came up. I found it had been asked on StackOverflow (https://stackoverflow.com/questions/37288453/calling-fsync2-...) but received no answers for a year. I put a 100-reputation bounty on it and quickly got a competent reply quoting the spec and summarising:
> It is not 100% clear whether the 'all currently queued I/O operations associated with the file indicated by [the] file descriptor' applies across processes. Conceptually, I think it should, but the wording isn't there in black and white.
With the spec being unclear, I took the question to the kernel developers (https://marc.info/?m=152535098505449), and was immediately pointed at the Postgres fsyncgate results.
So by spending a few hours on not believing what I wished was true, I managed to avoid writing an unsafe system.
Always ask those in the know (specs and people). Never assume.
Yeesh. The POSIX manual on fsync is intentionally vague to hide cross-platform differences. There are basically no guarantees once an error happens. I guess that's one interpretation of RTFMing, but... clearly it doesn't match user expectations.
If you open the datafile with write access from another process without using the LMDB API, you're intentionally shooting yourself in the foot and all bets are off.
The application should have stopped the first time fsync() reported an error. LMDB does this, instead of futile retry attempts that Postgres et al do. Fundamentally, a library should not automatically retry anything - it should surface all errors back to the caller and let the caller decide what to do next. That's just good design.
There are more details in the talk  I posted earlier, and in the LWN articles related to this issue.
Someone issuing a sync() could cause an error to be cleared before the app's fsync() happens. That's a drag.
As I already wrote - in LMDB, after fsync fails, the transaction is aborted, so none of the partially written data matters.
The kernel should never do this. If sync fails, all future syncs should also fail. This could be relaxed in various ways: sync fails, but we record which files have missing data, so any fsync for just those files also fails.
(Otherwise I agree with LMDB- there should be no retry business on fsync fails).
In LMDB's case you'd need to be pretty unlucky for that bug to bite; all the writes are done at once during txn_commit so you have to issue sync() during the couple milliseconds that might take, before the fsync() at the end of the writes. Also, it has to be a single-sector failure, otherwise a write after the sync will still fail and be caught.
If only a single page is bad, and the majority of the writes are ok, then you still have to be unlucky enough for the sync to run after the bad write; if it runs before the bad write then fsync will still see the error.
Thank you and the rest of the contributors for such a great library.
Have looked at replacing InnoDB in MySQL, but that code is a lot harder to read, so it's been slow going. Postgres doesn't have a modular storage interface, so it would be even uglier to overhaul.
Not really, I think. Page-level checksums don't protect against entire writes going missing, unfortunately.
I remember highlighting this problem to the Firebird db developers maybe 13 years ago. AFAIR they were open to the problem I'd pointed out and went about fixing it on other platforms so that the db behaved everywhere as it did on OS X. I'm probably in the bottom 5% of IT professionals on this site. I'm rather amazed to find out that so late in the day Postgres has come round to fixing this.
I haven't used Firebird in years and can't find a link to the discussion (could have been via email).
Wait, is "direct I/O" the same as O_DIRECT?
The same O_DIRECT that Linus skewered in 2007?
> There really is no valid reason for EVER using O_DIRECT. You need a buffer whatever IO you do, and it might as well be the page cache. There are better ways to control the page cache than play games and think that a page cache isn't necessary.
> Side note: the only reason O_DIRECT exists is because database people are too used to it, because other OS's haven't had enough taste to tell them to do it right, so they've historically hacked their OS to get out of the way.
More background from 2002-2007: https://yarchive.net/comp/linux/o_direct.html
I think there's pretty good reasons to go for DIO for a database. But only when there's a good sysadmin/DBA and when the system is dedicated to the database. There's considerable performance gains in going for DIO (at the cost of significant software complexity), but it's much more sensitive to bad tuning and isn't at all adaptive to overall system demands.
Yes - from a purely technical point of view, DIO is superior in various ways. It allows tuning to specific I/O patterns, etc.
But it's also quite laborious to get right - not only does it require a fair amount of new code, but AFAIK there is significant variability between platforms and storage devices. I'm not sure the project had enough developer bandwidth back then, or differently - it was more efficient to spend the developer time on other stuff, with better cost/benefit ratio.
One HUGE reason is performance for large sequential writes when using fast media and slow processors. Specifically, when the write speed of the media is in the same ballpark as the effective memcpy() speed of the processor itself, which, believe it or not, is very possible today (but was probably more unlikely in 2007) when working with embedded systems and modern NVMe media.
Consider a system with an SoC like the NXP T2080 that has a high speed internal memory bus, DDR3, and PCIe 3.0 support. The processor cores themselves are relatively slow PowerPC cores, but the chip has a ton of memory bandwidth.
Now assume you had a contiguous 512 MiB chunk of data to write to an NVMe drive capable of a sustained write rate of 2000 MB/s.
The processor core itself can barely move 2000 MB/s of data, so it’s clear why direct IO would perform better since you’re telling the drive to pull the data directly from the buffer instead of memcpy-ing into into an intermediate kernel buffer first. With direct IO, you can perform zero-copy writes to the media.
This is why I’m able to achieve higher sequential write performance on some modern MVMe drives than most benchmark sites report, all while using a 1.2 GHz PowerPC system.
No, there are other ways of doing DIO besides that particular interface.
people were told 'do several sync commands, typing each by hand'
But this mutated to just 'sync three times', so of course people started writing 'sync; sync; sync'
./eod.sh && sync && sync
I forgot to sync almost every time, but it would always boot after a fsck.
Needless to say, this tripped me up few times and videos weren't fully transferred.
> All I/O to the filesystem should be done synchronously. In the case of media with a limited number of write cycles (e.g. some flash drives), sync may cause life-cycle shortening.
unfortunately the upstream patches being backported don't provide any real write guarantee
Postgres does open-write-close and then later, in another unrelated process, open-fsync-close. They discovered this doesn’t always work, because if somebody somewhere does a fsync on that file for any reason, their process might miss the error as it doesn’t stick.
You are right, this affects other things too. And indeed several projects took action to remove any notion of fsync retry, referencing the LWN article etc.
POSIX is probably underspecified. AN Austin Group defect report would be interesting...
Does anyone know if Windows's FlushFileBuffers is susceptible to this as well? (P.S., interesting bit about FILE_FLAG_WRITE_THROUGH not working as you might expect: https://blogs.msdn.microsoft.com/oldnewthing/20170510-00/?p=...)
Only FreeBSD & Illumos do the sane thing.
I wonder how Oracle handles this. Raw device/partition and its own FS business logic?
(Which leaves the question of how oracle handles this unanswered, of course)
Specifically it seems like asking for trouble to open() a path a second time, and expect fsync() calls to apply to writes done on the other file descriptor object - there's no guarantee they're even the same file ! At the very least, pass the actual file descriptor to the other process. Linux's initial behavior was in violation of what I've said here, but the patch to propagate the error to every fd open at the time of the error should resolve this to satisfaction.
I would think about the only reasonable assumption you could make about concurrent cross-fd syncing is that after a successful fsync(fd), any successful read(fd) should return data that was on disk at the time of that fsync(fd) or later. In other words, dirty pages shouldn't be returned by read() and then subsequently fail being flushed out.
disclaimer: It's been a while since I've written syscall code and I've had the luxury of never really having to really lean into the spec to obtain every last bit of performance. Given how many guarantees the worse-is-better philosophy forewent, I don't see much point to internalize what few are there.
 Ya sure, you can assert that the user shouldn't be otherwise playing around with the files, I'm just pointing out that translating path->fd isn't a pure function.
Does MySQL have the same flaw?
MySQL in buffered IO mode surely has the same problem, and has implemented the same solution: PANIC rather than retrying. Same for WiredTiger (MongoDB).
Other systems using direct (non buffered) IO are not affected.
Sure, it mightn't be made in this thread yet, but that doesn't make it an irrelevant, invalid, or uninteresting observation. I think that the spirit of discussion, so integral to what separates HN from other websites, means we should not poo-poo this line of inquiry just because you're bored of it.
A system can be closed and secure, just you can't verify it.
I really think it does. It's like "the sky isn't green!" or "the earth isn't flat!" or "vaccines don't cause autism!" Sure, these are all true things, but they weren't exactly topics of discussion on this thread before you brought them up.
By all means, discuss the article, and rebut comments you feel espouse an inaccurate worldview. (IMO) preemptive rebuttals like this are only useful or interesting when they're somewhat novel, or represent some special insight into a particular field that outsiders wouldn't have. This one has neither.
My particular take on why this dead horse is irrelevant (as well as tedious and boring):
Fsync isn't a security issue, it's a data loss issue. Arguably, the Postgres behavior is quite reasonable and the article's headline is just inaccurate. Linux has been reviewed, e.g., https://danluu.com/file-consistency/ from 2017, summarizing research from 2001-2014, all of which pointed towards deficiencies in its data preservation behavior. The Linux community know they lose data and propose that users should accept it.
The Postgres <-> Linux fsync investigation has been ongoing for a long time, with lots of eyeballs on both sides of the kernel/userspace boundary. This isn't really a "bug escapes major application developers for 20 years!" so much as "Linux can't agree to provide an API to make file data consistent."
Well, we're sorry we didn't recognise you as the discussion warden, but I think that's how a conversation works: people are free to bring up the points that they feel relevant, and people can either continue the train of thought or not. If it has no appeal to you, you're free to let it die a natural death rather than make pronouncements on what's relevant or not.
With closed source though I often don't know the popularty and how many/what kind of bugs have been reported, but just the reputation of the vendor.
I prefer a popular software from a highly reputable vendor over a somewhat popular OSS. But I also prefer the most popular and battletested OSS, like postgresql and linux, over any closed system, e.g. SQL Server and Windows.
It would be quite possible for a review, and even extensive testing, to fail to pick up on some system-specific subtleties.