Hacker News new | comments | show | ask | jobs | submitlogin
The Business Case for Formal Methods (www.hillelwayne.com)
115 points by whack 5 months ago | hide | past | web | 65 comments | favorite





I've long been intrigued by pure FP, advanced type systems and proof assistants from an academic point of view but haven't yet been convinced that the effort involved in learning such systems/methods pays off in most cases (without taking into consideration the costs you still have when you already know the methods well). I will still try to learn more about this, but without any strong conviction that this needs to be immediately useful.

By contrast, I've only more recently started hearing more and more about model checkers such as TLA+. The case studies in this blog post serve to give a good motivation for why such systems might be useful (as well as honestly presenting drawbacks; something that advocates of e.g. PFP rarely do). I would love to use something like that one day.

As an aside, I still think it's a disgrace how irrelevant empirical research is to most aspects of today's software development. Another blog post by the author cites a paper claiming that very simple testing, especially of error conditions, would be able to prevent a huge chunk of production bugs, but things like that are rarely talked about. The whole profession is unfortunately amateurish and always obsessed with fads and trends instead of taking a more analytic and empirical approach which makes me sad.


The Redex modeling tool [1] is an interesting point in this space. Its specialty is in making executable specifications of domain-specific languages. Redex models can be tested by the unit-testing framework of the Racket dialect of Scheme. There's plenty of excellent documentation on Redex, and it has some neat visualization tools. I've gotten plenty of value from using it for my research over the years, but suspect it'd be useful outside of academic research.

[1] https://redex.racket-lang.org/


Might want to show some examples in the "Why Not To Use Formal Methods" section too.

Somewhat old numbers from seL4: [1]

- 2,500 lines of C code

- 100,000 lines of proofs

- ~7.6 person-years of full time work (with a lot of involvement from PhDs)

Interactive theorem proving definitely will not scale to significantly-sized C codebases anytime soon.

[1]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5597727/


This article talks about design specification and its model-checking; seL4 used a completely different formal method.

Formal Methods (for software assurance) is the name of a research discipline studying the following methods, as well as programming language semantics:

* Formal specification of either code or design

* Model checking

* Sound static analysis

* Machine-checked/assisted deductive proof systems and dependent type systems

* Concolic testing and automatic test generation

* Program synthesis

Some of the verification methods above (like some model checkers and sound static analysis) can be used without any explicit specification whatsoever (e.g. they're used to verify lack of undefined behavior in C programs). So the name "Formal Methods" covers a wide range of tools and techniques, with widely varying goals, costs and appropriate domains, that it doesn't really mean much to talk of "formal methods" as if it were a specific thing. Rather, it is a general approach used in constructing widely differing tools and techniques. That's why it's called formal methods, not a formal method.


Agreed- that's why I said in the beginning I'm only talking about design verification. Verifying code is a long way off from going mainstream. I talk about the limitations of code verification in this essay: https://www.hillelwayne.com/post/why-dont-people-use-formal-...

I assume you mean "full" code verification in some sense. Partial verification, especially of properties that are "local" and where verification can be done in a way that's "based" on the language syntax itself, is relatively common. This kind of verification is often called "type checking" and is often not even thought of as a means of code verification.

One can of course object that requiring the use of a language where verification is "baked in" like this might be undesirable; but either way, it seems clear even to formal-methods proponents that well-chosen language semantics are going to be very important if code verification is to be feasible. So it's six of one, half a dozen of the other.


Maybe C is just not the best tool for such code.

I would suspect that (subsets of) Ada, or, Rust, or maybe OCaml would be more fitting while staying on the industrial side. (Likely Haskell or Idris would be even easier targets, but their industrial adoption is quite limited.)


The seL4 code didn't start out as C. First they wrote the whole thing in HOL (an ML dialect), then translated that to Haskell, then translated _that_ to C. Verifying code is correct is Just Plain Hard.

How does the industrial adoption of OCaml compare to Haskell? (The relative placement of Ada, Rust and Idris seems apparent.)

I know two large companies using OCaml to a serious extent: Jane Street (obviously) and Facebook (as Reason).

I can't remember any large company where Haskell would be front and center, but a number of smaller companies doing so exists. There was a post about running Haskell in production recently here on HN.


Why do you think so? I would expect the vast majority of the proof to be about the semantics of the OS, not about language-level details.

Formally proving complex properties is just an inherently difficult and time-consuming task. That is the main reason it is not often undertaken, even in mathematics if I understand correctly.


It probably depends on your trust assumptions and requirements. Ideally we write formally provable code for a formally proven compiler to run on a formally proven OS/bios/hardware.

But some use cases require safe client code that will nevertheless run on unsafe OS and hardware. Client-side and some server-side cryptocurrency apps are one example that comes to mind. Probably banking apps too. It's still useful to write them in strongly typed languages amenable to formal verification, even though they'll be running on unverified Windows, Linux, MacOS and smartphones.

Also we really need OS's, firmware, etc. written in languages amenable to formal verification, but given the massive lockin of the mainstream options, that's a pipedream for now. But a worthy long-term goal.


I'm not sure that there are languages more or less amenable to full formal verification. Formal verification of a system is such a complex task that the language-level details are unlikely to be a significant proportion of the effort.

You can also look at existing formally verified software and probably confirm my assumption - there is likely as much formally verified C code (a weakly-typed, memory unsafe language) as formally-verified Haskell code.

And I do not agree at all that a formally verified banking app is a good idea. It would likely be so difficult to do that by the time it is finished it no longer runs on any current version of any OS.

Formal verification is just too hard to be worth it. There are better methods of ensuring software correctness that are far less costly. Maybe something will be discovered in the future that will make it easier, but for now formal verification of any piece of software more complex than a few hundred lines of code is a multi-year research-level project.


The (famously robust) SQLite has 92 million lines of tests, for 139 thousand lines of code. How many person-years did that take to create?

And then there are the hundreds of software libraries I've seen throughout my career that I never got to use because they were just too buggy.

From where I stand, C doesn't scale to significantly-sized codebases. Nobody I know uses it except when they need to glue to some legacy C code that's been tested in production for the past few decades.

It's easy to say something is a lot of work, but in this case every one of the alternatives looks worse. I don't have the time or resources to write 1000x my program in test code, to debug it in production for 10 years, or to throw it all away because I never achieve a minimum usable level of quality. Formal methods aren't looking so bad to me right now!


Yup. This shows why we need languages that are more amendable to formal methods. Proving stuff requires you to write short code that is obviously correct.

For most programs, an 'informal' proof would be a huge improvement already. The effort you have to put in non-formal proofs is much lower, but the main drawback is that they are not checked, so you might make mistakes in the proof (on the other hand, you can make mistakes in the specification as well).

The holy grail would be formal methods that work mostly like informal proofs. With todays methods you would only prove the smallest, most critical parts of the code correct.


If you're going to write a formal spec in the first place, you can then "fuzz" the code against the spec, or parts of the spec against itself. I.e. try to find an automated refutation as opposed to a proof. Model checkers work according to this principle, and I assume that the combination of "informal" proof and automated model checking can be especially helpful for suitably-sized systems. Either one of these without the other seems a bit harder to take seriously.

I don't know why this strain of thought continues to have currency that formal methods are some kind of unattainable utopian ideal, this line of thinking is exactly what Dijkstra prophesied that the flawed CS education would lead us to think, that testing software is a viable option to ensure their correctness.

The case studies are all race conditions in distributed systems. Is TLA+ useful for complex serial systems like compilers, etc.?

It's possible to use TLA+ for complex serial systems, and it can def help there. The thing is that TLA+ is designed to be good at modeling concurrent problems, problems that have a lot of specific challenges you don't see in serial systems. So if you know your system is serial and nondeterministic, then you're probably better off with something that focuses on that part of the design space.

To paraphrase Edmund Clarke's Turing lecture[1], the problems of verifying software include "floats, strings, user-defined types, procedures, concurrency, generics, storage, libraries…" If you're not worried about verifying concurrency in your tool, you've got a more resources to devote towards verifying the other things.

That's also a space I don't have as much familiarity with, so I don't know what the best tools are there. My vague understand is that at that stage the problem "simplifies" enough that it's a lot easier to verify the code itself. Like how CompCert isn't just a model of a compiler, it's an actual correct compiler. So you can apparently work much closer to the code when verifying serial things.

At one point I got a bunch of people in FM to write provably correct versions of Leftpad.[2] Maybe one of those would work best for you? Note also that one of the examples uses proves Java code correct via inline JML annotations.

[1]: https://amturing.acm.org/vp/clarke_1167964.cfm

[2]: https://github.com/hwayne/lets-prove-leftpad


Do you really mean "nondeterministic"? AIUI, what people call 'TLA' was designed by adding a bunch of modalities (you can think of these as "monads" if you like[0]) for nondeterminism and "time" (state transitions) to standard propositional logic. The rest of it is really a matter of what's idiomatic, such as using the nondeterminism modality to account for spec refinement as well. So it would seem that TLA+ should be enough to deal with these cases.

[0] Or burritos. Mmm, burritos. /s


I think he meant to write deterministic. The exact nature of the logic TLA is not the relevant point here. TLA (and so TLA+) extensively relies on nondeterminism for things like specifying code at arbitrary levels of detail (e.g. you can say, after this step, the list is sorted somehow, without specifying how), interaction with the environment (the user performs one operation of the possible ones, or some arbitrary machine fails) or concurrency in its programming sense (the OS will schedule one of these several operations next).

I agree, mostly, but understanding "the exact nature" of a logic (or at least a properly-defined subset of it) is pretty important for using it effectively.

If you're interested, I've written a rather detailed explanation of the TLA+ logic. Part 1 is an introduction that explains the design from a UX perspective, part 2 is about the + side of the logic, and parts 3 and 4 are about TLA: https://pron.github.io/tlaplus

TLA+ is really good at finding problematic traces involving concurrent agents, so it naturally shines in distributed systems.

For something that relies on finding problematic inputs (like for compilers), SMT solvers like Z3 are a more natural fit


The use of formal methods in NASA proves that they are useful and that the software industry should give it much more importance in my opinion.

https://ntrs.nasa.gov/search.jsp?N=0&Ntk=All&Ntt=formal%20me...


It proves they are useful in the extreme circumstances that NASA projects represent.

It doesn't prove that they provide a good ROI for a business looking to ship CRUD apps yesterday.


Our mobile apps derive actually revenue and customer satisfaction, with budgets already hard to justify (high 7 and low to mid 8 figures) and alway live in a highly complex service and operational environment. If you added the cost of formal methods nothing would ever be approved at all. The alternative of not doing formal methods and having stuff that mostly works and makes us money and customers reasonably happy is good enough (not crud apps, not even sure wtf that is in a real business, is Facebook a crud app? (maybe crap app but still)).

I could see formal methods for things that might kill people as possible but even there spending all of your money to deliver perfect code vs still being in business with something good enough is an easy math for most CEOs.


Create Read Update Delete - often refers to apps that are largely concerned with entering and transforming data stored in a backend DB.

If we're going to talk about the formality of methods, let's remember that "collection of positive anecdotes with back-of-envelope guesstimates" is not what's meant by a "business case".

An ROI or IRR calculation might not have anything like the elegance of a proof, but it's a method and it's formalised.


The next big breakthrough will be a program that can convert TLA+ directly into production code. And create all of the necessary tests, with a formal proof that the code is correct.

I actually think the opposite is true. TLA+'s strength in a business context is precisely because it is not production code.

This considerably derisks TLA+ because you cannot end up depending on TLA+, or put another way any deficiencies in TLA+ or its toolchain will not become production blockers for your team. You cannot lose more time than the time you spent writing your spec.

You can use TLA+ to specify as much or as little of your system as you want. You can upgrade your system independently of TLA+. You don't have to wait for TLA+ to support or integrate with your language.

I've said this elsewhere, but the closest competitor to TLA+ on most teams is decent high-level documentation, which is precisely what most teams lack.

Software tools, like many other things, suffer from an uncanny valley effect. If you're not tempted at all depend on a tool to generate code for you, you know the limits of the tool and can be quite happy. If the tool integrates smoothly with the rest of your code it's even better. However, if your tool integrates in a somewhat bumpy way with the rest of your code than it becomes worse than the two previous alternatives.

The TLA+ community already has limited resources that I think can be better spent on making TLA+ an even better design language before trying to cross the much bigger and more arduous gap of trying to make TLA+ executable (a goal which I suspect most of the community would actually oppose anyways).


It already exists (not with TLA+ but for other formal systems). This is bog standard code synthesis. Problem is that it does not scale very well.

There is a better way:

https://news.ycombinator.com/item?id=22278363


Our (sound) formal verification tools and knowledge are nowhere near anything that could do this affordably for programs of common sizes, and it's unclear they ever will be (the gap between programs we can verify soundly and the size of the programs we write has state roughly constant for decades). The hottest approach these days in formal methods that operate directly on code of real-world-sized programs is reducing the soundness of verification with approaches like concolic testing.

Hence using humans.

Humans don't do this affordably. Programs proven end-to-end correct using human provers and semi-automated proof assistants have required extreme simplification of the program so that it could be verified at all (so no clever efficient algorithms), teams of graduate-student proof-monkeys working for years, and even then the biggest programs we could prove that way were ~10KLOC. The greatest achievement of this semi-automated approach required years by highly-trained specialists to prove simplified programs that amount to 1/5th of jQuery.

We have some very interesting advances in formal verification, but that is probably not the way.


If you see my comment, it proposes using large infosys style teams. You don't need graduates per se, just people who have an intuition for logic and math, of which there are plenty. Plus you can prove isolated pieces e.g. TCP/IP stacks instead of end-to-end. It would not be as beneficial as full formal verification but it is still a large step in securing the most vulnerable targets such as IoT, industrial control systems, financial services etc.

Your program may be able to formally prove that the code corresponds to the TLA+. It may be able to generate tests to show that the code does what the TLA+ says it should do. But a program can't prove that the TLA+ is correct[1]. The most it could prove is that it is not self-inconsistent.

So now you move from debugging the production code to debugging the TLA+. That's still an improvement.

[1] Unless you take the TLA+ code to be the definition of "correct". That's a possible position, but I don't adhere to it. It seems to me more reasonable to be able to say that the spec is wrong if it doesn't correspond to what is actually needed, and the TLA+ is wrong if it doesn't accurately encode the spec. (For example, the MCAS software did exactly what the spec said. But the spec was wrong.)


But how do you unambiguously specify "what is needed"? I would think you need some high-level formal language for doing that.

So I wonder if instead of proving that some hand-written code corresponds to a specification, wouldn't it be better to think of the specification as a high-level program and then somehow translate into some lower level conventional programming language?


> But how do you unambiguously specify "what is needed"?

You don't. You follow the scientific method of building a model specification for "what is needed", throwing it into the world, and tweaking and improving it where you find that some of its properties are not a good fit for the problem it's intended to solve - primarily by gathering feedback from the people that are using it and finding its flaws.


This is not always possible due to the halting problem [1] and Gödel's incompleteness theorems [2]. It is not always possible to formally determine/prove the behavior of a program.

[1]: https://en.wikipedia.org/wiki/Halting_problem

[2]: https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_...


Doesn't really matter if only unrealistic edge cases are unprovable.

The problem is, you don't know that. The software industry is this complex because obscure yet realistic edge cases happen much more often than you'd expect.

So the ones that aren't edge cases can still benefit greatly, rather than not at all.

What good is applying formal verification software to have proof that part of your input behaves properly, but edge cases are not proven? This is already what test cases do.

Most software is boring and derivative.

Saw Hillel speak at YOW! Melbourne. Great speaker. Prop comedy and formal methods. Highly recommend.

That's not so much a business case as a contrived example. I absolutely would expect system testing and QA to find such a bug or the testing isn't up to scratch.

What you're proposing is that all engineers now effectively learn a new language and translate requirements into it. This expects requirements to be fully fleshed out themselves (a luxury few have these days) and relatively static (or everything needs to be reworked).

Good luck with that...


> I absolutely would expect system testing and QA to find such a bug

You can't test what you don't have. With design-based formal methods you can "fuzz" the high-level design and spot very real problems in it, before writing a single line of ordinary code.


That's not what the article is claiming though, and in fact when the code is written is irrelevant to the business case. We're talking about spotting production issues before they arise.

The article makes the very strong claim that QA aren't going to catch the bug that arises from this design omission and it's going to cost your company money as everything goes wrong in the field.

I don't believe it would get that far IRL as any dev with a brain is going to spot it, and even so QA should certainly have test cases for this sort of situation if they're doing their job right.

You can argue this may be a better approach, and catch potential problems earlier, have at it. You can argue that there are classes of error that this will catch that normally you wouldn't (some of the preamble and larger scale stuff does seem to show this) But the illustrated example doesn't hold up to scrutiny and is overstated. As such I think it undermines the point that's the author is trying to make, because it looks to me like something we'd catch anyway.


> But the illustrated example doesn't hold up to scrutiny and is overstated.

The illustrated example is intended to be comparatively trivial. It's just clarifying what this kind of verification actually involves in practice. It makes no sense to describe it as "overstated", since the article actually mentions several case studies as demonstrating the usefulness of this approach; it's just that these cannot be examined in detail in an introductory article.


It's attempting, AFAICT, to demonstrate TLA+ catching a class of very dangerous problem that traditional approaches won't/can't catch. It fails and in doing so it undermines the business case the wider article attempts to make.

I think perhaps the example has just been simplified too far.


I think the argument is if you've been around distributed systems long enough you will encounter race conditions. Sure, it's ok to say "Well, the testing infrastructure isn't up to snuff, so we just need to fix it" but at scale this is impractical.

Check out the fallacies of distributed computing[0]. If your testing system can simulate all of those edge cases, it probably looks a lot like TLA+.

[0] https://en.wikipedia.org/wiki/Fallacies_of_distributed_compu...


I agree, once you've been around parallel and distributed computing for a while you do notice this stuff, which is why I think the example given in the article isn't a particularlyt good one - that's a really noticeable design omission and I would expect it to be caught pretty early in any dev/QA process as the developer implements the functionality and thinks "Hey, what if .. don't we need to invalidate the other offers?"

I'm sure there are good cases for using TLA+, I'm sure there are situations where it's not only useful for catching errors before they even happen, but in which this more than offsets the upfront costs of the exercise.

I guess I just came away from the article not feeling that such had been demonstrated, in fact I came away with the feeling that the example was contrived to fit the agenda and didn't actually show much.


I think it's a great example. It boils down to simple code that may not be obvious how it behaves within a degraded network. Your dev/QA could even have the thought and actually test it but it work under ideal circumstances giving them false confidence.

I'll go out on a limb and say it is nearly impossible to design and build a test environment that can simulate all network conditions, so that even in trivial cases where a dev might know for a fact that there is an issue, it'll be incredibly hard to reproduce it.

Maybe put another way, formal methods give cover to dev/QA to avoid shipping known but hard to prove buggy code. Bugs they will ultimately be held responsible for.


Keep in mind, there are different approaches to design distributed systems. You can do pretty much everything without sharing or concurrently accessing any resources, without relying on paxos family of algorithms, etc., it would require you to make sure that algorithms converge, can be retries and so on, but that is local reasoning, easily testable, and it doesn't look like TLA+ even remotely nor can it benefit from it. If you need to model 35 high level steps, your system is definitely not designed that way, is too complex and even philosophically is not in the camp of making distributed algorithms so simple, that they can be easily reasoned about. I guess what I'm saying is that if you need TLA+ you are very likely doing it wrong.

> I guess what I'm saying is that if you need TLA+ you are very likely doing it wrong.

Agree to disagree. If your system is distributed it requires you to share data and act concurrently. Otherwise it isn't a distributed system. No amount of retries will cover all of the screwed up things that can happen on an unreliable network.


> That's not what the article is claiming though

I'm not sure how I can respond to that, other than pointing you to the Intro section (it's only a single, short paragraph). It seems clear to me that this is what the article is supposed to be about.


I'm not sure how I can respond to that, given explicitly in the text we have -

> The only symptom is credits mysteriously disappearing from people’s accounts. Your QA and monitoring won’t find this.

and

> It will cause your customers to lose money and stop trusting you.

Thread over I guess... :shrug:


The article is implicitly asking you to make a small imaginative leap – that the bug expressed is in fact subtle enough it would have never been caught otherwise – in the service of providing a concise, digestible example. It might have been better for the author to spell this out but I don't think it's really necessary.

I like how it emphasized the "debuggable" aspects of FMs.

Most people believe FMs are not resilient due to their coupling with the spec. What if the spec changes?

Saying that FMs are a debuggable form of specification, i.e. very close to debugging the design itself, turns that criticism on its head and shows that it is trivially the case that you can iterate on or prototype with FMs.


So what is a "FM"? Isn't it just any very high-level, preferably declarative programming language?

I asked for a comparison between TLA+ and Scala/Haskell to a dev that worked on the AWS application of TLA, this is what they responded.

https://news.ycombinator.com/item?id=22100536

You're right in a limited sense, except that it's not any language, it's particular languages that can count as a Formal Method in the sense of the OP (i.e. can do model checking in some kind of termporal logic).


The problem is that explaining the upfront cost to 'the business people' is an issue. In many cases, budgets for dev are yearly and that is kind of an horizon for most people (even some who should know better), so even if i'm going to say that your TCO is probably (which is a problem in itself, but probably statistically) going to be lower with formal methods than without, that assumption is spread over many years. There are enough systems that require vast refactoring every year because they were badly done in the first case, but that's the concern of every year's budget/manager, not of the one sitting the current year.

That goes for verifying code as well, however that is even harder to sell as it's so much more expensive and even though I 'feel' it is a good idea, I cannot really estimate if the TCO would be lower than just patching with bandaids over the years.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: