Hacker News new | comments | show | ask | jobs | submitlogin
Tests Should be Specific [video] (www.youtube.com)
55 points by KentBeck 5 months ago | hide | past | web | 37 comments | favorite





I would like to hear some words of wisdom from either your guys personal experience or maybe opined by by the gurus about whether or not to use random data in unit tests.

On the one hand, I understand the argument that random data makes your test irreproducible; so if something breaks the test, it make take a while to figure out exactly what and why went wrong.

On the other hand, I feel that hard-coding test data is too restrictive. For example, in the linked video, they have a 40-hour week with a 8 dollars per hour rate, and they expect the result of the calculation to equal 320. An immediate question arises, what's so special about these numbers? Would the test pass if the input were different? What about a 38-hour week and 20 dollars per hour? And so on...

What's your take on this?


The answer depends a lot on your terminology.

I would say that unit tests should always be as repeatable as possible, which means no randomization. But I would ALSO say that you should prefer property tests over unit tests whenever it's feasible to do so, and fuzzing with random input is one of the defining features of property tests.

Others consider property tests to be a special class of unit test, which, given that pretty much all tests these days are written and run using a framework with "unit" in the name, is also reasonable terminology.

As for why to use property tests: In a nutshell, because achieving sufficient coverage with unit tests alone is infeasible. One of the unstated assumptions of a primarily unit test oriented strategy is that devs are so on their game that they can just reliably think up all the boundary cases, all on their own, in order to achieve good test coverage. I'd argue that assuming everyone is just that good isn't too far off from assuming everyone is so good that you really don't need tests at all.

To the irreproducibility argument: Any property test framework will hunt for a minimum failing input and then print it out in the test failure report, so that you can then use to construct a unit test. From there, it's up to the developer to be disciplined, just like it always is. And some particularly clever ones, like Python's Hypothesis, will automatically (locally) remember failing cases and run them first.


> I'd argue that assuming everyone is just that good isn't too far off from assuming everyone is so good that you really don't need tests at all.

I'd argue unit tests aren't made to stop developers from pushing bad code, they're made to prevent future developers from breaking it even further.


I'd agree. But unit tests can still only guard the behaviors that someone thought to cover with a test. Property testing is still your friend here, either all by itself or as a way to find all the cases to memorialize with unit tests.

IME, random data is extremely useful. Consider fuzzing, property testing ala quickcheck, automated fault injections, etc.

The key thing to make it work well is to keep track of your prng seeds to make things deterministic. If you find that seed `484382943` makes a test fail, append it to a list of seeds to always re-test.


It only works until the pattern of randomness consumption changes even the tiniest bit. But it's a valuable tool nonetheless, even if only for quick re-runs during a debug/fix session.

For long term regression checks you might want to make sure that your test data object graph can be round-tripped through JSON (or any similar format, but JSON tooling is ubiquitous) to get your random discoveries into a less fragile form (or: fragile in a different, more fixable way).


My thought would be that, instead of re-using the seed in further tests, you should be figuring out what inputs that seed generated and create a test case for those inputs directly.

Are random test cases only for checking code throws (no) exceptions? If not, how do you create the expected results from the random data without reproducing the method you call to begin with?

They only idea I get is to fix the seeds/inputs, write the expected results into a file (making it persistent), and checking them manually the first time. Subsequent test runs check whether the output changed.


You can test properties of the result without an assertion of the exact result:

Take adding two integers, you can test 3+18= 21 or you can assert that, when both arguments are positive, the result is greater than each argument.


No offense, but that's a toy problem that's hard to generalize to actual issues we see in our code. This might apply to some financial code where some complex situation should always result in a great or smaller result, but what about the things people typically work on? I'd wager a significant percentage of us are writing CRUD apps; what's the crudsmith's equivalent to this?

    expect(add(A, B).toBeGreaterThan(A);
    expect(add(A, B).toBeGreaterThan(B);

It works the same way with actual (non-contrived) problems. You'd identify a property you want to test, and assert that it holds for randomized inputs.

Say you have a CRUD api (TheTrackerOfWidgets), and you want to test a create Widget call. Widgets can have a name, size, and each is associated with a unique id.

Your property test might look like (in pseudo-rust):

    #[quickcheck]
    fn created_widgets_exist(name: String, size: u32) -> bool {
        let resp = create_widget(&name, size);
        if !resp.is_valid_json_or_whatever() {
            return false;
        }

        let id = resp.get_id();

        let resp = retrieve_widget(id);
        if !resp.is_valid_json_or_whatever() {
            return false;
        }

        resp.get_name() == name && resp.get_size() == size
    }
You would probably go further and not use default types but instead use bounded types based on any domain specific verification logic you have. Note that nowhere does it actually assert that a response looks exactly like what you expect, you assert that it is valid and/or satisfies some property (that created widgets exist and are accessible by further api calls).

One test I've done a few times is a "round-trip test" with randomized data, to ensure that things can go in the system and come back out again the same. This could even be done all the way from the browser with selenium or whatever. Even something this simple, you'd be surprised what this can turn up. (I actually write "round trip" tests a lot, though not always randomized.)

For a CRUD app, I've also created random account generators that can randomly create all different sorts of accounts, based on the database layout. You can then even do things as simple as "render the view screen on this user account", and, again, you might be surprised what pops out. Plus, as you continue building the system, you can both extend your "random account generator" to generate more and more random things, and use it on more and more of your code. This is when you can discover that your page crashes if they don't have exactly the same number of shipping addresses and phone numbers, or your forum crashes on users who didn't enter the optional email contact address, or whatever. Also, as you have the ability to easily generate "100 fairly random accounts", you'll find other interesting tests that you can write.

Simple number comparisons are just an easy example.

If your CRUD app happens to extend just a wee bit beyond the boringest possible CRUD and you also have any sort of structured test you need to accept, you can also use a fuzz tool like http://lcamtuf.coredump.cx/afl/ You can find all sorts of interesting things in any sort of structured text input that way quite easily. (Theoretically, if you're motivated, you can even hook up a fuzzer to the random account creator idea above, but that does take a bit of work, but that would do interesting things if you could get it going. I've only used something like afl a few times, but it's kinda fun, and amazing.)


I had this exact discussion at work.

Say you have test(X). X is a random number in range A to B.

Say someone else has test_range(A, B). Instead of running for one input, it runs test for the full range of inputs.

Actually, both tests run on the same set of inputs. The difference is that, for the first test, you're not running all of the inputs at once. You're running some of the inputs with your commit, some with a colleague's later commit, some when a customer downloads the program and tries to run it... by making the input random, you're accepting the possibility of merging code that sometimes fails the tests.

And, actually, it will take you far more time running code to run the full set of inputs because there's nothing preventing you from running the same test twice.

So then I'd ask why you're not just running the full set before you commit. If it's too expensive to run, my opinion is one or more of:

A) You should be running fuzzing 24/7

B) You're using randomness as a means of avoiding deciding the inputs yourself because you don't the problem space

C) There's not actually a need to test the full set, X being 222348 is isomorphic to X being 222349.

>On the one hand, I understand the argument that random data makes your test irreproducible; so if something breaks the test, it make take a while to figure out exactly what and why fails the test.

Usually I see random tests saving the seed if they fail.


> B) You're using randomness as a means of avoiding deciding the inputs yourself because you don't the problem space

This is one you can definitely improve on. Learning how to partition your inputs into different groups/types that are likely bring out different behavior/issues/bugs with the code is important.

If you have an [add] method over integers, testing for [+,+], [+,-], [-,+], [-,-], [0,+], [0,-], [+,0], [-,0], and [0,0] are reasonable. The more complex the method, the more partitions you can wind up with; which also teaches building smaller methods.


I agree. Random data makes the tests less specific, so I'd wager the authors would probably also argue against it.

Assuming you trust your unit tests, you can claim a passing test suite means: (1) given current understanding, the code is most likely correct and (2) based on the same assumption, other developers agree that the code is most likely correct, for the current version of the program

I personally believe randomness has a place (fuzzing), but should stay semantically distinct from unit testing for the above reasons.


Any form of nondeterminism in CI or unit tests should itself be treated as a bug. If something fails, it must be failing no matter whose machine it's failing on, and vice versa for passes. Additionally it should not depend on the order in which tests are run.

The longer it takes to run the full suite, the more important this is.

Yes, sometimes this is unavoidable and you have to decide that it's not worth dealing with, but then you have to corral those tests off into a "these sometimes don't pass lol" section. The regular CI must not waste people's time with unreproducible failures. And it shouldn't waste downstream user's time with unreproducible or by-coincidence passes, either.

If this is not the case, then you let entropy win. Tests that fail "sometimes" become tests that fail most of the time. People get used to re-running until they get the result they want - which means it's passing by luck.

If you need to test RNG or time-based behaviour, make that reproducible as well. Otherwise you end up with a "can't print on Tuesday" bug, do the tests on Wednesday, conclude it works, and ship it.

There is definitely a debate to be had about coverage, which I think is what your question about ranges of inputs is really about. In some cases it may be possible to just test everything: https://randomascii.wordpress.com/2014/01/27/theres-only-fou...

.. but again, realistically, you have to look at your line and branch coverage and decide when it's good enough.

(One of my "debugging war stories" that I should get around to writing up was to do with touchscreens; we had all of "touchscreen stops working if you change PSU", "sometimes double-press", and "sometimes misses touches". For the latter two it was years before the business even definitively established it was a software fault and not just users damaging the touchscreen, and the eventual test process we settled on was "have a robot press the screen 10,000 times and count the results".)


> whether or not to use random data in unit tests.

Don't. That's not a unit test anymore; that's fuzzing. Which is a great thing to do! But it doesn't belong in unit tests.

Why? The problem isn't a computer problem, it's a psychology problem.

If your tests start intermittently failing, your developers - you - will just start hitting "re-run tests". You'll tell yourself you won't, of course, but then will come that day when you must fix something in production, so you just slap "run again" and promise you'll fix it later. And maybe you will. But over the course of time, people will just blame "flaky fuzzy tests" on failures, and stop following up, because hey - we've never seen this in production! We'll get around to fixing it, later. And then you start ignoring real test failures, or slapping "retry" half a dozen times before realizing your failures are real.

This will rot your engineering organization.


We’ve got an episode on this. Check out “Tests should be deterministic.” https://youtu.be/PwWyp-wpFiw

To put some more to it: I like my tests to not use random data, but I’ll use a random data generator when coming with the test cases. That way I don’t get surprises in CI with a flakey test, but I can still generate interesting cases.


You might be interested in (something like) Hypothesis [0]. "It works by letting you write tests that assert that something should be true for every case, not just the ones you happen to think of."

E.g. you specify something should work for all `int`s for example.

[0] https://hypothesis.readthedocs.io/en/latest/


Random data doesn't need to make tests irreproducible. You emit the random seed at the start of the test. Then if you need to reproduce a failure you just pass the same seed as was emitted in the failing test case, while synced to the revision where the test failed. We do this all the time at work.

Your tests are there to prove that every situation you can think of still works correctly after change. They should be on specific data not random because you are creating specific tests of situations you are afraid could break. As my mentor told me years ago "Keep writing tests until boredom overtakes fear" - in other words if you are afraid of random data pick some specific random data and write the specific tests for that.

The problem with random data is they don't fail often. Instead of random data in a unit test which can go years before it finds the bad input you should turn your random test into something either a fuzzer or theorem prover (the tools solve slightly different problems so both are good) can use and let the tool run for the hours it takes to get results.


Random data is definitely useful, but IMO, they shouldn't be used in unit tests. Unit tests are for development, for asserting overall correctness, you should use some combination of integration tests, property based testing or formal verification.

The first problem is it makes it easier to write false negative tests. If you're not testing that some result is equal to some hard coded data, then your test might just be asserting equality on two null values, for example.

The second problem is that it's often tempting to write tautological tests. That is, testing some function by comparing its result to the same function, but written in the test file.


Tautological tests are useful, but only if the implementation in the test is much easier to verify true than the one under test. A simple implementation of bubble sort (once you solve the off by one errors) for example can be used to test your complex sort algorithm that switches algorithms and runs some passes in parallel if they are big enough - the complex sort is much harder to be 100% sure doesn't have bug, but it blows the simple bubble sort out of the water for any large N and so is worth doing for somebody.

I agree in general with your statement. Very few people in the real world have a problem where there there are two correct solutions much less one where the more complex is worth the cost to write/maintain.


That's fuzz testing, not unit testing. Using a fuzz tester to generate a corpus of test cases which are then fixed in unit tests is a good idea. But randomizing your "quick to run, must pass on every commit" unit tests is problematic for reasons other people have noted in this thread (and very slow).

Fuzzing is a superpower.

You need general correctness assertions, which are often just “the program doesn’t crash” but fuzzing catches an outrageous number of errors in real programs. IMO, every single product that takes input from sources you don’t control should be fuzzed.


If anybody is interested, this appears to be part of a series: https://www.youtube.com/watch?v=5LOdKDqdWYU&list=PLlmVY7qtgT...

Specific tests doesn't exactly mean "Simple" tests. It is hard to balance between the two but from my experience when people try want to write specific tests they just start writing extremely simple tests.

They seem to start talking about stacks and leaves of stacks, as if one should only test at leaves. Surely this is woefully inaccurate?

Isn't the point of tests to test behaviour? Sure be specific about behaviour. And sure, unit tests should not overlap too much, but this is a separate matter to multiple tests failing, and now I don't know where my bug is.

?


The purpose of a test is to state that "This should NEVER change". Thus you need to figure out what will NEVER change before you can write any tests. You can use architecture to guide you - if you have a horizontal layer or vertical slice - either you know that whatever your layer/slice's interface to the other layer/slices will be hard to change and so you can always safely test there. The problem is your layer/slice is complex (if it isn't then your architecture is too inflexible!) and so you need to pick internal points to test. These internal points are then asserted they won't change - but you can legally refactor them latter if only you didn't have those pesky tests that would fail on conditions that might still be important.

Where to inject tests is an extremely hard problem.


I think they were contrasting the first test (a more specific test of a leaf function, i.e the innermost parts of the app) with second test (a less specific test because it fails when either the outermost or innermost part is broken) and saying that to diagnose the second kind of test failure, you can make note of the fact that the more specific test is failing as well and start your debugging in that innermost part of the app (the leaf of the callstack).

This is also important for system, integration, e2e or any other complex test.

The problem is that it is very hard to filter out failures due to environment or other unrelated issues, and even harder to pinpoint the problem


High-level testing suffers from this symptom, but catches overarching system problems. It usually means that you are missing a test at unit (spec), and possibly at contract/collaboration level.

I rather prefer higher level tests (Integration tests?) that says, hey, you have an error. Or user will experience error. Or the expected high-level outcome is wrong.

Because with unit testing, I miss and can miss so much conditions. If I have high level error, sure, I will figure it out - but it is part of development process rather than something breaks live and then you have to figure that out anyway.

Ofcourse Unit tests have their place where I want to test that this input produces particular output (for example some parser, sanitizer class etc.) and I want to receive a signal when my future development breaks some stuff.

Tldr: I don't think that having a very specific "what went wrong" is that important. I'm grateful when tests fail, because that saves me from mistakes going into production. It's like a safety net.


I too prefer higher level tests, but for a different reason: the deep unit tests tend to make it harder to change the code in non-functional ways. If I decide my program structure is wrong most of my unit tests will not apply to the new structure, but the high level tests need to pass.

You're right - high level integration tests (end to end) are important, and most would argue that they are more important. If you had to choose one type of testing only, that would be it.

Unit tests are also important, as you said.


Not sure I get the point of this



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: