If you're designing a new system using hash trees, you better use Sakura trees or have a good explanation for why not.
Edit: I also wrote some demo code for this attack when arguing with one of the IPFS guys about why they really shouldn't use Bittorrent BEP 30-style Merkle Trees. In order to get security with Bittorrent-style trees, the [length, root] pair really needs to be your cryptographic identifier, and you need to make sure they're always properly handled as a pair. There are just too many caveats to usage and we have provably secure alternatives.
As a former LimeWire developer, it makes me sad that Gnutella's TigerTrees avoided this vulnerability long before bittorrent's BEP 30 was published, and yet Bittorrent got it wrong. It was a well-known vulnerability, and was covered as part of my interview at LimeWire.
Max file size is 2^64 bits, so the hash tree is defined as a binary tree covering the entire 2^64 range, leaf layer made of 1KB pieces.
A hash of an empty range is defined to be 0, at any level.
That way, each hash is reliably fixed to its interval ("bin").
There are some features derived from that. For example, you can prove file size by showing a logarithmic number of hashes. A file is identified by its root hash (covering the entire 2^64 range), no additional metadata needed. And so on.
These days, the 7574 scheme is used in the DAT protocol.
With a better construction (Sakura, ...), that total and block size information are unnecessary. I wonder if any almost-compliant RFC 7574 or BEP 30 tools accidentally ignore those other sizes.
I've posted several places about being able to take the hash of the rightmost chunk and log(N) other hashes as a compact proof of file length for a hash tree root. (Including, I think, the posts I had arguing with the IPFS folks about using BEP 30 Merlke Trees.) That's handy, but there's nothing novel about RFC 7574 there.
However, 2^64 bits is a bit of an arbitrary limit. What's the justification there? SHA-512 and SHA-384 support up to 2^128-1 bits, and SHA-3 doesn't have such a limit.
More importantly, I don't recognize the names of the RFC authors from any cryptographic analysis. There's a big difference between a scheme that looks pretty good to a ton of random people who have looked at it and a scheme designed and published by world-famous cryptographers. Has your custom tree construct undergone formal review by qualified cryptographers? It looks pretty good to me, but if I had to bet by career on the security of a tree hash scheme, I'd prefer to go with the people who brought us Keccak/SHA-3 (and one of the people who brought us AES, despite its known weaknesses).
2. 2^64 is quite a lot, for a single file. The variant I described deals with a single static file. Also, it is a part of a network protocol, so there are some requirements. Like, using standard integer arithmetics for packet-level processing.
3. Feel free to show it to any people you like.
If you're stuck with an architecture that depends on certificate revocation lists, I think Micali's representation of the CRL is both efficient and secure. On the other hand, blacklists (such as certificate revocation lists) are generally easier to exploit than whitelists (such as the published lists used in certificate transparency).
I think we all would have been better off with an infrastructure that mandated certificate transparency, where a Merkle Tree of all valid certificates was published periodically by each Certificate Authority. In other words, keep a cryptographically verifiable whitelist instead of a cryptographically verifiable blacklist. For scalability, you'd want to have the published list be a Merkle Tree root of valid certificates, plus a signed tree hash of a Bloom filter of all of the certificates revoked in (say) the past month. From a signed Merkle Tree root, each site can create a compact proof of validity for its certificate of log(N) size. Using the signed Bloom filter, the site can keep reusing that proof for (say) a month. You'd want the Bloom filter to use a keyed hash function so that a collision in one publishing wouldn't be likely to keep colliding for the whole month.
You need a concept of a timestamp of a message on a stream relative to some other stream of data.
This is a distributed computing problem.
To solve this for Intercoin, we invented the Permissionless Timestamping Network
Look for descriptions of PTN. That's what you REALLY need for revocations.
The problem of revocations is then really reduced to a search and incentive problem.
1) in whitelist
2) maybe in blacklist
3) check cert not in actual blacklist
And a certificate is valid if 1 (and not) 2, or if 1 and 2 and 3?
[ed: i see you mentioned a keyed hash "to avoid colliding for a whole month". But the risk of a few random days of service (certificate) unavailability seems like an unacceptable design tradeoff? (as there will be other, non-logical reasons for downtime too) ]
The high-level view is that you have a whitelist, but to improve scalability of constructing proof of membership, we allow for longer lifetimes and Bloom filter shortening lifetimes. In case of collision in the Bloom filter, you need to construct a newer proof than would be ordinarily required.
With certificate revocation lists, the client doesn't hammer the CRL server with one request for every new connection it tries. 24 hours is a common lifetime for a CRL, during which the client wouldn't typically re-check for a new CRL.
So, presumably, every half month (similar to renewing DHCP leases when they're half expired) a server would get a freshly timestamped signature of the root of the CA's tree of valid certificates and get the list of hashes up the path from the server's certificate up to the root of the tree. The root signature, the list of hashes, and the number of leaves to the left of the server's certificate leaf in the tree constitute a compact proof of membership in the set of valid certificates.
In order to have equivalent worst-case revocation time to a CRL with 24 hour validity, the server would also need to present either:
1. A proof of membership in the valid set of certificates where the CA's root signature is at most 24 hours old OR
2. A proof of membership in the valid set of certificates within the past month and a signed Bloom filter less than 24 hours old where the server's certificate isn't in the Bloom filter
Of course, there's a denial-of-service opportunity here, just like there is with a CRL system. With a CRL system, there's fail-safe (don't allow any connections if unable to get a new CRL) or fail-dengerous (if the CRL server is down, just keep using the most recently seen CRL). One could have a similar fail-dangerous whitelist system where if the server can't get a Bloom filter signed within then past 24 hours, the client tries to get a signed Bloom filter within 24 hours via an independent channel, and if this fails, accept the Server's bloom filter if it's less than (say) 7 days old. I'm not necessarily advocating fail-dangerous, just stating that the denial-of-service mitigation is similar to that of CRL systems.
Note that if you don't actually need transparency, you can use a cryptographic accumulator in place of a Merkle Tree root. The advantage of a Merkle Tree root is that if leaves are ordered in terms of DNS name (.com.google.maps, .org.eff, etc.) then the CA can give you a compact run of all of the certificates ordered for your domain, as well as the certificate right before and the certificate right after your domain, plus a short list of hashes, and you can prove that you have a complete list of the valid certificates issued for your domain.
All of the cryptographic accumulators I'm aware of don't carry any proof of order, so there's no compact proof that you've been given all of the certificates in a certain name range.
One broadcast of one Merkle Tree root signature and one signed Bloom filter every 6 hours for a few hundred CAs is really easy to scale up. Presumably there would be something similar to NTP or a P2P broadcast network along with the major CDNs all providing caches of this information. All of these elements are the same for all participants in the system: the amount of data distributed is constant per CA and independent of the number of issued certificates. Unfortunately, for a constant false positive rate, the size of the Bloom filters needs to scale with the number of revoked certificates.
Constructing the compact proofs of membership requires interaction with some server that has access to the list of all of the leaves of the tree, which is heavier weight, and why we give these proofs a lifetime of a month.
Do you mean "vector"?
I don't have much formal mathematical education, but I believe the normal term in Set Theory for the mathematical construct is an ordered set. I know next to nothing of Abstract Algebra, so I won't speculate on defining your normal linear algebra vector operators for vectors of Certificates.
C++ and Java call the most obvious data structure for it a [Vv]ector. Python calls the most obvious data structure a List. Some data structures textbooks would call the most obvious data structure a dynamically sized array (or a fixed sized array, for that matter).
The paper isn't terribly clear on how to implement it, though I was only able to skim over it due to being otherwise occupied atm.
I I understand the paper correctly, one should append 0x00 for root nodes too. It's a little odd that roots & leaves are treated the same, but the math works out correctly.
(also, I think the paper appends a 1 bit for roots & leaves and a 0 bit for interior nodes — but it's all much of a muchness)
It's pretty long.
For those not familiar with the specific implementation issue, this post should be a fun read.
I don't know what's more concerning, that pymerkletools advertises itself for use with Bitcoin or that none of the contributors read the wiki page :|
The fix came through a higher order property (duplicate TXs not allowed anyways).
Edit: Okay, the other valid argument for not using prefixes to tag leaves and interior nodes is that you're using the provably secure Sakura construction, which using suffixes rather than prefixes. There are a few advantages to using suffixes, such as being able to have a single node containing both data and child nodes without having to know the length of the data portion before starting to hash the node. There's also better performance due to memory alignment when hashing memory-mapped files (if using block sizes that pack well into memory pages and cache lines) if you use suffixes. But, suffixes vs. prefixes is a tiny nit to pick.
h_leaf(leaf) which takes a leaf.
h_branch(branch_left, branch_right) which takes the two branches.
The important point being that one should not be able to find a leaf such that h_leaf(leaf) = h_branch(branch_left, branch_right) for any branch_left, branch_right.
Prefixes versus suffixes are just implementation details of the hash functions (i.e. h_leaf(x) = "\00" + sha256(x), h_branch(x,y) = "\01" + sha256(x + y) would also work).
Imagine you had some combinations of functions f, g, and h such that f(g(h(x))) = y. Obviously you could calculate that as h(x) -> g(h(x)) -> f(g(h(x)) = y, but then of course knowing h(x), or g(h(x)) would enable you to find y as well. So of course, due to the recursive nature of it, picking any set of inputs that were the outputs of a previous call would give you the same output.
That argument doesn't exactly fit multiple inputs, but the idea is the same.
I guess the point is, this property (value-neutral) of that design makes it unsuitable for real-world purposes. Since the implementations in question are meant for real world use, that makes them flawed, at least. :)
I don’t see how this is a vulnerability in the merkle tree algorithm. It just seems like yet another case of “python libraries contain bugs that are common in idiomatic python”.
I guess since so many people have independently implemented the same mistake, the writeup is useful.
If you're providing a merkle branch to prove that a document you've hashed is in the set, then I don't see how this helps you at all.
hashing root with tree depth as last step.
Edit: Also I think element count can be also alternative for depth
I get what you meant, but it still doesn't prevent the second preimage attack, unless you're forcing all subtrees to be the same depth. Think about the algorithm for testing set membership. How would they use the tree depth to distinguish between hash(hash(leafA) + hash(leafB)) and hash(leafC) where leafC = hash(leafA) + hash(leafB) but leafC is NOT a member of the set? Keep in mind that leafA and leafB could be leaves of the longest subtree, but other valid leaves could be on valid subtrees that are one node shorter.
> Edit: Also I think element count can be also alternative for depth
Element count doesn't solve the second preimage attack problem either.
/ \ /
AB CD EE
/ \ / \ /
A B C D E
/ \ / \
AB CD EE EE
/ \ / \ / \ / \
A B C D E E E E
Incidentally, a side effect of this is that you can't have duplicate transactions, because you wouldn't be able to tell between a E and F such that E=F, and E simply being duplicated to pad out to a power of 8. This works for Bitcoin because duplicate transactions aren't desirable, but it might not work for other applications.
Sorry, I couldn't help myself