Bloom filter privacy and thoughts on a newer protocol

This is a mirror of an article posted by Mike Hearn on the bitcoinj mailing list, posted here for archival purposes on April 9, 2016.


Jonas Nick has written up a blog post about wire privacy in bitcoinj, observing what was also recently observed in a research paper last year – the FP rates used in bitcoinj’s Bloom filtering implementation are very low:

I thought about reply on the blog or reddit, but figured here is a better place and then I can link to this email later. Really we should have written this up a long time ago because academics keep doing the same research and making the same “discoveries” over and over without bothering to talk to us, or even checking the bug database where these matters are discussed already. That’s unfortunate. I keep hoping someone will actually fix things, but nobody ever does.

Why is block chain private information retrieval hard?

The reason bitcoinj doesn’t use the obfuscation capabilities of the Bloom filtering protocol is that lying consistently is hard. I mentioned this to Jonas a few days ago at the Bitcoin meetup I attended. Let’s elaborate on what this means.

The Bloom filtering protocol let’s bitcoinj lie about what it’s interested in from a remote node. But anyone who ever watched a cop show knows that lying is one thing, but lying without getting caught is something else entirely. Usually in these shows, the detective cleverly puzzles out whodunnit from inconsistencies and mistakes in the suspect’s story.

Common problems that let the detective catch the bad guy include:  constantly changing their story, telling different lies to different observers, telling lies that contain elements of the truth and so on.

Subgraph traversal

As we all know, Bloom filters allow us to match more transactions and script elements than we actually need, and in a way that the server doesn’t know which we genuinely care about and which we don’t.

The problem starts when we realise that what we actually care about is not transactions but rather transaction chains. If we observe a payment to our address A whilst scanning the chain, we must also observe any transaction that spends that payment. Otherwise if someone cloned their wallet to another machine and we were in a watching wallet configuration, we would have no idea the money was actually spent. And it’s recursive – having observed the transaction that spends the payment, if it had a change address we must also observe any transaction that spends that transaction, and so on, following the “peel chain” of transactions all the way to the bottom.

Actually what we need to download are subgraphs of the global payment graph.

For this reason the Bloom filtering protocol is more complex than you might think. A remote node that is sending us a filtered block chain actually updates the filter that we uploaded, as it goes. Whenever a transaction matches the Bloom filter, the hash of it is added back into that filter so any spends of it are guaranteed to also match. Because adding an element to the filter increases the false positive rate, we say this process makes the filter dirty. Eventually the filter is so dirty that it matches the entire block chain and we’re doing as much download work as Bitcoin Core would, but without the extra security. BitcoinJ detects when the filter is starting to get dirty by observing how many false positives the remote node is sending back and uploading a new, fresh filter once it gets too high.

“Too high” in this case is currently defined as a percentage FP rate, but should really be defined in terms of bytes/second, as that’s the actual limiting factor.

Tracking and accounting bandwidth usage

The problem we face then is how do we clean the filter? The approach bitcoinj uses at the moment is just to generate a totally new, randomised Bloom filter from scratch. This is very simple because we just repeat the process we used the first time. And it’s guaranteed to reset the FP rate back to what we expect it to be.

But this is like the bad guy in our cop show who constantly changes his story whilst preserving elements of the truth. You can just collect the different filters and then do a bitwise AND of them together to progressively remove the random noise.

What we actually need to do is alter our story in the smallest way possible to get us back under our bandwidth budget. This is hard, and is the reason no Bitcoin wallet today offers wire privacy of addresses.

More precisely, we need to:

  1. Keep the false positives around in some form, grouped together by what subgraph/chain they are a part of.
  2. Calculate how much of our bandwidth is being wasted by which false positive chains.
  3. Calculate the smallest set of chains that sum up to the excessive bandwidth waste or more.
  4. Rebuild the filter such that it includes the other false positives, but not the ones we have identified as pushing us over our limit

But wait! It gets harder! To minimise delays caused by network latency, P2P clients (Bitcoin Core included) request blocks in batches. And because scanning the chain creates false positives, if we calculate a filter that nearly uses up all our bandwidth, as the next batch scan progresses we’ll rapidly end up over our limit again! So actually we have to aim to undershoot, and the amount of undershoot depends on how quickly we’ll end up blowing our budget again. This in turn is a function of how much Bitcoin is being used globally.

Additionally we have another problem – if we have a high FP rate then the risk of hitting an absolutely massive transaction chain (e.g. SatoshiDice) becomes realistic. If our wallet hits such a chain then we might end up being instantly swamped with FPs …… but we can only change our Bloom filter between block batches. So we either have to wait for the end of the current batch, or detect this situation and disconnect from the remote peer. A protocol upgrade could potentially fix this, but those are not something I care about doing anymore post getutxo-fiasco.

But wait! It gets harder! I hand-waved before when I said “keep the false positives around”. What does that mean? Holding them in RAM? Well, on Android we have no swap file and using too much RAM will kill the process. So we don’t want to waste more RAM than we have to. A more efficient encoding might mean holding the groups as Bloom filters themselves, but this adds code complexity.

Surviving restarts

But wait! It gets harder!

All the above state I just mentioned, however it is represented, is stored in RAM. That means when the app shuts down it’s all lost. When the app starts up again, we’re back to having newly randomised stories that can be intersected to find the truth.

It’s tempting to think we can just store the random seed we used to generate the filters. But then we’d have filters that didn’t incorporate any of the actual false positive chains we were previously chasing. This is a discrepancy a good adversary could pick up on.

So really, to do a great job, you need to serialise all that stuff to disk and load it back again. But we already have problems with wallets getting too large to load into RAM on constrained devices. Making wallets even larger doesn’t seem like a good idea.

Sharding wastes IOPs

One obvious way to increase privacy is to download the chain from multiple peers in parallel, using different non-overlapping Bloom filters for each one. Then the results can be recombined client side before being forwarded to the wallet. This was suggested by Jonas in his blog post.

The problems with this are threefold:

  1. It adds significant complexity to the code, because multiple chain syncs must run in parallel and be synchronised in lock step.
  2. The entire sync process ends up locked to the speed of the slowest node. BitcoinJ needs to do a better job of finding fast peers as the network seems to have got slower lately. This wouldn’t help.
  3. The slow part of serving a filtering request is the disk seeks needed to load the blocks. IOPS (inputs/outputs per second) are a precious and limited network wide resource. We might already be running out of them …. it’s hard to say because Core has little monitoring or performance metrics.

So simply querying multiple peers doesn’t seem like a good improvement.

Does it really matter?

The issues with Bloom filtering have never really been fixed, partly because it’s unclear that they actually matter to people. There is no equivalent of to map IPs to addresses. Someone might think it’s a good idea to build one, but it’d have no legitimate uses as far as I can tell, and would very likely break wiretapping laws. The legal risk seems likely to put people off.

Additionally, it’s unclear what someone would do with a big collection of filters. The NSA or GCHQ can certainly find a use for them, but other than those guys, it’s not obvious that these logs can be monetised or abused any easier than web server logs can. There are contrived scenarios involving wifi hotspots and muggings, but that’s the best I can come up with.

Now we’ve fixed address reuse, when people complain to us about problems with the library they usually care about performance. Not wire privacy.

What’s the best fix?

There are three ways to make things better:

  1. Implement all the above strategies to make bitcoinj a really strong implementation of PIR. This would be cutting edge research – I do not believe there are any internet protocols that even try to do good PIR today, let alone succeed. This is hard.
  2. Encrypt the P2P protocol. This would annoy NSA/GCHQ and other agencies like them, because then they couldn’t build mappings of name->address using DPI. Of course they can still run nodes and collect a random subset of the traffic, but it at least forces them to donate resources to us. This is easier, but I don’t plan to do it myself any time soon.
  3. Dump Bloom filtering for something else entirely.

Replacing Bloom filtering

What might we replace Bloom filtering with and how would it help?

It’s important to understand why we (Matt and myself) designed the protocol this way. There were two major reasons:

  1. It could be implemented on top of our existing code in Core and bitcoinj in a backwards compatible manner, with very low additional complexity. Still, making Bloom filtering work took a fair bit of debugging and tuning anyway.
  2. Wallets used (and still do use) the block chain as a record of payment metadata, specifically, to build a payments list showing the rough time when money was sent or received. It is expected that syncing the chain fills your transaction list with historical data.

Another advantage that we didn’t know about at the time is that this scanning process lets us push forward the HD key lookahead zone. Starting from an HD seed we can figure out how many keys we’ve used by rolling forward through the timeline.

At the time, blocks were about 10 kilobytes in size. So the overhead of loading them and filtering them was pretty low. Also the network was larger than today yet very quiet, so we had tons of server capacity. Gambling that someone¬† (maybe us) would use the Bloom protocol to build a really great PIR implementation seemed worth it, and anyway, more ambitious designs weren’t really possible. Matt implemented filtering during lectures and I had a day job.

Since then things have changed a bit. The network got a lot busier, with fewer nodes. Blocks are now routinely 700-1000kb in size, so that’s 100x growth. PIR turned out to be really hard, and nobody showed up to do the work. Other things took priority for me, like HD wallets (which back then did not exist).

Thanks to BIP 70 (which didn’t exist back then), we can now imagine a different design:

  • Wallets stop relying on the block chain to fill up the tx list with metadata. Instead they have really robust cloud backups, with full metadata preserved in this way.
  • They use BIP70 and some store-and-forward network like Subspace to communicate with each other.
  • Synchronising with the block chain now fulfils a different role than before: it’s just a way to get confidence in unspent outputs. We no longer need transaction data from fully spent transactions.
  • Full nodes build a secondary sorted UTXO index of output->{height, merkle branch}
  • We can use lists of script prefixes to select ranges of the UTXO set and download their heights and merkle branches
  • Wallet then checks the merkle branches against its list of headers to obtain confidence.

This design is in some ways more complicated than Bloom filtering – for example it requires that wallets operate entirely in terms of UTXOs and not transactions. The notion of a watching wallet is a bit different in such a system, because it’d be able to tell you the balance of the watched wallet and the relevant outputs, but not why the balance is that way. The moment money is spent, such a wallet would forget the payment ever existed at all unless it kept some external log.

And to keep the user interface how people have come to expect it, BIP70 needs to work for all payments including between people, instead of today where it’s only used for person->merchant payments.

Also, HD key lookahead is an open question. Restoring a wallet from an HD seed in such a world wouldn’t work well because you’d have no idea how many keys to pre-calculate. You can’t just generate some lookahead, then scan the UTXO set because if the seed is old those keys might have been used and spent a long time ago.

The problem of adapting FP rates to bandwidth usage still remains of course. If a script suddenly becomes hot we might end up downloading a lot of Merkle branches and would need to correctly detect which prefix was hot and tighten it.

The main advantage of such a scheme is that it scales a lot better, and we don’t have the problem of chasing chains of transactions. Only UTXOs matter and they are flat. So we can just invent some random prefixes, and as long as we scan those prefixes consistently we’re pretty much good. Also it means pruning full nodes can always service SPV clients even if they threw away all their blocks.

The downside is nodes have to maintain two UTXO databases. But it’s probably still cheaper than loading entire blocks to do filtering on them. Also, SPV clients would need to keep all headers for blocks that might potentially have unspent money in them (or at least their hashes). Currently we just keep a ring buffer of a few thousand headers and throw away old ones. How this interacts with HD key lookahead is interesting to think about.

Note that this is not “prefix filtering” as described by Peter Todd elsewhere, which refers to basically the same design as Bloom filtering but with a different filter encoding. Abandoning chain scanning entirely is a rather bigger design change. I don’t currently have a name for this. Perhaps UTXO filtering.

As you probably realise, switching away from Bloom filtering is far from a no brainer and opens up lots of unanswered research questions. Finding ways to improve Bloom filtering implementations will probably be the lowest hanging fruit for some years yet.

Leave a Reply

Your email address will not be published. Required fields are marked *