gdt's thoughts on tahoe-lafs and similar filesystems

This page contains gdt's thoughts on tahoe-lafs and to some extent other members the class of filesystems that could be called "cypherpunk distributed". (Filesystems like coda and those like gluster are not considered.)

Table of Contents

1 Introduction

tahoe-lafs is a distributed filesystem that combines capabilities, encryption, and erasure coding. This page considers tahoe from the viewpoint of one wishing to use it for distributed off-site backup. Most of the page assumes some familiarity with tahoe. We consider suggested improvements specific to friendnets, and also generally useful improvements.

Many feature requests in this page refer to tickets at tahoe trac, and eventually all should. I expect to update this page, and hope to remove feature and bugfix requests as new versions of tahoe are released.

Many items point to discussions that I have participated in or sometimes merely observed; in that sense this page is (intellectually, not in copyright) a derivative work of the tahoe-lafs mailinglist and trac stream.

2 Distributed backup

I approach offsite backup as a combination of resilient offsite storage (tahoe-lafs) and a scheme to put bits into that storage. There are many backup programs, typically offering some combination of incremental saving and deduplication. This section intends to list plausible schemes for use with tahoe, and analyze them.

2.1 Concept of Operations

The scenario of interest is the loss of all computers, external disk drives, and CDs at one particular site. We assume that post loss, a person will buy new computers and disks, reconnect to the grid, and retrieve data.

2.2 Capabilities backup

2.2.1 Introducer capability backup

To connect to the grid, one needs to have the introducer capability. This can be printed and retyped, like capabilities, but is of less concern because all the other nodes will have the introducer capability.

2.2.2 Root Capability backup

One's root capability is the key to all data one has stored in a grid, and this value must be preserved with extremely high probability. However, the root capability must also be kept confidential in most cases. The use of super-encryption of backup data with a memorizable passphrase can reduce the need to keep the rootcap confidential, but we assume there will always be some stored data without super-encryption.

A benefit of tahoe is that the work of storing the rootcap safely need only be done occasionally. Then, fresh data may be stored in the grid, and retrieved with the previously placed rootcap.

2.2.3 Encryption key backup

If backup objects are super-encrypted, that key must also be stored safely offline. One approach is to store the key encrypted in a passphrase under one's rootcap. Another is to ride on digital storage of the rootcap (e.g., CDs in multiple safe deposit boxes).

2.3 Backup approach

2.3.2 "tahoe backup"

This results in a deep tree, with cumulative risk of loss because all directories on the path must be readable. It also seems likely to result in much greater latencies to store/retrieve data, but I have not verified that.

"tahoe backup" blurs tahoe-specific concepts and being a general backup program which would be useful with other filesystems.

2.3.3 dump/tar

One could store traditional full/incremental dumps, or some other containers. This would result in a smaller tree of data, with fewer directories on the path from the root. One can think of using tahoe as virtual offsite tapes.

2.3.4 rdiff-backup on tahoe-lafs mounted as a filesystem

I haven't tried this, but I expect at least some trouble.

2.3.5 bup (via mounting)

This might be useful; bup has the concept of writing packfiles which don't get changed. Probably smaller packfiles are in order (to make repair less frightening and get better space distribution). Of course this inherits bup's not-yet-implemented age-out/garbage collection.

2.3.7 integration with amanda or backuppc

One could add support to existing backup programs to treat a tahoe grid as a virtual tape drive. This should be conceptually straightforward, but would result in a large amount of data traffic. The challenge is to design a scheme which results in good data traffic properties (amount, number of files) relative to the actual backup work necessary.

2.4 super-encryption

Given how capabilities are used in web browsers, one might not trust tahoe's encryption. So rather than a gzipped dump file, one might instead dump|gzip|gpg.

3 Obstacles to friendnet operation

The term friendnet in tahoe refers to a group of people who operate tahoe storage servers for mutual backup. In 2010, many people can afford as much disk space as they need, but managing offsite, replicated storage is hard. By having each person provide a storage node with 2 TB and connect to the non-public introducer, then everyone can place a substantial fraction of 2 TB worth of shares. With 3/7/10 coding, that's perhaps 500 GB of data.

This section considers issues that arise when operating a friendnet. It is intended both to provoke advice in cases where I haven't understood well enough, and to be constructive criticism for tahoe.

3.1 Fairness in storage usage

Tahoe has a "least authority" design where shares can be placed with an introducer capability and retrieved with a storage index; no information about the data owner is recorded. Some people have far more data than others, and it would be nice to know how much data each person is using so that, e.g., those with shares in excess of 75% of provided space can be asked to buy a new disk.

3.2 Limits on bandwidth usage

Storage nodes are likely to be attached to residential Internet service, or colo/VPS machines. Some service may have caps, there may be implicit caps, or there may be per-GB charges. Regardless of the reasons, node operators may wish to limit the network usage of tahoe. It would be nice to have two token bucket for incoming and outgoing data (perhaps acks and do-you-have-block can be free), and to limit share transfers to this token bucket.

Similar to network bandwidth usage, it should be possible to configure limits on disk bandwidth, although this is likely to be necessary only with fast networks and slow disks.

(Idea is due to Richard H.)

4 Generally useful features

This section addresses improvements to tahoe that would be generally useful, even if all servers are under common administrative control. It is not meant to be an exhaustive list, but merely some things of interest to me.

4.1 Resolve capability leakage with WUI (Web User Interface)

Tahoe's web interface (not the WAPI) is more featureful than the command line, and I argue that this is a bug. Users, even those who are uncomfortable passing capabilities to web browsers, are encouraged to do so. Capabilities are not treated as passwords, and by default firefox will remember them. The command line interface should be sufficiently powerful and usable that there is no reason to use the web interface.

4.2 Display free space on node status page

On the pubgrid, one often sees "Connected to 13 of 21 known storage servers:". However, one often gets exceptions during repair that shares could only be placed on 6 nodes. Therefore, it seems fairly clear that about half of the pubgrid nodes are not accepting shares, and that presence of 13 servers is deceptive. Having storage nodes tell clients the used and free space would enable displaying this information, as well as free-space-based share placement. (closed dup

4.3 Scoreboard server status

Clients should keep track of whether servers have performed according to expectations, and not wait for those servers to respond in future queries (perhaps while continuing to test the servers).

I can provoke bad behavior at will due to what I believe is a misbehaving firewall. I have two storage nodes on a network, and if both are running connections to each of them become broken quickly, as if the TCP connections are lost from a "keep state" table. Then, all operations on the pubgrid take around 4 minutes.

4.4 Validate storage node addresses

Storage servers can be set up with non-routable addresses. Of course, within a particular group of nodes, non-routable has varying meanings – an in-house grid in a RFC1918 network may be entirely reachable. But on the pubgrid, there are nodes with non-routable addresses. It would be nice if the introducer tried to connect back to the advertised address/port and if unsuccessful either fail or flag the registration.

4.5 SOCKS support

For servers behind firewalls, it would be nice for the server to be able to use SOCKS to obtain a globally listening socket.

4.6 Non-dispersed directories

Tahoe by default uses 3/10 encoding, so that 3 shares out of 10 are necessary for a 3.3x expansion, leading to greater reliability than a simple 3x replication. This decision is based on files being large, so that disk storage for them matters. Directories are often small, but directories are critical to retrieving other data. If a directory is less than a disk fragment, splitting it might not save any space. So a 1/10 encoding might make sense.

(I saw this idea expressed on the mailinglist.)

4.7 Free-space-based share placement

It would be nice to be able to place shares on nodes with the greatest fraction of free storage. Placement by speed leads to fast nodes becoming full.

4.8 Administrative and Physical Node Groups

A key goal of tahoe is to have data survive the loss of storage nodes. What if 6 of the 12 nodes in a grid were run by one person, and one day that person turns them all off? What if 6 are in one building, and there is a fire, flood, or lighting strike? tor has a similar problem, and has a MyFamily option to let clients avoid relying on multiple relays under common (declared) administrative control.

tahoe could adopt an administrative group, a normally-random value that can be copied to other nodes under common control, and storage nodes could report this to clients. tahoe could further have a physical location value, although one could argue that common physical location at least partially implies common administrative control.

This is particularly important as a growth plan in a friendnet is to have someone using a lot of data add another 2T drive. The easiest way to do that is to run another storage node on the second drive.

Physical failure groups are actually hierarchical, e.g.

  • two nodes on one disk
  • two disks on one computer
  • two computers in one building
  • two computers in different buildings in a campus
  • two computers in one town
  • two computers in one metropolitan area
  • two computers in one continent

It seems easy to design too much complexity for this case.

Ravi Pinjala pointed out on the list that the ceph project has addressed this:

4.8.1 Policy Node Groups

One might have policies that control acceptable locations for data placement. Here, the concern is policy/regulatory rather than reliability/survivability. Examples are ITAR and the European Data Protection Directive. Tahoe should be able to express that shares of some files may only be placed on some servers.

4.9 IPv6

tahoe should be able to operate in a mixed v4/v6 environment. Basically, this means foolscap support, which means twisted support.

4.10 Remote share hashes

Currently one can choose between asking if a share is present and retrieving it to validate checksums. It would be nice to be able to have the server compute a hash and return that; getting the expected hash would result in significant confidence that the share is OK (absent a malicious server). It's not clear if this is feasible given the share computation rules, and if extra data would need to be kept.

4.11 Share migration for node decommissioning

If someone is going to shut down a node, either temporarily or permanently, it would be nice to be able to have that node hand off its shares to other nodes. This would be sort of like repair, but done by the storage node. For each share, a DYHB can be issued, and the current share placed on a node with available storage that doesn't have a share of that storage index. Perhaps grid-wide policies about administrative and physical node groups would be followed.

(Idea is due to Richard H.)

4.12 Share migration for load sharing

Storage nodes could move shares to other nodes with more free space, but doing this without respecting the user's node group policies seems problematic.

4.13 Tor support

tahoe nodes should have easy support for using tor. There are likely some complicated interactions with normal tor exit policies. tahoe nodes should be able to run as hidden services. There is already some support for I2P, which may be better suited to tahoe than tor.

4.14 peer-to-peer vs. client-server

Tahoe is in some sense a p2p system, and in some sense a system of a group of clients and a group of servers. Unlike bittorrent, tahoe works fine with e.g. 30 pure clients (nodes not offering storage) and 10 servers (nodes offering storage with no affiliated users). This situation can occur if 10 people each run a storage node and then also access the resulting grid from several computers each.

A related issue is that if someone brings up a server with only private addresses, all nodes will attempt to connect to those addresses, resulting in network noise and mess in netstat output.

5 Bugfixes

This section describes things that are wrong with the latest release of tahoe. It exists merely to highlight tickets of interest to me.

5.1 mutable file repair increments sequence number

Currently, repairing a mutable file causes a new sequence number to be created. For example, given 3/10 encoding, finding 9 shares of seq17 will cause removal of those shares and placement of 10 shares of seq18. If servers come and go, but are mostly present (say each server is present 95% of the time), then repair will be constantly done. If the missing shares could simply be placed, then the system would come to a stable point (assuming the file is not written) quickly, resulting in future check operations not calling for repair.

5.2 a single old share causes unhealthy status

During repair, the presence of a single old share causes the file to be perceived unhealthy. For example (3/10), finding 10 shares of seq17 and 1 share of seq16 causes all to be removed and 10 shares of seq18 placed. Combined with the previous issue, excessive and unnecessary repairs happen in a grid with mostly-present nodes.

5.3 deep-check fails on one non-recoverable file

5.5 expiration should be on by default

When one creates a node with "tahoe create-node", expiration of shares is off by default. However, storage nodes without expiration are not sustainable; they will simply accumulate shares until full. A reasonable default is expire.enabled = True expire.mode = age

6 Build system issues

Build system issues are unglamorous, but a prerequisite for software being in wide use (beyond the set of people who think it's fun to hack on it) is that there are no significant build system issues and that the program is packaged in various packaging systems so that it can be installed by end users with zero effort.

6.1 separation of prereqs, build, install

Currently downloading tahoe and doing "./ build" will fetch and build dependencies. This violates the POLA for packagers. I can understand that automatic dependency fetching can be useful, but it should be easily avoidable.

6.1.1 TODO try zooko's patch

6.2 boutique vc

tahoe is stored in Darcs, which unnecessarily raises the effort required to contribute. As of 2012-02 tahoe is in the process of moving to git.

6.3 py-zfec downloads setuptools-darcs

As part of building py-zfec, it downloads setuptools-darcs. pkgsrc has a patch to work around this.

7 Unsorted ticket links

7.1 garbage collection    gc

7.1.1 TODO file subtickets

8 Existing Tahoe Grids

8.1 pubgrid

8.2 VolunteerGrid

8.3 VolunteerGrid2

8.4 i2p

There is apparently a grid on i2p.

9 Other Distributed/Cluster File Systems

9.1 Octavio

Chris Palmer is designing Octavio, a system that is like Tahoe but which aims to be far simpler.

9.1.1 erasure coding

Octavio has no erasure coding; just replication. This gives up some performance, but seems like a reasonable tradeoff for many.

9.1.2 CLI/WUI

Compared to Tahoe, Octavio removes the CLI and WUI, and uses only FUSE. I think this is a huge simplification with little downside.

9.1.3 introducer

Octavio has no introducers; presumably each client has to be configured with all servers; I'm not sure this is a win.

9.1.4 repair and leases

Octavio doesn't explain how it deals with repair, leases and garbage collection, or why it doesn't need them.

9.2 ceph

ceph appears to be a cluster file system, and appears limited to Linux.

9.3 cleversafe doesn't say much about how their system works, and doesn't make code available. The goal seems to be increased reliability in a situation where all storage nodes are well-managed and reliable to start with. The focus seems to be on IDA for a virtual disk drive, rather than files, so cleversafe might be called "distributed storage", rather than a "distributed filesystem". There are pointers to SNIA standards.

9.4 hekafs is a filesystem built on gluster but providing ciphertext storage.

9.5 Cypherpunk-deficient due to server plaintext

The filesystems in this section store plaintext on the server, and thus are unacceptable with respect the "cypherpunk distributed" norms.

9.5.1 gluster

gluster not only stores plaintext on the server, but normally uses unencrypted transport.

9.5.2 coda

Coda's write-disconnected operation is appealing.

Coda also is struggling with portability, relying on special kernel modules instead of FUSE. Currently Linux is the only truly viable platform, with NetBSD close behind (with issues with large directories).

10 Peer-to-peer distribution systems

11 Distributed Social networks

(These aren't cypherpunk filesystems, but they are of similar interest.)

11.1 Diaspora*

Date: 2014-01-22T21:48-0500

Author: Greg Troxel

Org version 7.9.2 with Emacs version 23

Validate XHTML 1.0