Stubbytree directory structure

See how EF files are stored in a stubbytree structure.

Stubbytree is a directory specification created by HTRC for storing Extracted Features 2.0 files. Previous versions were stored in the pairtree directory specification, which was created by Google, documented by the California Digital Library, and used HathiTrust and HTRC.

Stubbytree was created as a balance between the affordances of pairtree structure for storing massive numbers of files and the long processing times to traverse a pairtree directory.

Stubbytree places files in a directory structure based on the file name, with the highest level directory being the HathiTrust source library ID code (i.e. volume ID prefix), and then using every third character of the cleaned volume ID, starting with the first, to create a sub-directory. For example the Extracted Features file for the volume with HathiTrust ID nyp.33433070251792 would be located at:

nyp/33759/nyp.33433070251792.json.bz2

A stubbytree directory path is created in the following way:

Start with a root directory and HathiTrust volume ID (htid), such as nyp.33433070251792
Split the HathiTrust volume ID at the first period into a library ID (libid) and volume ID (volid), resulting in nyp and 33433070251792
Clean the volume ID, replacing colons, forward slashes and periods with + , =, and , respectively
Create a stub name by taking every third character of the cleaned volume ID, starting with the first (33433070251792 to 33759)

The new location is libid/stub/cleaned_htid, where the clean_htid simply uses the clean volid after the first period.

Here is the general code used to create create the stubbytree directories.

libid, volid = htid.split('.', 1)
clean_volid = id_encode(volid)
fname = os.path.join(stubbytree_root, libid, clean_volid[::3], libid+'.'+clean_volid)

Where
id_encode = lambda id: id.replace(":", "+").replace("/", "=").replace(".", ",")