About BioFile Handling

BioFile objects

Files in this project are managed using BioFile handling.

BioFile Basics

What are BioFile objects used for?

BioFile objects are used to track the myriad metadata associated with a given type of file. For example, a file might be associated with a specific species and a specific tissue type. A file might have been downloaded from a particular URL and might live at a specific address on Amazon S3. All of these attributes are tracked by a BioFile.

What can I do with BioFile objects?

BioFile objects hold custom attributes depending on the specific type of file. For example, a GenomeFastaFile object might have an associated version. These attributes can themselves be a BioFile object; for example, a GenomeGtfFile might point to the correct GenomeFastaFile it is associated with.

Some BioFile objects also have class-specific functions (referred to as "methods" in Python). For example, the GenomeGffFile class has a method .to_gtf() which converts the GFF file to GTF format, preserving a link to related the GenomeFastaFle.

How can I access BioFile attributes?

BioFile attibutes can be accessed using a period operator, as shown in the image below.

BioFile Attributes

BioFile objects that store links to other BioFile objects can allow for hierarchical access. For example, you could quickly get the filename of the GenomeFastaFile associated with a GenomeGtfFile object by using GenomeGtfFile.GenomeFastaFile.filename.

BioFileDockets

BioFileDockets are another class of objects used in this analysis. BioFileDocket objects store all BioFile objects associated with a specific dataset.

Docket Objects

What are BioFileDocket objects used for?

BioFileDocket objects keep track of all BioFile objects associated with a specific dataset. For example, a mouse adult brain RNA-Seq dataset might have a BioFileDocket that stores the GenomeFastaFile, GenomeGtfFile, IdmmFile, GxcFile, and other objects specific to that dataset.

BioFileDocket objects are particularly useful for collecting the most important BioFile objects generated by each Python script or Jupyter notebook and passing these variables between scripts.

To do this, we use the dill package to create a "pickle" of the BioFileDocket variable. The "pickle" is a file ending in .pkl, which can be loaded into a different script to keep the exact same information stored in the original BioFileDocket.

Pickling BioFileDockets is particularly helpful for the analysis pipelines in this project, in which we have to perform analyses on diverse datatypes and species and pass files between individual Python scripts. Some of the analyses we do only work with data in one species, while others require data from multiple species. Pickled Docket objects allow for standardized storage of information between different scripts, as well as a means of tracking the history of changes to many different files associated with each analysis.

How can I access BioFileDocket attributes?

BioFileDocket objects store attributes in the same way as BioFile objects. For example, for a mouse adult brain RNA-Seq dataset, you might get a filename for the genes by cells matrix using Mmus_BioFileDocket.gxc.filename.