About BioFile Handling
BioFile objects
Files in this project are managed using BioFile handling.
What are BioFile objects used for?
BioFile objects are used to track the myriad metadata associated with a given type of file. For example, a file might be associated with a specific species and a specific tissue type. A file might have been downloaded from a particular URL and might live at a specific address on Amazon S3. All of these attributes are tracked by a BioFile.
What can I do with BioFile objects?
BioFile objects hold custom attributes depending on the specific type of file.
For example, a GenomeFastaFile
object might have an associated version
.
These attributes can themselves be a BioFile object; for example, a GenomeGtfFile
might point to the correct GenomeFastaFile
it is associated with.
Some BioFile objects also have class-specific functions (referred to as "methods" in Python).
For example, the GenomeGffFile
class has a method .to_gtf()
which converts the GFF file to GTF format, preserving a link to related the GenomeFastaFle
.
How can I access BioFile attributes?
BioFile attibutes can be accessed using a period operator, as shown in the image below.
BioFile objects that store links to other BioFile objects can allow for hierarchical access.
For example, you could quickly get the filename of the GenomeFastaFile
associated with a GenomeGtfFile
object by using GenomeGtfFile.GenomeFastaFile.filename
.
BioFileDockets
BioFileDockets are another class of objects used in this analysis. BioFileDocket objects store all BioFile objects associated with a specific dataset.
What are BioFileDocket objects used for?
BioFileDocket objects keep track of all BioFile objects associated with a specific dataset.
For example, a mouse adult brain RNA-Seq dataset might have a BioFileDocket
that stores the GenomeFastaFile
, GenomeGtfFile
, IdmmFile
, GxcFile
, and other objects specific to that dataset.
BioFileDocket objects are particularly useful for collecting the most important BioFile objects generated by each Python script or Jupyter notebook and passing these variables between scripts.
To do this, we use the dill
package to create a "pickle" of the BioFileDocket variable.
The "pickle" is a file ending in .pkl
, which can be loaded into a different script to keep the exact same information stored in the original BioFileDocket.
Pickling BioFileDockets is particularly helpful for the analysis pipelines in this project, in which we have to perform analyses on diverse datatypes and species and pass files between individual Python scripts. Some of the analyses we do only work with data in one species, while others require data from multiple species. Pickled Docket objects allow for standardized storage of information between different scripts, as well as a means of tracking the history of changes to many different files associated with each analysis.
How can I access BioFileDocket attributes?
BioFileDocket objects store attributes in the same way as BioFile objects. For example, for a mouse adult brain RNA-Seq dataset, you might get a filename for the genes by cells matrix using Mmus_BioFileDocket.gxc.filename
.