BioFile Documentation
This page describes BioFile objects and their usage.
BioFile objects and functions for managing collections of biological data.
BioFile
BioFile objects collect metadata about biological filetypes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sampledict |
SampleDict
|
a SampleDict object from the BioFileDocket. |
required |
filename |
str
|
the name of the file. |
''
|
url |
str
|
when downloading a file on object creation, pass a string url along with a protocol. |
None
|
protocol |
str
|
passed along with a url for automatic download on object creation. |
None
|
s3uri |
str
|
the s3uri of the file, if downloading from s3 upon object creation. |
None
|
unzip |
bool
|
whether or not to unzip a file on download. Defaults to True. |
True
|
add_s3uri(s3uri)
Adds an s3uri to the BioFile object if it doesn't already exist.
exists()
property
bool
: checks whether the file currently exists.
filetype()
property
str
: infers filetype using the string after the final .
in filename.
get_from_s3(overwrite=False)
Downloads the BioFile from AWS S3.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
overwrite |
bool
|
decide whether to overwrite existing files. Defaults to False. |
False
|
get_from_url(url, protocol, filename='', unzip=True)
Downloads the BioFile from a URL using a chosen protocol, unzipping optionally.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
url of the file. |
required |
protocol |
str
|
protocol to be used for download. |
required |
filename |
str
|
name of file to be saved. If empty, will generate name from URL. |
''
|
unzip |
bool
|
decide whether to unzip if it is a zipped file. Defaults to True. |
True
|
path()
property
str
: path to the file, including the filename.
push_to_s3(overwrite=False)
Uploads the BioFile to AWS S3.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
overwrite |
bool
|
decide whether to overwrite existing files. Defaults to False. |
False
|
sampledict()
property
SampleDict
: a SampleDict object for the file.
species_prefix()
property
str
: runs prefixify(species)
.
unzip()
Unzips files ending in .gz, .gzip, or .zip.
BioFileDocket
BioFileDocket objects collect BioFiles and relate them to one another.
Important files called keyfiles
get their own uniquely-named attribute.
These files can be accessed using a dot operator.
For example, BFD.genome_fasta
would return a GenomeFastaFile object.
BFD.genome_fasta.path
would return the path to that GenomeFastaFile object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
species |
str
|
species name in 'Genus_species' format. |
required |
conditions |
str
|
unique conditions identitier for dataset. |
required |
Attributes:
Name | Type | Description |
---|---|---|
directory |
str
|
output directory for files in the dataset. |
metadata |
metadata_object
|
a collector for miscellaneous metadata. |
files |
dict
|
a dictionary of assorted files associated with the data. |
keyfiles |
attr
|
unique key-identified attributes representing key files. You can add these using BioFileDocket.add_keyfile() or .add_keyfiles(). |
add_file(BioFile)
Places a BioFile object into the files
attribute.
The objects can be accessed using their filename.
Note
The actual files of BioFile objects in this list are not automatically uploaded.
add_files(list)
Places a list of files into the files
attribute.
add_keyfile(key, BioFile, overwrite=False)
Adds a BioFile object using a unique key identifier.
Checks to make sure that key does not already exist. Also makes sure key is only alphanumeric or underscores.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
str
|
a unique key |
required |
BioFile |
BioFile
|
a BioFile object |
required |
add_keyfiles(dictionary, overwrite=False)
Adds a dictionary of BioFile objects using key:BioFile object pairs.
dill_filename()
property
str
: the unique filename of the BioFileDocket .pkl file.
dill_filepath()
property
str
: the path of the BioFileDocket .pkl file.
get_from_s3(overwrite=False)
Downloads the .pkl file for the BioFileDocket from AWS S3.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
overwrite |
bool
|
decide whether to overwrite existing files. Defaults to False. |
False
|
local_to_s3(overwrite=False)
Uploads all keyfiles that are BioFiles to S3 using their s3uri attributes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
overwrite |
bool
|
decide whether to overwrite existing files. Defaults to False. |
False
|
pickle()
Creates a .pkl file for the BioFileDocket object.
The filename and path are automatically generated.
push_to_s3(overwrite=False)
Uploads the .pkl file for the BioFileDocket to AWS S3.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
overwrite |
bool
|
decide whether to overwrite existing files. Defaults to False. |
False
|
remove_file(filename)
Removes a file based on filename from the files attribute.
remove_keyfile(key, warn=True)
Deletes a keyfile from the BioFileDocket if warn == False
.
s3_to_local(overwrite=False)
Downloads all keyfiles that are BioFiles from S3 using their s3uri attributes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
overwrite |
bool
|
decide whether to overwrite existing files. Defaults to False. |
False
|
s3uri()
property
str
: the S3 URI of the BioFileDocket.
sampledict()
property
SampleDict
: the full SampleDict of the BioFileDocket.
set_taxid(taxid)
Adds a taxid attribute to the BioFileDocket.
Tolerates either int
or str
input.
species_prefix()
property
str
: runs prefixify(species).
unpickle()
Unpickles a .pkl file for the BioFileDocket object.
The filename and path are automatically generated.
Returns:
Name | Type | Description |
---|---|---|
self |
BioFileDocket
|
returns the unpickled BioFileDocket object. |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
if the .pkl file doesn't exist yet. |
CellAnnotFile
Bases: BioFile
A cell annotation matrix file with two columns: 'cell_barcode' and 'celltype'.
Each cell should have a cell barcode exactly matching its name in the Gxc or Exc embedding.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sources |
list of BioFile objects
|
a list of the source BioFile objects. |
required |
CellRangerBarcodesFile
CellRangerFeaturesFile
CellRangerFileGroup
A special object that collects the multiple files of a CellRanger file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sampledict |
SampleDict
|
a SampleDict object from the BioFileDocket. |
required |
barcodes_address |
str
|
the address of the barcodes file as either url or s3uri. |
required |
features_address |
str
|
the address of the features file as either url or s3uri. |
required |
matrix_address |
str
|
the address of the matrix file as either url or s3uri. |
required |
how |
str
|
'url' or 's3uri' to choose a method of download. |
None
|
protocol |
str
|
when using url, the protocol to use. Defaults to 'curl'. |
'curl'
|
to_gxc(reference_genome, reference_annot, filename='', overwrite=False)
Converts the group of cell ranger files to a GxcFile and returns the file object.
CellRangerMatrixFile
ExcFile
FoldSeekOutputFile
GeneListFile
Bases: BioFile
An BioFile object for a gene list file, which is a .txt file with a gene on each line.
Defaults to creating a file based on a list of genes passed on object creation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sources |
list of BioFile objects
|
a list of the source BioFile objects. |
required |
identifier |
str
|
a description of the identifier type for this gene list. |
required |
make |
bool
|
whether or not to make the file upon object creation. defaults to True. |
True
|
get_uniprot_ids(ID_MAPPER_LOC, from_type, to_type)
Queries the Uniprot API to get ID mapping based on from_type and to_type, returning an UniprotIDMapperFile object.
If the number of genes in the file exceeds 100k, generates temporary files to query the API multiple times.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ID_MAPPER_LOC |
str
|
path to the ID mapping script, usually at ID_MAPPER_LOC. |
required |
from_type |
str
|
the source datatype (e.g. 'Ensembl') |
required |
to_type |
str
|
the destination datatype (e.g. 'UniprotKB') |
required |
get_uniprot_ids_by_genename(ID_MAPPER_LOC, from_type, to_type, taxid)
Don't use this, it's not built out yet.
make_file(genes)
Makes a gene list file at a given location when passed a list object.
GenomeFastaFile
Bases: BioFile
A BioFile object for genome fasta files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
version |
str
|
the specific version number of the genome. |
required |
get_transdecoder_cdna_gtf(GenomeGtfFile, TRANSDECODER_LOC, **kwargs)
Generates a TransdecoderCdnaFile given a GenomeGtfFile.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
GenomeGtfFile |
GenomeGtfFile
|
a GenomeGtfFile object that is compatible with the GenomeFastaFile. |
required |
TRANSDECODER_LOC |
str
|
the path to the parent directory of Transdecoder. |
required |
Returns:
Name | Type | Description |
---|---|---|
cdna_output |
TransdecoderCdnaFile
|
a TransdecoderCdnaFile object. |
rename_RefSeq_chromosomes(replace=False)
Renames the chromosomes of a RefSeq genome to numerical chromosomes.
GenomeGffFile
Bases: BioFile
A BioFile object for a genome reference in GFF format.
If a file ends in .gff3
, renames it to end in .gff
instead.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reference_genome |
GenomeFastaFile
|
the BioFile object of the associated genome fasta file. |
required |
GenomeGtfFile
Bases: BioFile
A BioFile object for a genome reference in GTF format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reference_genome |
GenomeFastaFile
|
the BioFile object of the associated genome fasta file. |
required |
GxcFile
Bases: BioFile
A BioFile object for a genes x cells matrix.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reference_genome |
GenomeFastaFile
|
the BioFile object of the associated genome fasta file. |
required |
reference_annot |
GenomeGffFile | GenomeGtfFile
|
the BioFile object of the associated genome annotation file. |
required |
IdmmFile
Bases: BioFile
An ID-mapping matrix file. Each row represents a feature and each column represents a name of that feature in a namespace.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sources |
list of BioFile objects
|
a list of the source BioFile objects. |
required |
kind |
str
|
a one-word description of the kind of idmm. |
required |
JointExcFile
Bases: MultiSpeciesFile
A MultiSpeciesFile for collecting an ExcFile with multiple species.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sources |
list of GxcFile
|
the original |
required |
embedding |
str
|
a description of the new embedding. |
required |
LoomFile
Bases: BioFile
A BioFile object for a Loom file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reference_genome |
GenomeFastaFile
|
the BioFile object of the associated genome fasta file. |
required |
reference_annot |
GenomeGtfFile | GenomeGffFile
|
the BioFile object of the associated genome GTF/GFF file. |
required |
get_cellannot(overwrite=False)
Generates a cellannot file from the LOOM file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
overwrite |
bool
|
whether or not to overwrite the file if it already exists. |
False
|
Returns:
Type | Description |
---|---|
CellAnnotFile
|
a |
get_idmm(id_type, overwrite=False)
Generates an idmm file from the LOOM file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id_type |
str
|
the name of the ID type extracted from the file, e.g. |
required |
overwrite |
bool
|
whether or not to overwrite the file if it already exists. |
False
|
Returns:
Type | Description |
---|---|
IdmmFile
|
an |
to_gxc(filename='', overwrite=False)
Generates a gxc file from the LOOM file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
filename |
str
|
automatically generated by appending |
''
|
overwrite |
bool
|
whether or not to overwrite the file if it already exists. |
False
|
Returns:
Type | Description |
---|---|
GxcFile
|
a |
MultiSpeciesBioFileDocket
Bases: BioFileDocket
MultiSpeciesBioFileDocket objects collect BioFileDockets and relate them to one another.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
species_dict |
dict
|
key is species name in 'Genus_species' format; value is conditions. These must be exact matches, or it will not work. |
required |
global_conditions |
str
|
a summary identifier for this collection of species' datasets. |
required |
analysis_type |
str
|
one-word description of the analysis type. |
required |
Attributes:
Name | Type | Description |
---|---|---|
directory |
str
|
output directory for files in the dataset. |
dill_filename()
property
str
: the unique filename of the BioFileDocket .pkl file.
get_BioFileDockets()
Gets species BioFileDocket .pkl files from S3 for each species in the species_dict.
local_to_s3(overwrite=False)
Iteratively calls local_to_s3 on all BioFileDockets in the group.
s3_to_local(overwrite=False)
Iteratively calls s3_to_local on all BioFileDockets in the group.
sampledict()
property
SampleDict
: a MultiSpeciesBioFileDocket-formatted sampledict.
species_concat()
property
str
: concatenation of species prefixes in alphabetical order.
MultiSpeciesFile
Bases: BioFile
A special class of BioFile objects for files with multiple associated species.
This subclass uses species_concat
as its species and its conditions is a '_' concatenation of its global_conditions
and analysis_type
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
species_dict |
dict
|
species dictionary from the MultiSpeciesBioFileDocket. |
required |
sampledict()
property
SampleDict: a MultiSpeciesBioFileDocket-formatted sampledict.
species_concat()
property
str: concatenation of species prefixes in alphabetical order.
OrthoFinderOutputFile
SampleDict
A dictionary containing dataset-specific fields: species, conditions, and directory.
This class is used to uniquely associate BioFile objects with specific datasets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
species |
str
|
species name in 'Genus_species' format. |
required |
conditions |
str
|
unique conditions identitier for dataset. |
required |
directory |
str
|
output directory for files in the dataset. |
required |
TransdecoderCdnaFile
Bases: BioFile
A BioFile object for Transdecoder cDNA files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reference_genome |
GenomeFastaFile
|
the BioFile object of the associated genome fasta file. |
required |
reference_annot |
GenomeGffFile | GenomeGtfFile
|
the BioFile object of the associated genome annotation file. |
required |
to_pep_files(TDLONGORF_LOC, TDPREDICT_LOC)
Generates a peptide file from a Transdecoder cDNA file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
TDLONGORF_LOC |
str
|
path to the TransDecoder.LongOrfs binary. |
required |
TDPREDICT_LOC |
str
|
path to the TransDecoder.Predict binary. |
required |
Returns:
Name | Type | Description |
---|---|---|
output_dict |
dict of TransdecoderOutFile
|
a dictionary of Transdecoder output files. |
TransdecoderOutFile
Bases: BioFile
A BioFile object for Transdecoder cDNA files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
reference_genome |
GenomeFastaFile
|
the BioFile object of the associated genome fasta file. |
required |
reference_annot |
GenomeGffFile | GenomeGtfFile
|
the BioFile object of the associated genome annotation file. |
required |
reference_cDNA |
TransdecoderCdnaFile
|
the BioFile object of the associated Transdecoder cDNA file. |
required |
UniProtTaxidListFile
Bases: BioFile
An BioFile object for pulling files from Uniprot based on input taxid.
Defaults to creating a file based on a taxid passed on object creation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
taxid |
str | int
|
the taxid of the species of interest. |
required |
make |
bool
|
whether or not to make the file upon object creation. defaults to True. |
True
|
get_proteins()
Queries the Uniprot API to extract all proteins for the taxid associated with this object.
UniprotIDMapperFile
Bases: IdmmFile
An BioFile object for the output file generated by calling the UNIPROT ID mapping API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
from_type |
str
|
the source datatype (e.g. 'Ensembl') |
required |
to_type |
str
|
the destination datatype (e.g. 'UniprotKB') |
required |
metadata_object
Bases: object
Simple dummy object to enable dot access to keys.
add(key, value, replace=True)
Adds a metadata feature using a unique key identifier.
- Checks to make sure that key does not already exist.
- Also makes sure key is only alphanumeric or underscores.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
key |
str
|
a unique key |
required |
value |
obj
|
any field you want to record |
required |
gxc_to_exc(sample_MSD, embedding_df, exc_file)
Converts an GxcFile into an ExcFile, returning the new file object.
make_output_directory(species, conditions, stringonly=False)
Creates an output directory based on species and condition parameters.
Does not create a directory if it already exists.
Output directory is placed in the GLOBAL_OUTPUT_DIRECTORY
, as defined by env/install_locs.py
.
Default GLOBAL_OUTPUT_DIRECTORY
is output/
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
species |
str
|
species name in |
required |
conditions |
str
|
unique conditions identifier for dataset. |
required |
stringonly |
bool
|
whether to only output the string without creating a directory |
False
|
Returns:
Name | Type | Description |
---|---|---|
output_directory |
str
|
path of the output directory, in format |
s3_transfer(to_loc, from_loc)
Transfers files to and from AWS S3 and local.
One of the two parameters must be an AWS S3 URI in string format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
to_loc |
str
|
path of the destination file. |
required |
from_loc |
str
|
path of the origin file. |
required |