BioFile Documentation

This page describes BioFile objects and their usage.

BioFile objects and functions for managing collections of biological data.

`BioFile`

BioFile objects collect metadata about biological filetypes.

Parameters:

Name	Type	Description	Default
`sampledict`	`SampleDict`	a SampleDict object from the BioFileDocket.	required
`filename`	`str`	the name of the file.	`''`
`url`	`str`	when downloading a file on object creation, pass a string url along with a protocol.	`None`
`protocol`	`str`	passed along with a url for automatic download on object creation.	`None`
`s3uri`	`str`	the s3uri of the file, if downloading from s3 upon object creation.	`None`
`unzip`	`bool`	whether or not to unzip a file on download. Defaults to True.	`True`

`add_s3uri(s3uri)`

Adds an s3uri to the BioFile object if it doesn't already exist.

`exists()` `property`

bool: checks whether the file currently exists.

`filetype()` `property`

str: infers filetype using the string after the final . in filename.

`get_from_s3(overwrite=False)`

Downloads the BioFile from AWS S3.

Parameters:

Name	Type	Description	Default
`overwrite`	`bool`	decide whether to overwrite existing files. Defaults to False.	`False`

`get_from_url(url, protocol, filename='', unzip=True)`

Downloads the BioFile from a URL using a chosen protocol, unzipping optionally.

Parameters:

Name	Type	Description	Default
`url`	`str`	url of the file.	required
`protocol`	`str`	protocol to be used for download.	required
`filename`	`str`	name of file to be saved. If empty, will generate name from URL.	`''`
`unzip`	`bool`	decide whether to unzip if it is a zipped file. Defaults to True.	`True`

`path()` `property`

str: path to the file, including the filename.

`push_to_s3(overwrite=False)`

Uploads the BioFile to AWS S3.

Parameters:

Name	Type	Description	Default
`overwrite`	`bool`	decide whether to overwrite existing files. Defaults to False.	`False`

`sampledict()` `property`

SampleDict: a SampleDict object for the file.

`species_prefix()` `property`

str: runs prefixify(species).

`unzip()`

Unzips files ending in .gz, .gzip, or .zip.

`BioFileDocket`

BioFileDocket objects collect BioFiles and relate them to one another.

Important files called keyfiles get their own uniquely-named attribute.
These files can be accessed using a dot operator. For example, BFD.genome_fasta would return a GenomeFastaFile object. BFD.genome_fasta.path would return the path to that GenomeFastaFile object.

Parameters:

Name	Type	Description	Default
`species`	`str`	species name in 'Genus_species' format.	required
`conditions`	`str`	unique conditions identitier for dataset.	required

Attributes:

Name	Type	Description
`directory`	`str`	output directory for files in the dataset.
`metadata`	`metadata_object`	a collector for miscellaneous metadata.
`files`	`dict`	a dictionary of assorted files associated with the data.
`keyfiles`	`attr`	unique key-identified attributes representing key files. You can add these using BioFileDocket.add_keyfile() or .add_keyfiles().

`add_file(BioFile)`

Places a BioFile object into the files attribute.

The objects can be accessed using their filename.

Note

The actual files of BioFile objects in this list are not automatically uploaded.

`add_files(list)`

Places a list of files into the files attribute.

`add_keyfile(key, BioFile, overwrite=False)`

Adds a BioFile object using a unique key identifier.

Checks to make sure that key does not already exist. Also makes sure key is only alphanumeric or underscores.

Parameters:

Name	Type	Description	Default
`key`	`str`	a unique key	required
`BioFile`	`BioFile`	a BioFile object	required

`add_keyfiles(dictionary, overwrite=False)`

Adds a dictionary of BioFile objects using key:BioFile object pairs.

`dill_filename()` `property`

str: the unique filename of the BioFileDocket .pkl file.

`dill_filepath()` `property`

str: the path of the BioFileDocket .pkl file.

`get_from_s3(overwrite=False)`

Downloads the .pkl file for the BioFileDocket from AWS S3.

Parameters:

Name	Type	Description	Default
`overwrite`	`bool`	decide whether to overwrite existing files. Defaults to False.	`False`

`local_to_s3(overwrite=False)`

Uploads all keyfiles that are BioFiles to S3 using their s3uri attributes.

Parameters:

Name	Type	Description	Default
`overwrite`	`bool`	decide whether to overwrite existing files. Defaults to False.	`False`

`pickle()`

Creates a .pkl file for the BioFileDocket object.

The filename and path are automatically generated.

`push_to_s3(overwrite=False)`

Uploads the .pkl file for the BioFileDocket to AWS S3.

Parameters:

Name	Type	Description	Default
`overwrite`	`bool`	decide whether to overwrite existing files. Defaults to False.	`False`

`remove_file(filename)`

Removes a file based on filename from the files attribute.

`remove_keyfile(key, warn=True)`

Deletes a keyfile from the BioFileDocket if warn == False.

`s3_to_local(overwrite=False)`

Downloads all keyfiles that are BioFiles from S3 using their s3uri attributes.

Parameters:

Name	Type	Description	Default
`overwrite`	`bool`	decide whether to overwrite existing files. Defaults to False.	`False`

`s3uri()` `property`

str: the S3 URI of the BioFileDocket.

`sampledict()` `property`

SampleDict: the full SampleDict of the BioFileDocket.

`set_taxid(taxid)`

Adds a taxid attribute to the BioFileDocket.

Tolerates either int or str input.

`species_prefix()` `property`

str: runs prefixify(species).

`unpickle()`

Unpickles a .pkl file for the BioFileDocket object.

The filename and path are automatically generated.

Returns:

Name	Type	Description
`self`	`BioFileDocket`	returns the unpickled BioFileDocket object.

Raises:

Type	Description
`FileNotFoundError`	if the .pkl file doesn't exist yet.

`CellAnnotFile`

Bases: BioFile

A cell annotation matrix file with two columns: 'cell_barcode' and 'celltype'.

Each cell should have a cell barcode exactly matching its name in the Gxc or Exc embedding.

Parameters:

Name	Type	Description	Default
`sources`	`list of BioFile objects`	a list of the source BioFile objects.	required

`CellRangerBarcodesFile`

Bases: BioFile

A BioFile object for the barcodes file of a CellRanger file.

`CellRangerFeaturesFile`

Bases: BioFile

A BioFile object for the features file of a CellRanger file.

`CellRangerFileGroup`

A special object that collects the multiple files of a CellRanger file.

Parameters:

Name	Type	Description	Default
`sampledict`	`SampleDict`	a SampleDict object from the BioFileDocket.	required
`barcodes_address`	`str`	the address of the barcodes file as either url or s3uri.	required
`features_address`	`str`	the address of the features file as either url or s3uri.	required
`matrix_address`	`str`	the address of the matrix file as either url or s3uri.	required
`how`	`str`	'url' or 's3uri' to choose a method of download.	`None`
`protocol`	`str`	when using url, the protocol to use. Defaults to 'curl'.	`'curl'`

`to_gxc(reference_genome, reference_annot, filename='', overwrite=False)`

Converts the group of cell ranger files to a GxcFile and returns the file object.

`CellRangerMatrixFile`

Bases: BioFile

A BioFile object for the matrix file of a CellRanger file.

`ExcFile`

Bases: BioFile

A BioFile object for an embedding x cells matrix.

Parameters:

Name	Type	Description	Default
`gxcfile`	`GxcFile`	the original `GxcFile` object used to generate the new embedding.	required
`embedding`	`str`	a description of the new embedding.	required

`FoldSeekOutputFile`

Bases: MultiSpeciesFile

A MultiSpeciesFile for FoldSeek output.

`GeneListFile`

Bases: BioFile

An BioFile object for a gene list file, which is a .txt file with a gene on each line.

Defaults to creating a file based on a list of genes passed on object creation.

Parameters:

Name	Type	Description	Default
`sources`	`list of BioFile objects`	a list of the source BioFile objects.	required
`identifier`	`str`	a description of the identifier type for this gene list.	required
`make`	`bool`	whether or not to make the file upon object creation. defaults to True.	`True`

`get_uniprot_ids(ID_MAPPER_LOC, from_type, to_type)`

Queries the Uniprot API to get ID mapping based on from_type and to_type, returning an UniprotIDMapperFile object.

If the number of genes in the file exceeds 100k, generates temporary files to query the API multiple times.

Parameters:

Name	Type	Description	Default
`ID_MAPPER_LOC`	`str`	path to the ID mapping script, usually at ID_MAPPER_LOC.	required
`from_type`	`str`	the source datatype (e.g. 'Ensembl')	required
`to_type`	`str`	the destination datatype (e.g. 'UniprotKB')	required

`get_uniprot_ids_by_genename(ID_MAPPER_LOC, from_type, to_type, taxid)`

Don't use this, it's not built out yet.

`make_file(genes)`

Makes a gene list file at a given location when passed a list object.

`GenomeFastaFile`

Bases: BioFile

A BioFile object for genome fasta files.

Parameters:

Name	Type	Description	Default
`version`	`str`	the specific version number of the genome.	required

`get_transdecoder_cdna_gtf(GenomeGtfFile, TRANSDECODER_LOC, **kwargs)`

Generates a TransdecoderCdnaFile given a GenomeGtfFile.

Parameters:

Name	Type	Description	Default
`GenomeGtfFile`	`GenomeGtfFile`	a GenomeGtfFile object that is compatible with the GenomeFastaFile.	required
`TRANSDECODER_LOC`	`str`	the path to the parent directory of Transdecoder.	required

Returns:

Name	Type	Description
`cdna_output`	`TransdecoderCdnaFile`	a TransdecoderCdnaFile object.

`rename_RefSeq_chromosomes(replace=False)`

Renames the chromosomes of a RefSeq genome to numerical chromosomes.

`GenomeGffFile`

Bases: BioFile

A BioFile object for a genome reference in GFF format.

If a file ends in .gff3, renames it to end in .gff instead.

Parameters:

Name	Type	Description	Default
`reference_genome`	`GenomeFastaFile`	the BioFile object of the associated genome fasta file.	required

`GenomeGtfFile`

Bases: BioFile

A BioFile object for a genome reference in GTF format.

Parameters:

Name	Type	Description	Default
`reference_genome`	`GenomeFastaFile`	the BioFile object of the associated genome fasta file.	required

`GxcFile`

Bases: BioFile

A BioFile object for a genes x cells matrix.

Parameters:

Name	Type	Description	Default
`reference_genome`	`GenomeFastaFile`	the BioFile object of the associated genome fasta file.	required
`reference_annot`	`GenomeGffFile \| GenomeGtfFile`	the BioFile object of the associated genome annotation file.	required

`IdmmFile`

Bases: BioFile

An ID-mapping matrix file. Each row represents a feature and each column represents a name of that feature in a namespace.

Parameters:

Name	Type	Description	Default
`sources`	`list of BioFile objects`	a list of the source BioFile objects.	required
`kind`	`str`	a one-word description of the kind of idmm.	required

`JointExcFile`

Bases: MultiSpeciesFile

A MultiSpeciesFile for collecting an ExcFile with multiple species.

Parameters:

Name	Type	Description	Default
`sources`	`list of GxcFile`	the original `GxcFile` object used to generate the new embedding.	required
`embedding`	`str`	a description of the new embedding.	required

`LoomFile`

Bases: BioFile

A BioFile object for a Loom file.

Parameters:

Name	Type	Description	Default
`reference_genome`	`GenomeFastaFile`	the BioFile object of the associated genome fasta file.	required
`reference_annot`	`GenomeGtfFile \| GenomeGffFile`	the BioFile object of the associated genome GTF/GFF file.	required

`get_cellannot(overwrite=False)`

Generates a cellannot file from the LOOM file.

Parameters:

Name	Type	Description	Default
`overwrite`	`bool`	whether or not to overwrite the file if it already exists.	`False`

Returns:

Type	Description
`CellAnnotFile`	a `CellAnnotFile` object for the created file.

`get_idmm(id_type, overwrite=False)`

Generates an idmm file from the LOOM file.

Parameters:

Name	Type	Description	Default
`id_type`	`str`	the name of the ID type extracted from the file, e.g. `'ensembl_id'`.	required
`overwrite`	`bool`	whether or not to overwrite the file if it already exists.	`False`

Returns:

Type	Description
`IdmmFile`	an `IdmmFile` object for the created file.

`to_gxc(filename='', overwrite=False)`

Generates a gxc file from the LOOM file.

Parameters:

Name	Type	Description	Default
`filename`	`str`	automatically generated by appending `'.asgxc.tsv'` if not provided.	`''`
`overwrite`	`bool`	whether or not to overwrite the file if it already exists.	`False`

Returns:

Type	Description
`GxcFile`	a `GxcFile` object for the created file.

`MultiSpeciesBioFileDocket`

Bases: BioFileDocket

MultiSpeciesBioFileDocket objects collect BioFileDockets and relate them to one another.

Parameters:

Name	Type	Description	Default
`species_dict`	`dict`	key is species name in 'Genus_species' format; value is conditions. These must be exact matches, or it will not work.	required
`global_conditions`	`str`	a summary identifier for this collection of species' datasets.	required
`analysis_type`	`str`	one-word description of the analysis type.	required

Attributes:

Name	Type	Description
`directory`	`str`	output directory for files in the dataset.

`dill_filename()` `property`

str: the unique filename of the BioFileDocket .pkl file.

`get_BioFileDockets()`

Gets species BioFileDocket .pkl files from S3 for each species in the species_dict.

`local_to_s3(overwrite=False)`

Iteratively calls local_to_s3 on all BioFileDockets in the group.

`s3_to_local(overwrite=False)`

Iteratively calls s3_to_local on all BioFileDockets in the group.

`sampledict()` `property`

SampleDict: a MultiSpeciesBioFileDocket-formatted sampledict.

`species_concat()` `property`

str: concatenation of species prefixes in alphabetical order.

`MultiSpeciesFile`

Bases: BioFile

A special class of BioFile objects for files with multiple associated species.

This subclass uses species_concat as its species and its conditions is a '_' concatenation of its global_conditions and analysis_type.

Parameters:

Name	Type	Description	Default
`species_dict`	`dict`	species dictionary from the MultiSpeciesBioFileDocket.	required

`sampledict()` `property`

SampleDict: a MultiSpeciesBioFileDocket-formatted sampledict.

`species_concat()` `property`

str: concatenation of species prefixes in alphabetical order.

`OrthoFinderOutputFile`

Bases: MultiSpeciesFile

A MultiSpeciesFile for OrthoFinder output.

`SampleDict`

A dictionary containing dataset-specific fields: species, conditions, and directory.

This class is used to uniquely associate BioFile objects with specific datasets.

Parameters:

Name	Type	Description	Default
`species`	`str`	species name in 'Genus_species' format.	required
`conditions`	`str`	unique conditions identitier for dataset.	required
`directory`	`str`	output directory for files in the dataset.	required

`TransdecoderCdnaFile`

Bases: BioFile

A BioFile object for Transdecoder cDNA files.

Parameters:

Name	Type	Description	Default
`reference_genome`	`GenomeFastaFile`	the BioFile object of the associated genome fasta file.	required
`reference_annot`	`GenomeGffFile \| GenomeGtfFile`	the BioFile object of the associated genome annotation file.	required

`to_pep_files(TDLONGORF_LOC, TDPREDICT_LOC)`

Generates a peptide file from a Transdecoder cDNA file.

Parameters:

Name	Type	Description	Default
`TDLONGORF_LOC`	`str`	path to the TransDecoder.LongOrfs binary.	required
`TDPREDICT_LOC`	`str`	path to the TransDecoder.Predict binary.	required

Returns:

Name	Type	Description
`output_dict`	`dict of TransdecoderOutFile`	a dictionary of Transdecoder output files.

`TransdecoderOutFile`

Bases: BioFile

A BioFile object for Transdecoder cDNA files.

Parameters:

Name	Type	Description	Default
`reference_genome`	`GenomeFastaFile`	the BioFile object of the associated genome fasta file.	required
`reference_annot`	`GenomeGffFile \| GenomeGtfFile`	the BioFile object of the associated genome annotation file.	required
`reference_cDNA`	`TransdecoderCdnaFile`	the BioFile object of the associated Transdecoder cDNA file.	required

`UniProtTaxidListFile`

Bases: BioFile

An BioFile object for pulling files from Uniprot based on input taxid.

Defaults to creating a file based on a taxid passed on object creation.

Parameters:

Name	Type	Description	Default
`taxid`	`str \| int`	the taxid of the species of interest.	required
`make`	`bool`	whether or not to make the file upon object creation. defaults to True.	`True`

`get_proteins()`

Queries the Uniprot API to extract all proteins for the taxid associated with this object.

`UniprotIDMapperFile`

Bases: IdmmFile

An BioFile object for the output file generated by calling the UNIPROT ID mapping API.

Parameters:

Name	Type	Description	Default
`from_type`	`str`	the source datatype (e.g. 'Ensembl')	required
`to_type`	`str`	the destination datatype (e.g. 'UniprotKB')	required

`metadata_object`

Bases: object

Simple dummy object to enable dot access to keys.

`add(key, value, replace=True)`

Adds a metadata feature using a unique key identifier.

Checks to make sure that key does not already exist.

Also makes sure key is only alphanumeric or underscores.

Parameters:

Name	Type	Description	Default
`key`	`str`	a unique key	required
`value`	`obj`	any field you want to record	required

`gxc_to_exc(sample_MSD, embedding_df, exc_file)`

Converts an GxcFile into an ExcFile, returning the new file object.

`make_output_directory(species, conditions, stringonly=False)`

Creates an output directory based on species and condition parameters.

Does not create a directory if it already exists.
Output directory is placed in the GLOBAL_OUTPUT_DIRECTORY, as defined by env/install_locs.py. Default GLOBAL_OUTPUT_DIRECTORY is output/.

Parameters:

Name	Type	Description	Default
`species`	`str`	species name in `Genus_species` format.	required
`conditions`	`str`	unique conditions identifier for dataset.	required
`stringonly`	`bool`	whether to only output the string without creating a directory	`False`

Returns:

Name	Type	Description
`output_directory`	`str`	path of the output directory, in format `Gspe_conditions`.

`s3_transfer(to_loc, from_loc)`

Transfers files to and from AWS S3 and local.

One of the two parameters must be an AWS S3 URI in string format.

Parameters:

Name	Type	Description	Default
`to_loc`	`str`	path of the destination file.	required
`from_loc`	`str`	path of the origin file.	required

BioFile Documentation

BioFile

add_s3uri(s3uri)

exists() property

filetype() property

get_from_s3(overwrite=False)

get_from_url(url, protocol, filename='', unzip=True)

path() property

push_to_s3(overwrite=False)

sampledict() property

species_prefix() property

unzip()

BioFileDocket

add_file(BioFile)

add_files(list)

add_keyfile(key, BioFile, overwrite=False)

add_keyfiles(dictionary, overwrite=False)

dill_filename() property

dill_filepath() property

get_from_s3(overwrite=False)

local_to_s3(overwrite=False)

pickle()

push_to_s3(overwrite=False)

remove_file(filename)

remove_keyfile(key, warn=True)

s3_to_local(overwrite=False)

s3uri() property

sampledict() property

set_taxid(taxid)

species_prefix() property

unpickle()

CellAnnotFile

CellRangerBarcodesFile

CellRangerFeaturesFile

CellRangerFileGroup

to_gxc(reference_genome, reference_annot, filename='', overwrite=False)

CellRangerMatrixFile

ExcFile

FoldSeekOutputFile

GeneListFile

get_uniprot_ids(ID_MAPPER_LOC, from_type, to_type)

get_uniprot_ids_by_genename(ID_MAPPER_LOC, from_type, to_type, taxid)

make_file(genes)

GenomeFastaFile

get_transdecoder_cdna_gtf(GenomeGtfFile, TRANSDECODER_LOC, **kwargs)

rename_RefSeq_chromosomes(replace=False)

GenomeGffFile

GenomeGtfFile

GxcFile

IdmmFile

JointExcFile

LoomFile

get_cellannot(overwrite=False)

get_idmm(id_type, overwrite=False)

to_gxc(filename='', overwrite=False)

MultiSpeciesBioFileDocket

dill_filename() property

get_BioFileDockets()

local_to_s3(overwrite=False)

s3_to_local(overwrite=False)

sampledict() property

species_concat() property

MultiSpeciesFile

sampledict() property

species_concat() property

OrthoFinderOutputFile

SampleDict

TransdecoderCdnaFile

to_pep_files(TDLONGORF_LOC, TDPREDICT_LOC)

TransdecoderOutFile

UniProtTaxidListFile

get_proteins()

UniprotIDMapperFile

metadata_object

add(key, value, replace=True)

gxc_to_exc(sample_MSD, embedding_df, exc_file)

make_output_directory(species, conditions, stringonly=False)

s3_transfer(to_loc, from_loc)

`BioFile`

`add_s3uri(s3uri)`

`exists()` `property`

`filetype()` `property`

`get_from_s3(overwrite=False)`

`get_from_url(url, protocol, filename='', unzip=True)`

`path()` `property`

`push_to_s3(overwrite=False)`

`sampledict()` `property`

`species_prefix()` `property`

`unzip()`

`BioFileDocket`

`add_file(BioFile)`

`add_files(list)`

`add_keyfile(key, BioFile, overwrite=False)`

`add_keyfiles(dictionary, overwrite=False)`

`dill_filename()` `property`

`dill_filepath()` `property`

`get_from_s3(overwrite=False)`

`local_to_s3(overwrite=False)`

`pickle()`

`push_to_s3(overwrite=False)`

`remove_file(filename)`

`remove_keyfile(key, warn=True)`

`s3_to_local(overwrite=False)`

`s3uri()` `property`

`sampledict()` `property`

`set_taxid(taxid)`

`species_prefix()` `property`

`unpickle()`

`CellAnnotFile`

`CellRangerBarcodesFile`

`CellRangerFeaturesFile`

`CellRangerFileGroup`

`to_gxc(reference_genome, reference_annot, filename='', overwrite=False)`

`CellRangerMatrixFile`

`ExcFile`

`FoldSeekOutputFile`

`GeneListFile`

`get_uniprot_ids(ID_MAPPER_LOC, from_type, to_type)`

`get_uniprot_ids_by_genename(ID_MAPPER_LOC, from_type, to_type, taxid)`

`make_file(genes)`

`GenomeFastaFile`

`get_transdecoder_cdna_gtf(GenomeGtfFile, TRANSDECODER_LOC, **kwargs)`

`rename_RefSeq_chromosomes(replace=False)`

`GenomeGffFile`

`GenomeGtfFile`

`GxcFile`

`IdmmFile`

`JointExcFile`

`LoomFile`

`get_cellannot(overwrite=False)`

`get_idmm(id_type, overwrite=False)`

`to_gxc(filename='', overwrite=False)`

`MultiSpeciesBioFileDocket`

`dill_filename()` `property`

`get_BioFileDockets()`

`local_to_s3(overwrite=False)`

`s3_to_local(overwrite=False)`

`sampledict()` `property`

`species_concat()` `property`

`MultiSpeciesFile`

`sampledict()` `property`

`species_concat()` `property`

`OrthoFinderOutputFile`

`SampleDict`

`TransdecoderCdnaFile`

`to_pep_files(TDLONGORF_LOC, TDPREDICT_LOC)`

`TransdecoderOutFile`

`UniProtTaxidListFile`

`get_proteins()`

`UniprotIDMapperFile`

`metadata_object`

`add(key, value, replace=True)`

`gxc_to_exc(sample_MSD, embedding_df, exc_file)`

`make_output_directory(species, conditions, stringonly=False)`

`s3_transfer(to_loc, from_loc)`