BioFile Documentation

This page describes BioFile objects and their usage.

BioFile objects and functions for managing collections of biological data.

BioFile

BioFile objects collect metadata about biological filetypes.

Parameters:

Name Type Description Default
sampledict SampleDict

a SampleDict object from the BioFileDocket.

required
filename str

the name of the file.

''
url str

when downloading a file on object creation, pass a string url along with a protocol.

None
protocol str

passed along with a url for automatic download on object creation.

None
s3uri str

the s3uri of the file, if downloading from s3 upon object creation.

None
unzip bool

whether or not to unzip a file on download. Defaults to True.

True

add_s3uri(s3uri)

Adds an s3uri to the BioFile object if it doesn't already exist.

exists() property

bool: checks whether the file currently exists.

filetype() property

str: infers filetype using the string after the final . in filename.

get_from_s3(overwrite=False)

Downloads the BioFile from AWS S3.

Parameters:

Name Type Description Default
overwrite bool

decide whether to overwrite existing files. Defaults to False.

False

get_from_url(url, protocol, filename='', unzip=True)

Downloads the BioFile from a URL using a chosen protocol, unzipping optionally.

Parameters:

Name Type Description Default
url str

url of the file.

required
protocol str

protocol to be used for download.

required
filename str

name of file to be saved. If empty, will generate name from URL.

''
unzip bool

decide whether to unzip if it is a zipped file. Defaults to True.

True

path() property

str: path to the file, including the filename.

push_to_s3(overwrite=False)

Uploads the BioFile to AWS S3.

Parameters:

Name Type Description Default
overwrite bool

decide whether to overwrite existing files. Defaults to False.

False

sampledict() property

SampleDict: a SampleDict object for the file.

species_prefix() property

str: runs prefixify(species).

unzip()

Unzips files ending in .gz, .gzip, or .zip.

BioFileDocket

BioFileDocket objects collect BioFiles and relate them to one another.

Important files called keyfiles get their own uniquely-named attribute.
These files can be accessed using a dot operator. For example, BFD.genome_fasta would return a GenomeFastaFile object. BFD.genome_fasta.path would return the path to that GenomeFastaFile object.

Parameters:

Name Type Description Default
species str

species name in 'Genus_species' format.

required
conditions str

unique conditions identitier for dataset.

required

Attributes:

Name Type Description
directory str

output directory for files in the dataset.

metadata metadata_object

a collector for miscellaneous metadata.

files dict

a dictionary of assorted files associated with the data.

keyfiles attr

unique key-identified attributes representing key files. You can add these using BioFileDocket.add_keyfile() or .add_keyfiles().

add_file(BioFile)

Places a BioFile object into the files attribute.

The objects can be accessed using their filename.

Note

The actual files of BioFile objects in this list are not automatically uploaded.

add_files(list)

Places a list of files into the files attribute.

add_keyfile(key, BioFile, overwrite=False)

Adds a BioFile object using a unique key identifier.

Checks to make sure that key does not already exist. Also makes sure key is only alphanumeric or underscores.

Parameters:

Name Type Description Default
key str

a unique key

required
BioFile BioFile

a BioFile object

required

add_keyfiles(dictionary, overwrite=False)

Adds a dictionary of BioFile objects using key:BioFile object pairs.

dill_filename() property

str: the unique filename of the BioFileDocket .pkl file.

dill_filepath() property

str: the path of the BioFileDocket .pkl file.

get_from_s3(overwrite=False)

Downloads the .pkl file for the BioFileDocket from AWS S3.

Parameters:

Name Type Description Default
overwrite bool

decide whether to overwrite existing files. Defaults to False.

False

local_to_s3(overwrite=False)

Uploads all keyfiles that are BioFiles to S3 using their s3uri attributes.

Parameters:

Name Type Description Default
overwrite bool

decide whether to overwrite existing files. Defaults to False.

False

pickle()

Creates a .pkl file for the BioFileDocket object.

The filename and path are automatically generated.

push_to_s3(overwrite=False)

Uploads the .pkl file for the BioFileDocket to AWS S3.

Parameters:

Name Type Description Default
overwrite bool

decide whether to overwrite existing files. Defaults to False.

False

remove_file(filename)

Removes a file based on filename from the files attribute.

remove_keyfile(key, warn=True)

Deletes a keyfile from the BioFileDocket if warn == False.

s3_to_local(overwrite=False)

Downloads all keyfiles that are BioFiles from S3 using their s3uri attributes.

Parameters:

Name Type Description Default
overwrite bool

decide whether to overwrite existing files. Defaults to False.

False

s3uri() property

str: the S3 URI of the BioFileDocket.

sampledict() property

SampleDict: the full SampleDict of the BioFileDocket.

set_taxid(taxid)

Adds a taxid attribute to the BioFileDocket.

Tolerates either int or str input.

species_prefix() property

str: runs prefixify(species).

unpickle()

Unpickles a .pkl file for the BioFileDocket object.

The filename and path are automatically generated.

Returns:

Name Type Description
self BioFileDocket

returns the unpickled BioFileDocket object.

Raises:

Type Description
FileNotFoundError

if the .pkl file doesn't exist yet.

CellAnnotFile

Bases: BioFile

A cell annotation matrix file with two columns: 'cell_barcode' and 'celltype'.

Each cell should have a cell barcode exactly matching its name in the Gxc or Exc embedding.

Parameters:

Name Type Description Default
sources list of BioFile objects

a list of the source BioFile objects.

required

CellRangerBarcodesFile

Bases: BioFile

A BioFile object for the barcodes file of a CellRanger file.

CellRangerFeaturesFile

Bases: BioFile

A BioFile object for the features file of a CellRanger file.

CellRangerFileGroup

A special object that collects the multiple files of a CellRanger file.

Parameters:

Name Type Description Default
sampledict SampleDict

a SampleDict object from the BioFileDocket.

required
barcodes_address str

the address of the barcodes file as either url or s3uri.

required
features_address str

the address of the features file as either url or s3uri.

required
matrix_address str

the address of the matrix file as either url or s3uri.

required
how str

'url' or 's3uri' to choose a method of download.

None
protocol str

when using url, the protocol to use. Defaults to 'curl'.

'curl'

to_gxc(reference_genome, reference_annot, filename='', overwrite=False)

Converts the group of cell ranger files to a GxcFile and returns the file object.

CellRangerMatrixFile

Bases: BioFile

A BioFile object for the matrix file of a CellRanger file.

ExcFile

Bases: BioFile

A BioFile object for an embedding x cells matrix.

Parameters:

Name Type Description Default
gxcfile GxcFile

the original GxcFile object used to generate the new embedding.

required
embedding str

a description of the new embedding.

required

FoldSeekOutputFile

Bases: MultiSpeciesFile

A MultiSpeciesFile for FoldSeek output.

GeneListFile

Bases: BioFile

An BioFile object for a gene list file, which is a .txt file with a gene on each line.

Defaults to creating a file based on a list of genes passed on object creation.

Parameters:

Name Type Description Default
sources list of BioFile objects

a list of the source BioFile objects.

required
identifier str

a description of the identifier type for this gene list.

required
make bool

whether or not to make the file upon object creation. defaults to True.

True

get_uniprot_ids(ID_MAPPER_LOC, from_type, to_type)

Queries the Uniprot API to get ID mapping based on from_type and to_type, returning an UniprotIDMapperFile object.

If the number of genes in the file exceeds 100k, generates temporary files to query the API multiple times.

Parameters:

Name Type Description Default
ID_MAPPER_LOC str

path to the ID mapping script, usually at ID_MAPPER_LOC.

required
from_type str

the source datatype (e.g. 'Ensembl')

required
to_type str

the destination datatype (e.g. 'UniprotKB')

required

get_uniprot_ids_by_genename(ID_MAPPER_LOC, from_type, to_type, taxid)

Don't use this, it's not built out yet.

make_file(genes)

Makes a gene list file at a given location when passed a list object.

GenomeFastaFile

Bases: BioFile

A BioFile object for genome fasta files.

Parameters:

Name Type Description Default
version str

the specific version number of the genome.

required

get_transdecoder_cdna_gtf(GenomeGtfFile, TRANSDECODER_LOC, **kwargs)

Generates a TransdecoderCdnaFile given a GenomeGtfFile.

Parameters:

Name Type Description Default
GenomeGtfFile GenomeGtfFile

a GenomeGtfFile object that is compatible with the GenomeFastaFile.

required
TRANSDECODER_LOC str

the path to the parent directory of Transdecoder.

required

Returns:

Name Type Description
cdna_output TransdecoderCdnaFile

a TransdecoderCdnaFile object.

rename_RefSeq_chromosomes(replace=False)

Renames the chromosomes of a RefSeq genome to numerical chromosomes.

GenomeGffFile

Bases: BioFile

A BioFile object for a genome reference in GFF format.

If a file ends in .gff3, renames it to end in .gff instead.

Parameters:

Name Type Description Default
reference_genome GenomeFastaFile

the BioFile object of the associated genome fasta file.

required

GenomeGtfFile

Bases: BioFile

A BioFile object for a genome reference in GTF format.

Parameters:

Name Type Description Default
reference_genome GenomeFastaFile

the BioFile object of the associated genome fasta file.

required

GxcFile

Bases: BioFile

A BioFile object for a genes x cells matrix.

Parameters:

Name Type Description Default
reference_genome GenomeFastaFile

the BioFile object of the associated genome fasta file.

required
reference_annot GenomeGffFile | GenomeGtfFile

the BioFile object of the associated genome annotation file.

required

IdmmFile

Bases: BioFile

An ID-mapping matrix file. Each row represents a feature and each column represents a name of that feature in a namespace.

Parameters:

Name Type Description Default
sources list of BioFile objects

a list of the source BioFile objects.

required
kind str

a one-word description of the kind of idmm.

required

JointExcFile

Bases: MultiSpeciesFile

A MultiSpeciesFile for collecting an ExcFile with multiple species.

Parameters:

Name Type Description Default
sources list of GxcFile

the original GxcFile object used to generate the new embedding.

required
embedding str

a description of the new embedding.

required

LoomFile

Bases: BioFile

A BioFile object for a Loom file.

Parameters:

Name Type Description Default
reference_genome GenomeFastaFile

the BioFile object of the associated genome fasta file.

required
reference_annot GenomeGtfFile | GenomeGffFile

the BioFile object of the associated genome GTF/GFF file.

required

get_cellannot(overwrite=False)

Generates a cellannot file from the LOOM file.

Parameters:

Name Type Description Default
overwrite bool

whether or not to overwrite the file if it already exists.

False

Returns:

Type Description
CellAnnotFile

a CellAnnotFile object for the created file.

get_idmm(id_type, overwrite=False)

Generates an idmm file from the LOOM file.

Parameters:

Name Type Description Default
id_type str

the name of the ID type extracted from the file, e.g. 'ensembl_id'.

required
overwrite bool

whether or not to overwrite the file if it already exists.

False

Returns:

Type Description
IdmmFile

an IdmmFile object for the created file.

to_gxc(filename='', overwrite=False)

Generates a gxc file from the LOOM file.

Parameters:

Name Type Description Default
filename str

automatically generated by appending '.asgxc.tsv' if not provided.

''
overwrite bool

whether or not to overwrite the file if it already exists.

False

Returns:

Type Description
GxcFile

a GxcFile object for the created file.

MultiSpeciesBioFileDocket

Bases: BioFileDocket

MultiSpeciesBioFileDocket objects collect BioFileDockets and relate them to one another.

Parameters:

Name Type Description Default
species_dict dict

key is species name in 'Genus_species' format; value is conditions. These must be exact matches, or it will not work.

required
global_conditions str

a summary identifier for this collection of species' datasets.

required
analysis_type str

one-word description of the analysis type.

required

Attributes:

Name Type Description
directory str

output directory for files in the dataset.

dill_filename() property

str: the unique filename of the BioFileDocket .pkl file.

get_BioFileDockets()

Gets species BioFileDocket .pkl files from S3 for each species in the species_dict.

local_to_s3(overwrite=False)

Iteratively calls local_to_s3 on all BioFileDockets in the group.

s3_to_local(overwrite=False)

Iteratively calls s3_to_local on all BioFileDockets in the group.

sampledict() property

SampleDict: a MultiSpeciesBioFileDocket-formatted sampledict.

species_concat() property

str: concatenation of species prefixes in alphabetical order.

MultiSpeciesFile

Bases: BioFile

A special class of BioFile objects for files with multiple associated species.

This subclass uses species_concat as its species and its conditions is a '_' concatenation of its global_conditions and analysis_type.

Parameters:

Name Type Description Default
species_dict dict

species dictionary from the MultiSpeciesBioFileDocket.

required

sampledict() property

SampleDict: a MultiSpeciesBioFileDocket-formatted sampledict.

species_concat() property

str: concatenation of species prefixes in alphabetical order.

OrthoFinderOutputFile

Bases: MultiSpeciesFile

A MultiSpeciesFile for OrthoFinder output.

SampleDict

A dictionary containing dataset-specific fields: species, conditions, and directory.

This class is used to uniquely associate BioFile objects with specific datasets.

Parameters:

Name Type Description Default
species str

species name in 'Genus_species' format.

required
conditions str

unique conditions identitier for dataset.

required
directory str

output directory for files in the dataset.

required

TransdecoderCdnaFile

Bases: BioFile

A BioFile object for Transdecoder cDNA files.

Parameters:

Name Type Description Default
reference_genome GenomeFastaFile

the BioFile object of the associated genome fasta file.

required
reference_annot GenomeGffFile | GenomeGtfFile

the BioFile object of the associated genome annotation file.

required

to_pep_files(TDLONGORF_LOC, TDPREDICT_LOC)

Generates a peptide file from a Transdecoder cDNA file.

Parameters:

Name Type Description Default
TDLONGORF_LOC str

path to the TransDecoder.LongOrfs binary.

required
TDPREDICT_LOC str

path to the TransDecoder.Predict binary.

required

Returns:

Name Type Description
output_dict dict of TransdecoderOutFile

a dictionary of Transdecoder output files.

TransdecoderOutFile

Bases: BioFile

A BioFile object for Transdecoder cDNA files.

Parameters:

Name Type Description Default
reference_genome GenomeFastaFile

the BioFile object of the associated genome fasta file.

required
reference_annot GenomeGffFile | GenomeGtfFile

the BioFile object of the associated genome annotation file.

required
reference_cDNA TransdecoderCdnaFile

the BioFile object of the associated Transdecoder cDNA file.

required

UniProtTaxidListFile

Bases: BioFile

An BioFile object for pulling files from Uniprot based on input taxid.

Defaults to creating a file based on a taxid passed on object creation.

Parameters:

Name Type Description Default
taxid str | int

the taxid of the species of interest.

required
make bool

whether or not to make the file upon object creation. defaults to True.

True

get_proteins()

Queries the Uniprot API to extract all proteins for the taxid associated with this object.

UniprotIDMapperFile

Bases: IdmmFile

An BioFile object for the output file generated by calling the UNIPROT ID mapping API.

Parameters:

Name Type Description Default
from_type str

the source datatype (e.g. 'Ensembl')

required
to_type str

the destination datatype (e.g. 'UniprotKB')

required

metadata_object

Bases: object

Simple dummy object to enable dot access to keys.

add(key, value, replace=True)

Adds a metadata feature using a unique key identifier.

  • Checks to make sure that key does not already exist.
  • Also makes sure key is only alphanumeric or underscores.

Parameters:

Name Type Description Default
key str

a unique key

required
value obj

any field you want to record

required

gxc_to_exc(sample_MSD, embedding_df, exc_file)

Converts an GxcFile into an ExcFile, returning the new file object.

make_output_directory(species, conditions, stringonly=False)

Creates an output directory based on species and condition parameters.

Does not create a directory if it already exists.
Output directory is placed in the GLOBAL_OUTPUT_DIRECTORY, as defined by env/install_locs.py. Default GLOBAL_OUTPUT_DIRECTORY is output/.

Parameters:

Name Type Description Default
species str

species name in Genus_species format.

required
conditions str

unique conditions identifier for dataset.

required
stringonly bool

whether to only output the string without creating a directory

False

Returns:

Name Type Description
output_directory str

path of the output directory, in format Gspe_conditions.

s3_transfer(to_loc, from_loc)

Transfers files to and from AWS S3 and local.

One of the two parameters must be an AWS S3 URI in string format.

Parameters:

Name Type Description Default
to_loc str

path of the destination file.

required
from_loc str

path of the origin file.

required