BioFile handling tutorial¶
Working with single-cell datasets across multiple species can be complicated!
The BioFile
handling functions in this repo are meant to help streamline the process of working with single-cell data across multiple species.
This notebook serves as a basic tutorial for the BioFile class and related functions.
0. Import¶
To get started:
- First, import some necessary dependencies.
- Use
sys.path.append
to place the necessary functions into your python$PATH
. - Then, import the functions from
biofile_handling.py
,string_functions.py
, andinstall_locs.py
.
# import standard python packages
import pandas as pd
import subprocess, os, sys, dill
# add the utils and env directories to the path
import sys
sys.path.append('../../utils/')
sys.path.append('../../env/')
# import functions from utils directory files
from string_functions import *
from biofile_handling import *
# import paths to software installs from env
from install_locs import *
1. Create a BioFileDocket¶
Before interacting with files, it's important to create a BioFileDocket
.
This class acts as a container which tracks all files in your dataset.
Creating a new BioFileDocket
only requires two parameters:
species
: the name of your species in the formatGenus_species
conditions
: a unique identifier for your dataset as an alphanumeric string (no spaces or underscores).
This could include details like tissue type and sample number (e.g.brain1
).
This string should be unique and not repeated for a different dataset.
Upon creation, a BioFileDocket
will create a directory on your machine if it does not already exist.
Files created for this dataset will be saved into that directory.
# Specify the name of the species in 'Genus_species' format
# This should contain an underscore
species = 'Genus_species'
# Specify any particular identifying conditions, eg tissue type:
# Must be alphanumeric; can't contain special characters
conditions = 'tutorial'
sample_BFD = BioFileDocket(species, conditions)
/home/ec2-user/glial-origins/output/Gspe_tutorial/ already exists Files will be saved into /home/ec2-user/glial-origins/output/Gspe_tutorial/
Tip¶
A BioFileDocket
has some useful parameters which you can access via a dot operator.
For example, to get the directory
where files in the BioFileDocket
are stored, see below:
sample_BFD.directory
'/home/ec2-user/glial-origins/output/Gspe_tutorial/'
2. Create a BioFile from scratch¶
Once you have a BioFileDocket
, you can start creating BioFile
objects.
These objects can be used to keep track of files on your local system and link them to each other.
A simple way to use the BioFile
object system is to reference a file that already exists on your system.
# This line will create a file for us to reference.
subprocess.run(['touch', sample_BFD.directory + 'samplefile.txt'])
# Creating a BioFile object minimally requires a SampleDict.
# This object carries information about species, conditions, and directory.
# If you create a file without downloading from a URL or from S3, you must also specify a filename.
# Conventionally, you should use sample_BFD.sampledict.
# This passes the needed SampleDict object from the BioFileDocket.
# The BioFile class will also accept a SampleDict object you generate from scratch.
biofile_object = BioFile(
sampledict = sample_BFD.sampledict,
filename = 'samplefile.txt'
)
# You can double-check to make sure your BioFile object points to the right place.
print('Does this file exist?', biofile_object.exists)
Does this file exist? True
Tip¶
BioFile
objects have numerous built-in functionalities.
You can learn more about these using the built-in help()
function.
help(BioFile)
Help on class BioFile in module biofile_handling: class BioFile(builtins.object) | BioFile(sampledict: biofile_handling.SampleDict, filename='', url=None, protocol=None, s3uri=None, unzip=True) | | BioFile objects collect metadata about biological filetypes. | | Args: | sampledict (:obj:`SampleDict`): a SampleDict object from the BioFileDocket. | filename (str, optional): the name of the file. | url (str, optional): when downloading a file on object creation, pass a string url along with a protocol. | protocol (str, optional): passed along with a url for automatic download on object creation. | s3uri (str, optional): the s3uri of the file, if downloading from s3 upon object creation. | unzip (bool, optional): whether or not to unzip a file on download. Defaults to True. | | Methods defined here: | | __init__(self, sampledict: biofile_handling.SampleDict, filename='', url=None, protocol=None, s3uri=None, unzip=True) | Initialize self. See help(type(self)) for accurate signature. | | add_s3uri(self, s3uri: str) | Adds an s3uri to the BioFile object if it doesn't already exist. | | get_from_s3(self, overwrite=False) | Downloads the BioFile from AWS S3. | | Args: | overwrite (bool): decide whether to overwrite existing files. Defaults to False. | | get_from_url(self, url: str, protocol: str, filename='', unzip=True) | Downloads the BioFile from a URL using a chosen protocol, unzipping optionally. | | Args: | url (str): url of the file. | protocol (str): protocol to be used for download. | filename (str, optional): name of file to be saved. If empty, will generate name from URL. | unzip (bool): decide whether to unzip if it is a zipped file. Defaults to True. | | push_to_s3(self, overwrite=False) | Uploads the BioFile to AWS S3. | | Args: | overwrite (bool): decide whether to overwrite existing files. Defaults to False. | | unzip(self) | Unzips files ending in .gz, .gzip, or .zip. | | ---------------------------------------------------------------------- | Readonly properties defined here: | | exists | bool: checks whether the file currently exists. | | filetype | str: infers filetype using the string after the final `.` in filename. | | path | str: path to the file, including the filename. | | sampledict | :obj:`SampleDict`: a SampleDict object for the file. | | species_prefix | str: runs prefixify(species). | | ---------------------------------------------------------------------- | Data descriptors defined here: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined)
3. Create a BioFile by downloading from a URL or S3 URI¶
Often when working with publicly available data, files are be downloaded from a URL.
You can certainly download a file manually using your preferred method and then capture it in a BioFile
object, as above.
However, the BioFile
object has methods for file downloading from a URL as well.
Using these methods, you can create a BioFile
object and immediately download that file to your system.
Currently, the get_from_url
method allows you to download using the curl
or wget
protocols.
The get_from_url
method is automatically called when you pass a url
and protocol
variable at the creation of a BioFile
object.
# To download using curl or wget, you can specify a filename, url, and protocol.
# Below, we download a file using curl.
biofile_object_curl = BioFile(
sampledict = sample_BFD.sampledict,
filename = 'testfile1.txt',
url = 'https://raw.githubusercontent.com/Arcadia-Science/glial-origins/das/biofile-revision-dev/utils/tut/testfile1.txt?token=GHSAT0AAAAAAB3ZLBCF4IP73QREWKMWAPXUY4P2IVQ',
protocol = 'curl'
)
# If this succeeded, it should print "Hello"
with open(biofile_object_curl.path, 'r') as f:
print(f.read())
# Here, we download a file using wget
biofile_object_wget = BioFile(
sampledict = sample_BFD.sampledict,
filename = 'testfile2.txt',
url = 'https://raw.githubusercontent.com/Arcadia-Science/glial-origins/das/biofile-revision-dev/utils/tut/testfile2.txt?token=GHSAT0AAAAAAB3ZLBCF4MWQWGUBQCGFENP4Y4P2JFA',
protocol = 'wget'
)
# If this succeeded, it should print "world"
with open(biofile_object_wget.path, 'r') as f:
print(f.read())
file testfile2.txt already exists at /home/ec2-user/glial-origins/output/Gspe_tutorial/testfile2.txt downloaded file /home/ec2-user/glial-origins/output/Gspe_tutorial/testfile3.txt
--2022-12-08 23:14:59-- https://raw.githubusercontent.com/Arcadia-Science/glial-origins/das/biofile-revision-dev/utils/tut/testfile2.txt?token=GHSAT0AAAAAAB3ZLBCF4MWQWGUBQCGFENP4Y4P2JFA Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected. HTTP request sent, awaiting response... 404 Not Found 2022-12-08 23:14:59 ERROR 404: Not Found.
Downloading from AWS S3¶
BioFile
objects also support downloading files using an AWS S3 URI.
# Here, we create a BioFile object by downloading from AWS S3 using a URI.
# If the URL or URI ends in a filename, you don't have to specify the filename variable;
# the filename will automatically be set to whatever string follows the final '/' in the URL string.
# We can omit the 'filename' field because the s3uri can be neatly parsed into a filename.
biofile_object_s3 = BioFile(
sampledict = sample_BFD.sampledict,
s3uri = 's3://arcadia-reference-datasets/tutorials/testfile3.txt'
)
inferring file name as testfile3.txt download: s3://arcadia-reference-datasets/tutorials/testfile3.txt to ../../output/Gspe_tutorial/testfile3.txt
4. Place BioFile objects into BioFileDocket¶
To keep track of your files, you need to place your BioFile
objects into the BioFileDocket
you created.
You can add these files individually or as a list.
To see the list of file objects listed by filename, you can use the .files
dot operator.
# Add a single BioFile object to tracked files.
sample_BFD.add_file(biofile_object)
# Add a list of BioFile objects to tracked files.
sample_BFD.add_files([biofile_object_curl, biofile_object_wget])
# List the names of files and the associated BioFile objects.
display(sample_BFD.files)
{'samplefile.txt': <biofile_handling.BioFile at 0x7fcf597f5c60>, 'testfile2.txt': <biofile_handling.BioFile at 0x7fcea86d3b50>, 'testfile3.txt': <biofile_handling.BioFile at 0x7fcea86d3be0>}
Keyfiles¶
Generic files added to the .files
attribute are tracked but not automatically uploaded.
You can use this to store files whose provenance is important but are easily generated.
For files that you expect to use repeatedly and which you want to be automatically uploaded, you should use the add_keyfile
method.
- For a given keyfile, you can specify a key, e.g.
aws_testfile
. - Keyfiles can be accessed directly using the dot operator, e.g.
sample_BFD.aws_testfile
. - This method is particularly useful if you are working across species.
For example, you could access all of the.genome_fasta
files from multiple species programmatically.
# Add a keyfile using the 'aws_testfile' key
sample_BFD.add_keyfile('aws_testfile', biofile_object_s3)
# Display the attributes of the BioFileDocket
# Note that 'aws_testfile' has its own key-value pair.
display(vars(sample_BFD))
# Display the attributes of the aws_testfile using dot operations.
display(vars(sample_BFD.aws_testfile))
# Get the path to the aws_testfile using dot operations.
display(sample_BFD.aws_testfile.path)
{'species': 'Genus_species', 'conditions': 'tutorial', 'directory': '/home/ec2-user/glial-origins/output/Gspe_tutorial/', 'files': {'samplefile.txt': <biofile_handling.BioFile at 0x7fcf597f5c60>, 'testfile2.txt': <biofile_handling.BioFile at 0x7fcea86d3b50>, 'testfile3.txt': <biofile_handling.BioFile at 0x7fcea86d3be0>}, 'metadata': <biofile_handling.metadata_object at 0x7fcf59d27fd0>, 'aws_testfile': <biofile_handling.BioFile at 0x7fcea8758f70>}
{'filename': 'testfile3.txt', 'species': 'Genus_species', 'conditions': 'tutorial', 'directory': '/home/ec2-user/glial-origins/output/Gspe_tutorial/', 's3uri': 's3://arcadia-reference-datasets/tutorials/testfile3.txt', 'metadata': <biofile_handling.metadata_object at 0x7fcea875b4f0>}
'/home/ec2-user/glial-origins/output/Gspe_tutorial/testfile3.txt'
5. Pickling the BioFileDocket
¶
When programming interactively, it's possible to lose track of your variables when you shut down a session.
To preserve the BioFileDocket
and its associated files, you can use the .pickle()
method.
This creates a binary file that stores the BioFileDocket
and all of its included BioFile
objects.
# Pickling the BioFileDocket is very simple!
sample_BFD.pickle()
Unpickling the BioFileDocket
¶
From a .pkl
file, you can retrieve your previous variables.
A BioFileDocket
object places its .pkl
file in a set location with a set name.
Using .unpickle()
allows you to load the object again.
# Creates a new BioFileDocket by unpickling your previous .pkl file.
new_sample_BFD = sample_BFD.unpickle()
display(vars(new_sample_BFD))
{'species': 'Genus_species', 'conditions': 'tutorial', 'directory': '/home/ec2-user/glial-origins/output/Gspe_tutorial/', 'files': {'samplefile.txt': <biofile_handling.BioFile at 0x7fcea875ad40>, 'testfile2.txt': <biofile_handling.BioFile at 0x7fcea875a290>, 'testfile3.txt': <biofile_handling.BioFile at 0x7fcf597f7970>}, 'metadata': <biofile_handling.metadata_object at 0x7fcf597f5db0>, 'aws_testfile': <biofile_handling.BioFile at 0x7fcf597f7760>}
Pushing the .pkl
file to AWS S3¶
You can also save the .pkl
file for a BioFileDocket
to a set location on AWS S3.
# This saves the .pkl file to S3
sample_BFD.push_to_s3()
upload: ../../output/Gspe_tutorial/Gspe_tutorial_BioFileDocket.pkl to s3://arcadia-reference-datasets/glial-origins-pkl/Gspe_tutorial_BioFileDocket.pkl
Getting a .pkl
file from AWS S3¶
Conversely, you can pull a .pkl file from S3 and extract its contents.
new_sample_BFD2 = sample_BFD.get_from_s3().unpickle()
display(vars(new_sample_BFD2))
file Gspe_tutorial_BioFileDocket.pkl already exists at /home/ec2-user/glial-origins/output/Gspe_tutorial/Gspe_tutorial_BioFileDocket.pkl
{'species': 'Genus_species', 'conditions': 'tutorial', 'directory': '/home/ec2-user/glial-origins/output/Gspe_tutorial/', 'files': {'samplefile.txt': <biofile_handling.BioFile at 0x7fcea86d2d40>, 'testfile2.txt': <biofile_handling.BioFile at 0x7fcea86d3fa0>, 'testfile3.txt': <biofile_handling.BioFile at 0x7fcea86d3d60>}, 'metadata': <biofile_handling.metadata_object at 0x7fcea86d0220>, 'aws_testfile': <biofile_handling.BioFile at 0x7fcea86d06a0>}
Tip: .pkl
file uniqueness¶
The .get_from_s3()
and .unpickle()
methods look for a .pkl
file based on the species
and conditions
of a BioFileDocket
.
This means that you don't have to have any information in your local BioFileDocket
to start with – as long as it exists on S3, you can retrieve the file.
# Creating an empty BioFileDocket using just the species and conditions,
# then pulling the .pkl file based on those identifiers to fill the BioFileDocket
species = 'Genus_species'
conditions = 'tutorial'
new_sample_BFD3 = BioFileDocket(species, conditions).get_from_s3().unpickle()
display(vars(new_sample_BFD3))
/home/ec2-user/glial-origins/output/Gspe_tutorial/ already exists Files will be saved into /home/ec2-user/glial-origins/output/Gspe_tutorial/ file Gspe_tutorial_BioFileDocket.pkl already exists at /home/ec2-user/glial-origins/output/Gspe_tutorial/Gspe_tutorial_BioFileDocket.pkl
{'species': 'Genus_species', 'conditions': 'tutorial', 'directory': '/home/ec2-user/glial-origins/output/Gspe_tutorial/', 'files': {'samplefile.txt': <biofile_handling.BioFile at 0x7fcea8540130>, 'testfile2.txt': <biofile_handling.BioFile at 0x7fcea86d1e70>, 'testfile3.txt': <biofile_handling.BioFile at 0x7fcea86d31c0>}, 'metadata': <biofile_handling.metadata_object at 0x7fcea86d3a60>, 'aws_testfile': <biofile_handling.BioFile at 0x7fcea86d10f0>}
6. Systematically transferring files in a BioFileDocket
to and from AWS S3¶
The BioFileDocket
can upload all associated keyfiles from S3 programmatically.
new_sample_BFD3.local_to_s3()
testfile3.txt already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.
Conversely, it can also download all files from S3 as well.
new_sample_BFD3.s3_to_local()
file testfile3.txt already exists at /home/ec2-user/glial-origins/output/Gspe_tutorial/testfile3.txt
XX. Create a MultiSpeciesDocket¶
# Specify the name of the species folder in Amazon S3
# This should contain an underscore
species_dict = {
'Genus_species': 'tutorial',
'Foo_bar': 'tutorial',
'Hello_world': 'tutorial'
}
# Specify any particular identifying conditions, eg tissue type:
# Must be alphanumeric; can't contain special characters
global_conditions = 'tutorial'
analysis_type = 'Testing'
sample_MSD = MultiSpeciesBioFileDocket(species_dict, global_conditions, analysis_type)
/home/ec2-user/glial-origins/output/FbarGspeHwor_tutorial_Testing/ already exists
sample_MSD.pickle()
sample_MSD.add_keyfile('blah', biofile_object_curl)