BioFile handling tutorial¶

Working with single-cell datasets across multiple species can be complicated!
The BioFile handling functions in this repo are meant to help streamline the process of working with single-cell data across multiple species.
This notebook serves as a basic tutorial for the BioFile class and related functions.

0. Import¶

To get started:

First, import some necessary dependencies.
Use sys.path.append to place the necessary functions into your python $PATH.
Then, import the functions from biofile_handling.py, string_functions.py, and install_locs.py.

In [1]:

            
                Copied!
                
                    
                    
                
                

        
# import standard python packages
import pandas as pd
import subprocess, os, sys, dill

# add the utils and env directories to the path
import sys
sys.path.append('../../utils/')
sys.path.append('../../env/')

# import functions from utils directory files
from string_functions import *
from biofile_handling import *

# import paths to software installs from env
from install_locs import *
# import standard python packages
import pandas as pd
import subprocess, os, sys, dill

# add the utils and env directories to the path
import sys
sys.path.append('../../utils/')
sys.path.append('../../env/')

# import functions from utils directory files
from string_functions import *
from biofile_handling import *

# import paths to software installs from env
from install_locs import *

1. Create a BioFileDocket¶

Before interacting with files, it's important to create a BioFileDocket.
This class acts as a container which tracks all files in your dataset. Creating a new BioFileDocket only requires two parameters:

species: the name of your species in the format Genus_species
conditions: a unique identifier for your dataset as an alphanumeric string (no spaces or underscores).
This could include details like tissue type and sample number (e.g. brain1).
This string should be unique and not repeated for a different dataset.

Upon creation, a BioFileDocket will create a directory on your machine if it does not already exist.
Files created for this dataset will be saved into that directory.

In [2]:

            
                Copied!
                
# Specify the name of the species in 'Genus_species' format
# This should contain an underscore
species = 'Genus_species'

# Specify any particular identifying conditions, eg tissue type:
# Must be alphanumeric; can't contain special characters
conditions = 'tutorial'

sample_BFD = BioFileDocket(species, conditions)
# Specify the name of the species in 'Genus_species' format
# This should contain an underscore
species = 'Genus_species'

# Specify any particular identifying conditions, eg tissue type:
# Must be alphanumeric; can't contain special characters
conditions = 'tutorial'

sample_BFD = BioFileDocket(species, conditions)

/home/ec2-user/glial-origins/output/Gspe_tutorial/ already exists
Files will be saved into /home/ec2-user/glial-origins/output/Gspe_tutorial/

Tip¶

A BioFileDocket has some useful parameters which you can access via a dot operator.
For example, to get the directory where files in the BioFileDocket are stored, see below:

In [3]:

            
                Copied!
                
sample_BFD.directory
sample_BFD.directory

Out[3]:

'/home/ec2-user/glial-origins/output/Gspe_tutorial/'

2. Create a BioFile from scratch¶

Once you have a BioFileDocket, you can start creating BioFile objects.
These objects can be used to keep track of files on your local system and link them to each other.

A simple way to use the BioFile object system is to reference a file that already exists on your system.

In [4]:

            
                Copied!
                
                    
                    
                
                

        
# This line will create a file for us to reference.
subprocess.run(['touch', sample_BFD.directory + 'samplefile.txt'])

# Creating a BioFile object minimally requires a SampleDict.
# This object carries information about species, conditions, and directory.
# If you create a file without downloading from a URL or from S3, you must also specify a filename.

# Conventionally, you should use sample_BFD.sampledict.
#   This passes the needed SampleDict object from the BioFileDocket.
# The BioFile class will also accept a SampleDict object you generate from scratch.
biofile_object = BioFile(
    sampledict = sample_BFD.sampledict,
    filename = 'samplefile.txt'
)

# You can double-check to make sure your BioFile object points to the right place.
print('Does this file exist?', biofile_object.exists)
# This line will create a file for us to reference.
subprocess.run(['touch', sample_BFD.directory + 'samplefile.txt'])

# Creating a BioFile object minimally requires a SampleDict.
# This object carries information about species, conditions, and directory.
# If you create a file without downloading from a URL or from S3, you must also specify a filename.

# Conventionally, you should use sample_BFD.sampledict.
#   This passes the needed SampleDict object from the BioFileDocket.
# The BioFile class will also accept a SampleDict object you generate from scratch.
biofile_object = BioFile(
    sampledict = sample_BFD.sampledict,
    filename = 'samplefile.txt'
)

# You can double-check to make sure your BioFile object points to the right place.
print('Does this file exist?', biofile_object.exists)

Does this file exist? True

Tip¶

BioFile objects have numerous built-in functionalities.
You can learn more about these using the built-in help() function.

In [10]:

            
                Copied!
                
help(BioFile)
help(BioFile)

Help on class BioFile in module biofile_handling:

class BioFile(builtins.object)
 |  BioFile(sampledict: biofile_handling.SampleDict, filename='', url=None, protocol=None, s3uri=None, unzip=True)
 |  
 |  BioFile objects collect metadata about biological filetypes.
 |  
 |  Args:
 |      sampledict (:obj:`SampleDict`): a SampleDict object from the BioFileDocket.
 |      filename (str, optional): the name of the file.
 |      url (str, optional): when downloading a file on object creation, pass a string url along with a protocol.
 |      protocol (str, optional): passed along with a url for automatic download on object creation.
 |      s3uri (str, optional): the s3uri of the file, if downloading from s3 upon object creation.
 |      unzip (bool, optional): whether or not to unzip a file on download. Defaults to True.
 |  
 |  Methods defined here:
 |  
 |  __init__(self, sampledict: biofile_handling.SampleDict, filename='', url=None, protocol=None, s3uri=None, unzip=True)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  add_s3uri(self, s3uri: str)
 |      Adds an s3uri to the BioFile object if it doesn't already exist.
 |  
 |  get_from_s3(self, overwrite=False)
 |      Downloads the BioFile from AWS S3.
 |      
 |      Args:
 |          overwrite (bool): decide whether to overwrite existing files. Defaults to False.
 |  
 |  get_from_url(self, url: str, protocol: str, filename='', unzip=True)
 |      Downloads the BioFile from a URL using a chosen protocol, unzipping optionally.
 |      
 |      Args:
 |          url (str): url of the file.
 |          protocol (str): protocol to be used for download.
 |          filename (str, optional): name of file to be saved. If empty, will generate name from URL.
 |          unzip (bool): decide whether to unzip if it is a zipped file. Defaults to True.
 |  
 |  push_to_s3(self, overwrite=False)
 |      Uploads the BioFile to AWS S3.
 |      
 |      Args:
 |          overwrite (bool): decide whether to overwrite existing files. Defaults to False.
 |  
 |  unzip(self)
 |      Unzips files ending in .gz, .gzip, or .zip.
 |  
 |  ----------------------------------------------------------------------
 |  Readonly properties defined here:
 |  
 |  exists
 |      bool: checks whether the file currently exists.
 |  
 |  filetype
 |      str: infers filetype using the string after the final `.` in filename.
 |  
 |  path
 |      str: path to the file, including the filename.
 |  
 |  sampledict
 |      :obj:`SampleDict`: a SampleDict object for the file.
 |  
 |  species_prefix
 |      str: runs prefixify(species).
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

3. Create a BioFile by downloading from a URL or S3 URI¶

Often when working with publicly available data, files are be downloaded from a URL.

You can certainly download a file manually using your preferred method and then capture it in a BioFile object, as above.
However, the BioFile object has methods for file downloading from a URL as well.
Using these methods, you can create a BioFile object and immediately download that file to your system.

Currently, the get_from_url method allows you to download using the curl or wget protocols.
The get_from_url method is automatically called when you pass a url and protocol variable at the creation of a BioFile object.

In [13]:

            
                Copied!
                
                    
                    
                
                

        
# To download using curl or wget, you can specify a filename, url, and protocol.

# Below, we download a file using curl.
biofile_object_curl = BioFile(
    sampledict = sample_BFD.sampledict,
    filename = 'testfile1.txt',
    url = 'https://raw.githubusercontent.com/Arcadia-Science/glial-origins/das/biofile-revision-dev/utils/tut/testfile1.txt?token=GHSAT0AAAAAAB3ZLBCF4IP73QREWKMWAPXUY4P2IVQ',
    protocol = 'curl'
)

# If this succeeded, it should print "Hello"
with open(biofile_object_curl.path, 'r') as f:
    print(f.read())
    
# Here, we download a file using wget
biofile_object_wget = BioFile(
    sampledict = sample_BFD.sampledict,
    filename = 'testfile2.txt',
    url = 'https://raw.githubusercontent.com/Arcadia-Science/glial-origins/das/biofile-revision-dev/utils/tut/testfile2.txt?token=GHSAT0AAAAAAB3ZLBCF4MWQWGUBQCGFENP4Y4P2JFA',
    protocol = 'wget'
)

# If this succeeded, it should print "world"
with open(biofile_object_wget.path, 'r') as f:
    print(f.read())
# To download using curl or wget, you can specify a filename, url, and protocol.

# Below, we download a file using curl.
biofile_object_curl = BioFile(
    sampledict = sample_BFD.sampledict,
    filename = 'testfile1.txt',
    url = 'https://raw.githubusercontent.com/Arcadia-Science/glial-origins/das/biofile-revision-dev/utils/tut/testfile1.txt?token=GHSAT0AAAAAAB3ZLBCF4IP73QREWKMWAPXUY4P2IVQ',
    protocol = 'curl'
)

# If this succeeded, it should print "Hello"
with open(biofile_object_curl.path, 'r') as f:
    print(f.read())
    
# Here, we download a file using wget
biofile_object_wget = BioFile(
    sampledict = sample_BFD.sampledict,
    filename = 'testfile2.txt',
    url = 'https://raw.githubusercontent.com/Arcadia-Science/glial-origins/das/biofile-revision-dev/utils/tut/testfile2.txt?token=GHSAT0AAAAAAB3ZLBCF4MWQWGUBQCGFENP4Y4P2JFA',
    protocol = 'wget'
)

# If this succeeded, it should print "world"
with open(biofile_object_wget.path, 'r') as f:
    print(f.read())

file testfile2.txt already exists at /home/ec2-user/glial-origins/output/Gspe_tutorial/testfile2.txt

downloaded file /home/ec2-user/glial-origins/output/Gspe_tutorial/testfile3.txt

--2022-12-08 23:14:59--  https://raw.githubusercontent.com/Arcadia-Science/glial-origins/das/biofile-revision-dev/utils/tut/testfile2.txt?token=GHSAT0AAAAAAB3ZLBCF4MWQWGUBQCGFENP4Y4P2JFA
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-12-08 23:14:59 ERROR 404: Not Found.

Downloading from AWS S3¶

BioFile objects also support downloading files using an AWS S3 URI.

In [15]:

            
                Copied!
                
                    
                    
                
                

        
# Here, we create a BioFile object by downloading from AWS S3 using a URI.

# If the URL or URI ends in a filename, you don't have to specify the filename variable;
#   the filename will automatically be set to whatever string follows the final '/' in the URL string.
# We can omit the 'filename' field because the s3uri can be neatly parsed into a filename.
biofile_object_s3 = BioFile(
    sampledict = sample_BFD.sampledict,
    s3uri = 's3://arcadia-reference-datasets/tutorials/testfile3.txt'
)
# Here, we create a BioFile object by downloading from AWS S3 using a URI.

# If the URL or URI ends in a filename, you don't have to specify the filename variable;
#   the filename will automatically be set to whatever string follows the final '/' in the URL string.
# We can omit the 'filename' field because the s3uri can be neatly parsed into a filename.
biofile_object_s3 = BioFile(
    sampledict = sample_BFD.sampledict,
    s3uri = 's3://arcadia-reference-datasets/tutorials/testfile3.txt'
)

inferring file name as testfile3.txt
download: s3://arcadia-reference-datasets/tutorials/testfile3.txt to ../../output/Gspe_tutorial/testfile3.txt

4. Place BioFile objects into BioFileDocket¶

To keep track of your files, you need to place your BioFile objects into the BioFileDocket you created.

You can add these files individually or as a list.
To see the list of file objects listed by filename, you can use the .files dot operator.

In [16]:

            
                Copied!
                
# Add a single BioFile object to tracked files.
sample_BFD.add_file(biofile_object)

# Add a list of BioFile objects to tracked files.
sample_BFD.add_files([biofile_object_curl, biofile_object_wget])

# List the names of files and the associated BioFile objects.
display(sample_BFD.files)
# Add a single BioFile object to tracked files.
sample_BFD.add_file(biofile_object)

# Add a list of BioFile objects to tracked files.
sample_BFD.add_files([biofile_object_curl, biofile_object_wget])

# List the names of files and the associated BioFile objects.
display(sample_BFD.files)

{'samplefile.txt': <biofile_handling.BioFile at 0x7fcf597f5c60>,
 'testfile2.txt': <biofile_handling.BioFile at 0x7fcea86d3b50>,
 'testfile3.txt': <biofile_handling.BioFile at 0x7fcea86d3be0>}

Keyfiles¶

Generic files added to the .files attribute are tracked but not automatically uploaded.
You can use this to store files whose provenance is important but are easily generated.

For files that you expect to use repeatedly and which you want to be automatically uploaded, you should use the add_keyfile method.

For a given keyfile, you can specify a key, e.g. aws_testfile.
Keyfiles can be accessed directly using the dot operator, e.g. sample_BFD.aws_testfile.
This method is particularly useful if you are working across species.
For example, you could access all of the .genome_fasta files from multiple species programmatically.

In [17]:

            
                Copied!
                
# Add a keyfile using the 'aws_testfile' key
sample_BFD.add_keyfile('aws_testfile', biofile_object_s3)

# Display the attributes of the BioFileDocket
#   Note that 'aws_testfile' has its own key-value pair.
display(vars(sample_BFD))

# Display the attributes of the aws_testfile using dot operations.
display(vars(sample_BFD.aws_testfile))

# Get the path to the aws_testfile using dot operations.
display(sample_BFD.aws_testfile.path)
# Add a keyfile using the 'aws_testfile' key
sample_BFD.add_keyfile('aws_testfile', biofile_object_s3)

# Display the attributes of the BioFileDocket
#   Note that 'aws_testfile' has its own key-value pair.
display(vars(sample_BFD))

# Display the attributes of the aws_testfile using dot operations.
display(vars(sample_BFD.aws_testfile))

# Get the path to the aws_testfile using dot operations.
display(sample_BFD.aws_testfile.path)

{'species': 'Genus_species',
 'conditions': 'tutorial',
 'directory': '/home/ec2-user/glial-origins/output/Gspe_tutorial/',
 'files': {'samplefile.txt': <biofile_handling.BioFile at 0x7fcf597f5c60>,
  'testfile2.txt': <biofile_handling.BioFile at 0x7fcea86d3b50>,
  'testfile3.txt': <biofile_handling.BioFile at 0x7fcea86d3be0>},
 'metadata': <biofile_handling.metadata_object at 0x7fcf59d27fd0>,
 'aws_testfile': <biofile_handling.BioFile at 0x7fcea8758f70>}

{'filename': 'testfile3.txt',
 'species': 'Genus_species',
 'conditions': 'tutorial',
 'directory': '/home/ec2-user/glial-origins/output/Gspe_tutorial/',
 's3uri': 's3://arcadia-reference-datasets/tutorials/testfile3.txt',
 'metadata': <biofile_handling.metadata_object at 0x7fcea875b4f0>}

'/home/ec2-user/glial-origins/output/Gspe_tutorial/testfile3.txt'

5. Pickling the `BioFileDocket`¶

When programming interactively, it's possible to lose track of your variables when you shut down a session.
To preserve the BioFileDocket and its associated files, you can use the .pickle() method.
This creates a binary file that stores the BioFileDocket and all of its included BioFile objects.

In [18]:

            
                Copied!
                
# Pickling the BioFileDocket is very simple!
sample_BFD.pickle()
# Pickling the BioFileDocket is very simple!
sample_BFD.pickle()

Unpickling the `BioFileDocket`¶

From a .pkl file, you can retrieve your previous variables.
A BioFileDocket object places its .pkl file in a set location with a set name.
Using .unpickle() allows you to load the object again.

In [19]:

            
                Copied!
                
# Creates a new BioFileDocket by unpickling your previous .pkl file.
new_sample_BFD = sample_BFD.unpickle()
display(vars(new_sample_BFD))
# Creates a new BioFileDocket by unpickling your previous .pkl file.
new_sample_BFD = sample_BFD.unpickle()
display(vars(new_sample_BFD))

{'species': 'Genus_species',
 'conditions': 'tutorial',
 'directory': '/home/ec2-user/glial-origins/output/Gspe_tutorial/',
 'files': {'samplefile.txt': <biofile_handling.BioFile at 0x7fcea875ad40>,
  'testfile2.txt': <biofile_handling.BioFile at 0x7fcea875a290>,
  'testfile3.txt': <biofile_handling.BioFile at 0x7fcf597f7970>},
 'metadata': <biofile_handling.metadata_object at 0x7fcf597f5db0>,
 'aws_testfile': <biofile_handling.BioFile at 0x7fcf597f7760>}

Pushing the `.pkl` file to AWS S3¶

You can also save the .pkl file for a BioFileDocket to a set location on AWS S3.

In [20]:

            
                Copied!
                
# This saves the .pkl file to S3
sample_BFD.push_to_s3()
# This saves the .pkl file to S3
sample_BFD.push_to_s3()

upload: ../../output/Gspe_tutorial/Gspe_tutorial_BioFileDocket.pkl to s3://arcadia-reference-datasets/glial-origins-pkl/Gspe_tutorial_BioFileDocket.pkl

Getting a `.pkl` file from AWS S3¶

Conversely, you can pull a .pkl file from S3 and extract its contents.

In [21]:

            
                Copied!
                
new_sample_BFD2 = sample_BFD.get_from_s3().unpickle()

display(vars(new_sample_BFD2))
new_sample_BFD2 = sample_BFD.get_from_s3().unpickle()

display(vars(new_sample_BFD2))

file Gspe_tutorial_BioFileDocket.pkl already exists at /home/ec2-user/glial-origins/output/Gspe_tutorial/Gspe_tutorial_BioFileDocket.pkl

{'species': 'Genus_species',
 'conditions': 'tutorial',
 'directory': '/home/ec2-user/glial-origins/output/Gspe_tutorial/',
 'files': {'samplefile.txt': <biofile_handling.BioFile at 0x7fcea86d2d40>,
  'testfile2.txt': <biofile_handling.BioFile at 0x7fcea86d3fa0>,
  'testfile3.txt': <biofile_handling.BioFile at 0x7fcea86d3d60>},
 'metadata': <biofile_handling.metadata_object at 0x7fcea86d0220>,
 'aws_testfile': <biofile_handling.BioFile at 0x7fcea86d06a0>}

Tip: `.pkl` file uniqueness¶

The .get_from_s3() and .unpickle() methods look for a .pkl file based on the species and conditions of a BioFileDocket.
This means that you don't have to have any information in your local BioFileDocket to start with – as long as it exists on S3, you can retrieve the file.

In [22]:

            
                Copied!
                
# Creating an empty BioFileDocket using just the species and conditions,
#   then pulling the .pkl file based on those identifiers to fill the BioFileDocket
species = 'Genus_species'
conditions = 'tutorial'
new_sample_BFD3 = BioFileDocket(species, conditions).get_from_s3().unpickle()

display(vars(new_sample_BFD3))
# Creating an empty BioFileDocket using just the species and conditions,
#   then pulling the .pkl file based on those identifiers to fill the BioFileDocket
species = 'Genus_species'
conditions = 'tutorial'
new_sample_BFD3 = BioFileDocket(species, conditions).get_from_s3().unpickle()

display(vars(new_sample_BFD3))

/home/ec2-user/glial-origins/output/Gspe_tutorial/ already exists
Files will be saved into /home/ec2-user/glial-origins/output/Gspe_tutorial/
file Gspe_tutorial_BioFileDocket.pkl already exists at /home/ec2-user/glial-origins/output/Gspe_tutorial/Gspe_tutorial_BioFileDocket.pkl

{'species': 'Genus_species',
 'conditions': 'tutorial',
 'directory': '/home/ec2-user/glial-origins/output/Gspe_tutorial/',
 'files': {'samplefile.txt': <biofile_handling.BioFile at 0x7fcea8540130>,
  'testfile2.txt': <biofile_handling.BioFile at 0x7fcea86d1e70>,
  'testfile3.txt': <biofile_handling.BioFile at 0x7fcea86d31c0>},
 'metadata': <biofile_handling.metadata_object at 0x7fcea86d3a60>,
 'aws_testfile': <biofile_handling.BioFile at 0x7fcea86d10f0>}

6. Systematically transferring files in a `BioFileDocket` to and from AWS S3¶

The BioFileDocket can upload all associated keyfiles from S3 programmatically.

In [23]:

            
                Copied!
                
new_sample_BFD3.local_to_s3()
new_sample_BFD3.local_to_s3()

testfile3.txt already exists in S3 bucket, skipping upload. set overwrite = True to overwrite the existing file.

Conversely, it can also download all files from S3 as well.

In [24]:

            
                Copied!
                
new_sample_BFD3.s3_to_local()
new_sample_BFD3.s3_to_local()

file testfile3.txt already exists at /home/ec2-user/glial-origins/output/Gspe_tutorial/testfile3.txt

XX. Create a MultiSpeciesDocket¶

In [6]:

            
                Copied!
                
                    
                    
                
                

        
# Specify the name of the species folder in Amazon S3
# This should contain an underscore
species_dict = {
    'Genus_species': 'tutorial',
    'Foo_bar': 'tutorial',
    'Hello_world': 'tutorial'
}

# Specify any particular identifying conditions, eg tissue type:
# Must be alphanumeric; can't contain special characters
global_conditions = 'tutorial'

analysis_type = 'Testing'

sample_MSD = MultiSpeciesBioFileDocket(species_dict, global_conditions, analysis_type)
# Specify the name of the species folder in Amazon S3
# This should contain an underscore
species_dict = {
    'Genus_species': 'tutorial',
    'Foo_bar': 'tutorial',
    'Hello_world': 'tutorial'
}

# Specify any particular identifying conditions, eg tissue type:
# Must be alphanumeric; can't contain special characters
global_conditions = 'tutorial'

analysis_type = 'Testing'

sample_MSD = MultiSpeciesBioFileDocket(species_dict, global_conditions, analysis_type)

/home/ec2-user/glial-origins/output/FbarGspeHwor_tutorial_Testing/ already exists

In [7]:

            
                Copied!
                
sample_MSD.pickle()
sample_MSD.pickle()

In [11]:

            
                Copied!
                
sample_MSD.add_keyfile('blah', biofile_object_curl)
sample_MSD.add_keyfile('blah', biofile_object_curl)

In [ ]:

BioFile handling tutorial¶

0. Import¶

1. Create a BioFileDocket¶

Tip¶

2. Create a BioFile from scratch¶

Tip¶

3. Create a BioFile by downloading from a URL or S3 URI¶

Downloading from AWS S3¶

4. Place BioFile objects into BioFileDocket¶

Keyfiles¶

5. Pickling the BioFileDocket¶

Unpickling the BioFileDocket¶

Pushing the .pkl file to AWS S3¶

Getting a .pkl file from AWS S3¶

Tip: .pkl file uniqueness¶

6. Systematically transferring files in a BioFileDocket to and from AWS S3¶

XX. Create a MultiSpeciesDocket¶

5. Pickling the `BioFileDocket`¶

Unpickling the `BioFileDocket`¶

Pushing the `.pkl` file to AWS S3¶

Getting a `.pkl` file from AWS S3¶

Tip: `.pkl` file uniqueness¶

6. Systematically transferring files in a `BioFileDocket` to and from AWS S3¶