Quickstart¶
Quick importation¶
pyGeno’s database is populated by importing data wraps. pyGeno comes with a few datawraps, to get the list you can use:
import pyGeno.bootstrap as B
B.printDatawraps()
Available datawraps for bootstraping
SNPs
~~~~|
|~~~:> Human_agnostic.dummySRY.tar.gz
|~~~:> Human.dummySRY_casava.tar.gz
|~~~:> dbSNP142_human_GRCh37_common_all.tar.gz
|~~~:> dbSNP142_human_common_all.tar.gz
Genomes
~~~~~~~|
|~~~:> Human.GRCh37.75.tar.gz
|~~~:> Human.GRCh37.75_Y-Only.tar.gz
|~~~:> Human.GRCh38.78.tar.gz
|~~~:> Human.GRCh38.98.tar.gz
|~~~:> Mouse.GRCm38.78.tar.gz
|~~~:> Mouse.GRCm38.98.tar.gz
To get a list of remote datawraps that pyGeno can download for you, do:
B.printRemoteDatawraps()
Importing whole genomes is a demanding process that take more than an hour and requires (according to tests) at least 3GB of memory. Depending on your configuration, more might be required.
That being said importating a data wrap is a one time operation and once the importation is complete the datawrap can be discarded without consequences.
The bootstrap module also has some handy functions for importing built-in packages.
Some of them just for playing around with pyGeno (Fast importation and Small memory requirements):
import pyGeno.bootstrap as B
#Imports only the Y chromosome from the human reference genome GRCh37.75
#Very fast, requires even less memory. No download required.
B.importGenome("Human.GRCh37.75_Y-Only.tar.gz")
#A dummy datawrap for humans SNPs and Indels in pyGeno's AgnosticSNP format.
# This one has one SNP at the begining of the gene SRY
B.importSNPs("Human.dummySRY_casava.tar.gz")
And for more serious work, the whole reference genome.
#Downloads the whole genome (205MB, sequences + annotations), may take an hour or more.
B.importGenome("Human.GRCh38.78.tar.gz")
That’s it, you can now print the sequences of all the proteins that a gene can produce:
from pyGeno.Genome import Genome
from pyGeno.Gene import Gene
from pyGeno.Protein import Protein
#the name of the genome is defined inside the package's manifest.ini file
ref = Genome(name = 'GRCh37.75')
#get returns a list of elements
gene = ref.get(Gene, name = 'SRY')[0]
for prot in gene.get(Protein) :
print(prot.sequence)
You can see pyGeno achitecture as a graph where everything is connected to everything. For instance you can do things such as:
gene = aProt.gene
trans = aProt.transcript
prot = anExon.protein
genome = anExon.genome
Queries¶
- Note that the way queries are handled is changing
Since pyGeno v1.4 the default method is to use generators
PyGeno allows for several kinds of queries, here are some snippets:
#in this case both queries will yield the same result
myGene.get(Protein, id = "ENSID...")
myGenome.get(Protein, id = "ENSID...")
#even complex stuff
exons = myChromosome.get(Exons, {'start >=' : x1, 'stop <' : x2})
hlaGenes = myGenome.get(Gene, {'name like' : 'HLA'})
sry = myGenome.get(Transcript, { "gene.name" : 'SRY' })
To know the available fields for queries, there’s a “help()” function:
Gene.help()
Faster queries¶
- Note that the way queries are handled is changing
Since pyGeno v1.4 the default method is to use generators
To speed up loops use get(gen=True):
for prot in gene.get(Protein, gen=True) :
print(prot.sequence)
For more speed create indexes on the fields you need the most:
Gene.ensureGlobalIndex('name')
Getting sequences
Anything that has a sequence can be indexed using the usual python list syntax:
protein[34] # for the 34th amino acid
protein[34:40] # for amino acids in [34, 40[
transcript[23] #for the 23rd nucleotide of the transcript
transcript[23:30] #for nucletotides in [23, 30[
transcript.cDNA[23:30] #the same but for the protein coding DNA (without the UTRs)
Transcripts, Proteins, Exons also have a .sequence attribute. This attribute is the string rendered sequence, it is perfect for printing but it may contain ‘/’s in case of polymorphic sequence that you must take into account in the indexing. On the other hand if you use indexes directly on the object (as shown in the snippet above) pyGeno will use a binary representaion of the sequences thus the indexing is independent of the polymorphisms present in the sequences.
Personalized Genomes¶
Personalized Genomes are a powerful feature that allow to work on the specific genomes and proteomes of your patients. You can even mix several SNPs together:
from pyGeno.Genome import Genome
#the name of the snp set is defined inside the datawraps's manifest.ini file
dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY')
#you can also define a filter (ex: a quality filter) for the SNPs
dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY', SNPFilter = myFilter())
#and even mix several snp sets
dummy = Genome(name = 'GRCh37.75', SNPs = ['dummySRY', 'anotherSet'], SNPFilter = myFilter())
pyGeno allows you to customize the Polymorphisms that end up into the final sequences. It supports SNPs, Inserts and Deletions:
from pyGeno.SNPFiltering import SNPFilter
from pyGeno.SNPFiltering import SequenceSNP
class QMax_gt_filter(SNPFilter) :
def __init__(self, threshold) :
self.threshold = threshold
def filter(self, chromosome, dummySRY = None) :
if dummySRY.Qmax_gt > self.threshold :
#other possibilities of return are SequenceInsert(<bases>), SequenceDelete(<length>)
return SequenceSNP(dummySRY.alt)
return None #None means keep the reference allele
persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10))