Importation¶
pyGeno’s database is populated by importing tar.gz compressed archives called datawraps. An importation is a one time step and once a datawrap has been imported it can be discarded with no concequences to the database.
Here’s how you import a reference genome datawrap:
from pyGeno.importation.Genomes import *
importGenome("my_genome_datawrap.tar.gz")
And a SNP set datawrap:
from pyGeno.importation.SNPs import *
importSNPs("my_snp_datawrap.tar.gz")
pyGeno comes with a few datawraps that you can quickly import using the Bootstraping module.
You can find a list of datawraps to import here: Datawraps
You can also easily create your own by simply putting a bunch of URLs into a manifest.ini file and compressing int into a tar.gz archive (as explained below or on the Wiki).
Genomes¶
- importation.Genomes.backUpDB()[source]¶
backup the current database version. automatically called by importGenome(). Returns the filename of the backup
- importation.Genomes.importGenome(packageFile, batchSize=50, verbose=0)[source]¶
Import a pyGeno genome package. A genome packages is folder or a tar.gz ball that contains at it’s root:
gziped fasta files for all chromosomes, or URLs from where them must be downloaded
gziped GTF gene_set file from Ensembl, or an URL from where it must be downloaded
a manifest.ini file such as:
[package_infos] description = Test package. This package installs only chromosome Y of mus musculus maintainer = Tariq Daouda maintainer_contact = tariq.daouda [at] umontreal version = GRCm38.73 [genome] species = Mus_musculus name = GRCm38_test source = http://useast.ensembl.org/info/data/ftp/index.html [chromosome_files] Y = Mus_musculus.GRCm38.73.dna.chromosome.Y.fa.gz / or an url such as ftp://... or http:// [gene_set] gtf = Mus_musculus.GRCm38.73_Y-only.gtf.gz / or an url such as ftp://... or http://
All files except the manifest can be downloaded from: http://useast.ensembl.org/info/data/ftp/index.html
A rollback is performed if an exception is caught during importation
batchSize sets the number of genes to parse before performing a database save. PCs with little ram like small values, while those endowed with more memory may perform faster with higher ones.
Verbose must be an int [0, 4] for various levels of verbosity
Polymorphisms (SNPs and Indels)¶
- importation.SNPs.importSNPs(packageFile)[source]¶
The big wrapper, this function should detect the SNP type by the package manifest and then launch the corresponding function. Here’s an example of a SNP manifest file for Casava SNPs:
[package_infos] description = Casava SNPs for testing purposes maintainer = Tariq Daouda maintainer_contact = tariq.daouda [at] umontreal version = 1 [set_infos] species = human name = dummySRY type = Agnostic source = my place at the IRIC [snps] filename = snps.txt # as with genomes you can either include de file at the root of the package or specify an URL from where it must be downloaded