Importation

pyGeno’s database is populated by importing tar.gz compressed archives called datawraps. An importation is a one time step and once a datawrap has been imported it can be discarded with no concequences to the database.

Here’s how you import a reference genome datawrap:

from pyGeno.importation.Genomes import *
importGenome("my_genome_datawrap.tar.gz")

And a SNP set datawrap:

from pyGeno.importation.SNPs import *
importSNPs("my_snp_datawrap.tar.gz")

pyGeno comes with a few datawraps that you can quickly import using the Bootstraping module.

You can find a list of datawraps to import here: Datawraps

You can also easily create your own by simply putting a bunch of URLs into a manifest.ini file and compressing int into a tar.gz archive (as explained below or on the Wiki).

Genomes

importation.Genomes.backUpDB()[source]

backup the current database version. automatically called by importGenome(). Returns the filename of the backup

importation.Genomes.deleteGenome(species, name)[source]

Removes a genome from the database

importation.Genomes.importGenome(packageFile, batchSize=50, verbose=0)[source]

Import a pyGeno genome package. A genome packages is folder or a tar.gz ball that contains at it’s root:

  • gziped fasta files for all chromosomes, or URLs from where them must be downloaded

  • gziped GTF gene_set file from Ensembl, or an URL from where it must be downloaded

  • a manifest.ini file such as:

    [package_infos]
    description = Test package. This package installs only chromosome Y of mus musculus
    maintainer = Tariq Daouda
    maintainer_contact = tariq.daouda [at] umontreal
    version = GRCm38.73
    
    [genome]
    species = Mus_musculus
    name = GRCm38_test
    source = http://useast.ensembl.org/info/data/ftp/index.html
    
    [chromosome_files]
    Y = Mus_musculus.GRCm38.73.dna.chromosome.Y.fa.gz / or an url such as ftp://... or http://
    
    [gene_set]
    gtf = Mus_musculus.GRCm38.73_Y-only.gtf.gz / or an url such as ftp://... or http://
    

All files except the manifest can be downloaded from: http://useast.ensembl.org/info/data/ftp/index.html

A rollback is performed if an exception is caught during importation

batchSize sets the number of genes to parse before performing a database save. PCs with little ram like small values, while those endowed with more memory may perform faster with higher ones.

Verbose must be an int [0, 4] for various levels of verbosity

Polymorphisms (SNPs and Indels)

importation.SNPs.deleteSNPs(setName)[source]

deletes a set of polymorphisms

importation.SNPs.importSNPs(packageFile)[source]

The big wrapper, this function should detect the SNP type by the package manifest and then launch the corresponding function. Here’s an example of a SNP manifest file for Casava SNPs:

[package_infos]
description = Casava SNPs for testing purposes
maintainer = Tariq Daouda
maintainer_contact = tariq.daouda [at] umontreal
version = 1

[set_infos]
species = human
name = dummySRY
type = Agnostic
source = my place at the IRIC

[snps]
filename = snps.txt # as with genomes you can either include de file at the root of the package or specify an URL from where it must be downloaded