Parsers

PyGeno comes with a set of parsers that you can use independentely.

CSV

To read and write CSV files.

class tools.parsers.CSVTools.CSVEntry(csvFile, lineNumber=None)[source]

A single entry in a CSV file

commit()[source]

commits the line so it is added to a file stream

class tools.parsers.CSVTools.CSVFile(legend=[], separator=',', lineSeparator='\n')[source]

Represents a whole CSV file:

#reading
f = CSVFile()
f.parse('hop.csv')
for line in f :
        print(line['ref'])

#writing, legend can either be a list of a dict {field : column number}
f = CSVFile(legend = ['name', 'email'])
l = f.newLine()
l['name'] = 'toto'
l['email'] = "hop@gmail.com"

for field, value in l :
        print(field, value)

f.save('myCSV.csv')             
addField(field)[source]

add a filed to the legend

closeStreamToFile()[source]

Appends the remaining commited lines and closes the stream. If no stream is active, raises a ValueError

commitLine(line)[source]

Commits a line making it ready to be streamed to a file and saves the current buffer if needed. If no stream is active, raises a ValueError

insertLine(i)[source]

Inserts an empty line at position i and returns it

newLine()[source]

Appends an empty line at the end of the CSV and returns it

parse(filePath, skipLines=0, separator=',', stringSeparator='"', lineSeparator='\n')[source]

Loads a CSV file

save(filePath)[source]

save the CSV to a file

streamToFile(filename, keepInMemory=False, writeRate=1)[source]

Starts a stream to a file. Every line must be committed (l.commit()) to be appended in to the file.

If keepInMemory is set to True, the parser will keep a version of the whole CSV in memory, writeRate is the number of lines that must be committed before an automatic save is triggered.

toStr()[source]

returns a string version of the CSV

exception tools.parsers.CSVTools.EmptyLine(lineNumber)[source]

Raised when an empty or comment line is found (dealt with internally)

tools.parsers.CSVTools.catCSVs(folder, ouputFileName, removeDups=False)[source]

Concatenates all csv in ‘folder’ and wites the results in ‘ouputFileName’. My not work on non Unix systems

tools.parsers.CSVTools.joinCSVs(csvFilePaths, column, ouputFileName, separator=',')[source]

csvFilePaths should be an iterable. Joins all CSVs according to the values in the column ‘column’. Write the results in a new file ‘ouputFileName’

tools.parsers.CSVTools.removeDuplicates(inFileName, outFileName)[source]

removes duplicated lines from a ‘inFileName’ CSV file, the results are witten in ‘outFileName’

FASTA

To read and write FASTA files.

class tools.parsers.FastaTools.FastaFile(fil=None)[source]

Represents a whole Fasta file:

#reading
f = FastaFile()
f.parseFile('hop.fasta')
for line in f :
        print(line)

#writing
f = FastaFile()
f.add(">prot1", "MLPADEV")
f.save('myFasta.fasta')
add(header, data)[source]

appends a new entry to the file

get(i)[source]

returns the ith entry

parseFile(fil)[source]

Opens a file and parses it

parseStr(st)[source]

Parses a string

reset()[source]

Erases everything

save(filePath)[source]

saves the file into filePath

toStr()[source]

returns a string version of self

FASTQ

To read and write FASTQ files.

class tools.parsers.FastqTools.FastqEntry(ident='', seq='', plus='', qual='')[source]

A single entry in the FastqEntry file

class tools.parsers.FastqTools.FastqFile(fil=None)[source]

Represents a whole Fastq file:

#reading
f = FastqFile()
f.parse('hop.fastq')
for line in f :
        print(line['sequence'])

#writing, legend can either be a list of a dict {field : column number}
f = CSVFile(legend = ['name', 'email'])
l = f.newLine()
l['name'] = 'toto'
l['email'] = "hop@gmail.com"
f.save('myCSV.csv')
add(fastqEntry)[source]

appends an entry to self

get(li)[source]

returns the ith entry

newEntry(ident='', seq='', plus='', qual='')[source]

Appends an empty entry at the end of the CSV and returns it

parseFile(fil)[source]

Parses a file on disc

parseStr(st)[source]

Parses a string

reset()[source]

Frees the file

GTF

To read GTF files.

class tools.parsers.GTFTools.GTFFile(filename, gziped=False)[source]

This is a simple GTF2.2 (Revised Ensembl GTF) parser, see http://mblab.wustl.edu/GTF22.html for more infos

get(line, elmt)[source]

returns the value of the field ‘elmt’ of line ‘line’

get_transcripts(transcript_ids=None)[source]

returns genes with its transcripts and associated exons and CDSs from a GTF if transcript_ids is used, only these transcripts will be returned

gtf2bed(bed_filename, feature='transcripts')[source]

Transform gtf to bed6/bed12 and saves the output to file

gtf2bed_cds(bed_filename, join_overlaps=True)[source]

Retrieves CDS information from gtf in bed6 format

gtf2bed_exons(bed_filename, join_overlaps=True)[source]

Retrieves exon information from gtf in bed6 format

gtf2bed_transcripts(bed_filename)[source]

Retrieves transcript information from gtf in bed12 format

VCF

To read VCF files.

class tools.parsers.VCFTools.VCFEntry(vcfFile, line, lineNumber)[source]

A single entry in a VCF file

class tools.parsers.VCFTools.VCFFile(filename=None, gziped=False, stream=False)[source]

This is a small parser for VCF files, it should work with any VCF file but has only been tested on dbSNP138 files. Represents a whole VCF file:

#reading
f = VCFFile()
f.parse('hop.vcf')
for line in f :
        print(line['pos'])
close()[source]

closes the file

parse(filename, gziped=False, stream=False)[source]

opens a file

Casava

To read casava files.

class tools.parsers.CasavaTools.SNPsTxtEntry(lineNumber, snpsTxtFile)[source]

A single entry in the casavas snps.txt file

class tools.parsers.CasavaTools.SNPsTxtFile(fil, gziped=False)[source]

Represents a whole casava’s snps.txt file:

f = SNPsTxtFile('snps.txt')
for line in f :
        print(line['ref'])
reset()[source]

Frees the file