GTF
CLASS
BASE
CDS, start_codon, stop_codon, UTR5, UTR3, other and other annotation types use this class.
Attributes | data type | info |
---|---|---|
chr | str | chromosome |
start | int | location start |
end | int | location end |
strand | str | strand ‘+’ or ‘-‘ |
more info: The attribute value of BASE
is readable. That is, after reading from the gtf file for the first time, it cannot be modified.
EXON
Record exon related information.
Attributes | data type | info |
---|---|---|
chr | str | exon chromosome |
start | int | exon location start |
end | int | lexon ocation end |
strand | str | exon strand ‘+’ or ‘-‘ |
id | str | exon id |
more info: The attribute value of EXON
is readable. That is, after reading from the gtf file for the first time, it cannot be modified.
TRANSCRIPT
Record transcript related information.
Attributes | data type | info |
---|---|---|
id | str | transcript id |
name | str | transcript name |
chr | str | transcript location chromosome |
start | int | transcript location start |
end | int | transcript location end |
strand | str | transcript strand ‘+’ or ‘-‘ |
CDS | List[BASE] | transcript CDSs |
start_codon | List[BASE] | transcript start_codons |
stop_codon | List[BASE] | transcript stop_codons |
UTR5 | List[BASE] | transcript UTR5s |
UTR3 | List[BASE] | transcript UTR3s |
exons | Dict[EXON.id]=EXON | transcript exons |
other | List[BASE] | transcript other annotation types |
GENE
Record gene related information.
Attributes | data type | info |
---|---|---|
id | str | gene id |
name | str | gene name |
chr | str | gene location chromosome |
start | int | gene location start |
end | int | gene location end |
strand | str | gene strand ‘+’ or ‘-‘ |
trans | Dict[TRANSCRIPT.id]=TRANSCRIPT | gene transcripts |
trans_map | Dict[TRANSCRIPT.name] = TRANSCRIPT.id | map of id and name |
GTF
Parse GTF files
Attributes | data type | info |
---|---|---|
name | str | gtf name |
version | str | gtf version |
URL | str | download URL of gtf file |
genes | Dict[GENE.id]=GENE | gtf genes |
genes_map | Dict[GENE.name]=GENE.id | map of id and name |
genes_interval | Dict[chromosome]=IntervalTree | genes location projects to IntervalTree |
err | List | Lines in gtf file that cannot be correctly recognized |
anno_map | Dict[attributes]=description | Commonly used gene structure attributes and their corresponding relationships described in gtf files |
annotation type map
annotation type map, the corresponding description relationship between annotation attributes and annotation types description in annotation files.
raw map table
Attributes | description |
---|---|
gene | gene |
trans | transcript |
exon | exon |
CDS | CDS |
start_codon | start_codon |
stop_codon | stop_codon |
UTR5 | five_prime_utr |
UTR3 | three_prime_utr |
other | other |
more info: other
is an additional reserved attribute for reading special cases that do not exist in the table but exist in the gtf file.
map table description content replacement
1 | from annokit import GTF |
more info: anno_map can modify the gtf description corresponding to multiple attributes at the same time, separated by semicolons, and the attributes and descriptions are separated by commas, like ‘{attributes1},{description1};{attributes2},{description2};...;{attributesN},{descriptionN}
‘
API
1 | from annokit import GTF |
more info: The three parameters name
, version
and URL
are all optional parameters. The main purpose is to record the relevant information of the gtf file.
interval search
intervaltree: a mutable, self-balancing interval tree for Python 2 and 3. Queries may be by point, by range overlap, or by range envelopment.
1 | from annokit import GTF |
more info: The location parameter consists of the chromosome, starting position, and ending position, with a colon between them, like ‘{chr}:{start}:{end}
‘.
gene inquires
Use the gene ID or name to query the related information of the gene in the gtf file, and output it in the DataFrame format of pandas.
1 | from annokit import GTF |
ilevel | gene | trans | exon |
---|---|---|---|
geneid | ✓ | ✓ | ✓ |
genename | ✓ | ✓ | ✓ |
chr | ✓ | ✓ | ✓ |
start | ✓ | ✓ | ✓ |
end | ✓ | ✓ | ✓ |
strand | ✓ | ✓ | ✓ |
transid | ✕ | ✓ | ✓ |
transname | ✕ | ✓ | ✓ |
transstart | ✕ | ✓ | ✓ |
transend | ✕ | ✓ | ✓ |
exonid | ✕ | ✕ | ✓ |
exonstart | ✕ | ✕ | ✓ |
exonend | ✕ | ✕ | ✓ |
If the corresponding information is missing, the replacement rules are as follows:
str: String types are replaced by
-
, such asid
,name
,chr
,strand
and other attributes.int: The int data type is replaced by
0
, such asstart
,end
and other attributes.
name2id or id2name
Convert gene name and id to each other
1 | # n2i |
genes_id:
geneid1,geneid2,...,geneidN
genes_name:
genename1,genename2,...,genenameN
CLI
interval search
1 | AnnoGtf -t searchs -g test.gtf -l chr1:1000:5000 -o test -od ./ -am UTR5,UTR5\;other,other_anno |
name2id or id2name
1 | genes |
more info: genes file consists of gene name or gene id, one for each line.
gene inquires
1 | genesid |
jupyter sample
more usage: https://github.com/iOLIGO/AnnoKit/blob/main/tests/GTF.ipynb.