pluaron
2024-05-28

generator

sequence generator

usage

SeqBox generator
usage: SeqBox generator [-h] [-out OUT] [-od OUT_DIR]
                        [-gt {random,norepeat,distance}] [-gn GNUMBER]
                        [-gts GTIMES] [-gc GCHAR] [-gd GDISTANCE] [-gl GLEN]

optional arguments:
-h, --help            show this help message and exit
-out OUT, --out_file OUT
                        sequence out file with TSV format
-od OUT_DIR, --out_dir OUT_DIR
                        out direction
-gt {random,norepeat,distance}, --gtype {random,norepeat,distance}
-gn GNUMBER, --gnumber GNUMBER
                        sequence number
-gts GTIMES, --gtimes GTIMES
                        Number of runs
-gc GCHAR, --gchar GCHAR
                        generator charset, like: ATCG
-gd GDISTANCE, --gdistance GDISTANCE
                        distance
-gl GLEN, --glen GLEN
                        generator sequence length

random

Randomly generated sequence, possible repetitions.

API

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from seqbox import SEQ
seq = SEQ(name="test")
seq.name
# 'test'
Grandom = seq.Gseq_random(seq_len=35)
n = 0
while n <= 10:
seq = next(Grandom)
print(seq)
n = n + 1
#2024-05-21 15:16:40.176 | WARNING | seqbox.seq:Gseq_random:150 - not found params char_set, use 'ATCG' with default
#CATCGTGCATAACTCGGTAGACAGGCAAGGCAACG
#CACTCGGTCTTGCCATGTGTCCTGTAGGTGACTAA
#AGGAACCCCCGTAGTAGCCCGACAGCAAGGTTCAT
#AGTGTGGATAAAAGAGCACCCACAGAGGACCAAAA
#CAAAACGTATGGCACGCTAACCATTTTCTAAACCG
#GATTTAAGCACAAGACCCTTGTTGGAAAGCAATTC
#AGGTCAACGGGCATTTAGTTCAGTTCCAGTGACCC
#ATGACAGGGAGTGTCATCTGAAAGTGCCAACGGGG
#TAGACTCACTATTATGCCTTCCATTCTCGGAGGTG
#CCTCACCGTTTCGCGGAACGTTTCAACCAAACTAA
#TGAGGGTCTACGATAGCTGTCAAGGAAGACCACTT

seq = SEQ(name="test")
Grandom_RNA = seq.Gseq_random(seq_len=35, char_set="AUCG")
n = 0
while n <= 10:
seq = next(Grandom_RNA)
print(seq)
n = n + 1
#UGAGGACGAGGGCGGUGAUAGGCAUUUUAAUUUGG
#UUGUUUAUCUAGUCCUCCAGCCUCGGAACCGGAAC
#GCUACAUUGUCUGCGGGUUUGACAACGACACGGAU
#CACUGAAGGCUGUUCCCUACUGACUUCGAGGCGCA
#CGAGACUCUUGAAUGUAUAAAAGGUAAGAUAUAUC
#GCCUUUUUUCGGGAGUUACUUCGAGUUGUUCAAUC
#AAGUGUUCCGACCUGAUCCUGCCUCCUGAUCACGU
#GGGGUGGAUUCUGGUAUAGGUGAGGCUAAUCAGUU
#CCUAGCACGACCAAGUUCAAUAGAACGAUCAGAGG
#CGAAUGUACGAGAUCGUCAGGCGGGCCGAUGUUGG
#AUCUGACCUAGGCCUUGGGAGGGUACCCACAAAAG

CLI

The main parameters include the length(-gl or --glen) and number(-gn or -gnumber) of generated sequences.

1
2
3

SeqBox generator -gt random -gn 100 -gl 35 -out test_random

output: test_random.tsv
#generator_random:seq_len:35;char_set:ATCG
GGTCCGACCTCATCTGGATGCTCCAATGTGGGCTG
AGGCATATGGATCGCCGACACCCGTGCTACAGTTA
TCAAGCGCGAACCGGGTACCTGCCGAAACCGTATA
AACAGTGTTGCGCAGTGCCTGCACTTAAACAAATC
GATATAGGGTCTCGTTAGTACGACGATTTCGCGAG
CCCACAGGTCGCAGACTCCGCTGTTGCTTGAAGGC
CGTTAAAGCTCAATCATCAACCCGATACGTTGTCT
GAGAGCCTAGAACAAGGTACACCGAAGACGAGACG
GCGCGGCTGTCCTTAGATATAGGTAGCAATACTGA
...
CGTACTGATCAAATAACCCCGCAGACGGGTAATGC

norepeat

randomly generate non-repeating sequences.

API

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

from seqbox import SEQ
seq = SEQ(name="test")
seq.name
#'test'
Gnorepeat = seq.Gseq_norepeat(seq_len=35)
n = 0
while n <= 10:
seq = next(Gnorepeat)
print(seq)
n = n + 1
#2024-05-21 16:08:06.289 | WARNING | seqbox.seq:Gseq_norepeat:177 - not found params char_set, use 'ATCG' with default
#GGTTGTAACAGCGATAACACGGTTTAACGGACCAG
#GTTAACTCTAACATATACCTCCTTTCTTCGCCACA
#GCGTCGGATATAGAAGCTTGCGCCCTAATGTTATA
#AATGCCTTGCTCGATTCAATTAGCTCCCAAACTTT
#TTGAGAGAGGTCCTGTCGTTCTTATCCTTTCCTAA
#CGTAACATATAGGATACTTACACGCAGGAGAGTAT
#CCCACCCTACGCATCCGCCGTCGGTCTGTCGCTAT
#AACGCGATGTCTAGAAGACCATGTTTCACCTAACT
#CGGCGGTGGTTATGGGATATGGGGCTCTGAGTGAG
#GTGCAGCCACGTGAATTTACGTCCTTCGCCGATCC
#CGGCAAATACTCTTGGATAGAGATTAGCAGCGAGC

seq = SEQ(name="test")
Gnorepeat_RNA = seq.Gseq_random(seq_len=35, char_set="AUCG")
n = 0
while n <= 10:
seq = next(Gnorepeat_RNA)
print(seq)
n = n + 1
#CGUCUUAAUUAAACCUCGAAUUGGGAGUUGGCUUG
#CCCAAUCUUCCACUUCUAUAGCGGUAACAUUAAAU
#UCACAUGCAACUCCACGGAACCAAUUAAACCAGGA
#GUGGAAAACCACUUCCAGACUGUAAAGCCAAUGGU
#CUGGACCUACGAUCGCAAAGAUAUAGCUCCUGUAG
#GAGUUUGGCUAGCAGUAUGAGGAGAUUGCAAAUCU
#UCGCUCACGGCGAAACAAUUCUGUAUUUGAGAAGG
#UUAUUCACGAUAUCGGCGGACAGCGAAGGAGUUAU
#AGGACCCUACAUGACGCACAGAGACCAGGUAUCGG
#AGUGUGGGCGUCUCCCGAAACGAAUCGCAUUGUUC
#UUCGUACCGAAGGUUUAGUGGAAUAGGUGGAGAGU

CLI

The main parameters include the length(-gl or --glen) and number(-gn or -gnumber) of generated sequences.

1
2
3

SeqBox generator -gt norepeat -gn 100 -gl 35 -out test_norepeat

output: test_norepeat.tsv
#generator_norepeat:seq_len:35;char_set:ATCG
ACTAGATTTTGATTTGGTCCGGAGTTAGAGATCGT
GGGATCGAAAGGGGTCGCCTCTCTTGAGAGCATTG
GCTATTTATTCAAATAGACTATATACAACAGTACA
GGACCTGTAGCGGCGTAGAATGTGCTGTGATACGA
CCTTGGACAGTGGGGTATAACCTATGGTGTGAGTA
TCACCTTTATTCAGGCGTATCTACGGTACTATCAA
GTAGGGTTTCTACCGTTTGAGCATGTAGATGCCAT
GCTTAAGTGATGTAAGGTGGCTTACCATCATCGAA
...
CATACCACGTAACAACCCGTAGGTTCGCGTTAGGT

distance

randomly generate sequences with edit distances between sequences greater than a fixed value.

API

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65

from seqbox import SEQ
seq = SEQ(name="test")
seq.name
#'test'
Gdistance = seq.Gseq_distance(seq_len=35, distance=5)
n = 0
while n <= 10:
seq = next(Gdistance)
print(seq)
n = n + 1
#2024-05-21 16:13:45.385 | WARNING | seqbox.seq:Gseq_random:150 - not found params char_set, use 'ATCG' with default
#CCGAGATAAGATACCGCCGCGGGCAACAGCAACCT
#TGAACGTACTTTTCGTCCTGGCGTCATGGGCTAGC
#ACTTAATCTTTGTTTCGACTGTGAGTCACCTATGT
#ACTGTTTCTCCGAGCACTGTCCAATCGTCTTCCCT
#CCTTACGTCCCGTTGTGCGTGTCCATTCGGCAACG
#TAGATTAACTAGCTTGCCGATCGATCAGTTAATTC
#GGAGACCTCAGCAAAGTTCCGCCTCTAAACAGGTA
#CCTCTTTAACATATTGCCTAATCACCGTGGGATAT
#TCGTTAGGAATGTGGACATTTTGAATGGCCAACCT
#TGAAGTAATATTAATCGAAGTTTATTGAGCTAACC
#AGTTGACATTAGCACGTGTGGGATGCGGTCCGAAT

seq = SEQ(name="test")
Gdistance_RNA = seq.Gseq_distance(seq_len=35, char_set="AUCG", distance=5)
n = 0
while n <= 10:
seq = next(Gdistance_RNA)
print(seq)
n = n + 1
#UGUGGCUCCGCACACUGGGCGCUGCCGUGAUGAGA
#ACUAGCAACGACGUAAUAGGUGCUAAUUUACAGUU
#CCUCAACGUAGUAGGGAUUCCAGCUCUACGCAUGA
#CCGCAGACCCAUAGUUAUAUAGCUUCCACUUUUUA
#CAGAGGUUCUUGCGUCUAAUCUCAUUGCAAGUAAU
#GCCCAUGGGGUACUCUUGCCCGAUGUUUGACUGUG
#UCUCCUUCGGGUAGGUGCCACACCGAGCCUUGAGU
#AACCAGAGAGUGACCAUUUGACAACGCACUACCUU
#UCACGUCAAGUUAAGCGAGCAAGAGGGCGCGUCGU
#UAACCUAUGUUCUACUCUUCACUUCUAAGUUAAGG
#CUUCGACCUCUACAUGAAGGCAAGCUUUGAAGUAU


seq = SEQ(name="test")
# When the distance is relatively large, it takes too long to obtain a sufficient number of entries.
# At this time, set the times parameter to control the number of attempts and avoid too long loops.
Gdistance = seq.Gseq_distance(seq_len=35, distance=35, times=10)
n = 0
while n < 10:
seq = next(Gdistance)
print(seq)
n = n + 1
#2024-05-21 16:31:31.828 | WARNING | seqbox.seq:Gseq_random:150 - not found params char_set, use 'ATCG' with default
#GGAATATAGCGTTATTCAGACCTAGTAGACAATTC
#---------------------------------------------------------------------------
#StopIteration Traceback (most recent call last)
#Cell In[14], line 3
# 1 n = 0
# 2 while n < 10:
#----> 3 seq = next(Gdistance)
# 4 print(seq)
# 5 n = n + 1
#
#StopIteration:

CLI

The main parameters include the length(-gl or --glen), number(-gn or -gnumber), distance(-gd or --gdistance) and try_times(-gts or gtimes) of generated sequences.

1
2
3

SeqBox generator -gt distance -gn 100 -gl 35 -gd 5 -out test_distance

output: test_norepeat.tsv
#generator_distance:seq_len:35;char_set:ATCG;distance:5
ACCATTAGCACCAACAGGCAAGCTCCTGCACGGTA
GTGCAGGCCCAACTTTCCCCACCTATAGGCTACGG
GACCGGGCGGGACTTTCGCCCAATCATCACATACC
AACCGGTAGTCGATGAGCGCTCATTAACACGAAGC
GTTCTGGTCATTTATCCTCCCTCAGGTACGGATTT
TTGCCGCTCAATTGAAAGGTACTGCCAGGAGTGTC
AGGCCAGAACGGATATACTAGTTGCTCCAACCTGA
ATTGACAGCAGGCGCAAGACATGCCCTAAGCCCTA
GTAACTATCCCGAGTCGACGCAGATTGTGCTTCGG