HOMEWORK#6
Sequence analysis
David Liu, a student in the department of Lfe Sciences, got a gene for
a mouse transcription factor. The sequence of this gene is listed here:
1 catcagacct ccgtacttgg ctttgcagtg cccgccactg tctcctgcgc tcccgcgccg
61 cgttccgccc aggccttgcc cagctggaat gcagagatcg ccgcccggct acggcgcaca
121 ggacgacccg ccctcccgcc gcgactgtgc atgggcccct ggaatcgggg ccgctgctga
181 ggcgcgcggc ctccctgtca ccaacgtctc gcccacctcg cccgcctccc cgtccagcct
241 tccgcggagc ccaccgcgca gccccgaatc agggcgctat ggctttggcc gcggagagcg
301 ccaaactgcc gacgagttgc gcattcggcg gcccatgaac gccttcatgg tgtgggcgaa
361 ggacgagcgc aagcgactgg cgcaacaaaa tccggatctg cacaacgcag tactgagcaa
421 gatgctgggc aaagcgtgga aggagctgaa cacggcggag aagcggccct tcgtggaaga
481 ggccgaacgg ttgcgtgtgc agcacttgcg cgaccatccc aactacaagt accggcctcg
541 tcgtaaaaaa caggagcgca aggtccggag gctggagccg ggtctcttgc tcccgggcct
601 cgtgcagccg tctgcgccgc ccgaggcctt cgctgcagcg tcagggtcag ctcgctcctt
661 ccgcgagcta cccactctgg gtgcggagtt cgatggcttg gggctaccca cgcccgagcg
721 ctcgcctctg gacggcctgg agcctggcga ggcctccttc ttcccaccgc ctttggcgcc
781 cgaggactgc gctctgcggg ctttccgggc accctatgcc cctgagctgg cacgggaccc
841 gagcttctgc tacggggcgc ccctgggtga agcgctcagg acagcgccgc ctgccgcgcc
901 actcgcaggt ctctactatg gcaccctggg hacbccgggc ccgnttccma atcctctgtc
961 accaccacct gagtccccgt ctcttgaggg cacagagcaa ctggagccta ccgccgacct
1021 ttgggccgat gtggacctca ccgaatttga ccagtatctc aattgcagcc ggactcgacc
1081 ggatgccact acactcccct accacgtggn actggccaaa ctaggtccgc gcgccatgtc
1141 ctgtccagaa gagagcagcc tcatttctgc gctgtctgat gctagcagcg cggtctatta
1201 cagtgcttgc atctcaggct agacactgtc cttgccctcc accgcttctg catgtggcca
1261 agtggcagag ttgcctgctc ccttcctttc gcatatgtat gttagggtat gcaacagcct
1321 ttagagctgg tggcctaaag atgccatttc tgtcgcctcc tcatttacac acctccttct
1381 gggggktncc tgtgctttgg gccttcccta ggatcgtcag gccctggacg tgcaagctac
1441 ctctgccagg attggtggtg aagaagctaa ggcttttctg ccatttatgt tctagaatga
1501 ggctgttctg tttactttgc cgggatatac atatatcata tataatacaa tatatttaat
1561 ttttaattaa acttttttct ttaag
Unfortunately, he doesn't know how to use the sequence analysis tools
availabled in the internet since he did not take Bioinformatics before.
Could you help him to do the following analysis?
(1) Find its corresponding polypeptide sequence
(DNA -> Protein translation).
SEQUENCE 377 AA; 41063 MW; CRC32.
MQRSPPGYGA QDDPPSRRDC AWAPGIGAAA EARGLPVTNV SPTSPASPSS LPRSPPRSPE
SGRYGFGRGE RQTADELRIR RPMNAFMVWA KDERKRLAQQ NPDLHNAVLS KMLGKAWKEL
NTAEKRPFVE EAERLRVQHL RDHPNYKYRP RRKKQERKVR RLEPGLLLPG LVQPSAPPEA
FAAASGSARS FRELPTLGAE FDGLGLPTPE RSPLDGLEPG EASFFPPPLA PEDCALRAFR
APYAPELARD PSFCYGAPLG EALRTAPPAA PLAGLYYGTL XXPGPXXNPL SPPPESPSLE
GTEQLEPTAD LWADVDLTEF DQYLNCSRTR PDATTLPYHV XLAKLGPRAM SCPEESSLIS
ALSDASSAVY YSACISG
(2) Identify this protein. Is it a new protein? If not, what's the name
of this protein?
>sp|P43680|SX18_MOUSE TRANSCRIPTION FACTOR SOX-18 Length = 377
Score = 1972 (895.0 bits), Expect = 8.1e-261, P = 8.1e-261
Identities = 374/377 (99%), Positives = 372/377 (98%)
Query: 1 MQRSPPGYGAQDDPPSRRDCAWAPGIGAAAEARGLPVTNVSPTSPASPSSLPRSPPRSPE 60
MQRSPPGYGAQDDPPSRRDCAWAPGIGAAAEARGLPVTNVSPTSPASPSSLPRSPPRSPE
Sbjct: 1 MQRSPPGYGAQDDPPSRRDCAWAPGIGAAAEARGLPVTNVSPTSPASPSSLPRSPPRSPE 60
Query: 61 SGRYGFGRGERQTADELRIRRPMNAFMVWAKDERKRLAQQNPDLHNAVLSKMLGKAWKEL 120
SGRYGFGRGERQTADELRIRRPMNAFMVWAKDERKRLAQQNPDLHNAVLSKMLGKAWKEL
Sbjct: 61 SGRYGFGRGERQTADELRIRRPMNAFMVWAKDERKRLAQQNPDLHNAVLSKMLGKAWKEL 120
Query: 121 NTAEKRPFVEEAERLRVQHLRDHPNYKYRPRRKKQERKVRRLEPGLLLPGLVQPSAPPEA 180
NTAEKRPFVEEAERLRVQHLRDHPNYKYRPRRKKQERKVRRLEPGLLLPGLVQPSAPPEA
Sbjct: 121 NTAEKRPFVEEAERLRVQHLRDHPNYKYRPRRKKQERKVRRLEPGLLLPGLVQPSAPPEA 180
Query: 181 FAAASGSARSFRELPTLGAEFDGLGLPTPERSPLDGLEPGEASFFPPPLAPEDCALRAFR 240
FAAASGSARSFRELPTLGAEFDGLGLPTPERSPLDGLEPGEASFFPPPLAPEDCALRAFR
Sbjct: 181 FAAASGSARSFRELPTLGAEFDGLGLPTPERSPLDGLEPGEASFFPPPLAPEDCALRAFR 240
Query: 241 APYAPELARDPSFCYGAPLGEALRTAPPAAPLAGLYYGTLXXPGPXXNPLSPPPESPSLE 300
APYAPELARDPSFCYGAPLGEALRTAPPAAPLAGLYYGTL PGPX NPLSPPPESPSLE
Sbjct: 241 APYAPELARDPSFCYGAPLGEALRTAPPAAPLAGLYYGTLGTPGPXPNPLSPPPESPSLE 300
Query: 301 GTEQLEPTADLWADVDLTEFDQYLNCSRTRPDATTLPYHVXLAKLGPRAMSCPEESSLIS 360
GTEQLEPTADLWADVDLTEFDQYLNCSRTRPDATTLPYHVXLAKLGPRAMSCPEESSLIS
Sbjct: 301 GTEQLEPTADLWADVDLTEFDQYLNCSRTRPDATTLPYHVXLAKLGPRAMSCPEESSLIS 360
Query: 361 ALSDASSAVYYSACISG 377
ALSDASSAVYYSACISG
Sbjct: 361 ALSDASSAVYYSACISG 377
(3) Report the total number of negatively charged residues and positively
charged residues.
Total number of negatively charged residues (Asp + Glu): 45
Total number of positively charged residues (Arg + Lys): 49
(4) Draw the hydrophobicity map for this protein using Eisenberg
hydrophobicity scale with window size 7. The relative weight of
the window edges compared to the window center should set to 40%.
(5) Please help him to use Prosite scanning tool to find out possible
functions or pattern of this protein.
[1] PDOC00001 PS00001 ASN_GLYCOSYLATION
N-glycosylation site
325-328 NCSR
[2] PDOC00005 PS00005 PKC_PHOSPHO_SITE
Protein kinase C phosphorylation site
Number of matches: 4
1 16-18 SRR
2 61-63 SGR
3 187-189 SAR
4 190-192 SFR
[3] PDOC00006 PS00006 CK2_PHOSPHO_SITE
Casein kinase II phosphorylation site
Number of matches: 7
1 16-19 SRRD
2 73-76 TADE
3 190-193 SFRE
4 212-215 SPLD
5 318-321 TEFD
6 329-332 TRPD
7 351-354 SCPE
[4] PDOC00007 PS00007 TYR_PHOSPHO_SITE
Tyrosine kinase phosphorylation site
57-64 RSPESGRY
[5] PDOC00008 PS00008 MYRISTYL
N-myristoylation site
Number of matches: 6
1 25-30 GIGAAA
2 34-39 GLPVTN
3 186-191 GSARSF
4 216-221 GLEPGE
5 256-261 GAPLGE
6 274-279 GLYYGT
(6) Color the protein by the hydrophobicity of the amino acids.
¦^¤W¤@¶