HOMEWORK#6
      
     Sequence analysis
                                    
     David Liu, a student in the department of Lfe Sciences, got a gene for
     a mouse transcription factor. The sequence of this gene is listed here: 
1 catcagacct ccgtacttgg ctttgcagtg cccgccactg tctcctgcgc tcccgcgccg 61 cgttccgccc aggccttgcc cagctggaat gcagagatcg ccgcccggct acggcgcaca 121 ggacgacccg ccctcccgcc gcgactgtgc atgggcccct ggaatcgggg ccgctgctga 181 ggcgcgcggc ctccctgtca ccaacgtctc gcccacctcg cccgcctccc cgtccagcct 241 tccgcggagc ccaccgcgca gccccgaatc agggcgctat ggctttggcc gcggagagcg 301 ccaaactgcc gacgagttgc gcattcggcg gcccatgaac gccttcatgg tgtgggcgaa 361 ggacgagcgc aagcgactgg cgcaacaaaa tccggatctg cacaacgcag tactgagcaa 421 gatgctgggc aaagcgtgga aggagctgaa cacggcggag aagcggccct tcgtggaaga 481 ggccgaacgg ttgcgtgtgc agcacttgcg cgaccatccc aactacaagt accggcctcg 541 tcgtaaaaaa caggagcgca aggtccggag gctggagccg ggtctcttgc tcccgggcct 601 cgtgcagccg tctgcgccgc ccgaggcctt cgctgcagcg tcagggtcag ctcgctcctt 661 ccgcgagcta cccactctgg gtgcggagtt cgatggcttg gggctaccca cgcccgagcg 721 ctcgcctctg gacggcctgg agcctggcga ggcctccttc ttcccaccgc ctttggcgcc 781 cgaggactgc gctctgcggg ctttccgggc accctatgcc cctgagctgg cacgggaccc 841 gagcttctgc tacggggcgc ccctgggtga agcgctcagg acagcgccgc ctgccgcgcc 901 actcgcaggt ctctactatg gcaccctggg hacbccgggc ccgnttccma atcctctgtc 961 accaccacct gagtccccgt ctcttgaggg cacagagcaa ctggagccta ccgccgacct 1021 ttgggccgat gtggacctca ccgaatttga ccagtatctc aattgcagcc ggactcgacc 1081 ggatgccact acactcccct accacgtggn actggccaaa ctaggtccgc gcgccatgtc 1141 ctgtccagaa gagagcagcc tcatttctgc gctgtctgat gctagcagcg cggtctatta 1201 cagtgcttgc atctcaggct agacactgtc cttgccctcc accgcttctg catgtggcca 1261 agtggcagag ttgcctgctc ccttcctttc gcatatgtat gttagggtat gcaacagcct 1321 ttagagctgg tggcctaaag atgccatttc tgtcgcctcc tcatttacac acctccttct 1381 gggggktncc tgtgctttgg gccttcccta ggatcgtcag gccctggacg tgcaagctac 1441 ctctgccagg attggtggtg aagaagctaa ggcttttctg ccatttatgt tctagaatga 1501 ggctgttctg tttactttgc cgggatatac atatatcata tataatacaa tatatttaat
1561 ttttaattaa acttttttct ttaag Unfortunately, he doesn't know how to use the sequence analysis tools availabled in the internet since he did not take Bioinformatics before. Could you help him to do the following analysis? (1) Find its corresponding polypeptide sequence (DNA -> Protein translation). SEQUENCE 377 AA; 41063 MW; CRC32. MQRSPPGYGA QDDPPSRRDC AWAPGIGAAA EARGLPVTNV SPTSPASPSS LPRSPPRSPE SGRYGFGRGE RQTADELRIR RPMNAFMVWA KDERKRLAQQ NPDLHNAVLS KMLGKAWKEL NTAEKRPFVE EAERLRVQHL RDHPNYKYRP RRKKQERKVR RLEPGLLLPG LVQPSAPPEA FAAASGSARS FRELPTLGAE FDGLGLPTPE RSPLDGLEPG EASFFPPPLA PEDCALRAFR APYAPELARD PSFCYGAPLG EALRTAPPAA PLAGLYYGTL XXPGPXXNPL SPPPESPSLE GTEQLEPTAD LWADVDLTEF DQYLNCSRTR PDATTLPYHV XLAKLGPRAM SCPEESSLIS ALSDASSAVY YSACISG (2) Identify this protein. Is it a new protein? If not, what's the name of this protein?
>sp|P43680|SX18_MOUSE TRANSCRIPTION FACTOR SOX-18 Length = 377 Score = 1972 (895.0 bits), Expect = 8.1e-261, P = 8.1e-261 Identities = 374/377 (99%), Positives = 372/377 (98%) Query: 1 MQRSPPGYGAQDDPPSRRDCAWAPGIGAAAEARGLPVTNVSPTSPASPSSLPRSPPRSPE 60 MQRSPPGYGAQDDPPSRRDCAWAPGIGAAAEARGLPVTNVSPTSPASPSSLPRSPPRSPE Sbjct: 1 MQRSPPGYGAQDDPPSRRDCAWAPGIGAAAEARGLPVTNVSPTSPASPSSLPRSPPRSPE 60 Query: 61 SGRYGFGRGERQTADELRIRRPMNAFMVWAKDERKRLAQQNPDLHNAVLSKMLGKAWKEL 120 SGRYGFGRGERQTADELRIRRPMNAFMVWAKDERKRLAQQNPDLHNAVLSKMLGKAWKEL Sbjct: 61 SGRYGFGRGERQTADELRIRRPMNAFMVWAKDERKRLAQQNPDLHNAVLSKMLGKAWKEL 120 Query: 121 NTAEKRPFVEEAERLRVQHLRDHPNYKYRPRRKKQERKVRRLEPGLLLPGLVQPSAPPEA 180 NTAEKRPFVEEAERLRVQHLRDHPNYKYRPRRKKQERKVRRLEPGLLLPGLVQPSAPPEA Sbjct: 121 NTAEKRPFVEEAERLRVQHLRDHPNYKYRPRRKKQERKVRRLEPGLLLPGLVQPSAPPEA 180 Query: 181 FAAASGSARSFRELPTLGAEFDGLGLPTPERSPLDGLEPGEASFFPPPLAPEDCALRAFR 240 FAAASGSARSFRELPTLGAEFDGLGLPTPERSPLDGLEPGEASFFPPPLAPEDCALRAFR Sbjct: 181 FAAASGSARSFRELPTLGAEFDGLGLPTPERSPLDGLEPGEASFFPPPLAPEDCALRAFR 240 Query: 241 APYAPELARDPSFCYGAPLGEALRTAPPAAPLAGLYYGTLXXPGPXXNPLSPPPESPSLE 300 APYAPELARDPSFCYGAPLGEALRTAPPAAPLAGLYYGTL PGPX NPLSPPPESPSLE Sbjct: 241 APYAPELARDPSFCYGAPLGEALRTAPPAAPLAGLYYGTLGTPGPXPNPLSPPPESPSLE 300 Query: 301 GTEQLEPTADLWADVDLTEFDQYLNCSRTRPDATTLPYHVXLAKLGPRAMSCPEESSLIS 360 GTEQLEPTADLWADVDLTEFDQYLNCSRTRPDATTLPYHVXLAKLGPRAMSCPEESSLIS Sbjct: 301 GTEQLEPTADLWADVDLTEFDQYLNCSRTRPDATTLPYHVXLAKLGPRAMSCPEESSLIS 360
Query: 361 ALSDASSAVYYSACISG 377 ALSDASSAVYYSACISG Sbjct: 361 ALSDASSAVYYSACISG 377
(3) Report the total number of negatively charged residues and positively charged residues. Total number of negatively charged residues (Asp + Glu): 45 Total number of positively charged residues (Arg + Lys): 49
(4) Draw the hydrophobicity map for this protein using Eisenberg hydrophobicity scale with window size 7. The relative weight of the window edges compared to the window center should set to 40%.
(5) Please help him to use Prosite scanning tool to find out possible functions or pattern of this protein. [1] PDOC00001 PS00001 ASN_GLYCOSYLATION N-glycosylation site 325-328 NCSR [2] PDOC00005 PS00005 PKC_PHOSPHO_SITE Protein kinase C phosphorylation site Number of matches: 4 1 16-18 SRR 2 61-63 SGR 3 187-189 SAR 4 190-192 SFR [3] PDOC00006 PS00006 CK2_PHOSPHO_SITE Casein kinase II phosphorylation site Number of matches: 7 1 16-19 SRRD 2 73-76 TADE 3 190-193 SFRE 4 212-215 SPLD 5 318-321 TEFD 6 329-332 TRPD 7 351-354 SCPE [4] PDOC00007 PS00007 TYR_PHOSPHO_SITE Tyrosine kinase phosphorylation site 57-64 RSPESGRY [5] PDOC00008 PS00008 MYRISTYL N-myristoylation site Number of matches: 6 1 25-30 GIGAAA 2 34-39 GLPVTN 3 186-191 GSARSF 4 216-221 GLEPGE 5 256-261 GAPLGE 6 274-279 GLYYGT (6) Color the protein by the hydrophobicity of the amino acids.

¦^¤W¤@­¶