1. The following sequence is a segment from human genome. (20%)

(a) Which chromosome is this segment in?

Chromosome 18.
Perform BLAST search of this sequence in organism “Homo sapiens”, and after the result comes out, you will see one of the entries with E value 0 is chromosome 18. So we know that this segment is in chromosome 18.

(b) What is the starting number of this segment in that chromosome? (ex. 5896432)

8935065
The first entry that BLAST returns has a LocusLink hyperlink, follow that link and click on the LocusID entry, then find its genome annotation in the following page, click the link to GenBank and in the accession line you can see the start and stop position of this segment in chromosome 18.

(c) Which gene does contain this segment?

SERPINB5:
serine (or cysteine) proteinase inhibitor, clade B (ovalbumin), member 5
In the LocusID entry mentioned above, you could see its name.

(d) What is the cytogenetic location of this gene? (ex. 9q34.1)

18q21.3
In the page where the LocusID entry resides, you could see its cytogenetic location in your right hand side.

2. Use ORF finder to find the coding region of the DNA segment shown in question 1. (15%)

(a) Which frame is the correct one? What is the length of DNA and protein of this frame?

Frame “+3” is the correct one.
The length of DNA of this frame is 1128(78..1205)
The length of protein is 375aa.

(b) Give its protein sequence in fasta format.

>lcl|Sequence 1 ORF:78..1205 Frame +3
MDALQLANSAFAVDLFKQLCEKEPLGNVLFSPICLSTSLSLAQVGAKGDTANEIGQVLHFENVKDIPFGF
QTVTSDVNKLSSFYSLKLIKRLYVDKSLNLSTEFISSTKRPYAKELETVDFKDKLEETKGQINNSIKDLT
DGHFENILADNSVNDQTKILVVNAAYFVGKWMKKFPESETKECPFRLNKTDTKPVQMMNMEATFCMGNID
SINCKIIELPFQNKHLSMFILLPKDVEDESTGLEKIEKQLNSESLSQWTNPSTMANAKVKLSIPKFKVEK
MIDPKACLENLGLKHIFSEDTSDFSGMSETKGVALSNVIHKVCLEITEDGGDSIEVPGARILQHKDELNA
DHPFIYIIRHNKTRNIIFFGKFCSP*

(c) Identify the protein. Give its name and accession number.

Protein name:
“serine (or cysteine) proteinase inhibitor, clade B (ovalbumin),member 5; protease inhibitor 5 (maspin) [Homo sapiens]”
Accession number: NP_002630
3. Find its counterpart in mouse (M.musculus). (20%)

(a) Give its name and accession number (for protein).

Name:
serine (or cysteine) proteinase inhibitor, clade B, member 5; serine protease inhibitor 7; serine (or cysteine) proteinase inhibitor, clade B (ovalbumin), member 5 [Mus musculus]
Accession number: NP_033283(NCBI)

or

Name: Maspin precursor (Protease inhibitor 5).
Accession number: P70124(Swissprot)

(b) Give the MGI number and chromosome number of its gene.

MGI: 109579
Chromosome 1.

(c) Use blast2seq to compare these two protein sequences. Give % of identity and positive (Do not use filter for LCR). Show Blast2 results.

Identity: 89%(334/375)
Positives: 97%(364/375)

4. Search the conserved domains. (15%)

(a) Identify this protein in Swissprot database. Give its name and accession number.

Name: Adenovirus 5 E1A-binding protein or BS69 protein
Accession number: Q15326

(b) Search the conserved domains that this protein may have. Give the description and pfam or smart number of the conserved domains you found.

smart number: SM00297
description: (no description, here is the interpro abstract)
Bromodomains are found in a variety of mammalian, invertebrate and yeast DNA-binding proteins. Bromodomains can interact with acetylated lysine. In some proteins, the classical bromodomain has diverged to such an extent that parts of the region are either missing or contain an insertion (e.g., mammalian protein HRX, C.elegans hypothetical protein ZK783.4, yeast protein YTA7). The bromodomain may occur as a single copy, or in duplicate.
The precise function of the domain is unclear, but it may be involved in protein-protein interactions and may play a role in assembly or activity of multi-component complexes involved in transcriptional activation.

smart number: SM00293
description: conservation of Pro-Trp-Trp-Pro residues

5. Analysis the above protein in ExPASy. (25%)

(a) Calculate its molecular weight and pI value.

pI=8.55
Mw=
66203.27

(b) Predict the hydrophobic profile of one of the domain (from 412 to 532). Use Eisenberg et al. scale, window size 9, Relative weight 90% and normalize the scale from 0 to 1. Paste the predict result on the answer sheet.



(c) Perform the trypsin cleavage for this protein. How many peptide fragments you got? Display peptides with a mass bigger than 1500 Dalton.



(d) Perform the Topology prediction. Predict this protein in SOSUI system. Give the results.

Average of hydrophobicity: -0.958541
This amino acid sequence is of a SOLUBLE PROTEIN.
Hydropathy profile:


(e) Using SOPM or SOPMA to predict its secondary structure.




SOPM results:
MSRVHGMHPKETTRQLSLAVKDGLIVETLTVGCKGSKAGIEQEGYWLPGDEIDWETENHDWYCFECHLPG
heeettccccccchhhhhhhctteeeeeeeeccccccccccttteecccccceecccccceeeeetcctt
EVLICDLCFRVYHSKCLSDEFRLRDSSSPWQCPVCRSIKKKNTNKQEMGTYLRFIVSRMKERAIDLNKKG
ceeeehhheeeecttccchheeecccccccccceeeeecttcccchhhhhhhhhhhhhhhhhhhhhcttc
KDNKHPMYRRLVHSAVDVPTIQEKVNEGKYRSYEEFKADAQLLLHNTVIFYGADSEQADIARMLYKDTCH
cccccchhhhhhhttccccchhhhhcttccchhhhhhhhhhheetteeeeeccccchhhhhhhhhhtchh
ELDELQLCKNCFYLSNARPDNWFCYPCIPNHELVWAKMKGFGFWPAKVMQKEDNQVDVRFFGHHHQRAWI
hhhhhhhctteeeeecccttceeecccccthhhhhhhhttttccchheehcccccceeeeeeccccceec
PSENIQDITVNIHRLHVKRSMGWKKACDELELHQRFLREGRFWKSKNEDRGEEEAESSISSTSNEQLKVT
cccccceeeeeeeeeeehhhhtcchhhhhhhhhhhhhhttceecccccccchhhhhhhhhhcchhheeee
QEPRAKKGRRNQSVEPKKEEPEPETEAVSSSQEIPTMPQPIEKVSVSTQTKKLSASSPRMLHRSTQTTND
ccccccttcccccccccccccccchhhhcccccccccccccceeeechhhhhhccccchhhhccccccct
GVCQSMCHDKYTKIFNDFKDRMKSDHKRETERVVREALEKLRSEMEEEKRQAVNKAVANMQGEMDRKCKQ
tchhhhhhhhhhhhhhhhhhhhhhccchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
VKEKCKEEFVEEIKKLATQHKQLISQTKKKQWCYNCEEEAMYHCCWNTSYCSIKCQQEHWHAEHKRTCRR
hhhhhhhhhhhhhhhhhhhhhhhhhhhccthceechhhhhheeeeccccceeeecchhhhhhhhhhhhhh
KR
hc
SOPM :
   Alpha helix     (Hh) :   246 is  43.77%
   310  helix       (Gg) :     0 is   0.00%
   Pi helix        (Ii) :     0 is   0.00%
   Beta bridge     (Bb) :     0 is   0.00%
   Extended strand (Ee) :    92 is  16.37%
   Beta turn       (Tt) :    40 is   7.12%
   Bend region     (Ss) :     0 is   0.00%
   Random coil     (Cc) :   184 is  32.74%
   Ambigous states (?)  :     0 is   0.00%
   Other states         :     0 is   0.00%


SOPMA results:
MSRVHGMHPKETTRQLSLAVKDGLIVETLTVGCKGSKAGIEQEGYWLPGDEIDWETENHDWYCFECHLPG
heeettccccccchhhhhhhctteeeeeeeeccccccccccttteecccccceecccccceeeeetcctt
EVLICDLCFRVYHSKCLSDEFRLRDSSSPWQCPVCRSIKKKNTNKQEMGTYLRFIVSRMKERAIDLNKKG
ceeeehhheeeecttccchheeecccccccccceeeeecttcccchhhhhhhhhhhhhhhhhhhhhcttc
KDNKHPMYRRLVHSAVDVPTIQEKVNEGKYRSYEEFKADAQLLLHNTVIFYGADSEQADIARMLYKDTCH
cccccchhhhhhhttccccchhhhhcttccchhhhhhhhhhheetteeeeeccccchhhhhhhhhhtchh
ELDELQLCKNCFYLSNARPDNWFCYPCIPNHELVWAKMKGFGFWPAKVMQKEDNQVDVRFFGHHHQRAWI
hhhhhhhctteeeeecccttceeecccccthhhhhhhhttttccchheehcccccceeeeeeccccceec
PSENIQDITVNIHRLHVKRSMGWKKACDELELHQRFLREGRFWKSKNEDRGEEEAESSISSTSNEQLKVT
cccccceeeeeeeeeeehhhhtcchhhhhhhhhhhhhhttceecccccccchhhhhhhhhhcchhheeee
QEPRAKKGRRNQSVEPKKEEPEPETEAVSSSQEIPTMPQPIEKVSVSTQTKKLSASSPRMLHRSTQTTND
ccccccttcccccccccccccccchhhhcccccccccccccceeeechhhhhhccccchhhhccccccct
GVCQSMCHDKYTKIFNDFKDRMKSDHKRETERVVREALEKLRSEMEEEKRQAVNKAVANMQGEMDRKCKQ
tchhhhhhhhhhhhhhhhhhhhhhccchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
VKEKCKEEFVEEIKKLATQHKQLISQTKKKQWCYNCEEEAMYHCCWNTSYCSIKCQQEHWHAEHKRTCRR
hhhhhhhhhhhhhhhhhhhhhhhhhhhccthceechhhhhheeeeccccceeeecchhhhhhhhhhhhhh
KR
hc
SOPMA :
   Alpha helix     (Hh) :   246 is  43.77%
   310  helix       (Gg) :     0 is   0.00%
   Pi helix        (Ii) :     0 is   0.00%
   Beta bridge     (Bb) :     0 is   0.00%
   Extended strand (Ee) :    92 is  16.37%
   Beta turn       (Tt) :    40 is   7.12%
   Bend region     (Ss) :     0 is   0.00%
   Random coil     (Cc) :   184 is  32.74%
   Ambigous states (?)  :     0 is   0.00%
   Other states         :     0 is   0.00%

※ Bonus
Please explain these following terms.
(1) Filter (Low-complexity) (hint: when using blast)

進行BLAST序列比對時將序列中組成成分是low complexity的區域mask掉的遮罩

(2) Matrix

進行protein-protein BLAST序列比對時,用來計算alignment分數的給分矩陣

(3) Homologous gene

不同物種間序列相似的同源基因

(4) Contig

shotgun method中,數組由gel electrophoresis得到並且互相之間有overlapping的序列為一個contig,每一組由gel electrophoresis得到的序列至少為一contig