1 Arashilabar

Protein Structure And Function Essay About Myself

Proteins are the most versatile macromolecules in living systems and serve crucial functions in essentially all biological processes. They function as catalysts, they transport and store other molecules such as oxygen, they provide mechanical support and immune protection, they generate movement, they transmit nerve impulses, and they control growth and differentiation. Indeed, much of this text will focus on understanding what proteins do and how they perform these functions.

Several key properties enable proteins to participate in such a wide range of functions.


Proteins are linear polymers built of monomer units called amino acids. The construction of a vast array of macromolecules from a limited number of monomer building blocks is a recurring theme in biochemistry. Does protein function depend on the linear sequence of amino acids? The function of a protein is directly dependent on its threedimensional structure (Figure 3.1). Remarkably, proteins spontaneously fold up into three-dimensional structures that are determined by the sequence of amino acids in the protein polymer. Thus, proteins are the embodiment of the transition from the one-dimensional world of sequences to the three-dimensional world of molecules capable of diverse activities.


Proteins contain a wide range of functional groups. These functional groups include alcohols, thiols, thioethers, carboxylic acids, carboxamides, and a variety of basic groups. When combined in various sequences, this array of functional groups accounts for the broad spectrum of protein function. For instance, the chemical reactivity associated with these groups is essential to the function of enzymes, the proteins that catalyze specific chemical reactions in biological systems (see Chapters 8–10).


Proteins can interact with one another and with other biological macromolecules to form complex assemblies. The proteins within these assemblies can act synergistically to generate capabilities not afforded by the individual component proteins (Figure 3.2). These assemblies include macro-molecular machines that carry out the accurate replication of DNA, the transmission of signals within cells, and many other essential processes.


Some proteins are quite rigid, whereas others display limited flexibility. Rigid units can function as structural elements in the cytoskeleton (the internal scaffolding within cells) or in connective tissue. Parts of proteins with limited flexibility may act as hinges, springs, and levers that are crucial to protein function, to the assembly of proteins with one another and with other molecules into complex units, and to the transmission of information within and between cells (Figure 3.3).


Crystals of human insulin. Insulin is a protein hormone, crucial for maintaining blood sugar at appropriate levels. (Below) Chains of amino acids in a specific sequence (the primary structure) define a protein like insulin. These chains fold into well-defined (more...)

Figure 3.1

Structure Dictates Function. A protein component of the DNA replication machinery surrounds a section of DNA double helix. The structure of the protein allows large segments of DNA to be copied without the replication machinery dissociating from the (more...)

Figure 3.2

A Complex Protein Assembly. An electron micrograph of insect flight tissue in cross section shows a hexagonal array of two kinds of protein filaments. [Courtesy of Dr. Michael Reedy.]

Figure 3.3

Flexibility and Function. Upon binding iron, the protein lactoferrin undergoes conformational changes that allow other molecules to distinguish between the iron-free and the iron-bound forms.


3.1 Proteins Are Built from a Repertoire of 20 Amino Acids

3.2 Primary Structure: Amino Acids Are Linked by Peptide Bonds to Form Polypeptide Chains

3.3 Secondary Structure: Polypeptide Chains Can Fold Into Regular Structures Such as the Alpha Helix, the Beta Sheet, and Turns and Loops

3.4 Tertiary Structure: Water-Soluble Proteins Fold Into Compact Structures with Nonpolar Cores

3.5 Quaternary Structure: Polypeptide Chains Can Assemble Into Multisubunit Structures

3.6 The Amino Acid Sequence of a Protein Determines Its Three-Dimensional Structure


Appendix: Acid-Base Concepts


Selected Readings

In a molecular biology or genomics laboratory a DNA fragment, supposedly corresponding to a gene, was identified. Its first 30 base pairs have been sequenced and determined to be 5'- TGACACAACACAAGGACGCACATGACAGGA -3'. Answer the questions below through Bioinformatics experiments.

5.1.1 Gene Characterization

1 - Can the putative gene be completely and unambiguously characterized from the partial DNA sequence given above? How?

Yes. The gene can be completely and unambiguously characterized, but the solution will require a more careful analysis than it did a couple of years ago. The reason is simple; there were much less sequences than there are now.


Since we have a nucleotide sequence as our initial data set, we must go to the NCBI BLAST page, select the nucleotide blast program and search the nucleotide database using a nucleotide query. The page will look like this:

This BLASTN interface is easy to understand. It is divided into four major sections.

  1. - The first allows the user to paste the query sequence for analysis. Our query sequence containing 30 nucleotides, highlighted in the red rectangle, was pasted into the search box.

  2. - The second section permits the choice of a database to be searched and optional sequence range coordinates to the search. We will search the non-redundant (nr) nucleotide (nt) database in this exercise. Note that this is no longer the default option. That is why it is highlighted in yellow.

  3. - The third section offers optimization alternatives to the search. We will use standard parameters and settings for our search. We will search for highly similar sequences using megablast.

  4. - This section shows a summary of our search features that were defined in the previous sections. We will now mark the option to show the results in a new window. This option is very useful, since it allows us to re-run BLAST searches with different sequences and/or parameters while maintaining the former results.

Now we are ready to perform our search, just click on BLAST icon. After a few seconds the results page will appear in a new window. Assuming that you are familiar with a BLAST output, only the part of the output list of hits and alignments will be illustrated here.

Output list of hits:

You may notice that only the first 11 hits present very low E-values (below 10–5), which are most possibly significant. If we wish to be more stringent, demanding a 100% coverage in the sequence alignment, only the first eight hits are relevant for our search and the characterization of our sequence.

The first six hits are related to complete genome sequences which makes it more difficult for us to identify our target sequence. We hope to find the putative gene directly or DNA sequences containing the putative gene. Hence, the only direct and easy options left are hits seven and eight.

This straightforward analysis have shown that the partial DNA sequence we are seeking to identify most likely is part of a gene belonging to either Mycobacterium tuberculosis or to Mycobacterium bovis, or both.

How to solve this possible ambiguity? We must now look carefully at the sequence alignments that follow the list of hits for these two hits we have selected. The results are:

The alignment statistics is identical for both hits.

For the hit whose GenBank accession number is U02492.1, our query sequence begins exactly at the first position of the hit sequence.

For the hit whose GenBank accession number is U41388.1 the 5' end of our query sequence matches residue 904 on the hit sequence.

If you click on the accession number links, you will be redirected to the corresponding annotation of these sequences. As suggested by the annotation, the second one is a much longer nucleotide sequence (1789 bp) containing not only one, but two putative genes, one of them starting at base 904 which is similar to that found in the first hit. Also, the second hit corresponds to genes that are putative, that is, with no experimental confirmation.

From these analyses we can now proceed to the next question.

2 – Does the partial DNA sequence above correspond to a gene? If so, what is its GenBank accession number?

Yes. According to the analyses in 1, the 30-nucleotide partial DNA sequence most likely corresponds to a confirmed gene. Its accession number is best represented by U02492.1.

3 – What is the gene name?

The gene name can be obtained from the annotation in the hit list or by searching the NCBI nucleotide database with the GenBank accession number U02492.1. Another way is to click on the link associated with the entry U02492.1 with the mouse right hand button and open it in a new window. The top of the resulting page will be as follow:

Looking carefully at this GenBank flat file we can obtain all the information about the nucleotide entry U02492.1. We can clearly see all over, but more specifically in the FEATURES section that the gene name is inhA.

4 – What is the gene size in base pairs?

In the FEATURE section, in the ninth line, the link gene shows that the inhA gene starts at base 22 and ends at base 831, thus totaling 810 base pairs.

5 – Which organism(s) does the gene belong to?

The analyses showed that the gene can be found in both bacteria M. tuberculosis and M. bovis.

5.1.2 Gene Product or Protein Characterization

6 – How many amino acids does the gene product contain?

Again, looking carefully at the U02492.1 entry, we can easily find the answer to this question or by a simple mathematical calculation. For example, if the gene contains 810 bp, the last three bases compose the stop codon, so the protein coding region is 807 bp long. After dividing this number by three (the number of bases in a codon) we obtain 269 amino acids. However, very often the annotation information is not available, and the only information we have is the nucleotide sequence. Thus, how can we find out the protein sequence and, consequently, its corresponding number of amino acids?

We need to translate our inhA gene sequence into its corresponding protein. There are several different ways to perform this task. We will use one of them, a tool called Translate (http://us.expasy.org/tools/dna.html), which can be accessed through the ExPASy web site. This tool translates a DNA (RNA) sequence into its six ( three in each direction) different possible open reading frames (ORFs). The Translate interface looks like this:

Our nucleotide sequence, corresponding to the GenBank entry U02492.1, was copied and pasted for analysis (red rectangle). After clicking on TRANSLATE SEQUENCE we obtain:

Our task now is to find out which of the six frames corresponds to the inhA gene product. In other words, we are asking what is the correct ORF. Often the correct ORF is the longest amongst the six possible ones. If we follow this assumption (which not always is correct!), the longest ORF amongst the six found by the Translate tool ORF 1 (5'-3'Frame 1) is the right answer. Clicking over the link 5'-3'Frame 1 will show us the conceptual translation in one-letter amino acid code, as follows:

Once identified, the ORF can be translated into its corresponding protein or amino acid sequence. In most prokaryote species, an ORF starts with an ATG coding for methionine (Met or M) and ends with a stop codon (TAA, TAG or TGA). Now, clicking on the first Met residue will give us our protein sequence in a format similar to the GenBank flatfile, including the number of amino acids. In this case, 269 (see figure below).

If we continue and click on the FASTA format link at the bottom of the page, we obtain our protein sequence in the the FASTA format, ready to be used by other programs.

7 – What is the gene product name and possible function?

So far we have worked with a DNA sequence. Now we will query the NCBI protein databases with the protein sequence obtained above and discover its possible function. For this puropose, we will go back to the basic BLAST page at NCBI (http://www.ncbi.nlm.nih.gov/BLAST/) but, at this time, we will select the link "protein blast", which will lead us to the BLASTp page. The BLASTp page at NCBI will look like this:

Our query sequence is already inserted in the search box. BLASTp standard parameters for protein-protein comparison are being used such as the substitution matrix BLOSUM62 and the nr database. For further details click in Algorithm parameters link at the bottom of the page.

The output list of results is shown for the top-21 hits:

The bit score and the E values indicate that we found our protein. Moreover, the red squares on the right with the letter "S" inside indicate that our proteins most likely have a 3D structure (S) which will help to answer the next questions.

To increase our confidence that the hits correspond to homologs of our query protein, we must check if the alignment covers the whole query protein sequence. The figure below shows the first alignment in the list of results.

Inspecting the annotation, we can see the our query protein is called Enoyl (Acyl Carrier Protein or ACP) Reductase of Mycobacterium tuberculosis strain H37Rv or InhA. To obtain a more complete answer we can access the protein annotation through its GenBank accession number NP_21600, similarly as we did with the gene accession number. The FEATURES section of the resulting archive is:

Since M. tuberculosis is a bacterium, this enzyme most likely functions in the cytoplasm and is involved in the biosynthesis of mycolic acid, an essential component of its cell wall. The blue links of the qualifiers can provide the curious reader with even more information about this enzyme.

8 – Is this gene present in humans? What consequences could this fact have for a possible application of gene product for a structure-based drug discovery?

To answer this question, you can run a BLASTp similarity search of this protein sequence against a database of human RefSeq proteins. Also, it is necessary to read about the fatty acid biosynthesis in humans and bacteria. A quick answer is: no, it is not present in humans. Hence, it is theoretically an ideal drug target against tuberculosis. In fact, the reader will find out that this enzyme has been proven to be a bonafide target of the drug isoniazid (INH).

9 – Does the gene product have a 3D structure? If so, what is its RCSB/PDB identification number? Describe the gene product architecture.

As we can see from the BLASTp output above, there are now several 3D structures for the InhA enzyme, enzyme-NADH complex, tertiary complexes involving drug candidates, and with the NADH-INH adduct which inhibits the enzyme. Historically, the first InhA structure to be determined was that with PDB ID 1ENY (the 8th entry in BLASTp's hit list), so we will investigate this particular one. To see the 3D structure, you should now go to PDBs search page (http://www.rcsb.org/pdb/home/home.do). To do this, enter the PDB code of the protein (1ENY) in the search bar at the top of the page and click the button "Site Search". The initial page of the result will look as follows:

We can obtain all the available information about the 3D structure of this enzyme by browsing through the links or download the PDB file to a local directory in our computer and work with our preferred molecular modeling and visualization package. For instance, we can see in the classification section that this enzyme has a 3-layer (αβα) sandwich architecture according to the CATH classification. We will visualize this next.

Some of the visualization software can be accessed directly from the page illustrated above. They can be located in the bottom of the section "Images and Visualization". Another alternative is the site First Glance in Jmol (http://molvis.sdsc.edu/fgij/index.htm). Go to the site, enter 1ENY in the blank box and click the Submit button:

As the name suggests, we can do many simple visualization manipulations that will help us to understand the molecule structure, interactions and function. Note: the use of molecular visualization software, no matter how simple it is, demands at least a general knowledge of the object to be visualized. Hence, to be able to run this software, the user must understand the basic principles of protein structure and function.

After submitting the 1ENY PDB code, and waiting a little bit to load the JAVA machine, we obtain:

Here the InhA structure appears static, but the site will initially display the image rolling in order to present an even better idea of its three-dimensionality. Pressing the left mouse button and moving it left and right and up and down allows the user to keep full control of the rotation of the molecule. Also, pressing the left mouse button at the same time as the control or alt keys, and moving the mouse, will change the zoom.

Leave a Comment


Your email address will not be published. Required fields are marked *