Medical
Information,
Introduction to Genes and DNA
DNA code is a sequence of chemicals that form information
that control how humans are made and how they work. It is a digital code but it
is not binary, but quaternary with 4 distinct items. The encoding information
in an ordered sequence of 4 different symbols called "bases",
typically denoted A, C, G, and T.
·
A: adenosine
·
C: cytosine
·
G: guanine
·
T: thymine
These 4 substances are the fundamental "bits" of
information in the genetic code, and are called "base pairs" because
there is actually 2 substances per "bit", as discussed later.
Everything else is built on top of this basis of 4 DNA digits.
The entirety of human DNA code, called the "human
genome", is about 3 million bases in total. Every human being has 2 copies
of this code, one copy from each parent, so a human's cell DNA contains a total
of around 6 billion bases. In computer terms, this is around 6 Gigabytes of
symbols, or more like 1 Gigabyte if compacted, since it's about 2 binary bits
of information per A/C/G/T base pair. DNA molecules are linear in a twisted
double-helix, with a start and an end, and do not contain any cycles.
chromosomes: These
6 billion odd base pairs are split amongst 46 chromosomes. Each person gets 2
pairs of chromosomes, 23 from each parent, to total 46 chromosomes per human
cell. A chromosome is the largest form of a DNA molecule, with a large sequence
of DNA codes, of differing lengths, usually hundreds of millions of base pairs
in each chromosome. chromosomes are independent molecules of DNA, with the
typical double-helix, a start and end, but no cycles. chromosomes are
physically large enough to be seen on high power microscopes.
Genes: Each
chromosome has subsequences of DNA bases that encode particular features, and
these are called "genes". Thus genes are not independent molecules,
but are abstract sequences within chromosomes. All genes have different
lengths. Genes are too small to be physically seen on a microscope, but are
analyzed using indirect chemical, molecular, and computational methods. The
total number of distinct genes in the human genome is believed to be around
30,000 genes according to the Human Genome Project.
So the hierarchy of terminology for genetic components is
something like:
·
Base pair: the smallest element, a
single DNA base-4 compound A, C, G, or T.
·
Gene: a medium-size sequence of
around 100,000 DNA base pairs, like a sub-module
·
chromosome: a large sequence of
hundreds of millions of DNA base pairs, like a computer program file
·
Human genome: the entirety of human DNA
program code: 2 pairs of 23 distinct chromosomes, adding to around 6 billion
DNA base pairs
Every individual has a unique genetic program, though all
human DNA shares much common code too. A lot of genes and other DNA
subsequences are modified or move around within the DNA of a species, such as
when they are inherited from parents at conception. DNA does not usually change
within a particular individual's body, though this can occur rarely from cell
mutations (e.g. some cancer cells) and also genetic damage such as from
radiation or toxic chemical exposure.
chromosomes
Each person has 46 chromosomes, in pairs of 2, with 23 from
each parent. So there are really 23 distinct chromosomes, and each body cell
effectively has 2 different copies of the DNA code, half from each parent.
Each chromosome is distinct and whole. They are ordered, and
have clear start and end sequences. In a sense, they are like a file of
computer code.
The 23 distinct chromosomes are known and named and have a
common structure for each human. The first 22 chromosomes are just named in
numbers, simply chromosome 1 through chromosome 22. The name for one of these
22 chromosomes is an "autosome".
The 23rd chromosome is the sex chromosome, which is called
either "X" or "Y". Every person has a pair of sex
chromosomes, one from each parent. However, unlike the other 22 pairs of
chromosomes, a human does not necessarily have 2 similar chromosomes. A male
person has a pair of different chromosomes, an X and a Y chromosome, and is
usually written as XY. A female has two X chromosomes and is called XX.
The key issue about chromosomes is to understand their role
in reproduction. Firstly, let's make some observations about reproduction:
·
Children are similar to both
parents, with similar traits, but are not identical to either parent.
·
Siblings look different, despite
sharing the same parents.
·
Male and female children occur in
about a 50-50 split.
To understand these features, we have to understand how
chromosomes are distributed during reproduction. Every person has 46
chromosomes, 23 from the father, 23 from the mother. But the father has 46
chromosomes and so does the mother. Each sperm cell in the father gets 23
chromosomes, and similarly an egg cell gets 23 chromosomes from the mother's
set of 46. For each autosome 1..22, the gamete (sperm or egg) gets one of the
chromosomes, randomly, without regard to which grandparent the chromosome
originally came from. For the sex chromosomes, the egg cell gets one of the
mother's X chromosomes, and the sperm gets either the X or Y from the father's
chromosomes. Hence, the number of permutations of chromosomes in a father's
sperm cell is 2^23, and similarly the number of egg chromosome permuations is
also 2^23. So even with the same parents, and even with only entire chromosomes
inherited, the number of siblings that can be created is about 2^46.
However, chromosomes are changed during reproduction. They
are a natural part of the process. Small or large chunks of chromosome material
are swapped during reproductive cell creation. This is called crossover. Thus,
the total number of possibilities is even huger than the number purely from
simple swapping over.
Non-Gene DNA Sequences
Genes are the best understood subsequence of DNA code. Most
genes clearly encode the data sequence representing a particular protein.
However, all of the genes together are only a small part of DNA code. The
30,000 odd genes in human DNA might only make up 4% of human DNA.
So what is the other DNA code for? These DNA sequences are
the least understood of all genetic issues. The main theory is that these DNA
sequences are the control mechanisms, that control when particular genes are
activated. If the genes are the data sequences for proteins, the remainder must
be the real code. This code presumably controls when the genes are activated,
so that human growth follows its normal timetables. It probably also controls
how much a gene is activated, controlling how much of each protein is produced
by a gene.
DNA and RNA
There are actually 2 main types of nucleic substances within
cell nuclei that process information. DNA is the basic form within chromosomes,
that is hard-coded into every cell. RNA is a more temporary form that is used
to process subsequences of DNA messages. RNA is an intermediate form used to
execute the portions of DNA that a cell is using. For example, in the synthesis
of proteins, DNA is copied to RNA, which is then used to create proteins:
DNA->RNA->Proteins.
The structure of DNA and RNA are very similar. They are both
ordered sequences of 4 types of substances: ACGT for DNA, and ACGU for RNA.
Thus RNA uses the same three ACG substances, but uses U (uracil) instead of T
(thymine). The molecules uracil and thymine are only slightly different
chemically. In DNA, there is pairing between AT and CG, and in RNA, the
pairings are AU and CG, but since RNA is not double-stranded, this pairing is
much rarer. Hence, RNA has the 4 substances:
·
A: adenosine
·
C: cytosine
·
G: guanine
·
U: uracil
Typically, DNA is created from RNA, and this is done by
faithfully copying the sequence of base pairs, with the only change converting
T to U. Hence, an RNA copy of a DNA sequence encodes the identical information,
though it uses a slightly different set of 4 substances.
The differences between DNA and RNA are also many. The
underlying sugar molecule that traps the 4 bases is different: deoxyribose in
DNA, ribose in RNA. DNA is two strands wrapped in a double-helix, but RNA is a
single strand.
Genes: Protein Data Sequences in the DNA Code
Some parts of DNA sequences are known to be purely data.
These are the "genes". The best understood aspect of DNA coding is
the encoding of amino acid information in genes that is used by the body to
synthesize proteins. These are data blocks that represent protein structures.
All proteins are substances made up of only 20 basic
building blocks called amino acids. Proteins are ordered sequences of these 20
amino acids. Another terminology is that an amino acid is a "peptide"
and a protein is a sequence of many peptides called a "polypeptide".
So how does DNA encode the structure of a protein? It uses
triplets of base pairs. There are 4x4x4=64 possible combinations in a base pair
triplet, and only 20 amino acids. Some extra codes are used as start and stop
signal markers at each end of the data sequence. Other triplets are mapped so
that more than one triplet can represent a particular amino acid. However, the
representation is unique across all DNA mapping base pair triplets to the 20
amino acids:
·
1. Phenylalanine (Phe): UUU, UUC
·
2. Leucine (Leu): UUA, UUG
·
3. Isoleucine (Ile): AUU, AUC, AUA
·
4. Methionine (Met): AUG
·
5. Valine (Val): GUU, GUC, GUA, GUG
·
6. Serine (Ser): UCU, UCC, UCA, UCG,
AGU, ACG
·
7. Proline (Pro): CCU, CCC, CCA, CCG
·
8. Threonine (Thr): ACU, ACC, ACA,
ACG
·
9. Alanine (Ala): GCU, GCC, GCA, GCG
·
10. Tyrosine (Tyr): UAU, UAC
·
11. Histidine (His): CAU, CAC
·
12. Glutamine (Gln): CAA, CAG
·
13. Asparagine (Asn): AAU, AAC
·
14. Lysine (Lys): AAA, AAG
·
15. Aspartic acid (Asp): GAU, GAC
·
16. Glutamic acid (Glu): GAA, GAG
·
17. Cysteine (Cys): UGU, UGC
·
18. Tryptophan (Trp): UGG
·
19. Arginine (Arg): CGU, CGC, CGA,
CGG, AGA, AGG
·
20. Glycine (Gly): GGU, GGC, GGA,
GGG
In addition, the following triplet codes are special:
·
STOP: UAA, UAG, UGA
·
START: AUG (same code as the
Methionine amino acid)
Clearly, there are not unique 1-1 mappings of triplets to
amino acids. However, although there is redundancy, it is not ambiguous. Any
triplet can represent only 1 amino acid.
Why this redundancy? Perhaps there is some meaning to it?
Perhaps simply a primitive form of error prevention? Perhaps it is simply an
accident of nature that occurs because 3 digits were needed, since 2 DNA digits
could only encode 4x4=16 codes, which is not enough to represent the 20 amino
acids and start/stop codes.
This DNA encoding appears to be almost the same for all
genetics on the planet. A few species of single-celled protists have slightly
different codes.
The DNA data sequences are of varying length depending on
the size of the protein. Proteins can range from tiny proteins with about 50
amino acids to huge proteins with 5,000 amino acids.
The DNA start and stop sequences are not the same as the RNA
start and stop triplets. DNA has a promoter sequence to show where RNA should
start to be copied, and a terminator sequence to tell RNA where to stop. The
RNA then uses only a single triplet as the start and stop markers. The DNA
promoter and terminator sequences are more complex.
Introns: Surprisingly,
not all of the DNA code is useful. Certain sequences called "introns"
are simply occurred. These are like comments in protein coding sequences. They
are transcribed to mRNA properly, but then they are excised from the mRNA to
produce the final mRNA. The resulting mRNA is the same order and codes as the
original mRNA, but with the introns sequences removed.
RNA Data Sequences in DNA
Proteins are not the only substances that are synthesized
directly from data within the DNA. Some forms of RNA are specialized, and also
have their formula encoded directly in digital DNA formulae.
Not all types of RNA are temporary intermediate forms with
their form depending on whatever DNA they are copying. There are certain forms
of RNA that have a particular form that is the same across all individuals.
Some of these special-purpose RNA forms are:
·
tRNA: transfer RNA
·
rRNA: ribosome RNA
There are exactly 20 forms of tRNA, one each to transfer a
particular amino acid. tRNA molecules contain about 75-80 bases. tRNA
recognizes one of the 64 triplets, and matches it to one of the 20 amino acids.
Since there are 20 tRNA types, and not 64, each tRNA molecule has to recognize
more than one triplet ordering as a match.
The DNA code contains multiple repetitions of the codes for
tRNA and rRNA. About 280 copies are spread over 5 chromosomes. Presumably, this
allows each cell to make multiple copies of tRNA and rRNA molecules at once
from its single copy of the DNA.
Executing the DNA Program: Parallel Execution
Every cell has a full copy of the entire DNA, complete with
around 6 billion DNA base pairs jammed into the cell's nucleus. Whenever cells
divide to replicate, they duplicate the entire DNA code so that each cell
retains a full DNA copy.
The only cells that do not have the entire DNA code are
reproductive sperm or egg cells that have only 23 chromosomes each, and thus
only about a half copy of DNA.
Summary
·
DNA is digital, but is quaternary,
not binary.
·
DNA is a base-4 code using the
digits A, C, G and T.
·
Proteins are a base-20 code using
the 20 amino acids.
·
DNA represents a protein has an
ordered sequence of base-4 triplets, using 64 possible values to 20 amino
acids.
·
Comments: Some DNA sequences are
ignored: introns