Genomics – A programmer’s guide.

Andy Thomason:

The human genome consists of two copies of about 3 billion base pairs of DNA using the alphabet A, C, G and T. This is about two bits per base or

3,000,000,000 * 2 * 2 / 8 = 1,500,000,000 or about 1.5GB of data.

https://en.wikipedia.org/wiki/Human_genome

In reality the two copies are very similar and indeed the DNA of all humans is nearly identical from Wall Street trader to Australian aboriginal.

There are a number of “reference genomes” such as the Ensembl Fasta files available from here:

ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/

Reference genomes help us to build a map of where we might find particlar features in human DNA, but do not represent any real individuals.

For example, we can use it to name a “location” for a protein coding gene such as BRCA2, a DNA repair mechanism implicated in breast cancer: