Methods for haplotype analysis

Steve Simon

2006-05-31

I am not an expert on haplotype analysis, but as I understand it, a haplotype is a combination of several SNPs (Single Nucleotide Polymorphisms) that show a stronger association with disease than any single SNP might.

Haplotype analysis is difficult because you often only have partial information about the genomes. Here is a small piece of information about the first fifteen SNPs on chromosome 22 for a subject in the HapMap project.

rs3016036<U+FFFD> AA rs2334386<U+FFFD> GG rs2844882<U+FFFD> AA rs11089130 GG rs738829<U+FFFD><U+FFFD> GG rs7510853<U+FFFD> CC rs10154488 CC rs915674<U+FFFD><U+FFFD> AG rs915675<U+FFFD><U+FFFD> AC rs915677<U+FFFD><U+FFFD> GG rs9604648<U+FFFD> GG rs7286962<U+FFFD> CC rs9604721<U+FFFD> CC rs12159982 CC rs4389403<U+FFFD> AG

There are eight possible ways that these SNPs could arrange themselves on the two strands of DNA:

Haplotype 1: AGAGGCCAAGGCCCA and <U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD> AGAGGCCGCGGCCCG
Haplotype 2: AGAGGCCAAGGCCCG and <U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD> AGAGGCCGCGGCCCA
Haplotype 3: AGAGGCCACGGCCCA and <U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD> AGAGGCCGAGGCCCG
Haplotype 4: AGAGGCCACGGCCCG and <U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD> AGAGGCCGAGGCCCA
Haplotype 5: AGAGGCCGAGGCCCA and <U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD> AGAGGCCACGGCCCG
Haplotype 6: AGAGGCCGAGGCCCG and <U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD> AGAGGCCACGGCCCA
Haplotype 7: AGAGGCCGCGGCCCA and <U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD> AGAGGCCAAGGCCCG
Haplotype 8: AGAGGCCGCGGCCCG and <U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD><U+FFFD> AGAGGCCAAGGCCCA

Actually, if you look closely at this, there are only four unique haplotypes (1/8, 2/7, 3/6, and 4/5 are effectively the same haplotypes).

In most realistic situations, you do not know what particular haplotype a patient has. You could sequence the DNA strand to figure out which of these haplotype combinations is actually present, but sequencing is a very expensive thing to do. Instead, you might be able to infer the likelihood of these haplotypes by looking at multiple patients and<U+FFFD> making assumptions consistent with Hardy-Wienberg equilibrium.

These inferences are effectively the same as many missing data problems and use an approach, the EM algorithm that is commonly relied on to help with this sort of problem. There is a library of programs for R called haplo.stats

I’ve run some experiments on applying information theory to the HapMap project, and I might investigate whether this provides an alternative way to identifying haplotypes.

I attended a talk last week by Pengyuan Liu and she described how to assess haplotype information with special attention to the case where you have data on related siblings. Some of the references she mentioned are worth reviewing.