Complex diseases involve multiple genes with low penetration which may be distributed over the genome. Most of the current haplotype association methods are typically restricted in length and consist of alleles from contiguous SNPs. These approaches have limited capacity to detect epistasis from multiple gene-gene interactions. Recent evidence has shown that the 3-D structure of chromosomes can play a role in allowing SNPs to interact with genetic elements which are physically separated by great distances. We hypothesize that interactions between different chromosomes may play a role in disease susceptibility. Thus we have developed a cooperative coevolutionary algorithm (CCA) to detect gene-gene interactions from case-control haplotype data; moreover, this algorithm can tolerate up to 15% missing/ambiguous positions in haplotype data arising during haplotype phasing from genotypes. Further, the algorithm can compute epistatic associations from genes spanning multiple chromosomes.
HapEvolution is a Java 6 application where the proposed CCA was implemented.
- Haplotype Files:
This program takes two input files which contains the same set of SNP genotypes for cases and controls. Below is a sample case and control phased haplotype files for 3 SNPs.
The format is similar to Haploview’s Hap format. The first row is the chromosome position of each SNP and they must be in ascending order. The first colum is the family ID, second column is the individual ID and the columns after that represent alleles for each SNPs. Each two rows defines the parental haplotype of an individual. The Bases are coded in numbers where 1 = A, 2 = C, 3 = G, 4 = T. The missing/ambiguous values arises from genotyping software are coded as 0.
- Map File:
The first column of the map file is the Chromosome number, the second column represents the rs# of each SNP and the third column is the chromosomal position of the SNP.
It is important to see if the CCA evolves in the search space or not; hence we have included a visualization module to observe the fitness increase.
- Fitness File: Each run will produce two output files, the file with *_fitness.txt consists of n-columns, where the first two columns are the average fitness of each generation and the maximum fitness of each generation; respectively. The rest of the columns are the average fitness of the sub-populations.
- Result File: The second file is the result file consists of 10 parameters for each haplotype:
- Numerical Haplotype - Haplotypes in numerical format.
- Minor/Major Allele - specification of major/minor allele for the haplotype.
- List of SNPs – the list of SNP for this haplotype.
- Case Frequency – Case sample frequency of the haplotype.
- Control Frequency – Control sample frequency of the haplotype.
- P-value – the p-value is computed after permutation test.
- Fitness - is the case-control frequency difference for this haplotype.
- Case Count – the occurrence of this haplotype in the case samples.
- Control Count - the occurrence of this haplotype in the control samples.
- HRR – the haplotype risk ratio.
Below is a demonstration of running the HapEvolution package using a demo dataset.
Fig 1: The interface of the HapEvolution package.
Fig 2: The real time progression of the fitness value for each CCA run. This visualization confirms the evolution of each CCA run.
Fig 3: Result summary of the multiple runs of the CCA algorithm. The x/y plot shows the interaction ratio, below the plot is the chromosome contig with SNP location. The haplogype table includes the extended haplotypes with their case and control frequency and the haplotype risk ratio (HRR). The haplotype alleles '1' represent the major allele and '2' represent the rare allele for a SNP.