当前位置:首页 >> 农林牧渔 >>

How to Interpret a Genome-wide Association Study


How to Interpret a Genome-wide Association Study
Thomas A. Pearson; Teri A. Manolio
Online article and related content current as of April 15, 2010. JAMA. 2008;299(11):1335-1344 (do

i:10.1001/jama.299.11.1335) http://jama.ama-assn.org/cgi/content/full/299/11/1335

Correction

Correction is appended to this PDF and also available at
http://jama.ama-assn.org/cgi/content/full/jama;299/18/2150-a Contact me if this article is corrected.

Citations Topic collections

This article has been cited 59 times. Contact me when this article is cited.

Statistics and Research Methods; Genetics; Genetics, Other
Contact me when new articles are published in these topic areas.

Subscribe
http://jama.com/subscribe

Email Alerts
http://jamaarchives.com/alerts

Permissions
permissions@ama-assn.org http://pubs.ama-assn.org/misc/permissions.dtl

Reprints/E-prints
reprints@ama-assn.org

Downloaded from www.jama.com by guest on April 15, 2010

SPECIAL COMMUNICATION

How to Interpret a Genome-wide Association Study
Thomas A. Pearson, MD, MPH, PhD Teri A. Manolio, MD, PhD a dramatic increase in genomic discoveries involvng complex, nonMendelian diseases, with nearly 100 loci for as many as 40 common diseases robustly identified and replicated in genome-wide association (GWA) studies (T.A.M.; unpublished data, 2008). These studies use highthroughput genotyping technologies to assay hundreds of thousands of the most common form of genetic variant, the single-nucleotide polymorphism (SNP), and relate these variants to diseases or health-related traits.1 Nearly 12 million unique human SNPs have been assigned a reference SNP (rs) number in the National Center for Biotechnology Information’s dbSNP database2 and characterized as to specific alleles (alternate forms of the SNP), summary allele frequencies, and other genomic information.3 The GWA approach is revolutionary because it permits interrogation of the entire human genome at levels of resolution previously unattainable, in thousands of unrelated individuals, unconstrained by prior hypotheses regarding genetic associations with disease.4 However, the GWA approach can also be problematic because the massive number of statistical tests performed presents an unprecedented potential for false-positive results, leading to new stringency in acceptable levels of statistical significance and requirements for replication of findings.5 The genome-wide, nonhypothesisdriven nature of GWA studies represents an important step beyond candi-

I

N THE PAST 2 YEARS, THERE HAS BEEN

Genome-wide association (GWA) studies use high-throughput genotyping technologies to assay hundreds of thousands of single-nucleotide polymorphisms (SNPs) and relate them to clinical conditions and measurable traits. Since 2005, nearly 100 loci for as many as 40 common diseases and traits have been identified and replicated in GWA studies, many in genes not previously suspected of having a role in the disease under study, and some in genomic regions containing no known genes. GWA studies are an important advance in discovering genetic variants influencing disease but also have important limitations, including their potential for false-positive and false-negative results and for biases related to selection of study participants and genotyping errors. Although these studies are clearly many steps removed from actual clinical use, and specific applications of GWA findings in prevention and treatment are actively being pursued, at present these studies mainly represent a valuable discovery tool for examining genomic function and clarifying pathophysiologic mechanisms. This article describes the design, interpretation, application, and limitations of GWA studies for clinicians and scientists for whom this evolving science may have great relevance.
JAMA. 2008;299(11):1335-1344 www.jama.com

date gene studies, in which the high cost of genotyping had limited the number of variants assayed to several hundred at most. This required careful selection of variants to be studied, often based on imperfect understanding of the biologic pathways relating genes to disease.6 Many such associations failed to be replicated in subsequent studies,7,8 leading to calls for all genetic association reports to include documented replication of findings as a prerequisite for publication.9,10 For non-Mendelian conditions, GWA studies also represent a valuable advance over family-based linkage studies, in which multiply affected families are arduously assembled and inheritance patterns are related to several hundred markers throughout the genome. Family-based linkage studies, al-

though successful in identifying genes of large effect in Mendelian diseases such as cystic fibrosis and neurofibromatosis, have had more limited success in common diseases like atherosclerosis and asthma.11 Major limitations of linkage studies are relatively low power for complex disorders influenced by multiple genes, and the large size of the chromosomal regions shared among family members (often comprising hundreds of genes), in whom it can be difficult to narrow the
Author Affiliations: Office of Population Genomics, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland (Drs Pearson and Manolio); and Clinical and Translational Science Institute, University of Rochester Medical Center, Rochester, New York (Dr Pearson). Corresponding Author: Teri A. Manolio, MD, PhD, Office of Population Genomics, National Human Genome ResearchInstitute,31CenterDr,Room4B-09,MSC2154, Bethesda, MD 20892-2154 (manolio@nih.gov). 1335

?2008 American Medical Association. All rights reserved.

(Reprinted) JAMA, March 19, 2008—Vol 299, No. 11

Downloaded from www.jama.com by guest on April 15, 2010

INTERPRETING A GENOME-WIDE ASSOCIATION STUDY

linkage signal sufficiently to identify a causative gene. GWA studies build on the valuable lessons learned from candidate gene and family linkage studies, as well as the expanding knowledge of the relationships among SNP variants generated by the International HapMap Project,12,13 to capture the great majority of common genetic differences among individuals and relate them to health and disease. These studies not only represent a powerful new tool for identification of genes influencing common diseases, but also use new terminologies (BOX 1), apply new models, and present new challenges in interpretation. GWA studies rely on the “common disease, common variant” hypothesis, which suggests that genetic influences on many common diseases will be at least partly attributable to a limited number of allelic variants present in more than 1% to 5% of the population.14 Many important diseasecausing variants may be rarer than this and are unlikely to be detected with this approach. Although GWA discovery studies provide important clues to genomic function and pathophysiologic mechanisms, they are as yet many steps removed from actual clinical application. Nonetheless, they have gained considerable media attention and have the potential for generating queries from patients about whether to get tested for the “new gene for disease X” based on the latest report. In this article, we describe the design, interpretation, application, and limitations of GWA studies for clinicians and scientists for whom this evolving science may have great relevance. Overview of GWA Studies A GWA study is defined by the National Institutes of Health as a study of common genetic variation across the entire human genome designed to identify genetic associations with observable traits.15 Although family linkage studies and studies comprising tens of thousands of gene-based SNPs also assay genetic

variation across the genome, 16 the National Institutes of Health definition requires sufficient density and selection of genetic markers to capture a large proportion of the common variants in the study population, measured in enough individuals to provide sufficient power to detect variants of modest effect. The present discussion focuses on studies attempting to assay at least 100 000 SNPs selected to serve as proxies for the largest possible number of SNPs.12 The typical GWA study has 4 parts: (1) selection of a large number of individuals with the disease or trait of interest and a suitable comparison group; (2) DNA isolation, genotyping, and data review to ensure high genotyping quality; (3) statistical tests for associations between the SNPs passing quality thresholds and the disease/ trait; and (4) replication of identified associations in an independent population sample or examination of functional implications experimentally. Most of the roughly 100 GWA studies published by the end of 2007 were designed to identify SNPs associated with common diseases. However, the technique can also be used to identify genetic variants related to quantitative traits such as height17 or electrocardiographic QT interval,18 and to rank the relative importance of previously identified susceptibility genes, such as APOE*e4 in Alzheimer disease19 and CARD15 and IL23R in Crohn disease.20 GWA studies can also demonstrate gene-gene interactions, or modification of the association of one genetic variant by another, as with GAB2 and APOE in Alzheimer disease,21 and can detect high-risk haplotypes or combinations of multiple SNPs within a single gene, as in exfoliation glaucoma22 and atrial fibrillation.23 These studies have also been used to identify SNPs associated with gene expression, either as confirmation of a phenotypic association, such as asthma and ORMDL3 expression,24 or more globally.25 Thus, GWA studies have broader applications than those solely involving dis-

covery of individual SNPs associated with discrete disease end points. Study Designs Used in GWA By far the most frequently used GWA study design to date has been the casecontrol design, in which allele frequencies in patients with the disease of interest are compared to those in a disease-free comparison group. These studies are often easier and less expensive to conduct than studies using other designs, especially if sufficient numbers of case and control participants can be assembled rapidly. This design also carries the most assumptions, which if not met, can lead to substantial biases and spurious associations (TABLE 1). The most important of these biases involve the selected, often unrepresentative nature of the study case participants, who are typically sampled from clinical sources and thus may not include fatal, mild, or silent cases not coming to clinical attention; and the lack of comparability of case and control participants, who may differ in important ways that could be related both to genetic risk factors and to disease outcomes.26 If well-established principles of epidemiologic design are followed, casecontrol studies can produce valid results that, especially for rare diseases, may not be obtainable in any other way. However, genetic association studies using case-control methodologies have often not always adhered to these principles. The often sharply abbreviated descriptions of case and control participants and lack of comparison of key characteristics in GWA reports27 can make evaluation of potential biases and replication of findings quite difficult.28 The trio design includes the affected case participant and both of his or her parents.29 Phenotypic assessment (classification of affected status) is performed only in the offspring and only affected offspring are included, but genotyping is performed in all 3 trio members. The frequency with which an allele is transmitted to an affected offspring from heterozygous parents is then estimated.29 Under the null hy-

1336 JAMA, March 19, 2008—Vol 299, No. 11 (Reprinted)

?2008 American Medical Association. All rights reserved.

Downloaded from www.jama.com by guest on April 15, 2010

INTERPRETING A GENOME-WIDE ASSOCIATION STUDY

Box 1. Terms Frequently Used in Genome-wide Association Studies
Alleles Alternate forms of a gene or chromosomal locus that differ in DNA sequence Candidate gene A gene believed to influence expression of complex phenotypes due to known biological and/or physiological properties of its products, or to its location near a region of association or linkage Copy number variants Stretches of genomic sequence of roughly 1 kb to 3 Mb in size that are deleted or are duplicated in varying numbers False discovery rate59,60 Proportion of significant associations that are actually false positives False-positive report probability61 Probability that the null hypothesis is true, given a statistically significant finding Functional studies Investigations of the role or mechanism of a genetic variant in causation of a disease or trait Gene-environment interactions Modification of gene-disease associations in the presence of environmental factors Genome-wide association study Any study of genetic variation across the entire human genome designed to identify genetic association with observable traits or the presence or absence of a disease, usually referring to studies with genetic marker density of 100 000 or more to represent a large proportion of variation in the human genome Genotyping call rate Proportion of samples or SNPs for which a specific allele SNP can be reliably identified by a genotyping method Haplotype A group of specific alleles at neighboring genes or markers that tend to be inherited together HapMap12,13 Genome-wide database of patterns of common human genetic sequence variation among multiple ancestral population samples Hardy Weinberg equilibrium Population distribution of 2 alleles (with frequencies p and q) such that the distribution is stable from generation to generation and genotypes occur at frequencies of p2, 2pq, and q2 for the major allele homozygote, heterozygote, and minor allele homozygote, respectively Linkage disequilibrium Association between 2 alleles located near each other on a chromosome, such that they are inherited together more frequently than expected by chance Mendelian disease Condition caused almost entirely by a single major gene, such as cystic fibrosis or Huntington’s disease, in which disease is manifested in only 1 (recessive) or 2 (dominant) of the 3 possible genotype groups Minor allele The allele of a biallelic polymorphism that is less frequent in the study population Minor allele frequency Proportion of the less common of 2 alleles in a population (with 2 alleles carried by each person at each autosomal locus) ranging from less than 1% to less than 50% Modest effect Association between a gene variant and disease or trait that is statistically significant but carries a small odds ratio (usually 1.5) Non-Mendelian disease (also “common” or “complex” disease) Condition influenced by multiple genes and environmental factors and not showing Mendelian inheritance patterns Nonsynonymous SNP A polymorphism that results in a change in the amino acid sequence of a protein (and therefore may affect the function of the protein) Platform Arrays or chips on which high-throughput genotyping is performed Polymorphic A gene or site with multiple allelic forms. The term polymorphism usually implies a minor allele frequency of at least 1% Population attributable risk Proportion of a disease or trait in the population that is due to a specific cause, such as a genetic variant Population stratification (also “population structure”) A form of confounding in genetic association studies caused by genetic differences between cases and controls unrelated to disease but due to sampling them from populations of different ancestries Power A statistical term for the probability of identifying a difference between 2 groups in a study when a difference truly exists Single-nucleotide polymorphism Most common form of genetic variation in the genome, in which a single-base substitution has created 2 forms of a DNA sequence that differ by a single nucleotide Tag SNP A readily measured SNP that is in strong linkage disequilibrium with multiple other SNPs so that it can serve as a proxy for these SNPs on large-scale genotyping platforms Trio Geneticstudydesignincludinganaffectedoffspringandbothparents
Abbreviation: SNP, single-nucleotide polymorphism.

pothesis of no association with disease, the transmission frequency for each allele of a given SNP will be 50%, but alleles associated with the disease will be transmitted in excess to the affected case individual. Because the trio design studies allele transmission from parents to offspring, it is not susceptible to population stratification, or ge-

netic differences between case and control participants unrelated to disease but due to sampling them from populations of different ancestry.30 A significant challenge of the trio design in GWA studies is its sensitivity to even small degrees of genotyping error,4,31 which can distort transmission proportions between parents and offspring, es-

pecially for uncommon alleles. Therefore, standards for genotyping quality in trio studies may need to be more stringent than for other designs. Cohort studies involve collecting extensive baseline information in a large number of individuals who are then observed to assess the incidence of disease in subgroups defined by
1337

?2008 American Medical Association. All rights reserved.

(Reprinted) JAMA, March 19, 2008—Vol 299, No. 11

Downloaded from www.jama.com by guest on April 15, 2010

INTERPRETING A GENOME-WIDE ASSOCIATION STUDY

Table 1. Study Designs Used in Genome-wide Association Studies
Assumptions Case-Control Case and control participants are drawn from the same population Case participants are representative of all cases of the disease, or limitations on diagnostic specificity and representativeness are clearly specified Genomic and epidemiologic data are collected similarly in cases and controls Differences in allele frequencies relate to the outcome of interest rather than differences in background population between cases and controls Short time frame Large numbers of case and control participants can be assembled Optimal epidemiologic design for studying rare diseases Cohort Participants under study are more representative of the population from which they are drawn Diseases and traits are ascertained similarly in individuals with and without the gene variant Trio Disease-related alleles are transmitted in excess of 50% to affected offspring from heterozygous parents

Advantages

Cases are incident (developing during observation) and free of survival bias Direct measure of risk Fewer biases than case-control studies Continuum of health-related measures available in population samples not selected for presence of disease Large sample size needed for genotyping if incidence is low Expensive and lengthy follow-up Existing consent may be insufficient for GWA genotyping or data sharing Requires variation in trait being studied Poorly suited for studying rare diseases

Controls for population structure; immune to population stratification Allows checks for Mendelian inheritance patterns in genotyping quality control Logistically simpler for studies of children’s conditions Does not require phenotyping of parents May be difficult to assemble both parents and offspring, especially in disorders with older ages of onset Highly sensitive to genotyping error

Disadvantages

Prone to a number of biases including population stratification Cases are usually prevalent cases, may exclude fatal or short episodes, or mild or silent cases Overestimate relative risk for common diseases

genetic variants. Although cohort studies are typically more expensive and take longer to conduct than casecontrol studies, they often include study participants who are more representative than clinical series of the population from which they are drawn, and they typically include a vast array of health-related characteristics and exposures for which genetic associations can be sought.17,18 For these reasons, genome-wide genotyping has recently been added to cohort studies such as the Framingham Heart Study 3 2 and the Women’s Health Study.33 Many GWA studies use multistage designs to reduce the number of falsepositive results while minimizing the number of costly genome-wide scans performed and retaining statistical power.4 Genome-wide scans are typically performed on an initial group of case and control participants and then a smaller number of associated SNPs is replicated in a second or third group of case and control participants (TABLE 2). Some studies begin with small numbers of participants in the initial scan but carry forward large numbers of SNPs to

minimize false-negative results.34 Other studies begin with more participants but carry forward a smaller proportion of associated SNPs.35 Optimal proportions of study participants and SNPs in each phase have yet to be determined,36 but carrying forward a small proportion ( 5%) of stage 1 SNPs will often mean limiting the associations ultimately identified to those having a relatively large effect.37 Selection of Study Participants Many genetic studies, whether GWA or otherwise, focus on case participants more likely to have a genetic basis for their disease, such as earlyonset cases or those with multiple affected relatives. Misclassification of case participants can markedly reduce study power and bias study results toward no association, particularly when large numbers of unaffected individuals are misclassified as affected. For diseases that are difficult to diagnose reliably, ensuring that cases are truly affected (as by invasive testing or imaging), is probably more important than ensuring generalizability, although the limitations on

diagnostic reliability and generalizability should be clearly described so that clinicians can judge the relevance to their patients. The control participants should be drawn from the same population as the case participants and should be at risk to develop the disease and be detected in the study. Inclusion of women as controls in genetic association studies of diseases limited to men, for example, is problematic in that this approach adds individuals to the control group who had no chance of developing the disease (but might have done so had they also inherited a Y chromosome), thus mixing the controls with possible latent cases. This artificially reduces the differences in allele frequencies between cases and controls and limits the ability of the study to detect a true difference (ie, reduces study power). If the disease is common, such as coronary heart disease or hypertension in the United States, efforts should be made to ensure that the controls are truly disease free. Some studies address this by using super-controls or persons at high risk but without even early evidence of disease, such as per-

1338 JAMA, March 19, 2008—Vol 299, No. 11 (Reprinted)

?2008 American Medical Association. All rights reserved.

Downloaded from www.jama.com by guest on April 15, 2010

INTERPRETING A GENOME-WIDE ASSOCIATION STUDY

sons with diabetes of long duration but without microalbuminuria in a study of diabetic nephropathy.38 The success of recent GWA studies using control groups of questionable representativeness due to volunteer bias, such as the blood donor cohort in the Wellcome Trust Case-Control Consortium,39 suggests that initial identification of SNPs associated with disease may be robust to these biases, especially given subsequent evidence of replication of these associations in studies using more traditional control groups.40-42 Of more concern may be the risk of false-negative findings, as many biases tend to reduce the magnitude of observed associations toward the null. Use of convenience controls such as blood donors, however, may also be problematic in examining potential modification of genetic associations by environmental exposures and sociocultural factors, and in the identification of less strongly associated SNPs. A key component in articles reporting results in the epidemiology literature of observational study is an initial table comparing relevant characteristics of those with and without disease, allowing assessment of comparability and generalizability of the 2 groups. Such comparisons are infrequent in GWA studies,28 but they are important because commondiseasesaretypicallyinfluenced by multiple environmental (as well as genetic) factors. Important differences should be adjusted for in the analysis if possible, to avoid the risk of identifying genetic associations not with the disease of interest but with a confounding factor, such as smoking43 or obesity.44 Confounding due to population stratification(alsocalledpopulationstructure) has been cited as a major threat to the validity of genetic association studies, but itstrueimportanceisamatterofdebate.45,46 Whenvariationsoccurinallelefrequency between population subgroups, such as those defined by ethnicity or geographic origin, that in turn differ in their risk for disease, GWA studies may then falsely identify the subgroup-associated genes as related to disease.30 Population structure should be assessed and reported in GWA

studies, typically by examining the distribution of test statistics generated from the thousands of association tests performed(eg,the 2 test)andassessingtheir deviation from the null distribution (that expected under the null hypothesis of no SNPassociatedwiththetrait)inaquantilequantile or “Q-Q,” plot (FIGURE 1). In these plots, observed association statis-

tics or calculated P values for each SNP are ranked in order from smallest to largest and plotted against the values expected had they been sampled from a distribution of known form (such as the 2 distribution).39 Deviations from the diagonal identity line suggest that either the assumed distribution is incorrect or that the sample contains values arising

Table 2. Examples of Multistage Designs in Genome-wide Association Studies a
3-Stage Study b Stage 1 2 3 4 Case Participants/ Control Participants 400/400 4000/4000 20 000/20 000 SNPs Analyzed 500 000 25 000 25 4-Stage Study c Case Participants/ Control Participants 2000/2000 2000/2000 2000/2000 2000/2000 SNPs Analyzed 100 000 1000 20 5

Abbreviation: SNP, single-nucleotide polymorphism. a Based on hypothetical data. b Five SNPs associated with disease. c Two SNPs associated with disease.

Figure 1. Hypothetical Quantile-Quantile Plots in Genome-wide Association Studies
A Before-and-after exclusion of most strongly associated locus
25

B Before-and-after adjustment for population stratification
25

2