Abeloff's Clinical Oncology, 4th Edition

Part I – Science of Clinical Oncology

Section B – Genesis of Cancer

Chapter 13 – Genetic Factors: Finding Cancer Susceptibility Genes

Elaine A. Ostrander,Danielle M. Karyadi




The identification of cancer susceptibility genes by either linkage studies within families or association studies in populations is a useful way to understand defining events in tumor development and to identify cellular pathways that are likely to be important in cancer.



Cancer susceptibility genes can be either strongly penetrant, in which case individuals born with a mutant allele have a high probability of developing cancer, or weakly penetrant, for which the probability of developing cancer is lower.



Ideal families for linkage studies are large, include many affected individuals who can be readily examined and interviewed, and include individuals with similar clinical features of disease from multiple generations. This allows large data sets associated with genetic heterogeneous forms of cancer to be stratified into homogenous subsets, thus increasing the power to detect genes.



Linkage between polymorphic markers and a disease state is assessed by using a number of statistical tools, including the parametric LOD score and nonparametric NPL score.



Association studies that include populations of affected cases and controls are useful for identifying and testing hypotheses about candidate genes and alleles that might be associated with disease. A precisely defined set of cases and matched controls is important.



Association studies and linkage-based studies both require collection of accurate clinical and family history data by clinicians, and both offer hope for future genetic testing.


Cancer susceptibility genes are those that, when mutated, increase an individual's risk of having cancer. If an individual is born with one mutant copy of a cancer susceptibility gene, subsequent mutations in the wild-type allele within the relevant tissues can result in a lack of functional gene product, leading to tumor formation.[1] Genetic mapping of cancer susceptibility genes allows identification of both genes and pathways that play a role in cancer susceptibility. Although the direct public health impact associated with cloning any specific cancer gene may be minimal, the contributions to understanding of tumor development and metastasis that such advances make are potentially enormous.

Population-based studies reveal excess familial cancer aggregation for most organ sites.[2] However, cancer susceptibility genes have been mapped for only a few cancer sites to date, and the underlying mutations have been identified for even fewer. Several studies suggest that the overall percentage of cancers in the general population that are caused by highly penetrant inherited mutations is low, likely less than a few percent, even when all organ sites are considered.[3] For breast cancer and prostate cancer, the numbers are probably among the best supported; 5% to 10% of cases of each are thought to be due to mutations in inherited susceptibility loci. [4] [5] [6] [7] The remaining cancer cases, making up the majority, are considered sporadic in nature. They are probably caused by a mixture of specific environmental and weakly penetrant genetic factors, with genetic background remaining poorly understood.

Cancer susceptibility alleles associated with a given gene may be either strongly penetrant, leading to a high probability that individuals born with a mutant allele will have the disease in question, or weakly penetrant, with carriers having a proportionately lower probability of having the disease. Allele penetrance associated with susceptibility alleles is often age-dependent, with the probability of having the disease increasing with each decade of life. Genetic mapping of cancer susceptibility genes is extremely difficult, in part because genetic background and environmental exposures are likely to affect penetrance. In addition, both highly and weakly penetrant alleles can be associated with the same gene. Further, the same allele can be associated with widely varying age-dependent penetrance within a single family as different family members have independent genetic backgrounds as well as unique life experiences and environmental exposures. Finally, stochastic effects likely play a role as well.

Highly penetrant disease alleles are best identified by family-based linkage analysis studies. The segregation of a defined chromosomal segment with affected individuals in multiple families suggests the presence of a cancer susceptibility gene within the genomic region tested. Statistical analysis that is performed after genotyping of appropriate numbers of families with markers spanning the whole genome allows researchers to calculate the probability that any given chromosomal region carries a susceptibility gene. Weakly penetrant alleles are more easily identified by association tests after analysis of DNA from two distinct populations, for example, patients with cancer together with an appropriately matched set of control subjects. Weak alleles are hypothesized to be more common in the general human population and are therefore likely to account for a higher percentage of cancer in the population overall. In this chapter, we investigate the ways in which both highly and weakly penetrant disease alleles are identified and studied.


Hereditary Cancer Families

Strong and Amos[8] have defined a general paradigm for population studies that can be applied to identifying genes that are important in predisposition to cancer. The hypothesis that a particular cancer has an identifiable genetic component usually occurs through family history analysis of sequential cancer cases, general clinical observations, and, finally, epidemiologic studies. Epidemiologic studies assess whether there is significant evidence for an increased cancer risk at a particular organ site that can be associated with a family history of the disease. If so, a segregation analysis may be undertaken to identify features of the putative susceptibility loci. Typical analyses address mode of inheritance (dominant, recessive, or X-linked) and estimate frequency and penetrance of the disease allele(s) in the general population, age-dependent penetrance, and potential number of loci contributing to the disease. In the event that genetic linkage studies are eventually undertaken, data from the segregation analysis are key in developing statistical models for analyzing subsequent linkage data.

Familial aggregation is a general term that describes the occurrence of multiple cases of cancer within a family ( Fig. 13-1 ). Such clustering may be due to shared environment, shared alleles of particular genes, or simply chance if the tumor is very common in the population. The successful mapping of cancer susceptibility genes for breast, colon, and prostate cancer has led to the development of a strictly defined term, hereditary cancer, which describes families with three or more first-degree relatives with a given cancer, three successive generations with cancer, or at least two siblings with the same cancer detected at a relatively young age.[4] First-degree relatives are defined as parents, offspring, or siblings.


Figure 13-1  Theoretical pedigree of a family segregating an autosomal dominant disorder. Individuals are numbered 1 to 21. Males are indicated by squares, and females are indicated by circles. Symbols for affected individuals are filled (numbers 1, 5, 7, etc.). A diagonal line through the symbol indicates that individual is deceased (1, 2, 5, and 9). A horizontal line between symbols indicates a mating (1+2, 3+4, etc.). Perpendicular lines drawn from mating lines indicate children (e.g., 13, 14, and 15 are all daughters of 3 and 4). Siblings are designated as shown for individuals 3, 5, 7, and 9; and individuals 5 and 7 are twins.



Many epidemiologic studies indicate that a family history of a specific cancer within first-degree relatives is associated with a doubling or more of risk among relatives.[9] In the case of prostate cancer, for instance, studies of selected hospital-based patient populations, [10] [11] population-based case-control studies, [12] [13] [14] and cohort studies [2] [15] all demonstrate that a family history of disease increases an individual's risk. If the affected family members are first-degree relatives (e.g., brothers or fathers and sons), the risk increases from 1.7-fold to 3.7-fold. Younger ages at diagnosis and multiple affected relatives with the disease tend to be associated with even higher relative risk (RR). For example, men with three or more first-degree relatives with prostate cancer have almost an 11-fold increased risk of the disease compared with men who have no family history of the disease.[10] For this reason, families that are ascertained for linkage analysis studies tend to be large, have multiple affected individuals, and feature people who were diagnosed with the disease at a comparatively young age.

Linkage Mapping and Finding Cancer Susceptibility Genes

Several requirements must be met to successfully identify cancer susceptibility genes. First, a large number of so-called high-risk or hereditary families must be ascertained by using appropriate guidelines for working with human subjects. Clinical features and family history data must be recorded, and DNA samples must be obtained. Once purified, DNA samples from appropriate family members need to be screened by using a set of highly polymorphic markers that span the genome at a high density. Historically, genome scans have used microsatellite-based markers distributed approximately every 10 million base pairs. Recent studies suggest that a denser scan with highly polymorphic markers every 5 million base pairs or even biallelic markers, such as single nucleotide polymorphisms (SNPs) placed every 2 to 3 million base pairs may be preferable. Finally, the data must be interpreted or analyzed in the context of the disease in question. Creating stratified data sets, which allow analysis of families with a common disease or family history features, is important and may increase the chance of finding a susceptibility-associated locus. These issues are discussed in the following sections.

Family Collection

Most cancers are heterogeneous diseases that likely involve multiple susceptibility genes. In a statistically ideal situation, a given set of affected individuals within a family would all have cancer for the same reason; that is, each member would have inherited a mutated copy of the same gene. But in truth, for common cancers such as those of the breast, prostate, and colon, any given family may have individuals whose disease is due to mutations in multiple different genes, some highly penetrant and some weakly penetrant, as well as family members whose disease is sporadic.[16] Often, disease presentation is similar in genetic and sporadic cases, and examination of clinical or pathologic features is uninformative for determining whether a specific patient represents a genetic or sporadic case.

Figure 13-2 demonstrates two types of seemingly useful families for linkage-mapping studies. Both include a significant number of affected members. The first family, in particular, has a large number of affected individuals (see Fig. 13-2A ). However, some individuals were affected very early in life, whereas others were diagnosed at later ages. It is likely that some individuals have the disease because they inherited mutated copies of a particular gene, whereas others have the disease for sporadic reasons unrelated to the disease allele segregating in the family. Ideally, age at onset provides some guidance as to which individuals are more likely to have hereditary versus sporadic forms of the disease; but this is not absolute, and in the case of a disease with age-dependent penetrance, some people will be affected late in life even though they carry a mutant allele, and others will be affected early in life for sporadic reasons. The family shown in Figure 13-2B also appears to be informative for and conducive to linkage mapping studies. There are several affected individuals in the family, and all were affected at a relatively early age. However, the presence of disease segregating on both sides of the family should be noted. The affected individuals in the youngest generation could have cancer because they inherited mutant alleles from one or both sides of their family, and one or multiple genes could be involved. Therefore, the family is of limited utility for mapping studies.


Figure 13-2  Two theoretical breast cancer families. Age at diagnosis is indicated below the symbol; males are indicated by squares, and females by circles. A, The family has many members affected with breast cancer, but some were given diagnoses relatively early in life (<50 years), whereas others were much older at diagnosis (>70 years). The utility of this family for genetic-mapping studies is thus limiting, because it likely contains individuals with both sporadic and hereditary breast cancer. B, All individuals were affected at an early age, but breast cancer, caused by mutations in either the same or different genes, is present on both sides of the family. Because there is no way to distinguish the number of mutant genes a priori, the utility of this family for a genome-wide scan is also somewhat limited.



Obtaining good clinical information for all individuals in a family mapping study gives geneticists the power to stratify the data into more homogenous subsets. This increases statistical power for finding the genes associated with any one particular aspect of a phenotype. If a subset of individuals in the family in Figure 13-2B all had tumors of similar stage and grade, the data from this homogenous subset ofindividuals could be considered in isolation from the rest of the affected cases, reducing heterogeneity and increasing power. In addition to clinical features of disease, family history, age at onset, and presence or absence of other cancers are all ways to stratify data into homogenous subsets and improve the likelihood of finding causative genes. In the case of prostate cancer, several recent studies have focused on families with an excess of men presenting with a high Gleason tumor grade at diagnosis. Recall that the Gleason score reflects the pathologic architecture of the tumor at diagnosis. Several regions are scored and assigned a grade of 1 to 5, representing a well to poorly differentiated pattern, respectively. The two predominant grades are added to give a summary score of between 2 to 10, most tumors being in the range of Gleason 5 to 7. Recent studies of prostate families reveal a set of specific loci for families in which two or more men from whom DNA could be genotyped presented with high Gleason scores or other measures of aggressive disease.[17] Indeed, some studies suggest a difference in men who present with a Gleason 3 + 4 versus 4 + 3, the latter representing a higher-risk class of study subjects with regard to higher frequency of biochemical failure, systemic recurrence, and cancer-specific death. [18] [19]

Identification of cancer families and collection of critical medical information, including family history, medical record data, and DNA samples, are generally regulated by institutional review boards. Families must be identified in a way that is neither intrusive nor coercive. For these reasons, genetic epidemiologists are increasingly turning to advertisement in periodicals such as supplements to popular newspapers or widely read periodicals[20] to recruit eligible families. A particularly innovative approach used by investigators trying to find hereditary prostate cancer families was to establish a toll-free telephone number, which was then advertised on a popular syndicated television talk show.[21] Listeners whose family history matched that described were encouraged to call for a preliminary phone screening and to obtain more information about the study.

For linkage-based approaches to correctly identify associated loci, rigorous quantitative data regarding strength of phenotype must be available for multiple generations of the family. Medical record data must be carefully and systematically extracted into well-protected databases. Family history data must also be obtained redundantly from multiple members of the family, and care must be taken to resolve discrepancies. Consent to contact other family members regarding the study is needed, as is permission to obtain medical records. Individual privacy must also be protected and personal identifiers such as names and addresses must remain confidential.

Locus Heterogeneity

If a particular trait is controlled by a large number of genes, each of which contributes only minimally to the final complex phenotype, it will be difficult to dissect the contributions of any one gene by studying a small number of families. If the phenotype is controlled largely by a small number of genes, however, the underlying genetics will be much easier to resolve. The breast cancer susceptibility genes BRCA1 and BRCA2 were likely among the first to be mapped for several reasons related to this point. [22] [23] First, only two genes appear to control the majority of the hereditary breast cancer in the general population.[24] Had the number of highly penetrant genes in the population been larger, the task would have been proportionately greater. Second, in both cases, large and well-characterized families had been meticulously ascertained. This ensured that there was sufficient statistical power to undertake the genome scan. Third, the power of any given data set can be increased dramatically by identifying families in which several members share minor disease features, thus making it likely that their disease is due to mutations in the same gene. The presence of ovarian cancer in some families and not others and the presence of breast cancer in some male carriers allowed for the creation of data sets that were enriched for the BRCA1 and BRCA2 genes, respectively. [22] [23] Finally, it is always useful to remove from a data set families whose disease is known to be caused by any given gene. The identification of the BRCA1 gene and subsequent removal of BRCA1-linked families from remaining data sets provided further useful enrichment for BRCA2-linked families. [23] [25]

Initially, in the case of breast cancer, investigators did not know the number of genes likely to be involved in genetic susceptibility. Detailed segregation analysis had suggested that the gene or genes responsible for breast cancer were likely to be highly penetrant and autosomal and to produce patterns of age-dependent penetrance. A segregation analysis typically involves interviewing a large number of sequential case patients who share common features of the disease. Once a segregation analysis is complete, the resulting data can be factored into the resulting genome-wide scan. This allows data from some individuals to be weighed more significantly. Segregation analyses have now been done for nearly all types of cancer, [4] [7] [26] [27] [28] [29] providing investigators with an array of clues with which to begin their search for genes of interest.

After families are designated for a genome-wide scan, a power analysis is performed to determine whether there is sufficient statistical power in the specific group of families collected to find linkage to the trait in question, given a certain set of assumptions. The assumptions include how many markers are being tested, how informative each marker is likely to be, and the composition of the families in question. The scenario in which a given set of families offers sufficient power to find a gene only if one of the markers is extremely close to the locus occurs frequently and is particularly associated with diseases such as cancer, for which locus heterogeneity is common. The power to find genes decreases dramatically as the number of genes that contribute to a phenotype increases.[30]

Another way to reduce locus heterogeneity is by studying families from isolated or inbred populations. Fewer disease alleles are predicted to segregate with a particular phenotype in a population derived from a limited number of founders. Studies of colon cancer in Finland and studies of breast cancer in Iceland and in Ashkenazi Jewish populations illustrate these points very well. In Finland, two variants in the DNA mismatch repair gene, MLH1, termed mutations 1 and 2, account for 51% of all Finnish families with verified or putative cases of hereditary nonpolyposis colorectal cancer.[31] Nineteen mutation 1 and six mutation 2 families were further investigated by haplotype analysis using 15 microsatellite markers surrounding the MLH1 locus. The presence of two distinct large conserved disease haplotypes, one in mutation 1 and the other in mutation 2 families, indicated that these families are likely to descend from two common ancestors born in the sixteenth century and eighteenth century, respectively.[31]

For the breast cancer susceptibility genes, BRCA1 and BRCA2, several founder mutations have been identified in different populations.[32] For instance, a single BRCA2 mutation, 999del5, was found in 16 of 21 Icelandic breast cancer families.[33] All 16 of these families share a haplotype or pattern of alleles within the BRCA2 gene, suggesting a common ancestral origin. Studies of breast cancer in Jewish families have also demonstrated this point, contributing enormously to our knowledge of founding mutations for both BRCA1 and BRCA2. [34] [35] The three common founder mutations in this population, BRCA1-185delAG, 5382insC, and BRCA2-6174delT, have a combined prevalence of 2% to 2.5%. [32] [34] [36] [37] With these observations in mind, investigators have frequently sought families for genetic-mapping studies from regions of the world where marriage between related individuals is not taboo and where geographic barriers have restricted gene flow. This is especially important for genetically heterogeneous diseases, such as prostate cancer. Indeed, support for the existence of a locus on the X chromosome, HPCX, was found in a subset of Finnish families, suggesting that the HPCX locus might contain a founder mutation in Finland.[38]

Principles of Genetic Linkage Analysis

The principles of meiotic recombination are key to understanding linkage analysis. In meiosis, the cell division leading to gamete formation, homologous chromosomes are paired. Each chromosome consists of two identical strands (chromatids), each chromosome pairing being composed of four strands. Homologous chromosomes separate from each other during the process of meiosis except at one or two zones of contact in a process that leads to genetic recombination ( Fig. 13-3 ). Mendel's second law, of independent assortment, states that alleles of genes at unlinked loci segregate or assort independently of one another. Deviations from independent assortment occur when genes are located close to one another, in which case alleles assort together more than 50% of the time. In this scenario, the associated loci are said to be linked. However, if two loci are located on different chromosomes or far apart on the same chromosome, their alleles will assort randomly, a given set of alleles being transmitted to the same gamete 50% of the time. Such loci are said to be unlinked.


Figure 13-3  Genetic recombination is the process of exchanging genetic information between two chromatids during meiosis. The recombination events for a single chromosome within a family are illustrated. The father's homologous chromosomes are light and dark purple, and the mother's are light and dark green. Recombination events occurring during meiosis create unique parental chromosomes.



For any given chromosomal segment, the probability of a genetic recombination event occurring between a pair of markers or a marker and a gene is proportional to the distance between them. This probability is expressed as a recombination frequency (q), where

θ = Number of recombinant offspring/Number of total offspring

Recombination frequency ranges from 0 for genes that are so closely linked that crossover events essentially never occur to 0.5 for genes that assort randomly. Within small intervals, when the probability of multiple crossovers is negligible, the relationship between the recombination fraction (θ) and the distance between two genes (x) is simply x = θ.[39] After a minor mathematical adjustment for the possibility of double recombinants is made, recombination fractionsare expressed in units called centimorgans (cM), named after the geneticist Thomas Hunt Morgan.[40] One percent recombination (θ = 0.01) is equal to 1 cM, which in the human genome corresponds to about one million base pairs. The entire human genome is estimated to be about 3300 cM.

Genetic linkage mapping queries whether any given portion of the genome is consistently inherited with the disease status. The ordered set of alleles associated with a particular part of the genome received by an offspring from one parent is called a haplotype ( Fig. 13-4 ). Recombinant haplotypes are generated when a crossover occurs between two linked markers (see Fig. 13-3 ). In Figure 13-4 , the father is a heterozygote for two loci (AB for locus 1 and XY for locus 2). The mother is homozygous at both of the same loci but with different alleles (CC and ZZ), and as a result, all offspring will inherit the C, χ haplotype. If these two markers are unlinked, as they would be if they were on different chromosomes or far apart on the same chromosome, four types of gametes would be expected from the father (A,X; B,Y; A,Y; and B,X) in approximately equal proportions (see Fig. 13-4A ). However, if the markers are linked (see Fig. 13-4B ), the father would be expected to produce an excess of the two “parental” haplotypes (A,X and B,Y) over a smaller number of the “nonparental” or “recombinant” haplotypes (A,Y and B,X).


Figure 13-4  Linked and unlinked markers segregating in two families. Below the symbols, the genotypes for both markers are listed. Offspring have either recombinant (R) or nonrecombinant (NR) haplotypes. The father is heterozygous for marker 1, AB, and marker 2, XY; and the mother is homozygous for both markers, CC and ZZ. A, If the markers were unlinked, there would be equal numbers of R and NR haplotypes from the father (AX, BY, AY, and BX). B, There is an excess of NR haplotypes (AX and BY), and only one R haplotype appears. Therefore, these loci are linked.




Marker Informativeness

Key to the success of any genome scan is the development of a well-defined set of markers that completely span the genome at defined intervals. The number of markers used determines the resolution of the resulting scan. A 10-cM genome scan, for instance, will allow localization of a disease locus only to within 5 million base pairs, whereas a 1-cM density scan, composed of approximately 3000 informative markers, will localize a gene to within half a million base pairs.

A genetic marker, by definition, has two or more alleles. If the frequency of the most common allele is less than 95%, the marker is said to be polymorphic. One measure of polymorphism is calledpolymorphism information content (PIC).[41] PIC defines the probability that the genotype of a specific offspring will be sufficiently informative to determine which of two parental alleles has been inherited. Markers are assigned PIC values between 0 (minimally informative) and 1.0 (perfectly informative). A second measure of polymorphism is called heterozygosity. Heterozygosity (H) is calculated as H = 1 - Σ(Pi)[2]. Pi is a measure of the allele frequencies for a given marker in the population under consideration.[42] There are currently several thousand well-characterized genetic markers with assigned PIC and/or heterozygosity values whose chromosomal locations in the human genome are well known.

Several different types of genetic markers are currently in use for genetic linkage mapping, each with different strengths. Microsatellites are small stretches of repetitive DNA composed of repeated motifs of mono-, di-, tri-, or tetranucleotides, such as (CA)n or (GAG)n, located randomly in the genome. [43] [44] [45] They occur frequently in mammalian populations, with a dinucleotide (CA)n repeat found, on average, every 30 to 60 kilobases. [45] [46] Microsatellites occurring in human DNA are extremely polymorphic, a given marker occasionally having in excess of 20 alleles. Microsatellite alleles are sufficiently stable in the population and therefore can be reliably used to track inheritance of chromosomal segments through several generations in a family. Yet with an estimated mutation rate of 5 × 10-4to 10-5 per allele per meiosis, new alleles appear frequently in the population, contributing to their overall utility as genetic markers.[47]

Individual microsatellite markers are distinguished from one another after amplification of the locus by polymerase chain reaction and separation of the resulting alleles by electrophoresis [48] [49] ( Fig. 13-5). Among the most popular platforms is the Applied Biosystems (ABI) Capillary system using the GeneScan software.


Figure 13-5  Schematic agarose gel electrophoresis of a microsatellite marker analyzed on a single family. Males are indicated by squares, and females by circles. Four alleles are segregating; the father has alleles 1 and 4, and the mother carries alleles 2 and 3. Each child has inherited one allele from each parent, together with the surrounding genomic information.



One disadvantage of (CA)n repeat–based microsatellite markers is that the resulting variant alleles are so similar in size that they can sometimes be hard to separate on a gel. For this reason, most genome scans today are done by using commercially prepared sets of markers based largely on trinucleotide and tetranucleotide repeats.[50] Although less frequent than (CA)n repeats in the genome, they are easier to automate, and the resulting data are assessed with generally lower error rates. Sets of markers that are known to be very polymorphic and to have 5-cM or 10-cM spacing, and are thus optimized for genome-wide scans, are commercially available for the human, mouse, rat, and dog genomes. Commercially prepared marker sets are optimally designed so that the markers can be multiplexed, making it possible to analyze several markers simultaneously in a single gel lane.

SNPs (pronounced “snips”) occur when a single base in the genome is altered. It is estimated that polymorphic SNPs with a minor allele frequency greater than 5% occur once every 450 base pairs in the human genome and thus offer an unending resource for tracking variation.[51] However, since SNPs are biallelic, the overall informativeness of a single marker is less then the average microsatellite. As a result, thousands of SNPs are required to perform a complete genome-wide linkage scan.

The recent development of high-throughput SNP genotyping platforms for linkage studies by Affymetrix (10K GeneChip microarray, Fig. 13-6 ) and Illumina (Linkage IV panel, Fig. 13-7 ) have not only made SNP-based linkage scans realistic, but significantly reduced the time required to interrogate a complete genome. Dense SNP genotyping also increases the information content in families in which the parental genotypes are not available, as is typically the case for late-onset diseases like most cancers. For example, researchers at the Mayo Clinic analyzed a linkage study of prostate cancer families with both a 10-cM spaced microsatellite and the Affymetrix 10K SNP genome-wide scans.[52] They concluded that for families that lack parental genotypes, the SNP scan had significantly higher information content than did the microsatellite scan (61% versus 41%). This concept is an important consideration in experimental design. Measures of linkage can be limited by both lack of critical family members and marker informativeness. The increase in information content associated with a dense SNP genome-wide scan will improve measures of linkage at loci, which are truly linked.


Figure 13-6  Affymetrix's high-throughput SNP genotyping platform. A, Affymetrix GeneChip can assay over 250,000 SNPs per sample on a single GeneChip. Oligonucleotide probes are spotted to a glass microchip, where each SNP is interrogated with approximately 40 different probes. B, A pseudocolored scan of a single Human Mapping 100K Hind GeneChip. C, The GTYPE software displays the genotype calls of several samples for a single SNP (C versus T). The orange triangles are homozygous CC, the green squares are heterozygous CT, and the blue diamonds are homozygous TT.




Figure 13-7  Illumina's high-throughput SNP genotyping platform. A, Illumina's Infinium BeadChip can assay over 650,000 SNPs per sample on a single BeadChip. B, Illumina's BeadArray Technology is based on 3-micron silica beads. Each bead is covered with over 100,000 copies of a specific oligonucleotide that act as the capture sequence in the Illumina assay. C, BeadScan Software scans the results of a single-color BeadChip. Each green dot represents one 3-micron bead and the specific oligonucleotide attached to it. D, BeadStudio Genotyping Software displays the cluster plot for the results of a single SNP (A versus G). The genotypes of the 79 red samples are homozygous AA, the 119 purple samples are heterozygous AG, and the 71 blue samples are homozygous GG.



However, genome-wide SNP linkage scans do have some limitations. Each individual SNP is not very informative; therefore, tracking the inheritance of chromosomal segments through a family or the detection of Mendelian errors requires the assembly of data into haplotypes ( Fig. 13-8 ). A haplotype represents a set of alleles associated with an ordered set of markers inherited together on a chromosome that form either the maternal or paternal ancestry. Computer programs such Genehunter and Merlin are available to assist in assembling haplotypes. [53] [54] The process is most efficiently expedited, however, by the collection and genotyping of critical family members, such as grandparents, whether they carry the disease in question or not. It is also important to note that linkage disequilibrium between SNPs can artificially inflate measures of linkage. This is more problematic with SNPs then with microsatellites, as the SNP markers are most closely spaced and are thus more likely to be coinherited as a block. So for any data set, the linkage disequilibrium between SNPs needs to be carefully assessed and SNPs that are in high linkage disequilibrium with other SNPs need to be removed from the data set.


Figure 13-8  SNP haplotypes in a prostate cancer family. Males are indicated by squares, and females are indicated by circles. Each individual's haplotype for a chromosomal region of interest is drawn below the symbols. The father's two haplotypes are blue or green, and the mother's are red or yellow. The three affected brothers and father all share the haplotype shaded blue, whereas the unaffected brother has inherited the haplotype shaded green. There are recombination events in brothers 2 and 4, restricting the region of interest to that defined by SNPs 2, 3, and 4 (indicated by the brackets).



Measures of Linkage

Calculating LOD Scores

On completion of a genome-wide scan, extensive checking is done by using programs such as PEDCHECK, PREST, RELPAIR, and MERLIN to detect potential genotyping errors by checking, for instance, Mendelian inheritance. [54] [55] [56] [57] Data are then analyzed to determine which, if any, markers are closely linked to a putative disease locus. The likelihood for linkage (θ < 0.5) versus the likelihood for recombination (θ = 0.5) is calculated on the basis of the number of observed recombinant and nonrecombinant offspring produced by a given mating. Conventionally, the logarithm of the likelihood ratio, or LOD score, is and is used as the measure of support for linkage versus nonlinkage. For example, if n observations consist of k recombinants and n - k nonrecombinants, the corresponding LOD score is given by 

It is often stated that linkage is “found”; that is, a marker under consideration is said to be linked to a putative disease locus when a recombination fraction of θ < 0.5 is supported by a LOD score of at least 3.0.[58] However, in searching for genes, it is important to distinguish between pointwise significance levels and genome-wide significance levels.[59] The pointwise or nominal significance level is the probability that one would encounter such an extreme deviation at a specific locus by chance. The genome-wide significance level is the probability that one would encounter such a deviation somewhere in the whole genome scan ( Box 13-1 ). The former is an evaluation of a single test of the null hypothesis of no linkage (testing for linkage of a favorite candidate gene to a disease locus); the latter involvesscreening over a large number of tests (i.e., running a large number of markers spanning the genome) to find the most significant result.

Box 13-1 


It is noteworthy that there is frequent confusion in the literature about LOD scores versus probability measures. Again, according to the example provided by Lander and Kruglyak, a LOD score of 3.0 means that the observed data are 1000 times more likely to arise under a specific hypothesis of linkage than under the null hypothesis of independent assortment. A P value of 10-3 means that the probability of encountering as large a LOD score as is observed is 10-3 under the null hypothesis.

In assessing the significance of a putative linkage result, Lander and Kruglyak[59] have assigned the following descriptors. Suggestive linkage is that which would be expected to occur one time at random in a genome scan. Significant linkage would be expected to occur 0.05 times in a genome-wide scan. Highly significant linkage that is considered statistically significant is expected to occur 0.001 times in a genome scan. It is generally the norm to report all regions with a nominal P value of P = 0.05 in a complete genome scan. These may indicate places in the genome where additional families, markers, or both are needed. A LOD threshold of 3.3, corresponding to P = 5 × 105, is the value that is now accepted as indicating a genome-wide significance level of 5%. Thus, in the context of a genome-wide scan, a marker is said to be linked to a disease locus if a LOD score of 3.3 is achieved[59] ( Box 13-2 ).

Box 13-2 

confirmation linkage studies

Clinicians often choose to participate in studies that are aimed at confirmation of previously published linkage reports. For confirmation of published findings of linkage in an independent data set, a nominal P value of 0.01 is required. Because linkage from any previously published reports may hinge on precise features of the clinical diagnosis or stratification of the data set based on features of family history, it is vital that participating physicians record their clinical observations as accurately as possible.

Limitations and Sources of Error

Linkage analysis is an inherently error-prone approach. It is fairly easy to arrive at an incorrect conclusion because of the large number of assumptions that must be made in the calculation. For instance, calculation of LOD scores is dependent on accuracy of the linkage model.[60] The model is typically based on assumption regarding mode of inheritance or frequency of mutant alleles in the population which may be derived from segregation analysis conducted in one geographic region at one hospital, thus introducing potential bias. In addition, studies have shown that LOD score analysis of a small number of families is very sensitive to changes in a few key data points. Small errors in genotyping or misclassification of affected status have the potential to artificially inflate or deflate a particular LOD score. Consider, for example, age at onset. The linkage calculation will weigh the value of information provided by any given family member compared with that of every other person in the study. Soif early-onset disease is a defining part of the phenotype, the genotyping data from an individual woman who was diagnosed with the disease at the age of 30 years will be weighted more in the linkage statistic than will data from a person who was diagnosed at 80 years. Therefore, it is vitally important that the clinician participating in the study obtain the most current and accurate information available from any patient who is likely to be a study participant, including the age at which various family members had cancer.

One additional problem with linkage calculations is the loss of power associated with missing data. In principle, an individual who is a heterozygote at two loci A and B (AaBa) could have received his or her A allele in coupling with either the B or the b allele from one parent. To distinguish recombinants from nonrecombinants, the parental and nonparental haplotypes must be known. If the A and B alleles were inherited together on the same chromosome, they are said to be in phase. Unfortunately, when mapping diseases such as cancer, in which late age of onset is common, two-generation families are typically all that are available for sampling, thus limiting the ability to determine phase. However, collection and analysis of data from spouses or offspring of deceased affected individuals may allow researchers to reassemble the genotype of the deceased individual. Similarly, collection of DNA samples from unaffected siblings can be very useful for establishing parental phase.

Nonparametric Analysis

Because calculation of LOD scores is dependent on models of linkage that are notoriously difficult to derive,[61] researchers are turning increasingly to nonparametric linkage (NPL), or nonmodel-based approaches for mapping genes.[53] Such approaches use only the data from affected individuals; therefore, no assumptions are made as to whether an unaffected person is more or less likely to have cancer and, if so, at what age. In an NPL analysis, haplotypes are built across the relevant regions of the genome by using computer programs such as GENEHUNTER or MERLIN. [53] [54] The data are compared, and P values are calculated to assess the degree of significance observed between inheritance of a disease state and a specific haplotype.

Nonparametric approaches have the disadvantage of being less powerful than LOD score–based approaches because data from unaffected individuals, which could have contributed to the LOD score, do not contribute to the NPL score. For diseases that are genetically heterogeneous, such as common cancers, this is more than made up for by the lack of reliance on incomplete or inaccurate linkage models.

One final consideration is that of age at diagnosis versus age at onset, which, depending on the disease and available diagnostics, can differ by several years. Diagnosis of prostate cancer by prostate-specific antigen (PSA) testing provides an interesting example. The widespread use of screening for prostate cancer by serum PSA measurements has dramatically changed the patterns of disease diagnosis in the United States.[62] Reported rates increased rapidly between 1986 and 1993, in part because of the detection of latent prostate tumors in the general population as a result of PSA screening.[63] It is generally believed that PSA can detect tumors from 2 to 5 years earlier than methods such as digital rectal examination. [64] [65] [66] Therefore, data from a man who was given a diagnosis of prostate cancer at age 65 in 1995 should contribute differently to a genome scan than data from a man given the diagnosis at the same age in 1975. The man who was given the diagnosis in 1975, assuming that he participated in screening, would probably have been given the diagnosis in his early sixties if he were alive today. Developing statistical methods to account for these differences can be extremely difficult.

It is also important to note the difference between untested and unaffected individuals. Patients are more likely to know that they are truly unaffected with a disease such as prostate cancer (as defined by PSA status) than they are likely to know they are truly unaffected with other cancers, such as pancreatic or ovarian cancer, for which vigilant screening is not the norm. For this reason, in some genome-wide scans for cancer susceptibly genes, clinical status of putatively unaffected individuals may be coded as “unknown” rather then “unaffected.” It is important that the interviewing physician make the distinction when recording any patient's family and medical history.

Positional Cloning Resources

Meiotic linkage studies may define a region of interest as small as a few thousand bases or as big as several million bases. The latter may span more than a hundred genes [67] [68] and must be further reduced before mutation scanning can realistically begin. Several strategies exist for narrowing the search. Among the most common is the search for genomic rearrangements in tumors, which may indicate chromosomal regions where cancer susceptibility genes are likely to be located. Expression arrays identify genes that are differentially expressed in tumors compared to matched normal tissue. Additionally, intriguing candidate genes within the region of interest may be selected for priority sequencing based on known biologic function or a previous association with cancer.

Comparative Genome Hybridization

Comparative genome hybridization (CGH) can be used to acquire information about gains and losses of chromosomal regions in tumors. [69] [70] CGH allows investigators to perform genome-wide analysis of DNA sequence copy number in a single tissue. Traditionally, differentially labeled genomic DNA from a “test” and a “reference” cell population are cohybridized to normal metaphase chromosome spreads. Regions of gain or loss of chromosomal segments, such as deletions, duplications, or amplifications, are seen as changes in the ratio of the intensities of the two fluorochromes. The procedure works because the ratio of fluorescence intensities along the length of the chromosome is proportional to the ratio of the copy numbers of the corresponding DNA sequences in the test and reference genomes at each point in the chromosome. More recent innovat-ions with this technique allow investigators to circumvent the low resolution associated with metaphase spreads and very precisely determine DNA copy number by combining traditional CGH with arrays of bacterial artificial chromosomes (BACs) or oligonucleotides. [71] [72] Array CGH studies have contributed to the identification of the genetic alterations involved in cancer progression and metastasis and highlight the importance of studying tumors at various stages of progression.

Expression Arrays

Another method for refining linkage data before proceeding with candidate genome analysis is the use of expression arrays. Expression arrays analyze differences in gene expression on a large scale by assaying thousands of genes in one experiment. For one type, DNA microarrays, DNA sequences from the coding regions of known or putative genes are assayed with probes made from messenger RNA, which determines an expression profile of genes for a certain cell type or under specific experimental conditions. In terms of cancer genomics, normal cells may express different portions of the genome or different genes at different levels when compared with their neoplastic counterparts, which may indicate biologic networks or pathways involved in disease pathogenesis.[73] Microarray experiments have led to molecular classification of many cancer types according to differences in gene expression, including breast cancer,[74] lymphomas,[75] and soft-tissue tumors.[76] Integration of DNA microarray data with genetic mapping results will help to prioritize candidate genes in regions of known linkage. As single gene traits are defined and researchers’ interest turns increasingly to multigene traits, a combination of traditional and twenty-first century approaches will most likely define the genes of interest.[77]

Tissue Banks

One problem that is not infrequently encountered with CGH and expression array studies is lack of reproducibility across studies. This is due to both the limited number of tumors typically available for studies and the heterogeneity in the tumors themselves. One way to circumvent both problems is to develop tumor banks in which investigators can deposit well-characterized tissues for a variety of research purposes.[78] One potential complication is the rigor with which such banks must be maintained. It is important that complete pathologic records accompany each tissue and that the tissue deposited be as free as possible of adjacent noncancerous tissue. Toward this end, researchers have turned increasingly to the use of laser capture microdissection as a way to isolate virtually pure populations of tumor for CGH and expression array studies.[79]


Association studies are distinct from linkage analysis in that specific alterations in the DNA are assessed in both affected and unaffected individuals to determine whether the variant is found more often in individuals with the disease and whether this difference in the frequency of the variant allele is statistically significant. Association studies are widely used to assess the significance of variants in candidate genes and just recently are being used to identify susceptibility genes in genome-wide association studies.

Assessment of Candidate Genes

Once a candidate gene is proposed, association studies are important to determine whether sequence level changes in the gene are associated with the disease of interest. Candidate genes are identified in a variety of ways. For instance, the biologic function of a known gene might suggest a role in cancer susceptibility. Alternatively, the sequence of the gene might suggest that it is a member of a protein family that is known to play a role in cancer biology. Genes that are important in DNA repair, apoptosis, and cell cycle regulation are all likely candidates. Finally, the gene in question might be located at a locus identified by linkage analysis of high-risk families. Thus, it may be one of a large number of genes under consideration.

Genome-wide Association Studies

Since association studies have greater power to identify common genetic variability compared to linkage studies, the ability to conduct these experiments has been highly anticipated. Genome-wide SNP association studies have become more practical in the past few years owing to the information gained from the HapMap project (www.hapmap.org) and the development of technologies to genotype hundreds of thousands of SNPs in a single experiment. The International HapMap project aimed to identify genetic similarities and differences in human populations. The initial data were generated from four different populations with African, Asian, and European ancestry. These data allow for the identification of minimal sets of informative SNPs to tag variation throughout the genome for a given population. Also important for the advancement of genome-wide association studies was the development of high-throughput genotyping assays, which are cost-effective, accurate, and reproducible. Currently, the two leading high-throughput genotyping platforms are from Affymetrix and Illumina. The Affymetrix GeneChip microarrays (see Fig. 13-6 ) are available to genotype 10K, 100K, or 500K genome-wide SNPs in a single experiment. From Illumina, the Infinium HumanHap300 and HumanHap550 (see Fig. 13-7 ) are specifically designed for whole-genome genotyping of tag SNPs derived from the HapMap project. The study design and caveats for genome-wide association studies are similar to the candidate gene approach and will be discussed briefly in the following sections. However, adjusting for multiple testing is a particular concern for genome-wide association studies as the number of SNPs tested in a given experiment is large.

Study Design

Two primary types of study design are typically used: cohort and case-control studies. In a cohort study, subjects are selected, individuals with the disease of interest (i.e., prevalent cases) are excluded, one or more exposures of interest are measured, and then the cohort is monitored over time to determine who develops cancer or the outcome(s) of interest and the degree to which the exposure is associated with disease incidence. In genetic epidemiology, the primary “exposure” is the genetic variant under consideration. This type of cohort study is prospective in nature, the health outcomes (presence or absence of cancer over time) occurring after the enrollment of study subjects. Exposure is measured at baseline, when the cohort is initially established, and may be updated over the period of follow-up for those exposures that may change over time (obviously, the germline variant or variants a given individual carries do not change over time). The advantages of cohort studies include minimized information and selection biases and the ability to directly calculate disease incidence in exposed and unexposed groups and thus the relative risk (RR) and absolute risk (attributable risk, AR). RR is the risk of developing a disease given a particular exposure and is calculated as the incidence of cancer among a set of individuals with a particular genotype, divided by the incidence of cancer among a set of individuals who do not carry that genotype. Disadvantages include the fact that prospective cohort studies are expensive and time-consuming and large numbers of study subjects are typically required to obtain sufficient numbers of outcomes (e.g., cancer cases) to have adequate power to determine associations. Loss of subjects to long-term follow-up over time is also an issue because it may affect the ability to draw conclusions.

Cohort studies may also be retrospective in nature, when the exposure and subsequent development of the disease occur before the study begins. For the purpose of finding associations between disease status and genotype, retrospective cohorts depend on existing medical records to identify all cancer diagnoses in the population of interest during a specified time period. As such, if any of the medical information documenting either the exposure(s) or the disease is not completely accurate, then bias or error would be introduced into the study.

Case-control studies differ from cohort studies in that the selection of subjects is based on their disease status. Case-control studies have the potential to examine multiple risk factors simultaneously. Two very popular case-control designs are population-based and hospital-based. For each, it is important to select case patients and control subjects who are similar with respect to recognized confounding factors such as age, sex, race, and ethnic background. [80] [81] [82] [83] [84] [85] Another critical factor for study design is that the controls must be selected from the same underlying population from which the cases were ascertained. The purpose of the control group is to provide valid data on exposure prevalence (e.g., distribution of genotype) in the same population from which cases were accrued.

Population-based case-control studies draw on a well-defined source population such as a particular geographic region defined by state, county, or city for ascertainment of both case patients and unaffected control subjects. Popular mechanisms include use of cancer registries, such as Surveillance Epidemiology and End Results, or health care provider databases. Control subjects should be selected from the same source population or geographic region by a method designed to randomly sample individuals, such as random-digit telephone dialing.[86] Selection bias, in which selection of case patients, control subjects, or both is influenced by prior exposures, is a particular concern in case-control studies. [83] [84] [85] [87] [88] Proper design of the selection process can help to reduce this problem. Multiple studies have shown, for instance, that nonparticipants in such studies are more likely to smoke than are individuals who agree to participate.[88] Thus, there is concern that participants might be more health conscious than nonparticipants.

In comparison, hospital-based case-control studies enlist a sequential series of patients who are admitted to the hospital or clinic during a specific period. Case patients are enrolled because they have the cancer of interest, whereas control subjects are determined to be cancer-free, although they may be patients at the same clinic or hospital for unrelated reasons. Significantly more potential for bias exists in hospital-based case-control studies. Depending on the clinic or hospital from which patients are drawn, disease presentation, severity, and treatment outcome may be nonrandom among study subjects. Often, cases are drawn from so-called high-risk clinics. In such situations, both case patients and control subjects might be more likely to carry a specific genotype. They might have been referred to a high-risk clinic because of their family history status. A particular concern is that hospital- or clinic-based controls might not accurately reflect the frequency of the exposure in the underlying population from which cases were ascertained. Thus, it might be difficult to generalize results from a hospital-based case-control study to the general population.

Confounders and Sources of Bias

Confounders are factors that are associated with both the exposure and the disease and are outside of the causal pathway of the exposure. Age is a frequent confounder, as it is often associated with the frequency and level of exposures and many diseases increase in incidence with advancing age. In the study of genetic risk factors and disease, an important confounder is ethnic background. For example, in a study of the association between human leukocyte antigen (HLA) genotypes and cervical cancer, ethnic background must be addressed as a potential confounder because HLA genotypes can vary by ethnic background and lifestyle factors that contribute to disease can be associated with ethnic differences.[89] When the association between HLA genotype and cancer was calculated, ethnic background was adjusted to minimize the potential for bias caused by this confounder.[89] In addition to statistical adjustment for ethnic background, exposure-disease associations can also be examined and reported separately according to ethnic background.

Another source of bias that is specific to case-control studies and retrospective cohorts is recall bias.[90] Study subjects might find it difficult to correctly remember information related to subjective factors that are potentially important in disease susceptibility such as those related to environmental or lifestyle factors, such as diet, smoking, exercise, and stress levels.[91] Recall bias can lead to incorrect or incomplete measurement of exposures and potential confounders. This introduces error into the calculation of the association between the exposure and the disease because of the inability to fully adjust for the effects of confounders.

SNP Genotyping and Association Studies

In an ideal situation, collection of blood samples from all eligible study subjects, both cases and controls, is undertaken as part of any genetic epidemiology study. The resulting DNA samples can then be used to test for association of candidate genes and features of disease. Several types of low- to mid-throughput assays are frequently used to genotype SNPs, including direct sequencing, Sequenom, Taqman and SNPlex (both by Applied Biosystems), as well as custom Illumina's GoldenGate bundles and many others. Most assays available start with small quantities of DNA, and polymerase chain reactions amplify specific regions of DNA before querying the different SNP alleles and calling the genotypes.

SNPs that introduce nonsense changes, causing premature termination of the protein, and SNPs that alter a key amino acid sequence in the encoded protein are the focus of much of the current literature with regards to nearly all cancers.[92] Other SNPs of interest change single amino acids in key protein motifs or occur in splice domains. Much less well understood are SNPs that are found to be in association with the disease state but located in noncoding regions of the gene. [93] [94] [95] Association studies are observational in nature and define genotypes or exposures that are simply “associated” with a particular disease, but they might not actually cause the disease of interest. In such situations, the SNP is thought to be in linkage disequilibrium with an as yet unidentified disease-causing mutation.

SNPs within the vitamin D receptor gene provide a set of interesting examples. Single base changes in intron 8 and within exon 9 affect recognition sites for the restriction enzymes BsmI and TaqI. Neither variant apparently affects the resulting protein sequence, but both have been associated with prostate cancer risk. [93] [95] Additionally, variation in the 3′ untranslated region of the poly A tail is similarly associated with prostate cancer risk,[94] but the alteration appears to have no obvious effect on mRNA stability.[96] All three polymorphisms have been reported to be in at least partial association with one another in a subset of studies and, in all likelihood, serve as markers for an as yet undiscovered disease-causing variant. [96] [97] In addition to these variants, some investigators report a variant in exon 2, which creates a FokI restriction enzyme site and results in a new start codon for the protein, generating a transcript with three additional amino acids.[98] It is unknown whether this change itself is disease-associated or whether it, too, is simply a marker for an unknown variant. Functional studies are needed to test the role of all of these variants on protein function.

Relative Risks

Results of an association study are evaluated by calculation of a RR or an odds ratio (OR), which is an estimate of the RR derived from a case-control study. Values greater than 1.0 indicate that the exposure (in this case, a particular genotype) is associated with an increased risk of the cancer under consideration. In comparison, values less than 1.0 indicate a decreased risk for the disease associated with that genotype. In cohort studies, RR can be calculated directly as the likelihood of developing cancer among a set of individuals with a particular genotype, divided by the likelihood of developing cancer among a set of individuals who do not carry that genotype. Thus, the RR is the risk or probability of developing cancer in the defined source population, given a particular genotype.

A different statistical method is used for assessing risk in case-control studies, because subjects are selected according to disease status and not exposure, as they would be in a cohort study. The OR is an estimation of the RR and is calculated as the odds of exposure for cases divided by the odds of exposure for controls. For both point estimates of the RR and OR, 95% confidence intervals are calculated to determine statistical significance of the risk estimate. For example, at the α = 0.05 level, statistically significant risk estimates are those in which the 95% confidence interval excludes 1.0, that is, the null hypothesis of no association. For a more stringent test, 99% confidence intervals may be calculated.

Logistic regression is often used to calculate ORs when multiple potential confounding factors are expected to affect the risk of the disease. Such calculations are made to account for the contribution of other disease-associated factors that are also correlated with the exposure of interest. If high-quality information about other characteristics of subjects is collected, data sets can be stratified by age, family history, or clinical features of the disease before calculation of the association between genotype and disease, as a way of identifying important modifiers of risk in relation to genotype.


Patients will frequently approach clinicians with questions about genetic-testing opportunities for specific cancers. If appropriate tests are available, identifying a person with increased risk for a particular cancer is useful for at least three reasons. First, it can suggest a particular clinical course that will reduce the chance of having cancer, such as treatment with tamoxifen or prophylactic surgery for women who are at risk for hereditary breast cancer. [99] [100] [101] Second, it can induce patients at risk to undergo more vigilant screening, such as frequent colonoscopy examinations for patients who are at risk for colon cancer. Finally, an individual's quality of life can sometimes be improved by having specific knowledge about the more precise risk for disease or recurrence. In addition, such information is frequently sought by unaffected individuals who perceive themselves to be at increased risk.

The National Society of Genetic Counselors defines genetic counseling as a “the process of helping people understand and adapt to the medical, psychological, and familial implications of the genetic contributions to disease.”[102] Therefore it should be offered only in consultation with certified genetic counselors who serve to (1) help patients comprehend the medical facts and risks associated with their disease, (2) help patients understand their alternatives for dealing with both risk of disease and recurrence, (3) help patients choose a clinical course that best meets their needs, and (4) provide support and guidance for patients experiencing difficulty in dealing with unexpected results. Patients often approach genetic testing with strong preconceived notions about the likelihood that they either do or do not have an inherited mutation. Thus, “unexpected” is likely to apply to both carriers and noncarriers.

In advising patients whether it is appropriate to consider genetic testing, it is important to remember that many currently available tests have limitations. Clinical validity is the term that is used to describe the predictive value of a test for clinical outcomes.[103] It is affected by both the sensitivity and the specificity of the test, as well as a host of factors that are beyond laboratory control, such as penetrance of the mutant allele. The latter may itself be a function of genetic background, environmental exposures, or both. Most mutations associated with cancer susceptibility genes are not fully penetrant. There will therefore be a small group of individuals who, even if they live into their eighties, will not get cancer even if they carry protein-truncating mutations in a gene associated with a particular cancer. Helping patients to understand these concepts can be difficult.

One additional concern is what to tell patients who do not have obvious protein-truncating mutations but who do carry missense changes in the coding region of a cancer susceptibility gene. Again, particularly interesting examples are provided by the BRCA1 gene. More than 300 independent missense changes have been reported for BRCA1 to date.[104] Disease association status is known for only a fraction of these, such as those occurring in the RING finger domain[105] and the C-terminal region of the protein. [106] [107] In the case of RING finger mutations, these conclusions are supported by the existence of dozens of families in which RING finger mutations are shown to closely segregate with disease state. [108] [109] Other single amino acid changes are known to be inconsequential polymorphisms that clearly do not affect protein function. For instance, some 40% of the population is heterozygous for the substitution of leucine for proline at position 871, as reported in the canonical sequence. Both residues are hydrophobic, and the location is not one that is well conserved evolutionarily; and this is likely to be an inconsequential polymorphism in the gene. Of particular concern, rather, are the large number of missense changes that are reported in patients with breast cancer whose disease association status is unknown. Phylogenetic analysis provides some insight as to which are likely to be important,[110] and functional assays are useful for testing mutations in some regions of the gene. [107] [111] However, at this time, little guidance is available for most patients carrying such changes.


The sequence of the human genome has been referred to as an “instruction book for human biology.”[112] Locked within the sequence of each individual's DNA is the genetic code necessary to develop a complete and healthy individual, but encoded as well is the sequence-level variation that will determine each person's susceptibility to a host of diseases. Variation is important in defining the field of genomic medicine. A more complete understanding of the molecular pathways involved in cancer susceptibility will suggest avenues for the development of both methods of diagnosis and treatment. Identification of specific genes offers the promise of genetic testing to individuals who are at risk, as well as the hope for targeted therapeutics. Finally, understanding the specific variation offers the promise of twenty-first century “personalized medicine” in which lifestyle, diet, and preventative therapies come together to offer patients a full spectrum of choices for maintaining their personal health.

It is clear that the Human Genome Project has had and will continue to have an effect on human health and biology.[112] What remains to be seen is the rate at which the successes of the Human Genome Project will move from bench to bedside. In a sense, that rate will be determined by practicing physicians. Knowledge of the underlying principles of genetic analysis is fundamental for today's practicing clinician. The ability to accurately record family history and medical record data affects the integrity of all subsequent studies for which those data are used. An understanding by physicians of the findings generated through both association studies and family-based linkage studies is key to both moving research forward and prioritizing new hypotheses for researchers to consider. Finally, as twentieth century–born patients struggle to make personal health care choices in the twenty-first century, communicating what genomic medicine has to offer is a complicated task at which every physician must now excel.


  1. Knudson AG: Chasing the cancer demon.  Annu Rev Genet2000; 34:1-19.
  2. Goldgar DE, Easton DF, Cannon-Albright LA, Skolnick MH: Systematic population-based assessment of cancer risk in first-degree relatives of cancer probands.  J Natl Cancer Inst1994; 86:1600-1608.
  3. Easton D, Peto J: The contribution of inherited predisposition to cancer incidence.  Cancer Surv1990; 9:395-416.
  4. Carter BS, Beaty TH, Steinberg GD, et al: Mendelian inheritance of familial prostate cancer.  Proc Natl Acad Sci USA1992; 89:3367-3371.
  5. Schaid DJ, McDonnell SK, Blute ML, Thibodeau SN: Evidence for autosomal dominant inheritance of prostate cancer.  Am J Hum Genet1998; 62:1425-1438.
  6. Grönberg H, Damber L, Damber J-E, Iselius L: Segregation analysis of prostate cancer in Sweden: support for dominant inheritance.  Am J Epidemiol1997; 146:552-557.
  7. Claus EB, Risch N, Thompson WD: Genetic analysis of breast cancer in the cancer and steroid hormone study.  Am J Hum Genet1991; 48:232-242.
  8. Strong LC, Amos C: Inherited susceptibility.   In: Schottenfeld D, Fraumeni JF, ed. Cancer Epidemiology and Prevention,  2nd ed.. New York: Oxford University Press; 1996:559-583.
  9. Ross R, Schottenfeld D: Prostate cancer.   In: Schottenfeld D, Fraumeni JF, ed. Cancer Epidemiology and Prevention,  2nd ed.. New York: Oxford University Press; 1996:1180-1226.
  10. Steinberg GD, Carter BS, Beaty TH, et al: Family history and the risk of prostate cancer.  Prostate1990; 17:337-347.
  11. Spitz MR, Currier RD, Fueger JJ, et al: Familial patterns of prostate cancer: a case control analysis.  J Urol1991; 146:1305-1307.
  12. Whittemore A, Wu A, Kolonel L, et al: Family history and prostate cancer risk in black, white, and Asian men in the United States and Canada.  Am J Epidemiol1995; 141:732-740.
  13. Hayes RB, Liff JM, Pottern LM, et al: Prostate cancer risk in U.S. blacks and whites with a family history of cancer.  Int J Cancer1995; 60:361-364.
  14. Ghadirian P, Howe GR, Hislop TG, Maisonneuve P: Family history of prostate cancer: a multi-center case-control study in Canada.  Int J Cancer1997; 70:679-681.
  15. Cerhan JR, Parker AS, Putnam SD, et al: Family history and prostate cancer risk in a population-based cohort of Iowa men.  Cancer Epidemiol Biomarkers Prev1999; 8:53-60.
  16. Ostrander EA, Stanford JL: Genetics of prostate cancer: too many loci, too few genes.  Am J Hum Genet2000; 67:1367-1375.
  17. Ostrander EA, Kwon EM, Stanford JL: Genetic susceptibility to aggressive prostate cancer.  Cancer Epidemiol Biomarkers Prev2006; 15:1761-1764.
  18. Tollefson MK, Leibovich BC, Slezak JM, et al: Long-term prognostic significance of primary Gleason pattern in patients with Gleason score 7 prostate cancer: impact on prostate cancer specific survival.  J Urol2006; 175:547-551.
  19. Chan TY, Partin AW, Walsh PC, Epstein JI: Prognostic significance of Gleason score 3+4 versus Gleason score 4+3 tumor at radical prostatectomy.  Urology2000; 56:823-827.
  20. Smith JR, Freije D, Carpten JD, et al: Major susceptibility locus for prostate cancer on chromosome 1 suggested by a genome-wide search.  Science1996; 274:1371-1374.
  21. Gibbs M, Stanford JL, McIndoe RA, et al: Evidence for a rare prostate cancer-susceptibility locus at chromosome 1p36.  Am J Hum Genet1999; 64:776-787.
  22. Hall JM, Lee MK, Newman B, et al: Linkage of early onset familial breast cancer to chromosome 17q21.  Science1990; 250:1684-1689.
  23. Wooster R, Neuhausen SL, Mangion J, et al: Localization of a breast cancer susceptibility gene, BRCA2, to chromosome 13q12-13.  Science1994; 265:2088-2090.
  24. Peto J, Collins N, Barfoot R, et al: Prevalence of BRCA1 and BRCA2 gene mutations in patients with early-onset breast cancer.  J Natl Cancer Inst1999; 91:943-949.
  25. Miki Y, Swensen J, Shattuck-Eidens D, et al: A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1.  Science1994; 266:66-71.
  26. Presciuttini S, Strigini P: Genetic epidemiology of colorectal cancer.  Tumori1996; 82:107-113.
  27. Sellers TA, Chen PL, Potter JD, et al: Segregation analysis of smoking-associated malignancies: evidence for Mendelian inheritance.  Am J Med Genet1994; 52:308-314.
  28. Malmer B, Iselius L, Holmberg E, et al: Genetic epidemiology of glioma.  Br J Cancer2001; 84:429-434.
  29. Banke MG, Mulvihill JJ, Aston CE: Inheritance of pancreatic cancer in pancreatic cancer-prone families.  Med Clin North Am2000; 84:677-690.x–xi.
  30. Jarvik GP, Stanford JL, Goode EL, et al: Confirmation of prostate cancer susceptibility genes using high risk families.  Monogr Natl Cancer Inst1999; 26:81-88.
  31. Moisio AL, Sistonen P, Weissenbach J, et al: Age and origin of two common MLH1 mutations predisposing to hereditary colon cancer.  Am J Hum Genet1996; 59:1243-1251.
  32. Neuhausen SL: Ethnic differences in cancer risk resulting from genetic variation.  Cancer1999; 86(suppl 8):1755-1762.
  33. Thorlacius S, Olafsdottir G, Tryggvadottir L, et al: A single BRCA2 mutation in male and female breast cancer families from Iceland with varied cancer phenotypes.  Nat Genet1996; 13:117-119.
  34. Struewing JP, Abeliovich D, Peretz T, et al: The carrier frequency of the BRCA1 185delAG mutation is approximately 1 percent in Ashkenazi Jewish individuals.  Nat Genet1995; 11:198-200.
  35. Neuhausen S, Gilewski T, Norton L, et al: Recurrent BRCA2 6174delT mutations in Ashkenazi Jewish women affected by breast cancer.  Nat Genet1996; 13:126-128.
  36. Oddoux C, Struewing JP, Clayton CM, et al: The carrier frequency of the BRCA2 6174delT mutation among Ashkenazi Jewish individuals is approximately 1%.  Nat Genet1996; 14:188-190.
  37. Roa BB, Boyd AA, Volcik K, Richards CS: Ashkenazi Jewish population frequencies for common mutations in BRCA1 and BRCA2.  Nat Genet1996; 14:185-187.
  38. Schleutker J, Matikainen M, Smith J, et al: A genetic epidemiological study of hereditary prostate cancer (HPC) in Finland: frequent HPCX linkage in families with late-onset disease.  Clin Cancer Res2000; 6:4810-4815.
  39. Morgan TH: The Theory of Genes,  New Haven, CT, Yale University Press, 1928.
  40. Morgan TH: Random segregation versus coupling in Mendelian inheritance.  Science1911; 34:384.
  41. Botstein D, White RL, Skolnick M, Davis RW: Construction of a genetic linkage map in man using restriction fragment length polymorphisms.  Am J Hum Genet1980; 32:314-331.
  42. Ott J: Genetic loci and genetic polymorphisms.  In Analysis of Human Genetic Linkage,  3rd ed.. Baltimore: John Hopkins University; 1999:24-36.
  43. Hamada H, Kakunaga T: Potential Z-DNA forming sequences are highly dispersed in the human genome.  Nature1982; 298:396-398.
  44. Miesfeld R, Krystal M, Arnheim N: A member of a new repeated sequence family which is conserved throughout eucaryotic evolution is found between the human delta and beta globin genes.  Nucleic Acids Res1981; 9:5931-5947.
  45. Stallings RL, Ford AF, Nelson D, et al: Evolution and distribution of (GT)n repetitive sequences in mammalian genomes.  Genomics1991; 10:807-815.
  46. Ostrander EA, Sprague Jr GF, Rine J: Identification and characterization of dinucleotide repeat (CA)n markers for genetic mapping in dog.  Genomics1993; 16:207-213.
  47. Kwiatkowski DJ, Henske EP, Weimer K, et al: Construction of a GT polymorphism map of human 9q.  Genomics1992; 12:229-240.
  48. Weber JL, May PE: Abundant class of human DNA polymorphisms which can be typed using the polymerase chain reaction.  Am J Hum Genet1989; 44:388-396.
  49. Litt M, Luty JA: A hypervariable microsatellite revealed by in vitro amplification of a dinucleotide repeat within the cardiac muscle actin gene.  Am J Hum Genet1989; 44:397-401.
  50. Broman KW, Murray JC, Sheffield VC, et al: Comprehensive human genetic maps: individual and sex-specific variation in recombination.  Am J Hum Genet1998; 63:861-869.
  51. Kruglyak L, Nickerson DA: Variation is the spice of life.  Nat Genet2001; 27:234-236.
  52. Schaid DJ, Guenther JC, Christensen GB, et al: Comparison of microsatellites versus single-nucleotide polymorphisms in a genome linkage scan for prostate cancer-susceptibility loci.  Am J Hum Genet2004; 75:948-965.
  53. Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES: Parametric and nonparametric linkage analysis: a unified multipoint approach.  Am J Hum Genet1996; 58:1347-1363.
  54. Abecasis G, Cherny S, Cookson W, Cardon L: Merlin: rapid analysis of dense genetic maps using spares gene flow trees.  Nat Genet2002; 30:97-101.
  55. O'Connell JR, Weeks DE: PedCheck: A program for identification of genotype incompatibilities in linkage analysis.  Am J Hum Genet1998; 63:259-266.
  56. Sun L, Wilder K, McPeek MS: Enhanced pedigree error detection.  Hum Hered2002; 54:99-110.
  57. Epstein MP, Duren WL, Boehnke M: Improved inference of relationship for pairs of individuals.  Am J Hum Genet2000; 67:1219-1231.
  58. Morton N: Sequential tests for the detection of linkage.  Am J Hum Genet1955; 7:277-318.
  59. Lander E, Kruglyak L: Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results.  Nat Genet1995; 11:241-247.
  60. Risch N, Giuffra L: Model misspecification and multipoint linkage analysis.  Hum Hered1992; 42:7-92.
  61. Clerget-Darpoux F, Bonaiti-Pellie C, Hochez J: Effects of misspecifying genetic parameters in lod score analysis.  Biometrics1986; 42:393-399.
  62. Potosky AL, Miller BA, Albertsen PC, Kramer BS: The role of increasing detection in the rising incidence of prostate cancer.  JAMA1995; 273:548-552.
  63. Stanford J, Stephenson R, Coyle L, et al: Prostate cancer trends 1973–1995.  SEER Program, National Cancer Institute, Vol. 99–4543. Bethesda, MD: National Institutes of Health; 1999:7-16.
  64. Gann PH, Hennekens CH, Stampfer MJ: A prospective evaluation of plasma prostate-specific antigen for detection of prostatic cancer.  JAMA1995; 273:289-294.
  65. Pearson J, Luderer A, Metter E, et al: Longitudinal analysis of serial measurement of free and total PSA among men with and without prostatic cancer.  Urology1996; 48(suppl 6A):4-9.
  66. Whittemore AS, Lele C, Friedman GD, et al: Prostate-specific antigen as predictor of prostate cancer in black men and white men.  J Natl Cancer Inst1995; 87:354-360.
  67. Venter JC, Adams MD, Myers EW, et al: The sequence of the human genome.  Science2001; 291:1304-1351.
  68. Lander ES, Linton LM, Birren B, et al: Initial sequencing and analysis of the human genome.  Nature2001; 409:860-921.
  69. Kallioniemi A, Kallioniemi OP, Sudar D, et al: Comparative genomic hybridization for molecular cytogenetic analysis of solid tumours.  Science1992; 258:818-821.
  70. Kallioniemi OP, Kallioniemi A, Piper J, et al: Optimizing comparative genomic hybridization for analysis of DNA sequence copy number changes in solid tumors.  Genes Chromosomes Cancer1994; 10:231-243.
  71. Pinkel D, Segraves R, Sudar D, et al: High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays.  Nat Genet1998; 20:207-211.
  72. Lockwood WW, Chari R, Chi B, Lam WL: Recent advanced in array comparative genomic hybridization technologies and their applications in human genetics.  Eur J Hum Genet2006; 14:139-148.
  73. Nelson PS, Stanford JL, Ostrander EA: Prostate cancer research in the post-genome era.  Epidemiol Rev2001; 23:187-190.
  74. Perou CM, Jeffrey SS, van de Rijn M, et al: Distinctive gene expression patterns in human mammary epithelial cells and breast cancers.  Proc Natl Acad Sci USA1999; 96:9212-9217.
  75. Alizadeh AA, Ross DT, Perou CM, van de Rijn M: Towards a novel classification of human malignancies based on gene expression patterns.  J Pathol2001; 195:41-52.
  76. Nielsen TO, West RB, Linn SC, et al: Molecular characterisation of soft tissue tumours: a gene expression study.  Lancet2002; 359:1301-1307.
  77. Collins FS: Positional cloning moves from perditional to traditional.  Nat Genet1995; 9:347-350.
  78. Grizzle WE, Aamodt R, Clausen K, et al: Providing human tissues for research: how to establish a program.  Arch Pathol Lab Med1998; 122:1065-1076.
  79. Craven RA, Banks RE: Laser capture microdissection and proteomics: possibilities and limitation.  Proteomics2001; 1:1200-1204.
  80. Greenland S: Response and follow-up bias in cohort studies.  Am J Epidemiol1977; 106:184-187.
  81. Criqui MH: Response bias and risk ratios in epidemiologic studies.  Am J Epidemiol1979; 109:394-399.
  82. Criqui MH, Austin M, Barrett-Connor E: The effect of nonresponse on risk ratios in a cardiovascular disease study.  J Chronic Dis1979; 32:633-638.
  83. Heilbrun LK, Nomura A, Stemmermann GN: The effects of nonresponse in a prospective study of cancer.  Am J Epidemiol1982; 116:353-363.
  84. Bergstrand R, Vedin A, Wilhelmsson C, Wilhelmsen L: Bias due to non-participation and heterogenous sub-groups in population surveys.  J Chronic Dis1983; 36:725-728.
  85. Benfante R, Reed D, MacLean C, Kagan A: Response bias in the Honolulu Heart Program.  Am J Epidemiol1989; 130:1088-1100.
  86. Waksberg J: Sample methods for random digit dialing.  J Am Stat Soc1978; 73:40-46.
  87. Wilhelmsen L, Ljungberg S, Wedel H, Werko L: A comparison between participants and nonparticipants in a primary preventive trial.  J Chronic Dis1976; 29:331-339.
  88. Carter WB, Elward K, Malmgren J, et al: Participation of older adults in health programs and research: a critical review of the literature.  Gerontologist1991; 31:584-592.
  89. Maciag PC, Schlecht NF, Souza PS, et al: Major histocompatibility complex class II polymorphisms and risk of cervical cancer and human papillomavirus infection in Brazilian women.  Cancer Epidemiol Biomarkers Prev2000; 9:1183-1191.
  90. Barry D: Differential recall bias and spurious associations in case/control studies.  Stat Med1996; 15:2603-2616.
  91. Smith-Warner SA, Spiegelman D, Yaun SS, et al: Intake of fruits and vegetables and risk of breast cancer: a pooled analysis of cohort studies.  JAMA2001; 285:769-776.
  92. Webb T: SNPs: Can genetic variants control cancer susceptibility?.  J Natl Cancer Inst2002; 94:476-478.
  93. Ingles SA, Coetzee GA, Ross RK, et al: Association of prostate cancer with vitamin D receptor haplotypes in African-Americans.  Cancer Res1998; 58:1620-1623.
  94. Ingles SA, Ross RK, Yu MC, et al: Association of prostate cancer risk with genetic polymorphisms in vitamin D receptor and androgen receptor.  J Natl Cancer Inst1997; 89:166-170.
  95. Taylor JA, Hirvonen A, Watson M, et al: Association of prostate cancer with vitamin D receptor gene polymorphism.  Cancer Res1996; 56:4108-4110.
  96. Durrin LK, Haile RW, Ingles SA, Coetzee GA: Vitamin D receptor 3′-untranslated region polymorphisms: lack of effect on mRNA stability.  Biochim Biophys Acta1999; 1453:311-320.
  97. Ingles SA, Haile RW, Henderson BE, et al: Strength of linkage disequilibrium between two vitamin D receptor markers in five ethnic groups: implications for association studies.  Cancer Epidemiol Biomarkers Prev1997; 6:93-98.
  98. Gross C, Eccleshall TR, Malloy PJ, et al: The presence of a polymorphism at the translation initiation site of the vitamin D receptor gene is associated with low bone mineral density in postmenopausal Mexican-American women.  J Bone Miner Res1996; 11:1850-1855.
  99. King MC, Wieand S, Hale K, et al: Tamoxifen and breast cancer incidence among women with inherited mutations in BRCA1 and BRCA2: National Surgical Adjuvant Breast and Bowel Project (NSABP-P1) Breast Cancer Prevention Trial.  JAMA2001; 286:2251-2256.
  100. Haffty BG, Harrold E, Khan AJ, et al: Outcome of conservatively managed early-onset breast cancer by BRCA1/2 status.  Lancet2002; 359:1471-1477.
  101. van Roosmalen MS, Verhoef LC, Stalmeier PF, et al: Decision analysis of prophylactic surgery or screening for BRCA1 mutation carriers: a more prominent role for oophorectomy.  J Clin Oncol2002; 20:2092-2100.
  102. Resta R, Biesecker B, Bennett R, et al: A New Definition of Genetic Counseling: The National Society of Genetic Counselors' Task Force Report.  J Genet Couns2006; 15:77-83.
  103. Grann VR, Jacobson JS: Population screening for cancer-related germline gene mutations.  Lancet Oncol2002; 3:341-348.
  104. Szabo C, Masiello A, Ryan JF, Brody LC: The breast cancer information core: database design, structure, and scope.  Hum Mutat2000; 16:123-131.
  105. Brzovic PS, Meza JE, King MC, Klevit RE: BRCA1 RING domain cancer-predisposing mutations. Structural consequences and effects on protein-protein interactions.  J Biol Chem2001; 276:41399-41406.
  106. Monteiro AN, August A, Hanafusa H: Evidence for a transcriptional activation function of BRCA1 C-terminal region.  Proc Natl Acad Sci USA1996; 93:13595-13599.
  107. Vallon-Christersson J, Cayanan C, Haraldsson K, et al: Functional analysis of BRCA1 C-terminal missense mutations identified in breast and ovarian cancer families.  Hum Mol Genet2001; 10:353-360.
  108. Serova O, Montagna M, Torchard D, et al: A high incidence of BRCA1 mutations in 20 breast-ovarian cancer families.  Am J Hum Genet1996; 58:42-51.
  109. Szabo CI, King MC: Inherited breast and ovarian cancer.  Hum Mol Genet1995; 4:1811-1817.
  110. Fleming MA, Potter JD, Ramirez CJ, et al: Understanding missense mutations in the BRCA1 gene: an evolutionary approach.  Proc Natl Acad Sci USA2003; 100:1151-1156.
  111. Hayes F, Cayanan C, Barilla D, Monteiro AN: Functional assay for BRCA1: mutagenesis of the COOH-terminal region reveals critical residues for transcription activation.  Cancer Res2000; 60:2411-2418.
  112. Collins FS, McKusick VA: Implications of the Human Genome Project for medical science.  JAMA2001; 285:540-544.