1. Stanford University School of Medicine; 2. Broad Institute of MIT and Harvard
For a purebred dog and, on occasion, even a first-generation hybrid, breed-inference services often just confirm what the dog owner already knows. Sometimes a pedigree is available, tracing back through multiple generations of purebred ancestors and providing essentially complete information about the pet’s ancestry. In other cases, an owner’s extensive experience leads to correct intuition that ears that floppy together with a nose that keen must indicate complete or near-complete beagle ancestry. By contrast, when applied to investigate the ancestry of a mutt, DNA-based inference often yields surprising conclusions.
DNA-based ancestry inference can have great practical value. It can be used to settle family debates about a beloved mutt’s ancestry and can offer at least the prospect for protecting a pet’s health. For example, the finding that a mutt has ancestry from a breed known to have a high risk of cancer could recommend more frequent screening for tumors.
As with any emerging method, though, ancestry inference is not without its challenges and uncertainties. Here, we provide background potentially useful for those considering an ancestry service or struggling to interpret surprising results. We begin by discussing the biological processes that give rise to the fascinating, complex genomes of mutts, then provide an overview of how existing approaches seek to disentangle this genomic complexity to provide insights into breed ancestry. We conclude by discussing some of the challenges that can compromise ancestry inferences, and comment on what sorts of information will be required to resolve these challenges over the next several years.
What is a purebred dog? What is a mutt?
To arrive at a precise definition of a mutt, it’s useful to consider how dogs first arose. Available data hint that initially incidental interactions with humans could explain their ancient origins (Larson and Fuller, 2014). Assume, for the moment, that some ancient wolves were wary of humans and others were comparatively comfortable. Under this interpretation, the burgeoning availability of human food scraps as human populations expanded could have provided a new main food source for the more gregarious wolves. Eventually, these animals formed a distinct population of animals that preferred to live in close proximity to humans, and tended to mate among themselves rather than with their more wild relatives.
If, as is required under any evolutionary scenario, there was a genetic basis for the trait distinguishing these two incipient populations—here, that would be a set of mutations that modulate an individual wolf’s comfort around humans—then it is possible that the availability of food near human populations explains the origin of dogs from their wild wolf ancestors. It’s important to note that, under this scenario, dogs were not domesticated by humans per se. Instead, humans simply created an environment that permitted self-domestication by a subset of wolves that, by chance, were genetically predisposed to be at least a little tolerant of humans.
Figure 1. The origins of modern breeds. Though the exact timing remains controversial, it is generally thought that dogs emerged as a population distinct from ancestral wolves in Eurasia between 10,000 and 40,000 years ago (Larson and Fuller, 2014). Under this scenario, while most wolves continued to be wary of humans and subject to natural selection in the wild, a few were able to tolerate life near humans, and perhaps to take advantage of their food waste. This subset of wolves eventually gave rise to a genetically distinct population of animals able to live near humans. Specific dog breeds arose far more recently, with most breeds formed less than 150 years ago (Larson et al., 2012). During this process, dogs were bred to different lines by selection for specific traits such as fighting, herding, hunting, or just being a good companion.
It is generally thought that modern dog breeds emerged around 150 years ago, during the Victorian era—long after the establishment of dogs from their wolf ancestors. This inference comes from the observation that humans create mating pairs of dogs that shared traits considered useful for specific tasks, giving rise to distinct groups of dogs increasingly enriched for genetic mutations encoding specific features (Larson et al., 2012). As in any evolutionary process, the relevant mutations initially arose randomly and were later favored by selective breeding. Various groups, including the American Kennel Club and the Kennel Club of India, eventually defined distinct breeds, ultimately giving rise to the definition of a purebred dog as one whose entire ancestry is represented by individuals listed in the stud book (“Inherited Defects in Pedigree Dogs. Part 2: Disorders That Are Not Related to Breed Standards,” 2010). In the context of the selective breeding processes that first established, and now maintain, distinct breeds, a mutt can be defined as a dog whose ancestry traces back to more than one genetically distinct lineage.
The goal of ancestry inference, then, is to use genetic information from a mutt to infer which dog breeds were present among its ancestors, and to infer their relative genetic contributions.
Collecting DNA from a dog
The first step in ancestry inference is to collect and extract DNA for genetic assessment. Fortunately, saliva is a great source of DNA—and most owners find it’s quite easy to collect. With just few moments in a dog’s mouth, one of the sampling swabs provided by a commercial genotyping service typically becomes coated with an abundance of cells. These cells are mostly of two types: white blood cells, which are suspended in saliva and aid in immune responses, and epithelial cells, which line the mouth and are typically replaced around every 24 hours. Once cells are collected, the swab (Fig 2A) is mailed to an ancestry-inference company. There, cell membranes are broken (Fig 2B) freeing the cellular nucleus (Fig 2C), which contains the DNA, then freeing DNA from the nucleus (Fig 2D). Proteins and other biomolecules can then be washed away, yielding a high-quality DNA sample.
Figure 2. Isolation of DNA for ancestry inference. During few moments in a dog’s mouth, a saliva swab (A) picks up many epithelial and immune cells (B). Next, nuclei (C) can be isolated from cells, and then lysed to free DNA (D), which can then be cleaned and used for genotyping or sequencing.
Though this process of isolating DNA from saliva swabs is impressively robust and can yield high-quality DNA from big dogs, small dogs, adults, and puppies alike, it is not without its mysteries. This summer, for example, we and some colleagues collected weekly saliva samples from six pups. To our surprise, the ratio of white blood and epithelial cells per saliva sample varied a great deal among individuals, and from week to week. Even more surprising was that one pup consistently had vastly more cells per sample than did her sisters and brothers. We hope eventually to learn what might explain this dramatic variation. For the moment though, it’s reassuring that even the lowest-yield samples typically contain enough DNA for ancestry inference.
How to make a mutt: Chromosome inheritance and exchange
In humans and dogs alike, the mother and the father make almost equal contributions to the genomes of their offspring. The dog genome is divided into 38 pairs of autosomes (humans have 22 sets), and one pair of sex chromosomes (humans, too, have one pair of sex chromosomes). Each of the 38 pairs of canine autosomes consists of one chromosome delivered by the egg from Mom, and one delivered by the sperm from Dad. The genome of the mitochondrion, a small fragment of DNA that contains many genes involved in metabolism, is always contributed by Mom.
To model the genetic origin of a mutt, let us first consider the mating of two purebred dogs: a male Labrador retriever and a female poodle. The sperm from the male and egg from the female, each carrying one copy of each chromosome, combine to form a Labrador-poodle mix that carries one copy of each chromosome contributed by each parent.
Figure 3. An egg from a female poodle is fertilized by a sperm from a male Labrador retriever, forming a Labrador-poodle mix. For each chromosome pair, the offspring has one copy contributed by the mother (purple), and one copy contributed by the father (pink). N.B. For clarity, just one of the 39 pairs of canine chromosomes is shown in this and subsequent figures.
In a parallel mating, a female beagle mom and a male pug mate, giving rise to a male pug-beagle mix.
Figure 4. An egg from a female beagle is fertilized by a sperm from a male pug, forming a pug-beagle mix. For each chromosome pair, the offspring has one copy contributed by the mother (black), and one copy contributed by the father (blue).
To understand how a mutt comes to contain genetic contributions from several different breeds, we must continue forward to the next generation. As before, both Mom—here, a labradoodle—and Dad —here, a puggle—contribute one of the chromosomes in each pair. However, rather than passing on the very same chromosomes that they themselves have inherited, the parents contribute recombined chromosomes, which are a combination of fragments from their own parents (Figure 5). In this example, the resulting puppy would be called a mutt, and has DNA derived from ancestors of four different breeds.
Figure 5. An egg from a female Labrador-poodle is fertilized by a sperm from a male pug-beagle mix, forming a mutt. For each chromosome pair, the offspring inherits one chromosome copy from the mother (purple and pink), and one chromosome copy from the father (black and blue). In this mating, the two parents are themselves mixed breed. Therefore, when the Labrador-poodle mix produces eggs and pug-beagle mix produces sperm, the resulting chromosomes contain DNA from more than one breed.
Recombination involves a fair trade of genetic material between the two chromosomes that make up each pair. Each recombination event produces a new version of the original chromosome, for which the overall amount of genetic material is the same as it was before, but is partitioned differently between the two chromosomes. Note that recombination is inherent to the production of sperm and eggs—even in a purebred beagle or poodle, the chromosomes within each pair exchange chunks. The consequences are most apparent, though, when the recombining chromosomes have different histories.
Inferring a mutt’s ancestry by comparison to reference genomes from purebred dogs
Local ancestry inference works by determining which breed most likely contributed each chunk of a mutt’s genome. Once an inference is made for each chromosome chunk, those inferences can be summed to estimate the overall fraction of mutt’s genome contributed by each inferred breed.
To infer the most likely contributor of a given chromosome chunk, of course, we need some way to distinguish among the genetic contributions of various breeds. Fortunately, while most of the genome is very similar across all dogs, each breed contains specific genetic changes—called mutations—that are either unique to it, or at least much more common in it than in any other breed. Some of these mutations are directly relevant to the features of a particular breed. Others just happen to be unique to or more common in one breed than in others but have no known relevance to the breed’s specific physical features. Mutations of both types are useful for ancestry inference. In Figures 3-6, mutations specific to individual breeds, useful for inferring breed ancestry, are represented by chromosome color.
Figure 6. Inferring breed ancestry by comparing a mutt genome to a set of reference genomes. To infer breed ancestry for a mutt, a set of breed reference genomes (A) is collected, and then compared to the mutt genome of interest (B) to enable ancestry inference for each chromosome chunk, and estimate of overall ancestry contributions. The mutt above is inferred to have roughly equal contributions from pug, Labrador retriever, poodle, and beagle ancestors, as expected given that it had one grandparent from each of these four different breeds.
The steps in inferring ancestry for a mutt are then to:
- Collect a set of genetic data from purebred dogs (Figure 6A)
- Collect genetic data from the mutt of interest (Figure 6B)
- Compare the mutt genome to a reference genome, make best guesses about breed origin for each chromosome chunk, and sum across those chromosome chunks to estimate overall breed ancestry (Figure 6C)
Data from even just a small fraction of a mutt’s genome can be useful for ancestry inference
The genome of a dog contains about 2.5 billion nucleotides—the As,Ts, Cs, and Gs that make up DNA. This is not drastically different from the roughly 3 billion nucleotides that make up the human genome. In an ideal world, of course, it would be financially feasible to collect whole-genome sequence data from every dog. Over the past two decades, we have inched closer to this goal. In 2001, when the first complete human genome sequence was published, sequencing each of our approximately 3 billion nucleotides cost $2.7 billion dollars. A massive decline in sequencing costs has enabled large-scale endeavors like the 1,000 Genomes Project, which has catalogued whole-genome sequences from humans all over the world.
Despite these price declines, it still costs about $1,400 to sequence the whole genome of a dog here at the Broad Institute’s Genomics Platform. This price is surely a vast improvement upon earlier prices, but remains substantial. Fortunately, genotyping provides a cheaper—and still largely informative—alternative. In contrast to whole-genome sequencing, genotyping assays a subset of nucleotides within the genome. In the case of the dog genome, for example, the most popular chip assays about 170,000 mutations.
It is, at first, hard to imagine how data from only about 0.000068% of a mutt’s genome (170,000 of 2.5 billion) could provide an adequate proxy for the genome overall. Part of the answer lies in the details of the recombination process mentioned above. At each generation, the chromosome chunks derived from a given ancestor become smaller and smaller. Despite these overall declines in length, chromosome chunks remain, for many generations, large relative to the entire genome. Therefore, with some important caveats—and an acknowledgement that some errors will inevitably occur—it is typically reasonable to use the identity of one nucleotide in the genome of a mutt to make a guess about the identity of a neighboring nucleotide (Figure 7). This approach, termed imputation, has vastly improved opportunities for comparatively low-cost inference of ancestry components in mixed-breed dogs.
Figure 7. Imputation uses genotype information from some nucleotides to make informed guesses about others. For a chromosome formed by recombination of DNA from poodle (purple) and Labrador retriever (pink) DNA, identifying breed ancestry of positions 1 and 2, which were contributed by Labrador retriever, enables a correct guess about the breed origin of the surrounding region. By contrast, position 3 is near a breakpoint between chromosome chunks; data from that site leads to a correct guess about the origin of positions to the left but not the right of the sampled position.
How does a genotyping chip work?
Canine genotyping chips designed by companies like Affymetrix and Illumina are optimized for identifying mutations relevant to disease. The result is that only the subset of mutations most likely to be clinically informative are interrogated for each dog, keeping costs down.
DNA is a very sticky double-stranded molecule in which each strand wants to bind to the other, complementary sequence. In the DNA of all life on earth, A (adenine) pairs with T (thymine), and C (cytosine) pairs with G (guanine). Therefore the DNA sequence “ATCG” would stick to the complementary sequence “TAGC.” However, even a difference of one letter (i.e., “TGGC”) can prevent the two pieces of DNA from binding to one another. Genotyping chips take advantage of this principle of selective binding to determine which mutations are present in a given dog. DNA probes are designed to bind sections of a dog’s DNA containing the mutated, and alternatively the non-mutated form of the DNA. These short sequences are attached to the top of a small glass slide commonly referred to as a “chip” or an “array” (Figure 8).
Figure 8. Genotyping to determine which mutations each dog has. The DNA probes (short sequences complementary to mutations of interest) are present in different locations on the genotyping array. Here we illustrate detection of one of the thousands of mutations assayed by the chip. After dog DNA is added and allowed to bind to the DNA on the chip, DNA that hasn’t bound is washed off. Next, fluorescent molecules are added that bind to the remaining dog DNA. In this way, the mutations present in a dog can be identified by visualizing which regions of the genotyping array are glowing.
To ensure binding to these short genotyping probes, DNA isolated from a mutt’s saliva is first broken into tiny pieces. Next, a chemical is attached to the dog’s DNA that is great at sticking to fluorescent molecules that will be critical to interpreting the results. The mutt’s DNA is washed over the chip and each strand binds its complementary probe sequence. Thus, pieces of the mutt’s DNA find the matching probe on the genotyping chip. Two features ensure specific binding and, therefore, reliable data. First, a genotyping probe cannot bind mutt DNA derived from a different part of the genome. Second, it cannot bind the mutated form of the sequence unless the dog happens to have that specific mutation (i.e., the “A” sequence illustrated above). Unbound DNA is washed from the slide, and lastly, fluorescent molecules are attached to the remaining DNA that successfully bound the probes. Because each probe was created in a specific location on the array, we can interpret which mutations a dog has by observing which tiny spots on the array are glowing.
Factors that can undermine ancestry inference
Despite recent advances, several lingering challenges can undermine efforts at accurate inference of breed ancestry in mixed-breed dogs.
Figure 9. An ancestor can be inferred only if the relevant genome is present in the reference set. For breeds that are well represented among reference genomes, and well sampled by a genotyping array (for example, poodle, pug, and Labrador retriever in the scenario above) ancestry-inference efforts will typically be successful in identifying both the presence and the approximate percentage of DNA contributed by recent ancestry from that breed. However, for breeds not well represented among reference genomes (for example, beagle in the scenario above), chromosome chunks are often misattributed to a better-represented breed (for example, basset hound in the scenario above), leading to incorrect assessment of a mutt’s ancestry.
While some problems can result in merely underestimating the percentage of mutt’s ancestry that derives from a specific breed, other problems can prevent the correct breed from being identified at all. The most substantial of these problems is the absence of true ancestral breed from the reference dataset (Figure 9). Because breed ancestry is inferred by comparing chunks of mutt DNA to purebred dogs of known breeds, if a breed is absent from the reference dataset, that breed simply cannot be detected, even if it contributed a very large fraction of a mutt’s DNA. This issue will ultimately be solved only through inclusion of reference genomes from recognized breeds; in the meantime, if you are interested in knowing whether your dog has ancestry from a specific rare breed, it is important to make sure your breed ancestry company of choice is able to check for that breed. For those who decide to proceed with ancestry inference even though the breed of interest is known to be absent from the reference set, it is important to keep in mind that the absence of that breed from the list of inferred ancestors provides no information as to whether the mutt truly lacks that particular ancestry.
The mutations selected for genotyping also determine which breed ancestries can be accurately identified in a mixed-breed dog. Genotyping arrays tend to include more mutations present in common breeds. This means that chunks of chromosomes from poodles and German shepherds may be especially easy to identify because many of the mutations common in these breeds are assayed on genotyping arrays. While many mutations could help identify chunks of DNA from rare breeds such as New Guinea singing dogs or Skye terriers, some of these mutations may not be represented on widely-used genotyping arrays, which could make these breeds harder to identify. This problem will eventually be solved by creating breed reference datasets with sequence data, which would allow for the interpretation of many more mutations and would not be biased toward detection of ancestry from specific breeds.
A mutt’s relationship to its purebred ancestors also affects the reliability of breed determination. In particular, it is easier to identify the breed ancestry of DNA from a purebred ancestor who is a close relative (such as a parent) because mutations from recent ancestors will reside in longer chunks of DNA with more informative mutations. For example, while the first mutation observed on a mutt’s chromosome may be common in both Labradors and Golden Retrievers, perhaps the first, second, and third mutations observed are only seen together in Golden Retrievers. DNA contributed by ancestors from many generations back will exist as only short chromosome chunks, with fewer mutations to help identify their contribution to the mutt’s ancestry, making inference more difficult. This issue can be mitigated by using data from sequencing instead of genotyping, allowing for all mutations to be analyzed. However, DNA inherited from many generations back can be in chromosome chunks so short that it will not contain chromosome chunks characteristic of a specific breed, such that the breed’s contributions to a mutt’s ancestry cannot be detected even with whole-genome data (Li et al., 2014).
What’s next? Should I genotype my dog?
As with any new technology, breed inference is an exciting opportunity that introduces some unresolved challenges. Many dog owners intrigued to learn more about the origins of their pet will surely appreciate having a window into which breeds contributed to the unique genetics of their mutt. You might even earn the rights to speculate that your dog’s excellent stamina at high altitude comes from her Lhasa Apso grandparent (Li et al., 2014)! Still, we urge owners to be cautious and remember that a variety of problems can compromise the reliability of inferences, all the while remaining optimistic that inference will improve as reference data accrue.
Larson, Greger, Elinor K. Karlsson, Angela Perri, Matthew T. Webster, Simon Y. W. Ho, Joris Peters, Peter W. Stahl, et al. 2012. “Rethinking Dog Domestication by Integrating Genetics, Archeology, and Biogeography.” Proceedings of the National Academy of Sciences of the United States of America 109 (23):8878–83.
Li, Yan, Dong-Dong Wu, Adam R. Boyko, Guo-Dong Wang, Shi-Fang Wu, David M. Irwin, and Ya-Ping Zhang. 2014. “Population Variation Revealed High-Altitude Adaptation of Tibetan Mastiffs.” Molecular Biology and Evolution 31 (5):1200–1205.