Project : USDA ARS

ARS Home » Northeast Area » Ithaca, New York » Robert W. Holley Center for Agriculture & Health » Plant, Soil and Nutrition Research » Research » Research Project #434608

Research Project: Database Tools for Managing and Analyzing Big Data Sets to Enhance Small Grains Breeding

Location: Plant, Soil and Nutrition Research

2023 Annual Report

Objectives
Objective 1: Develop methods and analyses on the Triticeae Toolbox (T3) database that use data stored there to assign likelihood to genome segments of carrying trait associated variants. Sub-objective 1.A. Improve T3 upload, download, and quality control tools. Sub-objective 1.B. Implement the Genomics and Open-source Breeding Informatics Initiative (GOBII) genotype data storage on T3. Sub-objective 1.C. Automate imputation to high-density genotyping platforms. Sub-objective 1.D. Automate genome-wide association study implementation. Objective 2: Improve linkages between diversity data stored in T3 and knowledge gleaned from the literature based on biological experimentation. Sub-objective 2.A. Develop new linkages with KNetMiner. Sub-objective 2.B. Implement analyses to estimate between-trait genetic correlations using the whole database as the reference population. Objective 3: Enhance T3 facilities to analyze and manage multi-omic data and data from multi-state cooperative nurseries. Sub-objective 3.A. Functions for search and analysis of transcriptomic and metabolomic data. Sub-objective 3.B. Clustering and prediction using multi-omic data.

Approach
ARS develop text file input methods and will implement Mendelian error checking when both parents and an offspring have marker data. Upon upload of a high-dimensional phenotype dataset, a relationship matrix will be constructed from it and compared to the marker-based and pedigree-based relationship matrices. It will be important to scale each phenotype according to the information it carries about the genotype, namely, its heritability. The method on both transcriptomic and metabolomic datasets will be developed and tested. The Genomics and Opensource Breeding Informatics Initiative (GOBII, www.gobiiproject.org) genotype data management system will be incorporated using the Breeding application program interface (BrAPI, www.brapi.org). For imputation, Beagle4 has been tested. We will collaborate with another ARS lab in bringing the Practical Haplotype Graph (PHG) to wheat. When new lines are uploaded to T3 with genotype data of adequate density, they will be imputed. Genome-wide association study (GWAS) analyses using imputed scores will take the reliability of those scores into account. For traits assayed in multiple trials, results are combined by meta-analysis. Genes will be sorted by cumulative evidence of association and automated links are made to external databases, and we will populate a JBrowse track with GWAS hits. T3 users will want to access the KNetMiner network after an association analysis in T3: having identified a variant associated with a trait, KNetMiner will provide access to information from the literature about it. KNetMiner has developed a beta application program interface (API) that takes a gene and a trait and displays the knowledge network connected to those will be used. Traits will be linked by a co-located association. In a focal dataset, users will query all associations that pass a user-defined threshold. Physical distance between associations in prior and focal datasets will be ranked and presented to enable the user to determine which traits she wants to link to. Traits will also be linked by the overall genetic correlation between them by correlating genomic predictions to traits measured in the focal dataset. All expression data of tens of thousands of genes will be stored in “materialized view” tables. JBrowse tracks will be created allowing gene expression of sets of individuals to be displayed. Clicking on a transcript will open a window with a link back to T3 enabling the selection of the transcript as a phenotype. The transcriptome sequences will be added as a T3 BLAST database. The challenge of metabolomics is that most metabolites detected in mass spectroscopy (MS) experiments are of unknown chemical composition. Metabolomic databases other than T3 allow metabolite identities to be explored. Metabolomic data will be stored in formats compatible with those databases to enable sharing. Users will be able to link back to T3 from them. As for gene expression, metabolites will be searchable based on the genetic correlation of their levels with other phenotypes.

Progress Report
This Progress Report is the last for the current Project Plan. This Project Plan focused primarily on developing advanced genomics tools in conjunction with the breeding data being deposited to T3. In some ways the coming Project Plan focuses more on “back to basics” efforts to make the tool more immediately useful to breeders so that they easily deposit their data there and the interest in the tool grows as a function of the wealth of its data resource. The milestone of bringing T3 fully into BrAPI compliance for upload features exemplifies that effort. We have been helped in that milestone by the Breedbase team (www.breedbase.org). Breedbase is the codebase upon which T3 rests, as well as the Breeding Insight team (www.breedinginsight.org) who have helped the BrAPI implementation become more robust. Thanks to these efforts, there are now clear guidelines (https://wheat.triticeaetoolbox.org/guides/brapi) for developers to create apps that use T3 data, accessed directly from within the app, for specialized functions. An example of such an app shows how this access will become an increasingly powerful feature. T3 participates in the Wheat Coordinated Agricultural Project (WheatCAP). An important component of the project is that each breeding program is flying trials with a drone taking images of plots. These images are centrally analyzed by a group called UASHub hosted at Texas A&M to extract vegetation indices and plant heights. All of this data is being deposited to T3. The breeders upload their trials and field layouts to T3, sending the images to Texas. Using BrAPI, the UASHub can download the plot layout from T3 and push the results of their analyses back to T3 via computer-to-computer communication without any human handoff. The breeder can then work with their data on T3, for example easily aggregating their data to data from other cooperators. Our progress on imputation also exemplifies T3 growing interconnections with the small grains breeding communities. We have received and updated imputation panel from USDA scientist Katie Jordan that was built using 472 exome-capture sequenced Wheat lines. Before using this panel, we need to verify the accuracy with which it imputes markers. Accuracy verification has proved more challenging than anticipated. We face three challenges. First, the reference sequence for wheat has changed over time. Validation data that we have has been called against a different reference sequence than the data that went into the PHG. Second, allele calls based on sequence data are inherently less robust than calls from chips because they depend in part on the bioinformatics pipeline used to call them, especially given the allopolyploid nature of wheat. Again, our validation data was called using a different pipeline than the PHG data. We are only the users of this data, not its creators so that we are dependent on others for standardization. These pipeline differences mean that some loci, though they share identical metadata across protocols, nevertheless produce different allele calls. These calls cause problems when one is supposed to validate the other. Finally, imputation accuracy is not universal across wheat: it is population dependent. We therefore need to specify our analyses to different wheat market classes, adding an analysis step that we should have anticipated but didn’t. Thus, the imputation process is not complete. But we believe that we have a good handle on the challenges and the way forward is becoming clearer. The reward when we get there will be substantial: an ability to harmonize genotyping calls across the many different genotyping platforms that have been used in wheat over the years, enabling more powerful data aggregation across large numbers of trials. An important innovation we have made in deploying T3 is the development of a Docker image containing all of the T3 scripts. Dockerization allows us to create and maintain new instances of T3. For example, we have created a separate instance of T3 that is restricted to WheatCAP Principal Investigators, reducing breeders’ hesitation to depositing the new data to T3. We also have created two instances that are being used by breeders for their ongoing programs: the University of Illinois program and the USDA-ARS program in Manhattan, Kansas. Collaborating with these programs leads to frequent identification of issues with T3 leading to improvements. We have improved pedigree functionality, seedlot handling, field layout, and data search functions through interactions with breeders. We have heard from several breeders that being able to host program-specific private instances of T3 that they can use for their own programs would be a game-changing feature. We are now working with SCINet leaders to determine if USDA cyber infrastructure can help with this request. In all, progress on T3 data analysis features has been slow but steady and we are also moving to important changes in the basic features that enable us to provide new support to public sector breeders in the United States.

Accomplishments
1. Global cassava (Manihot esculenta) research collaborations. Cassava research is helping breeding programs become more efficient. USDA-ARS scientists in Ithaca, New York, continue to collaborate globally, improving experimental methodology in the area of genotype by environment interaction (GEI) for cassava breeding and providing guides to causal loci in the cassava genome. This year saw a major focus on understanding genotype by environment interaction patterns affecting cassava breeding in Nigeria, leading to actionable clustering of locations that could be used to breed for specific adaptation. This work showed the value of ongoing curation and aggregation of breeding data to analyze multi-year data identifying consistent patterns across the geography of an area served by a breeding program. Two other important contributions from our group were in understanding genetic recombination in cassava as it is affected by introgressions coming from the congeneric M. glaziovii, and in continuing analyses to detect loci affecting cassava response to cassava brown streak virus. Our group played a minor role on collaborations with Brazil and Thailand on cassava yield and quality. Particularly with respect to the work on GEI, stakeholders involved in breeding benefit from a clear example of assessing the patterns of GEI in a large multi-year multi-location dataset.

2. Genomic prediction in the biphasic lifecycle sugar kelp (Saccharina latissima). No kelp species has been domesticated in the sense that artificial selection has made it more adapted to cultivation than to living “in the wild”. USDA-ARS scientists in Ithaca, New York, collaborate on research to domesticate sugar kelp, particularly to develop efficient, DNA-marker based, methods of selection. They then published the first report of genomic prediction in sugar kelp. This publication represents a breakthrough because sugar kelp has a biphasic lifecycle in which the farmed kelp is diploid but the germplasm used in breeding is haploid. Thus, observations on the diploid kelp were used to train a model to predict performance of progeny of haploid kelp. Theory was developed to accomplish this prediction across life stages. This research advances agriculture into a new area of off-shore farming and ARS is fittingly at the forefront of the effort.

3. Increases in recombination do not increase rates of gain when using genomic prediction. New molecular methods promise to enable increased recombination in plants, causing the release of variation from new combinations of alleles. Applying such methods can increase gain from phenotypic selection. USDA-ARS scientists in Ithaca, New York, performed simulations based on the wheat genome to determine if increased recombination also increases gain from genomic prediction. Recombination may increase variation but also disrupt the marker-QTL associations that genomic predictions rely on. In general it was found that increasing recombination did not increase gain from genomic selection. This result is an important cautionary tale related to the combination of different technologies not synergizing. In general, simulations are used to promote new high-tech methods. The value of this research to ARS and its stakeholders is that it indicates that avoiding seeking to apply the latest technologies to increase recombination may be the best approach, thus potentially reducing wasteful efforts.

4. Attention to metabolite structure and biochemical pathway can increase genomic prediction accuracy. Continuing work on oat metabolomics, USDA-ARS scientists in Ithaca, New York, developed generalizable approaches to using the wealth of data deriving from such experiments. They showed that combining genome wide association analyses with information coming from known biochemical pathways of the metabolites can increase prediction accuracy for some categories of metabolites. The research represents an advancement in methods combining a high throughput phenotyping method (metabolomics) with knowledge in biochemistry acquired over decades of global effort. This research provides a tool for breeders across multiple crops who worki on nutritional quality affected by secondary metabolites.

U.S. DEPARTMENT OF AGRICULTURE

Plant, Soil and Nutrition Research: Ithaca, NY