Big Data Infrastructure for Crop Genomics

Project information
Abstract We propose the strategic development and deployment of a bioinformatics platform to enable genomics research in crop science. The platform will directly address the needs of the scientific community by integrating and facilitating the use of available genomics and bioinformatics resources, and will be developed in collaboration with a broad user base including plant biologists, geneticists and crop breeders. The main output of this initiative will be an infrastructure to accommodate data from the large-scale genomic resequencing projects that are already underway within the plant research community, for the model species Arabidopsis thaliana and for crop species such as rice (whole genome), brassica (transcriptome), wheat and barley (exomes and genotyping-by-sequencing). This is an area of active research in crop genomics as a direct consequence of the availability of novel inexpensive sequence-based genotyping technologies. We propose to develop a suite of tools (extending from existing software where possible) and an application programming interface (API) to interact with genomic representations of population-derived sequences. Tools will range from simple querying mechanisms to the implementation of more advanced expression and association analyses. We will also develop infrastructure to enable the archival and querying of plant phenotypic data, using existing ontological terms and building on the software developed by the International Mouse Phenotyping Consortium. The platform will be accessible via servers located at TGAC and EBI, and will also be available as a virtual machine for local installation. Summary Recent advances in sequencing technologies and computational tools have made it possible to sequence the genomes of some of the world's most important crop species, such as rice, barley, rapeseed, maize, soya and wheat. These crops constitute a substantial part of the daily food intake for most of the population of the world and any improvements in the breeding for more efficient and nutritious varieties will have a direct impact on ensuring global food security. Whilst obtaining the genome sequences for these crops provides a hugely useful resource for giving insights into the differences between species, it is through sequencing different individuals from the same or closely-related species which allows us to identify useful genetic variants which can be selected for during plant breeding. These approaches require a combination of sequence and phenotypic data, plus analysis tools. We propose to develop a crop bioinformatics platform which enables users to access this genetic and phenotypic variation and perform analyses to explore gene expression and associations between genetic variation and traits. The platform will be developed using open source principles and publicly available data. Population-wide genetic variants will be represented on a genomic data structure; an archiving system for storing plant phenotype data will be developed; tools to allow the querying of these datasets and analyses to link genotype to phenotype will be implemented; and the platform will be accessible via TGAC and EBI servers but also packaged into a virtual machine for easy installation on users' local hardware. This novel platform for crop bioinformatics will promote opportunities for collaborative work with R&D groups in industry, research and academia. The availability of data generated by publicly funded resources, and the concomitant development of new, production-quality tools will lower the barriers to information-enabled crop science, stimulating new opportunities for research and application. The platform will also open up new opportunities for the UK bioinformatics community, traditionally focused on biomedical applications, by developing alternative career paths around biotechnology and agri-food. Impact Summary The recent advances in data-generating technologies have opened a gap between the ability to generate data and the capacity to effectively store and analyse them. The objectives set for the infrastructure we propose to develop will directly target this issue by contributing solutions in areas of research relevant to the BBSRC in food security, bioenergy and biology underpinning health. Academic, Economic and Commercial Impacts The development of the platform will generate new opportunities for collaborative work with R&D groups in industry working in crop breeding and academic institutions. TGAC and EBI are members of large international consortia such as the wheat (IWGS) and barley (IBSC) genome sequencing project. The transformative effect of the availability of large diversity datasets is one of the main drivers supporting next generation crop breeding programmes. One example is the effect that genomics assisted methods will have on breeding for disease resistance traits. The availability of data generated by the public sector and the translation of the research tools into production pipelines will have a direct impact on the generation of new service-based business. The most important traits, such as yield and drought tolerance, involve multiple genes in general identified through Quantitative Trait Loci (QTLs), and complex interactions with the environment. High-density molecular markers are one of the most important tools for informing the characterisation of complex agricultural traits and the design of sophisticated breeding strategies (e.g. genomic selection). This initiative is focused on the development of a data infrastructure to support these kind of datasets. Societal impacts The development and availability of the infrastructure for crop bioinformatics will directly impact the local community with the generation of new jobs and funding opportunities. TGAC's presence in the Norwich Research Park and EBI in the Cambridge area have strengthened theposition of the region as a technology hub hosting specific expertise in informatics applied to life sciences and biotechnology. This will create new opportunities around the development of services in genomics and bioinformatics, which will translate into job opportunities. We also expect this development will bring a renewed interest in the application of genomics and bioinformatics to areas of agriculture and biotechnology research. Policy: BBSRC, research councils and UK A direct consequence of the implementation of this initiative will be to position TGAC and EBI as international leaders in informatics for crops research. Around this, we expect the emergence of a high-class scientific base in computational research placing the UK in a unique position in a future where technology, data and multidisciplinary work will be the common denominators. This is aligned with the general principles set by the UK Agri-Tech Strategy which emphasises the importance of using scientific knowledge to drive agricultural innovation.
Project dates: 
December 2014 to June 2016
Contact
Contact project
Contact person: 
Dr Sarah Ayling
Contact organisation: 
The Genome Analysis Centre
Funding
Funding agency: 
Biotechnical and Biological Sciences Research Council
Grant: 
k€2522