October 27, 2020
The Urgency to Build a FAIR Multi-Omics Data and Process Environment
John F. Conway* and Yolanda Sanchez§
*Chief Visioneer Officer, 20/15 Visioneers
§PhD, Senior Translational Visioneer, 20/15 Visioneers
A Partnered 20/15 Visioneers-Genestack Multi-Omics Scientific Data Management Industry White Paper Contents
1.Introduction and Problem Statement 2.Observed and Experienced Challenges 3.Breaking the Poor Data and Process Culture Habits 4.Genestack is an Evolving Leader as a Multi-omics FAIR Data Environment Provider 4.1Genestack’s Success Story: Revamping the Multi-omics Data Landscape of a Top-5 Pharma and Growing Industry Collaboration 5.Revamping Your Multi-Omics Data Landscape 6.Conclusion About the authors References Introduction and Problem Statement
“You will need to adopt FAIR (Findable, Accessible, Interoperable, Reusable) compliant and scalable data solutions or you will drown in your own data…”
The current complexity of multi-omics in the pharmaceutical, biotech, agriculture, and consumer packaged goods companies is extensive and many Partner-Vendors and services organizations are ill-equipped to handle it. If you are not under a higher than normal sense of urgency to build out your next-generation multi-omics data environment, you may want to start asking your organization some questions. This data environment is a foundational need and if done improperly the change or fix has been likened to changing an airliner’s wings while in flight.
Don’t squander the opportunity to optimally leverage your information. Multi-omics data, information, and derived knowledge will start to bridge the gaps in our huge lack of biology understanding, but only if it is properly curated, managed, and fully exploited. This data environment is at least an order of magnitude more complex than your average lab or scientific data environment. The level of rigor needed for analysis will require the detailed data and process contextualization that many organizations are still struggling to produce. If you fail to properly do this, your return on your large investment may not get fully realized. In an era where our knowledge of target and pathway interdependencies is still limited, we think that multi-omics approaches will:
1. Enable early therapy ideation and clinical study design
2. Improve the chances of clinical success for new medicines, and better outcomes overall for all verticals
3. Be an integral part of the promise of Translational Science and Precision Medicine
These are obvious objectives that everyone will get behind.
Right now, new technologies can measure the molecular, cellular, and/or patient phenotypes derived from the integration of data concerning gene function, chromatin structure, and epigenetic regulation. Omics approaches cover a myriad of transcripts, proteins, metabolites, microbes, and more. These technologies are currently being applied to both new drug discovery programs and the cataloging of exponentially increasing patient information, in an attempt to improve (1) the efficiency of the drug discovery process, and (2) the understanding of human disease. Proper management of multi-omics information generated by drug discovery efforts and patient understanding is critical to the translation of preclinical to clinical outcomes and to fully realize the advantage of precisely tailoring new therapies to the needs of individual patients. Although not specifically covered in this paper, we recognize that a similar situation and proposition apply to other industries where the generation of new ‘omics data is driving innovation. For example, from a plant biology perspective, the missing detail in annotations across the biology space will benefit from a FAIR data environment and result in more streamlined research.
The problem; however, is that if an organization repeats the mistakes of the past and mismanages this wealth of scientific data, you will once again fall short of the immense capability of bridging that gap between failure and success.
“To this end, we propose that the multi-omics scientific data and process environment must be FAIR and must be valued and guarded as one of an organization’s top assets.”
We want to be clear and highlight that this will take culture and change management to be successful, an area we touched on in our previous whitepaper1, where we discussed the whole multi-omics ecosystem and described the challenge of “stitching” together with the analysis from the many different (existing and upcoming) “omes”. Multi-omics is driving precision and personalized medicine by providing better disease understanding through molecular action and mechanism, diversity or variation, and soon, more and better omics/disease knowledge bases. It is a very dynamic space that is changing rapidly, including the introduction of new “omics” disciplines, science, and analytical methods. As you can see from this high-level description, this new discipline comes with an overwhelming amount of data and detail that must be precisely managed to drive innovative understanding and change (Figure 1).
Figure 1. A single experiment can nowadays generate a wealth of biological information that needs to be handled with new methodologies and technologies. A combination of these experiments (if the data is FAIR) yields an even greater amount of information that can be integrated with other studies, whether preclinical or clinical, to multiply the knowledge generated. Genestack Omics Data Manager (ODM) provides the critical data-metadata integration capabilities that enable rapid cross-study multi-omics data exploration. For all these reasons it is critical that your scientists become increasingly “data-savvy” and that they have access to intuitive, integrated, and user-friendly data management systems that enable FAIR activities in the ‘omics space. Genestack Omics Data Manager (ODM) is one of a few highly functional multi-omics data environments available on the market today and will dovetail into your current multi-omics ecosystem, whether legacy or new.
Observed and Experienced Challenges
“Let us tell you what we have learned over time...”
We are not underestimating the complexity of multi-omics; indeed, we recognize that there are many challenges to ‘omics data integration. We have tried here to capture all the challenges based on our own experiences and lessons learned, and also through a global survey conducted earlier in 2020. There are always major unexpected challenges when implementing new science and the accompanying technology, especially when the environment is evolving very fast and the amount of information is practically exploding. For example, if you have a previously existing environment that needs upgrading you will encounter issues with intercalating your new approaches from an integration of data, services, and applications perspective, and most importantly, a change management perspective.
Dovetailing into an existing or even a new ecosystem or environment will have to satisfy the 6 basic workflows needed in a multi-omics ecosystem: Request, Sample, Test, Experiment, Analyze, and Report. Tools already exist to manage the pipelines and/or the workflows. A critical part of making sure the data is FAIR is that the processes themselves must be FAIR. As mentioned above this “new” environment needs to be FAIR compliant and provide the R-reusable data as Model Quality Data (MQD) to drive the secondary use in innovative hypothesis-testing models.
If you have Scientific Data Management challenges now, you have to dig deep and get to the root cause and find out why. We are willing to bet it has something to with one or all of the following factors: Poor Data and Process Culture, lack of/insufficient Scientific Data and Process Strategy, and/or inadequate technology.
There actually is a proper cadence to this evolution, as shown below in Figure 2, but the cart is often put before the horse and you have to figure out how to achieve your goals out of order if you don’t start with a foundational approach like a robust Scientific Data and Process Strategy.
Figure 2. Ideal sequence of events to culminate in a FAIR Data and Process Environment in a perfect world. Unfortunately, we do not live in a perfect world, and therefore pragmatic, fit-for-your-organization approaches become paramount.
It is worthwhile to emphasize again that your data is one of the most important assets for your organization. You make significant investments in time and money and your scientists work very hard to generate and/or gather the significant and relevant data needed to assemble the institutional information and knowledge required for your business to thrive. Figure 3 illustrates the key points of omics data integration and the infrastructure and mindset required to achieve optimal use of your data.
Figure 3. There is no need to learn lessons the hard way if you invest in the right tools and training for your organization.
Based on the February/March 2020 survey that we conducted with multi-omics personas, scientists/researchers/informaticians/IT, primarily from the biopharmaceutical industry, we found that researchers are struggling with their Omics data6,7. Quality, Storage, Data and Process FAIR Compliance, Integration, and Analysis topped the list.
“Beyond infrastructure (i.e., addressing basic storage needs), two critical challenges coming out of our survey were Data Integration and FAIR Compliance, highlighting the need for systems that can address agile data management and effective data mining.”
Our experience has also taught us, and it pains us to say this, that many of the “data lake” implementations are just not cutting it. They are simply not FAIR compliant and are causing unneeded grief for those who need the data the most as part of their daily jobs. Figure 4 nicely makes this point if you are a fisherman (or a data scientist).
Figure 4. Missing or only finding part of your data is less than ideal. FAIR (Findable, Accessible, Interoperable, Reusable) is a litmus test to help you assess the health of your data and process environment. If data and processes are not “FAIR” you will struggle in: replicating your organization’s work, performing data exploration/mining, integrating your processes and data and critically, you won’t be able to get secondary use out of your data, which is not an acceptable outcome in 2020 and beyond. We estimate that 70 - 80% of biopharmaceutical environments are not FAIR compliant, which is right in line with the Survey results.
“Indeed, about 75% of respondents do not think their current data environment is close to FAIR, and many of them confess to struggling significantly even finding the data they need.” Some of the other, perhaps more general, challenges identified by our survey were access, support, and breadth of coverage. There are several reasons for this which include, and are probably not limited to, new areas of science and technology capabilities, heavy influence from academia and open-source, poor study design, improper change management, and evolution of processes and understandings. It's not uncommon to encounter this situation in a highly evolving field.
As the multi-omics value has grown over the years the need for proper informatics, sample handling, and process optimization has also grown. As multi-omics has also permeated the clinical and translational space, regulatory and patient privacy needs are now also part of the workflows and carry an additional burden of expertise and know-how. Software companies that cannot rise to the occasion when it comes to certifications, compliance, and ultimately the security and integration of data and systems, will simply not survive.
There are always tradeoffs involved in a “build vs buy” tactical solution. Our experience shows that hybrid environments are still the best value, where one can assess and lock into a “Commercial Off The Shelf” (COTS) solution that doesn’t require extensive and expensive configuration. The last thing you want to do is to extensively customize a COTS solution. Extensive customization can result in cumbersome and/or slow processes and sometimes interference with other basic organizational systems, which can be very frustrating for the end-user. Instead, we think that there are very reasonable tradeoffs to be made between personalization and performance.
Therefore a “Best in Breed” environment can allow you to leverage the mature and enterprise COTS offerings and manage the less mature and dynamic processes with configuration or bespoke approaches. We envision the next generation multi-omics solution environment as a ~80% COTS and ~20% bespoke/open source/academic. As usual, the integration of the multi-omics environment will be challenging, but the rewards are considerable if the process is well managed. If an organization properly overcomes these challenges, the new biological knowledge generated from omics data, whether target/pathway discovery, biomarker identification, or patient phenotype understanding will represent a significant competitive advantage, as summarized in Figure 5.
Figure 5. The rewards in new biological knowledge once your organization overcomes experimental and data integration issues can result in a significant competitive edge.
Breaking the Poor Data and Process Culture Habits
“Change is hard, but help is on the way...”
As mentioned in the introduction, the current data and process as an asset culture are not sufficient in many R&D organizations. Don’t be fooled, if this culture isn’t rock solid and complied with by all, your outcomes will suffer.
Excessive data wrangling (as high as 80%) and unFAIR data and processes results in struggling with your data and analysis. The outcome will be only marginal advances instead of the true understanding that can be derived from ‘omics-based unbiased approaches to scientific questions. Repetition of experiments because of failure to locate data and/or experimental reproducibility issues will hugely impact the productivity (and often morale) of your organization.
“A recent project estimated 9-12-million-dollar savings to a BioPharma client based on a 40% reduction in data wrangling and subsequent efficiency gain, therefore enabling more and better science.”
Figure 6 illustrates the potential impact and value-added of successful ‘omics data integration.
Figure 6. Successful integration of ‘omics data can impact many aspects of the workings of an organization, including access to data (democratization), data re-use (higher return on investment), and better biological target and biomarker selection (more efficient drug discovery). Indeed, the impacts on overall productivity and the bottom line cannot be underestimated.
In summary, if your organization doesn’t have a scientific data and process strategy yet, you must craft one as soon as possible and govern it properly. A true data transformation means everyone must change and commit to the new data management culture.
Genestack is an Evolving Leader as a Multi-omics FAIR Data Environment Provider
“The Intent to solve a specific problem sometimes comes with attention to detail”
The first question you may ask is why Genestack deserves this description. The answer is that Genestack (founded in 2012) is primarily in the multi-omics data management space and has successfully tackled the complex ‘omics integration problem. They have the technical and scientific expertise and know-how needed to build a multi-omics data environment that adheres to the FAIR standards. The result is Omics Data Manager (ODM), their flagship product launched in 2019.
Genestack’s Success Story: Revamping the Multi-omics Data Landscape of a Top-5 Pharma and Growing Industry Collaboration.
Multi-omics data now play critical roles in early R&D and clinical trials. For example, we know that researchers are increasingly relying on both transcriptomics and proteomics data to inform target selectivity, and cross-referencing that information with genomics data to stratify tumor indication.
However, we also realize that making multi-omics data FAIR is difficult and expensive. In this rapidly-growing field, the three critical challenges are around data integration, metadata standards, and data portals, as described below:
Data integration: study/sample/patient metadata and different omics data types are scattered
Metadata standards: lack of minimum metadata model with ontologies and controlled vocabularies
Data portals: no universal data access framework to rapidly build tailored data portals for specific use cases
Genestack was approached by a top-5 pharma about these challenges in 2016. This pharma company had given up building an in-house (bespoke) solution - it was too expensive for them to keep up with growing data types, volumes, sources, and use cases. Also, there was a big gap in this industry between Data Lake and Data Portals:
Figure 7. When Generic data approaches result in poor or unfair data outcomes.
Data Lakes tend to have poor data standards, data relationships, and metadata management. Without a universal data access layer, data are scattered across data portals. And developing such portals becomes expensive and slow. Figure 7
That’s the motivation behind Genestack’s Omics Data Manager (ODM) product suite. ODM standardizes, integrates, and indexes multi-omics data and metadata into analysis-ready data, accessible for integration with external data catalogs/visualization tools, via Genestack’s data catalog, or for tailored data portals via API. Figure 8
Figure 8. The ODM solution architecture.
The top-5 pharma’s R&D department has now successfully turned their omics data from raw to reusable and analysis-ready. Data managers can now capture and curate data much more robustly. Data scientists can develop data portals much more rapidly. And ultimately, researchers can re-use data and generate the much-needed insights to drive dug development. In summary, their data has become FAIR.
Leveraging their revamped multi-omics data landscape, Genestack recently extended ODM to connect with their clinical data systems. This strategically enables omics and clinical data to be explored together to drive biomarker discovery and patient selection. This has become a critical enabler of translational science for this pharma company.
An important lesson learned is that enterprise omics data management is a difficult and expensive problem. But, it’s a common one. So now Genestack is leading a collaboration with other pharma, agriscience, FMCG, and research organizations to build a sustainable industry-standard solution, through knowledge and cost-sharing.
This collaboration enables Genestack to strategically avoid common pitfalls and accelerate development to keep up with the latest industry needs, currently single-cell and clinical-omics data integration being the priority.
Revamping Your Multi-Omics Data Landscape
“It’s a Journey Whether it is New or your Second Rodeo”
Genestack’s ODM directly solves the challenges and observations we outlined previously. With the proper Scientific Data and Process Strategy you now can:
Confidently select a platform that you can easily integrate into your existing or new ecosystem environment.
Sample Management, e.g. Titian and Xavo.
LIMS, e.g. Sapio Sciences, Clarity, LabVantage etc.
ELN, e.g. Benchling, Dotmatics, IDBS, Jupyter Project, Perkin Elmer, Sapio Sciences, Scilligence etc.
Analysis, e.g. Bespoke, Illumina, Spotfire, Qiagen, Thermo Fisher, EPAM’s Cloud Pipeline etc.
Incorporate data standards, Ontologies, Taxonomies, and data dictionaries.
Curate and manage the data, e.g. duplication-checking, integrity, contextualization etc.
Ensure FAIR enablement of your science, meaning that your search is powerful, the data can be found, you have access with the proper level of security, you can integrate and interoperate with legacy data and external data sets, and importantly FAIR data is accessible and ready for reuse in your secondary models.
Scale the system in performance and use because it was built for the cloud and high-performance computing (HPC).
Conclusion
“The time to raise the ante on your multi-omics data management environment is now”
We have focused in this white paper on the very important aspect of data that is not just the capture part of it (in fact, generating data might be the easy part) but the whole ecosystem of how you will curate, contextualize, access, integrate, and finally and perhaps most importantly, reuse your data.
If you are at all questioning the current approaches in your R&D organization based on what you have read here, we would recommend you immediately pause your program and make serious adjustments. Going on this journey with a ¼ tank of gas will most likely lead you to a destination that will result in considerable frustration for all involved.
ODM is being adopted by both large pharmaceutical and biotech companies who have lived the “Poor Data” Pandemic and know by experience that they will not be competitive without a data ecosystem that can also be used for (very) near future AI/ML methods and initiatives. For the other industries, (Agriculture and Consumer Packaged Goods), we recommend taking a hard look at ODM as that highly capable multi-omics data ecosystem that will future proof your efforts for years to come!
A few final words
If you have gotten to this part, hopefully, you have read the whole thing and what’s described here will make sense to you - multi-omics is continuing to provide more and more value to end-to-end drug discovery and translational science workflows and outcomes. Multi-omics is complex and will need to adhere to certain requirements as outlined above to realize its full potential. No one tool does it all, so instead, we envision a combination of “best of breed” platforms that when brought together and intercalated with one another, drive a very promising set of outcomes.
Having spent many years in this space, we wanted to put this information and our thoughts out there to save people and organizations time and money by learning from our research and efforts. We hope we can help you to get ideas or even a head-start on planning and strategizing on your Next-Generation Multi-omics Solution Environment. The time to act on your multi-omics data management is now! About the authors
John F. Conway
John has spent 30 years in R&D on all sides of the fence; industry, software, and services, and consulting. His industry roles were Global Head of R&D&C IT @ Medimmune/AstraZeneca and then Global Head of Data Sciences and AI IT @AstraZeneca, Global Director of Discovery Informatics and Chair of the Structural Biology IT domain at GlaxoSmithKline, and Merck and Company where he worked in the Molecular Modeling group, Cheminformatics group, Biological Data group, and Analytical Vaccines Department. John also spent many years in scientific software at Accelrys (now Biovia, a Dassault Systemes company) as a Senior Director of Solutions and Services and Global Head of Presales. Also, John was Vice President of Solutions and Services at Schrodinger. Lastly, the Head of R&D Strategy and Services at LabAnswer which was acquired by Accenture where he became Global Head of R&D Thought Leadership and Innovation.
Yolanda Sanchez
Yolanda has 25+ years of combined experience academic and industry experience in translational research and drug discovery, including 14 years at GSK, where she was Vice-President and Discovery Performance Unit head, responsible for a portfolio of mechanisms to address disease progression in chronic respiratory diseases. Her expertise spans target and pathway identification and validation, lead optimization, candidate selection, translational studies, biomarkers, and early clinical studies. Her combined cell biology, pharmacology, and translational expertise are relevant to diseases and indications in which maladaptation of cellular stress mechanisms drives disease progression. Yolanda is passionate about implementing new technologies to drive efficiency in drug discovery. References
Making multi-omics data accessible to researchers. Conesa A, Beck S. Scientific Data 6, 251(2019). https://doi.org/10.1038/s41597-019-0258-4
Integrated omics: tools, advances, and future approaches, Authors: Misra BB, Langefeld C, Olivier M, Cox LA, Journal of Molecular Endocrinology 62, R21-R45 (2019)
Transcriptomics, https://edu.t-bio.info/blog/2019/11/28/omicslogic-transcriptomics-2020-introduction-to-rna-seq-analysis/
Integrative analysis to select cancer candidate biomarkers to targeted validation. Kawahara R, et al. Oncotarget 6 (2015) 10.18632/oncotarget.6018
Metabolomics and Integrative Omics for the Development of Thai Traditional Medicine. Khoomrung S. Frontiers in Pharmacology 8 (2017) 10.3389/fphar.2017.00474.
Multi-Omics Survey. John F. Conway, 20/15 Visioneers. (2020) https://www.surveymonkey.com/r/MW5GZXK
The next generation multi-omics informatics and system. John F. Conway, 20/15 Visioneers. (2020) https://20visioneers15.com/f/the-next-generation-of-multi-omics-informatics-white-paper.
Comments