Updated: May 3
June 22, 2020
White Paper By
John F. Conway Chief Visioneer Officer 20/15 Visioneers
1. Introduction and Problem Statement. 3 2. Observed Challenges. 4 3. Solution Architecture. 8 4. Next Generation Multi-omics Platform Players. 10 4.1 BlueBee. 10 4.2 DNAnexus. 12 4.3 EPAM... 12 4.4 Qiagen.. 13 5. Data Standards. 14 6. Request Management. 14 7. Sample Management. 15 8. LIMS Environment. 15 9. Data Environments. 15 9.1 Genestack. 15 9.2 DataBricks. 17 10. Electronic Laboratory Notebook (ELN). 17 11. Automation.. 17 12. Professional Services. 17 13. Conclusion.. 18 14. References. 18 15. Integrated omics: tools, advances and future approaches. 18
John has spent 30 years in R&D on all sides of the fence; industry, software, and services, and consulting. His industry roles were Global Head of R&D&C IT @ Medimmune/AstraZeneca and then Global Head of Data Sciences and AI IT @AstraZeneca, Global Director of Discovery Informatics and Chair of the Structural Biology IT domain at GlaxoSmithKline, and Merck and Company where he worked in the Molecular Modeling group, Cheminformatics group, Biological Data group, and Analytical Vaccines Department. John also spent many years in scientific software at Accelrys (now Biovia, a Dassault Systemes company) as a Senior Director of Solutions and Services and Global Head of Presales. Also, John was Vice President of Solutions and Services at Schrodinger. Lastly, the Head of R&D Strategy and Services at LabAnswer which was acquired by Accenture where he became Global Head of R&D Thought Leadership and Innovation.
1. Introduction and Problem Statement
The world of biology is vast and diverse. Trying to piece together this world of biology so that new vaccines (prevention), medicines (treatment), and/or therapies (cures) improve quality of life is one of the most challenging feats the human species has attempted; met with some successes, and, lots of failures.
Given the complexity and the missing pieces of the understanding outlined above, multi-omics approaches try and “stitch” together understanding in complex biological environments by amalgamating the analysis of the individual “omes”: epigenome, genome, metabolome, microbiome, proteome, transcriptome, and other not mentioned or soon to be defined “omes”. Figure 1This leads to better understandings of relationships, biomarkers, and pathways. In doing so, multi-omics integrates diverse omics data to find a coherently matching geno-pheno-envirotype relationship or association. It has become core to understanding and defining early science, translational science, and clinical science. It is driving precision and personalized medicine by providing better disease understanding through molecular action and mechanism, diversity or variation, and soon, more and better omics/disease knowledge bases. It is a very dynamic space that is changing rapidly, including new “omics” disciplines and science. For this reason, flexibility and the ability to integrate and eventually change is an important factor for a solution(s) approach and implementation. The bottom line is that most existing competitive environments are disparate and multi-channeled from a solution architecture perspective as well as from a scientific data management perspective. Most biopharmaceutical companies need a Reboot or Next-Generation scientific data/process analysis and management approach to multi-omics and this whitepaper is going to outline how to achieve that. This in no way is diminishing all the challenging, creative, and innovative work others have done and continue to do.
2. Observed Challenges
Based on the February/March 2020 survey that we conducted with multi-omics personas, scientists/researchers/informaticians/IT, primarily from the biopharmaceutical industry, we found that researchers are struggling with their Omics data. Quality, Storage, Data and Process FAIR compliance, Integration, and Analysis topped the list. Figure 2
FAIR, (Findable, Accessible, Interoperable, Reusable) is a litmus test to help you assess the health of your data and process environment. If data and processes are not “FAIR” you will struggle in: replicating your organizations work, performing data exploration/mining, integrating your processes and data and lastly you won’t be able to get secondary use out of your data, which is not an acceptable outcome in 2020 and beyond. We estimate that 70 - 80% of biopharmaceutical environments are not FAIR compliant, which was right in line with the Survey results. Figure 3
Figure 3 From a current multi-omics software perspective, cost, access, support, breadth of coverage (capability stack), and quality topped the list of challenges. Figure 4
There are several reasons for this which include, and are probably not limited to, new areas of science and technology capabilities, heavy influence from academia and open-source, improper change management, and evolution of processes and understandings. It's not uncommon to see this in a highly evolving field.
As the multi-omics value has grown over the years the need for proper informatics, sample handling, and process optimization has also grown. As multi-omics has also permeated the clinical and translational space, regulatory and patient privacy needs are now also part of the workflows and carry an additional burden of expertise and know-how. Software companies that can’t rise to the occasion when it comes to certifications, compliance, and ultimately the security and integration of data and systems will simply not survive. Figure 5 There are always tradeoffs in build vs buy. Our experience shows that hybrid environments are still the best value where one can assess and lock into a COTS solution that doesn’t require extensive and expensive configuration. The last thing you want to do is customize a COTS solution. Therefore a “Best in Breed” environment can allow you to leverage the mature and enterprise COTS offerings and manage the less mature and dynamic processes with configuration or bespoke approaches. We see the next generation multi-omics solution environment to be a ~80% COTS and ~20% bespoke/open source/academic. As usual, the integration of the multi-omics environment will be challenging.
3. Solution Architecture
As mentioned, the three distinct areas that multi-omics are applied to are research, translational sciences, and clinical. You can expect there may be science and technology overlap within organizations. The three areas can have some differing needs but can be managed with platform approaches and data integration with data lake (single environment for all disparate data) or knowledge graph (a semantically related set of data and information that is based on ontologies, which can amalgamate data and information and consolidate into a summary, including image files and graphs) approaches. While researching the current state of multi-omics environments at biotech and biopharma, some observations made from interviewing different experts and scientific data stewards is not only applicable to the multi-omics space, but all of R&D, and that is the need for a scientific data strategy that drives efficiency and knowledge advances through secondary use of data and AI/ML. Figure 6This brings guaranteed focus and commitment. It solidifies the strategy and outlines what expected commitments and sacrifices, if any, will need to happen. Another critical need is replication and reproducibility which is causing serious problems in the R&D space. The survey showed that ~25% of respondents think that they have a FAIR multi-omics data environment. Given the size, complexity, and requirements for advanced analysis, multi-omics data needs to be carefully managed with proper informatics oversight.
It's apparent that the second wave or generation of multi-omics software and solutions are arriving and companies like Microsoft, Amazon, and Google have invested heavily to handle some of the research and advanced analysis, hosting, and data storage, but that they will most likely not compete but partner with the leading omics software providers. Our research has shown that there is a plethora of tools, algorithms, platforms, and supporting software but the truth is many of these have run their course, have become clunky, suffer from integration issues, and simply put, are not ideal for the next generation “enterprise” environment. The newer architectures are modular, built for the cloud, flexible, and have well-documented APIs for integration and are scalable. Running antiquated or academic and sometimes open-source code in a fast-paced and dynamic environment like a biopharmaceutical company is not always advisable. Support and maintenance of these tools take time, energy, and money. COTs software vendors will be managing most, if not all of this for you. Again, being very careful not to diminish all the great and meaningful work others have done to get us here!
There are key processes that you must fully understand in a highly functioning multi-omics Environment. These are overall request management (everything starts with a request! Capture it!), sample management (the backbone of a modern drug discovery efforts, samples must get places), Experiment/Test, Analysis (where it all happens), and Reporting (Share the results with context and explanation). Figure 7
There are key platforms that will make this best in breed approach become a reality. Remember the key to a highly functioning area is laying the foundation of adopted standards and rigorous scientific informatics, having an ELN that help keep track and maintain your electronic scientific methods and approaches, a sample and biospecimen management environment for your organization that provides the ability to manage, handle and distribute samples and biospecimens. On top of all that is your multi-omics Platform that integrates with these other offerings and, in essence, is the brain and heart behind your experimental design, testing, analysis, and reporting.
Who are the scientific software players in the world today that are cloud-first, compliance first, and ready to drive this forward?
4. Next Generation Multi-omics Platform Players
4.1 BlueBee(Just Acquired by Illumina)
BlueBee has built a phenomenal modular and cloud-first offering that manages a full stack of workflows and omics interactions. Keywords here are modular and cloud-first. Figure 8 Modular is critical for many clients as a big bang approach to change isn’t always possible or even feasible. Cloud-first because you are getting software that was written to exploit the power of the cloud and any High-Performance Computing (HPC). In other words, you not getting a clunky wrapped approach to on-premise software that underperforms. This has been the bane of many scientists and researchers trying to use scientific software applications running on converted on-premise software or even remote desktop services (RDS) applications. Given modern architecture and APIs, integration becomes not only easier to achieve but also easier to manage and change when needed.
· From raw data to report for any data type and workflow
· Pipeline editing, testing, validation, and deployment
· Automated, traceable, and compliant with ISO/IEC 27001, HIPAA, GDPR, Chinese Cyber Security Law, and more
· ISO 13485 quality management option for Dx applications BlueVantage
· Custom UI and branded data solutions
· Tailored data visualizations and reports to deliver results optimally BlueBench
· Interactive and visual environment for data analysis and visualization
· Easy translation of machine learning and AI models to production workflows
· Highly secure, integrated environments to prevent data security breaches BlueBase
· Limitless scale, private data warehouse
· Flexible metadata models and query options
· Controlled sharing of data and insight across collaboration models
· Integrated with the data analysis workflow to create a continuous learning cycle
DNAnexus has been a major player in the genomics space and some other omics since 2009. Their latest product, Apollo, was built to address some of the next generation multi-omics needs. It is a Cloud platform, heavy on compliance and catering to the next generation multi-omics needs. Helping bridge the gap between translational science and clinical research.
EPAM Cloud Pipeline (open sourced) is a web-based Cloud environment that provides the ability to build & run the customized scripts & workflows that support omics analysis, modeling & simulation, and machine learning activities that are required to accelerate drug discovery research. Figure 9
Cloud Pipeline supports the current three main Cloud platforms: Amazon, Azure, and Google.
Cloud Pipeline solution wraps Cloud compute and storage resources into a single service, providing an easy and scalable approach to accomplish a wide range of scientific tasks.
Genomics data processing: create data processing pipelines and run them in the Cloud in an automated way. Each pipeline represents a workflow script with versioned source code, documentation, and configuration. Users can create such scripts in the Cloud Pipeline environment or upload them from the local machine.
Product company with accompanying informatics offerings. DiseaseLand is their premier multi-omics product.
Used in nearly all stages of drug development (from discovery to preclinical and clinical research), QIAGEN DiseaseLand encompasses multi-omics data, while providing a standardized pipeline for the incorporation of data, with controlled vocabularies for associated information. · Utilizes public single-cell RNA-seq data · Broad support for genomic data types and thousands of curated and quality-controlled public disease-focused genome datasets · Powerful visualization and analytics to compare and correlate within and across datasets · Internal “Land” next-generation database technology provides very fast access to many genomic datasets 5. Data Standards We want to emphasize this critical piece of the solution. Your scientific data strategy should be guiding you on your approach to adopting and using scientific data standards. This is critical, especially with multi-omics analysis, to drive proper data integration, data science, and successful outcomes to the large investment in omics analysis. Not only will you need the individual data standards in the individual Omics disciplines/data types, but you will need to add additional 2nd order descriptors so that you can properly join this data together and perform the advanced analysis needed. It is the FAIR approach that will assist the easier assembly of all these high-value data. Work closely in a pre-competitive way with the industry. Work with the scientific software vendors mentioned in this white paper but also engage companies like SciBite who are skilled in making this a reality. SciBite, a text analytics and semantic enrichment company specializing in life sciences, enables the production of semantic meta-data by aligning text to ontologies. Experimental meta-data from multi-omics projects can cover a wide range of entity types or “things” and as such requires a wide-ranging set of ontologies to accurately represent it, including; age, sex, ethnicity, drug treatments, cell line, gene, species, tissues, etc. The alignment of experimental meta-data to these ontologies can be done either retrospectively or prospectively. Retrospective mark up of legacy data, such as experiment write-ups, can be done using their NER engine TERMite, which can rapidly produce semantic experimental meta-data. Prospective alignment of data to the same ontologies can be done using SciBite Forms, which provides type-ahead functionality upon data entry, functionality that could be integrated into sample registry tools to allow for the collection of adequate sample meta-data upon entry; simplifying the process of request management. Where there exists a need for the creation of an ontology to cover a domain currently not catered for in the public setting or for the creation of an ontology to capture internal proprietary standards SciBite’s democratized ontology management platform, CENtree, can support this. 6. Request Management Request management solutions have been traditionally bespoke and not all-encompassing. Dotmatics has request management built into their ELN and Sapio Sciences into their LIMS offering. Unfortunately, the lack of a proper enterprise request management environment in R&D is, in our opinion, a serious oversight for many, if not, most, large R&D organizations. Everything starts with a request, from the CEO to the Bench Scientist. 7. Sample Management Two major sample management companies in the world can handle the biospecimens and samples associated with multi-omics testing and research, Titian and Xavo. This is a heavy-duty (enterprise) requirement and won’t be managed well manually or with a subpar solution that can’t easily adapt to your logistics and needs. Always plan for the most complex situation since chances are, your environment will evolve to that through collaboration and externalization. Don’t underestimate the importance of sample management. It becomes the backbone of larger organizations and core to most work getting accomplished and executed promptly. 8. LIMS Environment Many sample testing environments leverage a LIMS system to manage the testing and results logistics. The Wiki definition of a LIMS is: A laboratory information management system (LIMS), sometimes referred to as a laboratory information system (LIS) (clinical or forensic setting) or laboratory management system (LMS), is a software-based solution with features that support a modern testing laboratory's operations. Key features include—but are not limited to—workflow and data tracking support, flexible architecture, and data exchange interfaces, which fully "support its use in regulated environments". The features and uses of a LIMS have evolved over the years from simple sample tracking to an enterprise resource planning tool that manages multiple aspects of laboratory informatics. A LIMS system can be highly configurable but that does not mean that it is easily customizable which is something you want to avoid at all costs with COTS (commercial off the shelf solutions). There is also applicability when placing a LIMS in an R&D environment and should be vetted carefully based on the scientific domain and the current and near-future workflows. From a next-generation multi-omics perspective there are only several LIMS providers that we would recommend. Sapio Sciences is one of those.
9. Data Environments
Genestack, founded in 2012, is primarily in the multi-omics data management space with its product Omics Data Manager (ODM). ODM is their flagship product and was launched in 2019, it aims at solving critical issues in multi-omics data management in life sciences:
· Data stuck in silos are impossible to search efficiently in one go on a single platform.
· The provenance and relationships between data entities (e.g. a study, its associated samples, and the data generated from the samples) are not explicitly defined in a machine-readable format for cross-study and/or cross-omics queries.
· Studies are only searchable by metadata (e.g. sample characteristics) but not by the omics data because only metadata are indexed by the search engine.
· Heterogeneous and missing metadata annotations for all data entities due to a lack of metadata standards hinder data discoverability, reuse, and integration.
To break down the barriers between the silos, data are imported into the ODM catalog “in situ”, i.e. without moving the massive volumes of physical data. This critical when handling data from large to massive studies, for example, single-cell RNA-seq studies with millions of cells each. Concurrent with import is automatic data and metadata indexing, using a combination of open-source and proprietary indexing technologies. To keep up with the latest science, Genestack focuses on bespoke, scalable data indexing technology. Meanwhile, the relationships between data entities are automatically captured at the point of data import. The importance of this provenance tracking is prominent in their latest endeavor in integrating R&D and clinical data, so users can reliably match up clinical studies which also have complementary R&D data. Their RESTful APIs provide users with plenty of options to search/filter data and get cross-study, cross-omics data of high liquidity in seconds, enabling efficient downstream integration with visualization applications, analysis pipelines, or machine learning algorithms. Search and indexing aside, to ensure there is sufficient and standardized contextual information for all studies, ODM brings together three key components of bio-curation to automate metadata harmonization as much as possible: customizable templates that define the metadata model and the use of ontologies, automatic rule-based metadata validation/curation, and an intuitive interface for manual, case-by-case curation. Together, the three capabilities of ODM ― data catalog, curation tools, and analytical search APIs ― enable fast, federated, and integrative cross-study/cross-omics search for interoperable, interpretable, and reusable results. Finally, ODM is a system designed for more than data FAIRification. Genestack expects ODM to be at the center of an organization’s data landscape, so they chose to build ODM with a modern tech stack and a flexible, modular architecture to fit and integrate with existing IT systems, as well as for agile response to business needs. It can be deployed on the cloud or on-premise depending on security and scalability requirements.
DataBricks has built a highly functional and integrateable genomics data platform in SPARK that will integrate with both BlueBee and DNAnexus.
10. Electronic Laboratory Notebook (ELN)
An ELN today is so much more than a paper on glass rendition of an old company provided lab notebook. It is a highly configurable tool that, if used properly, will enable the scientist to not only document what he or she is doing but to also relate, track, and report. The lines can be blurred with some LIMS system functionality so be careful that you are matching the needs to the right tool. Also, many ELNS are successfully used in more dynamic research environments and are centered around the experiment centric part of the overall workflow or laboratory workflow documentation in the regulated areas of development. Dotmatics, IDBS, and Sapio Sciences seem to be dominating the medium to large biopharma space from a multi-omics perspective.
Automation of testing, depending on the scale and scope of your organization, is something that you want to seriously consider. We highly recommend a deeper dive into automating much of the multi-omics testing processes, if possible. HighRes BioSolutions is at the top of the list for Next Generation multi-omics robotics and automation capable vendors on the market.
12. Professional Services
Beyond the professional services team from a software of choice company. The need for a true R&D oriented professional services organization is warranted for the Next-Generation Multi-omics environment. Today, besides smaller SME oriented consulting companies, there are not too many choices for a “general contractor” who can speak all the domain languages and understand all the processes. We have strong ties with R&D domain-oriented consulting and integrators and can make bigger things happen.
If you have gotten to this part, hopefully, you read the whole thing and what’s described here will make sense - multi-omics is continuing to provide more and more value to the overall drug discovery and translational science workflows and outcomes. It's complex and will adhere to certain requirements outlined above. No one tool does it all, but instead, best of breed platforms, that when brought together and intercalated with one another, drive a very promising set of outcomes. Having spent many years in this space, we wanted to put this information out there to save people and organizations time so that they can learn from our research and efforts and get an idea or even a head start on planning and strategizing on your Next-Generation Multi-omics Solution Environment.
Conesa, A., Beck, S. Making multi-omics data accessible to researchers. Sci Data 6, 251 (2019). https://doi.org/10.1038/s41597-019-0258-4 Integrated omics: tools, advances and future approaches in Journal of Molecular Endocrinology Authors: Biswapriya B Misra 1 , Carl Langefeld 1 , 2 , Michael Olivier 1 and Laura A Cox 1 , 3 Chromosome to DNA Image https://www.sciencelearn.org.nz/resources/206-dna-chromosomes-and-gene-expression EPAM Website Transcriptomics https://edu.t-bio.info/blog/2019/11/28/omicslogic-transcriptomics-2020-introduction-to-rna-seq-analysis/
Kawahara, Rebeca & Meirelles, Gabriela & Heberle, Henry & Domingues, Romênia & Granato, Daniela & Yokoo, Sami & Canevarolo, Rafael & Vischi Winck, Flavia & Ribeiro, Ana Carolina & Brandão, Thaís & Filgueiras, Paulo & Cruz, Karen & Barbuto, José Alexandre & Poppi, Ronei & Minghim, Rosane & Telles, Guilherme & Fonseca, Felipe & Fox, Jay & Santos-Silva, Alan & Leme, Adriana. (2015). Integrative analysis to select cancer candidate biomarkers to targeted validation. Oncotarget. 6. 10.18632/oncotarget.6018.
Khoomrung, Sakda & Wanichthanarak, Kwanjeera & Nookaew, Intawat & Thamsermsang, Onusa & Seubnooch, Patcharamon & Laohapand, Tawee & Akarasereenont, Pravit. (2017). Metabolomics and Integrative Omics for the Development of Thai Traditional Medicine. Frontiers in Pharmacology. 8. 10.3389