October 7, 2020
HTE High Throughput Experimentation
What You Need to Know to Drive Better Outcomes from Lessons Learned to Proper Scientific Informatics
By John F Conway1, Ralph Rivero PhD. 1, and Laurent Baumes PhD.2
1 20/15 Visioneers LLC
2 Exxon Mobil Corp.
What is High Throughput Experimentation (HTE)? How is it used? And how can improved data handling and instrument integration significantly reduce discovery cycle times and costs, and improve outcomes?
High Throughput Experimentation is a self-described tactic that involves some level of automation, comes in several flavors, and all flavors entail scientific experimentation where you conceive a design of experiment(s) and you execute the experiment(s) in parallel, or in a rapid serial fashion while altering specific experimental variables, or parameters. For example, temperature, catalyst, pressure, solvent, reactants, etc. Much of this requires robotics, rigs, semi-automated kit, multichannel pipettors, solid dispensers, liquid handlers, etc. Like other more targeted experimentation, there needs to be adherence to the Scientific Method: Hypothesis, Methods, Results, and Conclusion. Ideally, well-conceived experiments result in every well, or reaction vessel, generating a wealth of information that is captured, quickly interpretable, and ultimately creates the foundation for better decisions and follow-up experimental designs. The importance of an appropriate IT and informatics infrastructure to fully capture all data in a FAIR-compliant fashion cannot be overstated. While capturing, raw data, results, and conclusions have historically been the focus of information systems, we believe that the electronic scientific method can be significantly enhanced by Ideation capture, as well as other design and experimental learnings. This improvement of knowledge management will help you optimize your experimentation and reduce any unnecessary failure that would exist within the realms of current scientific understanding and documented findings. An additional benefit of this improved knowledge management is the organization of intellectual property. In many cases we recommend enhancing the “DMTA” (design-make-test-analyze) knowledge cycle, to include ideation - “IDMTA”. While ideation may be characterized by some as “soft data”, it should be considered for some experiments as contextual and therefore foundational to any knowledge management system intending to preserve and make available the rationale that inspired the experimental work. Scientific intuition and creative ideas make the world go around!
At the most basic level, the tools necessary for a sustainable HTE program are HT equipment for fast and parallel synthesis or testing of materials, computational methods to select experiments, e.g. design libraries, and a FAIR environment with a well-designed data repository and query system to retrieve and further use the data in future ideation and enhanced designs. Biology, chemistry, and material sciences, just to name a few, are among the scientific domains that have benefited from the volumes of data generated from High Throughput Experimentation. There are multiple advantages for implementing HTE, those include, but are not limited to, automation driving reproducibility in science, innovation, and of course major efficiency gains. HTE is ideal for the driven scientist who does not settle for less, and wants to accomplish more with less, and in less time. For R&D organizations to realize the full benefits of HTE, careful investment in strategy, hardware, and software is required. Too often; however, software platforms, perhaps viewed as not as sexy as the hardware, are underfunded or neglected, resulting in lost value and opportunity. As you read on, the recurring questions to consider include: “With the massive investment in HTE, did we get the IT right? Were there some gold nuggets left uncaptured that would have provided long term value and opportunity for greater institutional learning and reduced rework? The latter question is extremely important as world-leading corporations are often recognized by their commitment to creating a culture of continuous improvement and learning. Neither is truly possible if improper strategy and IT systems are put in place.
Biological Sciences- The More the Better-Wrong!
For close to three decades High Throughput Biology (a.k.a. High Throughput Screening, and more recently High Content Screening, that deals with biological imaging) has matured to where researchers routinely and rapidly “screen” thousands to millions of molecules in a biochemical or cellular context, using a variety of assay or imaging techniques, to determine endpoints like biological activity, genetic markers, apoptosis, toxicity, binding, and other biochemical, cellular and tissue readouts. It has become a staple of the drug discovery process and provided many lessons learned and valuable critical decision-making knowledge for an industry whose foundation relies on data. While information/LIMS systems have provided tremendous value by facilitating the generation of files that instruct liquid handlers and robots to run high throughput screens, the real value is derived from the efficient capture of decision-making data that allows scientists to turn that data into actionable knowledge and insights. A big lesson learned was screening everything and anything was not probably a good strategy. A better strategy was to carefully manage the master DOE and infuse scientific and mathematical thinking into the approach. General approaches sometimes yield general or less than general results. In addition, the amount of time and effort spent in the analysis and the overall solution architecture of these massive campaigns may have underwhelmed the costs. While these technologies, particularly the hardware from automation vendors have been transformational in the biological sciences. It is only natural for one to anticipate that recent advancements in AI and ML, along with continued instrument advances, would lead to improved data-driven decisions. R&D organizations that embrace those advancements, and prepare for them now, will most certainly emerge as industry leaders. Chemistry
The adoption of high throughput technologies by scientists, though well-established in the life sciences’ biology space, has been somewhat slower to take hold in some synthetic chemistry labs. The powerful ability to explore numerous hypotheses by executing multiple experiments in parallel, or in rapid serial fashion, promised to revolutionize discovery sciences. So, what happened in discovery chemistry? The advent of high throughput synthesis (AKA combinatorial & parallel synthesis in discovery) in the early 1990s, often executed manually with multichannel pipettors and custom reaction plates, created the early market for automation and, more importantly, the requisite informatics for sample tracking and to capture reaction data including chemical reactivity, observations, results, etc., in a FAIR compliant fashion. Early on in the discovery space, specifically, discovery synthesis, the “ideation” and “design” components of the ideation-design-make-test-analyze (IDMTA) knowledge cycle, were somewhat flawed by the belief that all makeable chemical diversity, was equally valuable. It did not take long to realize that just because you could prepare a molecule, does not mean you should. So now in 2020, we apply those learnings that not all chemical diversity is in fact biologically relevant, nor developable, and that valuable knowledge has been leveraged by computational scientists in Pharma into improved predictive models that routinely inform the ideation and design components of the knowledge cycle. A classic example of turning lemons into lemonade. Undeterred by the slower uptake in the discovery chemistry space, research automation industry leaders have continued to make tremendous technical advances in synthesis equipment and automation platforms. Most of the limitations of early automated synthesizers (often just modified liquid handlers) have been cleverly addressed by these innovative companies providing chemists with modular workstations, with few synthetic restrictions, and the ability to customize workstations as needed for even the most bespoke reaction sequences. The integration of these synthetic and post-synthetic modular workstations with existing company analytical and IT systems, critical to getting the maximum return on investment, is curiously often left to individual organizations, despite the vendors’ ability to provide that service. While it’s unclear why this full integration, so critical for maximizing ROI, is not prioritized higher, it is often a decision that can plague organizations for years to come.
High throughput experimentation has been employed in the analytical chemistry space for decades as the platforms used closely mimic those used to run high throughput screening. Various vendors provide instruments capable of carrying out analyses in a rapid serial manner providing rapid turnaround of critical decision-making data. As was the case in the discovery chemistry space, automation exploiting plate-based analyses are initially well integrated with existing IT systems. Integration of these analytical systems and their output with new instruments or new IT systems is often done via intermediary databases sometimes slowing down processes and critical decision making.
Materials Science- Size Does Matter
Characterization chemistry, a term sometimes used in the materials space has also been around for a couple decades and has a large overlap with the previously mentioned hardware and instrument manufacturers. Sometimes the difference here is that microvessels up to multi-liter vessels can be part of the “HTE”. Materials science, and in particular catalysis, is characterized by a scarcity of data compared to other domains. This can be viewed as a hardware limitation and related difficulties to set up new experiments. A more realistic reason is the inverse correlation between parallelization or miniaturization and scale-up. Early HTE reaction screening has been highly parallelized or miniaturized, but quickly fell out of favor due to the limited relevance or potential use of the data to drive discovery and optimization at a larger scale. At that time, the community was using HTE in combination with the Combinatorial approach borrowed from Pharma. However, the combinatorial method gets very quickly combinatorically intractable for materials science. Remaining hardware businesses now focus on larger scale equipment with a relatively small reactor parallelization (4 to 16) but using conditions that will allow a more easily scale up exercise.
HTE and combinatorial approaches usually assume a large amount of data. In Materials science and especially catalysts, the amount of experimentation is relatively low due to the scale-up constraints mentioned above. In such domains, it is prudent to balance the selection of the experiments and the value of the generated data. A computational technique called Active Learning is concerned with the integration of data collection, design of experiments, and data mining, for making better data exploitation. The learner is not treated as a classical passive recipient of the data to be processed. The researcher has the control of data acquisition, and he must pay attention to the iterative selection of samples for extracting the greatest benefit from future data treatments. It is crucial when each data point is costly, and the domain knowledge is imperfect.
The sampling strategy in HT materials embodies an assessment of where it might be good to collect data in an iterative fashion. Evolutionary algorithms, homogeneous covering, or traditional DoE have been used. However, those techniques still use a rather large number of samples and are not optimized considering the downstream techniques used to learn on the consolidated dataset. The idea of the latter is to optimize libraries based on the learning efficiency of a given technique between the materials space and the response space.
Moving the bottleneck- Data Analysis
HTE helped generate new materials faster (HT synthesis), Test faster (parallel reactors), and characterized materials faster. In such a good-looking scenario, the analysis of the data may become the bottleneck. As mentioned above, Catalysis space is not generating a huge amount of data and experiments. However, such amounts of data are sufficient to become impossible for a human to ingest it and make an optimal decision with it. In some cases, characterization or reaction data requires adapted algorithms to decipher the contained information. Scientists still need to understand what is going on between the solid and reactants, and for that, they need to know, at the atomistic level, how the materials and reactants are interacting with the surface. (Excitingly, there have been some success stories reported were using advanced algorithms and high-performance computing, the solid structures from HTE have been exploited.)
Finally, the conventional catalyst development relies on fundamental knowledge and know-how. The main drawback is that it is very time consuming and intuition or initial choice becomes critical. To overcome these attempts to shorten this process using HTE have been reported for 30 years. HTE is more pragmatic-oriented and involves screening of the collection of samples. It must be stressed that the relevant parameters are usually unknown and some of these cannot be directly and individually controlled.
Descriptors and Virtual Screening
The concept of virtual screening using molecular descriptors has co-evolved along with HTE. High‐throughput experimentation has become an accepted strategy. While the design of libraries has improved in the drug discovery arena, it is still challenging, especially, if vast numbers of catalysts are to be explored. QSAR (quantitative structure-activity relationship) is a powerful method used in drug discovery for which molecules need to be represented by so‐called descriptors. However, a transfer of descriptor concepts to solids is a challenge as they cannot easily be represented since no structural formula can be given. Little success has been demonstrated in that area whereas the identification of efficient descriptors would open the path to virtual screening for solids. Note that there have been few demonstrations of such concept and most of them are related to a special type of materials, so-called zeolite, which is crystalline and therefore can be more easily described at the structure/atomistic level.
Making Materials Data FAIR
Up to now, the focus has been on running experiments for a given “program” and the general outcome is that it is very hard to reuse the data outside of the context of each study. Libraries are too often developed in a silo and without the use of a controlled vocabulary, or even better an ontology, and retrieving the data across multiple programs, materials, or reactions remains challenging. Catalyst development is a long process that involves synthesis, formulation, characterization of the materials pre and post-reaction (or even in situ for Operando systems), reaction testing at different scales which all generate data. Challenge in consolidating and connecting all that data is still not accomplished.
Your Vendor-Partner Solution Architecture- Be Holistic
There are at least a dozen of Vendor-Partners that have contributed significantly over the years to this space. We have worked with many of them. When you are discussing High Throughput Chemistry, you may be getting into an emerging market and an all-inclusive environment is pointing to Sapio Sciences (www.sapiosciences.com) Exemplar Scientific Platform with its Visual HTE Chemistry modules complete with Analysis and Knowledge extraction. The combination of the ELN and LIMS environments provides the Request, Sample, Experiment, Test, Analysis, and Reporting workflows that will allow Chemistry HTE to run very efficiently. From a Materials HTE perspective, there are a handful of companies who been heavily involved. Many of the companies that have contributed to the biological space have also competed in the Materials Science verticals. The alternative to this is a solution architecture that takes a “best in breed” or “good enough” approach, like DOE, ELN, Sample management, Analysis/Visualization, Reporting, and the integration to sew it all together.
As scientists of all flavors, we have a societal contract to improve the human condition and leave this Earth a better place after we are gone. The promise of high throughput experimentation to revolutionize life and material sciences industries has not yet been fully realized. If in drug discovery, success is defined as identifying a drug directly from a high throughput screen, it has failed; but if success is defined as finding hit/lead molecules and generating volumes of data, it has been quite successful. The problem is that has not necessarily translated to more drug approvals or shorter cycle times – the ultimate measure of success in Pharma. It has certainly led to lots of potentially useful data to inform future ideation and that is the key. Success with HT Optimization is routinely used in life and material sciences to improve routes – yield, reduce cost, improve sustainability, etc. – and thus success is somewhat easier to measure and claim. I think we all agree that too often the reality of technological advances do not live up to the hype, but in the case of HT experimentation, much of the value, because it is contained within the data, has remained untapped. As our population grows, and environmental impact is a serious concern, the elephant in the room can no longer be ignored, and we must address our growing carbon-footprint, and propensity to pollute. HTE methods, in the hands of committed scientists, will most certainly pave the way to a greener, more environmentally friendly chemical processes. Its ability to help identify greener methods for long-accepted chemical transformations cannot be overstated. Green Chemistry teams, with the sole purpose of creating more eco-friendly processes, formed in most R&D organizations, have already generated a plethora of information over the last decade, and that knowledge, combined with advancing technologies provides hope for a greener future.
While there are many lessons learned from the three decades of HTE use in the life and material sciences arena, managing the data, processes, standards, contextualization, metadata, integration, and DOE remains challenging. The powerful impact of providing scientists with the ability to test multiple hypotheses in parallel, or in rapid serial fashion has produced an exponential increase in data generation, however; many industry leaders acknowledge that there is still significant room for improvement in key industry metrics, such as discovery cycle times and costs. While it seems paradoxical that an increase in data generation would not lead to a dramatic decrease in discovery cycle times and costs for industries that rely so heavily on data, the reasons for this may be quite simple. Our ability to generate data, through the continued evolution and improvement of automation platforms has somehow outpaced our ability to optimally leverage that data for improved decision-making. Consequently, we have missed some valuable opportunities to develop reliable predictive models that would reduce costly experimentation, despite the major advances in AI and ML (machine learning). But how can that be? There should be no better marriage than volumes of data, with AI and ML. The marriage of “big data” with AI and ML, as in any successful marriage; however, it is all about communication, compatibility, and seamless integration. R&D organizations need to treat data as a perishable and a high-value asset. Model Quality Data is the goal! If it is not properly contextualized at the time of capture, properly curated, and data plumbed or routed, the opportunity to generate maximum value from the data can be lost forever. Only then can the promise of AI and ML, and the implementation of an “insilico-first: or “model-first” strategy, be truly realized, allowing organizations to embrace a culture of continuous improvement and learning. Those who can successfully consummate that marriage of “big data” with AI and ML by solving the communication, compatibility, and seamless integration problems will easily differentiate themselves from their competitors. Establishing a FAIR data environment will dramatically reduce the 40-80% of the scientific effort spent on data wrangling and will allow scientists to focus on the ideation and design leading to better experiments and better outcomes. Get ready for the “Era of Ideation”!