The Science Behind Trash Data

john87833
Jul 22, 2020
6 min read

Updated: Jul 19, 2022

July 23, 2020

The Suffering Science, and Missed or Delayed Discoveries Behind and Underneath Trash Data

Author: John F. Conway, Chief Visioneer Officer, 20/15 Visioneers

Re-posted from my guest blog post on www.TetraScience.com

Trash or Dark Data is collected data that is not being touched or getting secondary use. We are not talking about tracking data or ancillary system data, etc. We are unfortunately talking about R&D data that has come from experimentation and testing, both physical and virtual. This means individuals and organizations are not taking advantage of key decision-making data and information and are missing very valuable insights.

In many R&D organizations, Trash Data has accumulated and can account for upwards of 80% of its known data. (1) , (2)

The interesting part of this conundrum is that it's both structured and unstructured data! If data isn't properly contextualized, it’s probably not going to be reused, hence it becomes Trash Data. So, don’t get fooled with just pushing mis- or uncontextualized data to the cloud, because you will now have Trash Data in the cloud! Unfortunately, in the year 2020, this is a major problem for R&D organizations. It is telling you that either you have no Scientific Data and Process Strategy, or, even worse, you are not following the one you have written and committed to following! It's also telling you that some of your AI/ML strategies are going to be delayed until the organization solves its major data and process problems. Model Quality Data (MQD), and lots of it, are needed for AI/ML approaches in R&D. And remember, your processes produce the data, so they go hand-in-hand. Data generation processes need to be formulated and established during the initial step of any data project. To mitigate the risk of being inundated with trash data, a foundation with strict requirements of data production and integration need to be established. “Data plumbing” is a critical step that will ensure your properly contextualized data is routed and deposited in the right location.

Another major issue is what to do with legacy data! Based on what was just discussed, my experience has shown that ~80% of legacy data is not worth migrating. The best strategy is probably to leave it where it is and connect to it with your new data integration platform. However, you will determine the value and worth of the legacy data by performing a data assessment through the new platform.

So, how has this happened? How did we arrive at this place where 80% of our hard-won R&D data ends up as trash? The truth is that it comes down to human behavior and discipline. Writing, agreeing, and committing to a Scientific Data and Process Strategy is step number one. However, to take a written agreement and turn it into standard processes that are embedded within the organization, you need a company culture that includes true leadership and a focus in the area of REPRODUCIBLE science. Tactically, it starts with data capture and storage principles. You need a Data Plumber!

R&D organizations are like snowflakes - no two are identical - but there is much overlap in process and types of data generated! Variation is the real problem. Instrument types and various lab equipment with or without accompanying software (e.g. CDS - chromatography data systems), ancillary scientific software like entity registration, ELN (electronic laboratory notebook), LIMS (laboratory information system), SDMS (scientific data management system), exploratory analysis, decision support, Microsoft Office tools, and the list goes on. Hopefully, you have consistent business rules for managing and curating your data, but the chances are this varies as well. What you, unfortunately, end up with is a decidedly unFAIR (see FAIR data principles) data and process environment.

Why did you, a mature R&D organization end up in this position? (Startup Biotechs – BEWARE! Learn from the mistakes of those who have gone before and don’t let this happen to you!) (3). It may be hard to deconvolute, but I think, and this can be up for debate or confirmation, the mindset of “data and processes as an asset” got lost somewhere along the way. Perhaps management and others became impatient and didn’t see a return on their investment. Poorly implemented scientific software tools that were designed to help these problems, compounded the situation. In some cases, environments were severely underinvested in. Taking shortcuts in the overall data strategy, without establishing other foundational steps or processes, is like building a house without a proper foundation. At first, this leads to sagging windows and doors, and a poor living experience with the house. Eventually, the foundation-less house collapses or needs to be demolished. In other cases, churn and turnover in different IT/Informatics and business functions created a “kick the problem down the road” situation. Many times, the “soft” metrics weren't gathered for the repeat of experiments and the time that was wasted in trying to find things and make heads or tails of poorly documented data or experiments. At the end of the day, humans worked harder instead of smarter to make up for the deficiencies.

Imagine you can start to solve this problem both strategically and tactically. Tactically, it starts with business process mapping and completely understanding your processes. The understanding of the detail is an insurance policy for much better outcomes in your journey. As discussed, strategically you need a sound written Scientific Data and Process (SD&P) Strategy that the organization can follow. Tactically, you need to capture your structured and unstructured data in both raw and processed forms. It must be properly contextualized; in other words, you need to execute on your SD&P Strategy. Make sure you are adding the right metadata so that your data is relatable and easily searchable.

This can’t all be done on the shoulders of your scientists. Instead, use smart technology and adopt standards wherever possible. You need to purposefully design the framework for the “data plumbing” in your organization. Both hot and cold data need to be routed to where it belongs. And... besides an ELN, SDMS, or LIMS, it may belong in a knowledge graph where its true value can be exploited by all personas, like data scientists, computational scientists, and savvy scientists making bigger decisions! When you can accomplish this purposeful routing, you will end the broken pipes which led to data silos and disparate data! Finally, you are on the road to being a FAIR compliant R&D organization! Findable! Accessible! Interoperable! And last, but not least, secondary use of your data - Reusable!

Your organization must make very important decisions about how it is going to guard some of its top assets - propriety data and processes. All R&D organizations need their high-quality science to be reproducible and repeatable. The data and process capture must be exact for this to occur. This means understanding both the instrument integration AND the “plumbing” or shepherding of the data into the right data repositories is critical.

Let’s consider one example. On the surface, an ELN is keeping track of the researcher’s Idea/Hypothesis through to his or her Conclusion. The ELN captures some artifacts of data including images, files, etc. However, in many cases, the ELN does not support the storage of massive amounts of experimental data. Instead, it records a pointer to this data. This “pointer” strategy prevents application data bloat and encourages the proper storage and curation of experimental data. This is just one example where “data plumbing” design comes into play in many medium to large R&D organizations. Having a platform that you can plug into that captures data, instrument, application, and process integration is a high ROI (return on investment) need. Having worked in this space for thirty years in a plethora of roles, from individual contributor to leading strategy and teams for many years, it became obvious that this is a very big problem and will need true team work to combat. Many bespoke systems have been built, maybe some have worked well, but I haven't seen it myself. I believe that we are finally able to solve this Trash Data problem once and for all. You need to partner with companies who are taking a platform approach to get into production as quickly as possible. You need partners who truly understand data diversity, contextualization, and FAIR principles. Every day you don’t have a solution in production, the Trash Data continues to pile higher. The predictions are out there, IBM is estimating a jump to 93% in the short years to come. The platform needs to be cloud-native to provide the scalability and agility needed for future-proofing. It needs to be enterprise-grade, meeting the security and compliance needs of Life Sciences R&D. It also needs to elegantly handle the complexity of not only the automated collection of data – everyone can do that these days – but also “plumbing” of the data to and from multiple ELNs, LIMS, SDMS, Knowledgebases/graphs for data science tools, etc. We all know that big pharma is never going to be able to consolidate to one provider across the enterprise. And it needs to harmonize all the data – RAW and processed - and significantly reduce or eliminate Trash Data. The platform needs to automate all repeatable tasks and metadata collection/capture to remove the burden from the scientists and improve data integrity. This is a serious endeavor and one you can't afford to ignore. After all, you don’t know what you don’t know, but even worse, you don’t know what you already should, and you can’t find that experiment or data in your current environment!

The Science Behind Trash Data

July 23, 2020

Recent Posts

Comments