From Bench Science to Data Sculpture: Crafting Master Data in Laboratory Informatics

maurinabignotti
Oct 2
3 min read

Updated: 4 days ago

By Steven Bates, Ph.D.

When I worked as a bench scientist, one of the truisms I learned about proteomics (and molecular biology in general) was that they’re “as much an art as a science.” In my experience, the same is true in many respects for Scientific Data Management.

I’ve lately taken to calling myself a Data Sculptor, which is only half tongue-in-cheek. The metaphor already gets some usage in software development with wireframes. Like a physical wireframe, an informatic software’s data model provides the structure for holding information. Instead of living in physical 3D space, this structure exists in a kind of “idea space,” where elements relate to each other in terms of conceptual closeness rather than literal proximity. Despite being non-physical, this informatic sculpture is a sculpture in motion.

To create a kinetic sculpture that can change over time would require building moving parts, motor components, and other infrastructure on top of the physical wireframe. In a similar way, the next step after deploying new laboratory informatics is programming master data, to be able to assimilate user-entered data over time. This is an iterative process as the sculptor familiarizes themselves with both the software framework and the structure of the data to be collected. A data ontology will emerge as the master data is crafted to bridge the gap between the shape of the sculpture and its behavior. On a large enough scale, this data ontology will fit into a larger data ecosystem, and the Data Sculptor might better be called a Data Architect.

The Data Sculptor has the most room for creativity in a paper-to-LIMS transition. I once took responsibility for the master data management component on such a project. The pharma client relied on a couple dozen contract manufacturing and testing organizations for not only production but also QC testing of their drug products and intermediates.

The trickiest portion of designing the master data was determining the number of tests to define, based on how results were allocated. Every criterion on the Product Specifications corresponded to one LIMS result field, but multiple result fields could be grouped into a single test. In some cases, the grouping decision was obvious, but not always. Often the decision was guided by which result fields had the same method number indicated by the contract organization, implying they were all tested with the same method on the same run. As the number of tests was finalized, so were descriptive test names, following a consistent convention where appropriate. For instance, tests of molecular identity and tests of molecular purity each had that classification indicated in the LIMS test name. This process also entailed harmonization of test names, which were indicated by as many as four to six synonymous names on the Product Specifications and Certificates of Analysis. With the result fields grouped under appropriately named tests, a similar process of harmonization was applied to result field names—on a smaller scale because only a subset of result fields presented difficult decisions.

To plan for configuring the master data in the LIMS once it was organized, I made use of an entity-relationship diagram (ERD). I adapted the diagram from its intended use for diagramming database structures. Although the dozen or so LIMS entities that were needed to define tests and spec evaluations corresponded to software objects, the point of view captured in the diagram reflected that of a user who is interacting via configurations rather than back-end database operations. The ERD highlighted which fields were mandatory for defining an entity, and which fields corresponded to report fields.

Aside from test name conventions, it was important to develop naming conventions for LIMS Specs, LIMS Spec Criteria, virtual Samples, and Test Sites that were consistent, concise, and clear. This included standardized abbreviations for products and Test Sites. Stability Study configurations also needed a scheme to specify entities according to storage conditions of tested samples.

All these choices were design decisions for how to structure data that was previously unstructured on paper. The benefits were standardized naming to aid in communication, minimizing ambiguity in the types of results expected for report generation, and enforcing explicit specifications for each field’s decimal precision, among others.

The entire process requires close collaboration with the client, as the client is the final judge of the quality of the delivered solution. But it’s the responsibility of the laboratory informatics expert to not only provide the technical knowledge but also the creativity and artistic flair that can distinguish great software solutions from merely good ones.