(Key Outlined Priority) Advancing Reproducibility in Machine Learning: A Pathway Towards Open Science
(Additional) Exploring Markov Chain Monte Carlo (MCMC) Methods for Bayesian Inference: The Value of Data Provenance
(Key Outlined Priority) Advancing Reproducibility in Machine Learning: A Pathway Towards Open Science
The reproducibility of experiments has always been a problem in science. The number of machine learning experiments, such as computational simulations, is rapidly increasing. As such, replication becomes harder due to their inherent stochasticity. Addressing this challenge involves enhancing provenance. Reproducibility is in decline, and robust data infrastructure is necessary to make it possible.
Relevant research includes Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program). There should be more “standards” worked on being built. A large-scale attempt at this is the Semantic Web Standards by W3C.
Provenance data enhances scientific accountability and offers significant benefits to individual researchers. For instance, consider an ML engineer looking to optimize hyper-parameters for model performance. Tracking "runs" and provenance data makes assessing the impact of different hyper-parameters on outcomes easier, thereby streamlining the fine-tuning process and potentially leading to better results.
How provenance data is displayed is a crucial consideration. Firstly, if the goal for provenance data is to allow for easier experiment reproducibility, those who want to audit existing research should not have difficulty doing so. For the second use case, it would be valuable for a researcher to be able to extract the information most valuable for hyperparameter fine-tuning. The question of how usable data can be easily extracted during a workflow and have meaningful information presented to users is a (personally-identified) neglected area. We can look to RDFs as frameworks for graphing relational (meta)data and think about the relational nature of hyperparameters for a single experiment and its different runs, etc.
RelationalAI is an AI coprocessor for “data clouds and language models” and has “relational knowledge graph capabilities”
One of the most common dynamical system simulation methods is Markov Chain Monte Carlo Simulations (non-deterministic data with even distributions). Interoperable provenance data can enhance the quality of chain exploration by considering multiple MCMC runs, which can then be compared before calculating effective sample sizes.
A Tribuo-like, provenance framework for Jupyter Notebooks. Jupyter notebooks already have interoperable capabilities. A document containing code written in their core programming languages (Python, Julia and R) exists as a JSON file. JSON is a language-independent data format, the structured data makes it such that, if the framework could retrieve relevant provenance data once it is in the JSON format (f.ex: through their REST API) this data could be visualized and utilized.
The hope would be that not only is the provenance data extremely human-readable once transformed into a Markdown language, but it can also be the basis of interaction between a remote user and the original Jupyter experiment (i.e. wanting to rerun a run that used the nth RNG seed).
Technical documentation for data provenance. There currently does not exist a canonical tool for ML engineers wanting to better track lineage in their software and experiments. But, there are “parts”, tools and libraries such that a keen lab or researcher could develop their own. Technical documentation would lay out the existing tools, research that has explained how to implement them (i.e. Seltzer’s paper on MERIT - a tool built on top of the Tribuo provenance framework in Java, rOpenSci - R packages for the sciences), the fundamentals of provenance and why it’s important, etc.
Streamlining and adding robustness to research compendium formation. A research compendium is a “collection of all the digital parts of a project together into a reproducible research package.” Currently, we have DOIs for papers, but we don’t have constant identifiers for data sources and code for any given research project. What would it like to have DOIs for codebases and data repositories associated with any given project, perhaps a single DOI for an entire compendium, with sub DOIs for each subfile: have it be interoperable to work with different IDEs, identifiable and shareable (i.e. a DOI could link to a specific Github repo, etc.)
The status quo is static data and code (training code, evaluation code, (pre)trained model, libraries, etc.) files, ignoring the stochastic components that already make reproducibility difficult. Changes aren’t tracked, which disincentivizes making changes or, more likely, researchers not updating journal reviewers.
There should be standardized informative docs (ReadMe files) that include relevant reproducibility information. For documentation and (conference) reproducibility guidelines, establishing best practices for reproducibility in machine learning (ML) workflows. Here is an initial line of reasoning/case that would be specified explaining the case for TensorFlow over Pytorch.
A program reproduced in PyTorch, when written in Tinygrad, will yield different results. This should be semantic data that would allow this information among many papers to be easily retrieved.
TensorFlow's static graph execution model stands out as an ideal choice for standardization due to its structured approach.
With TensorFlow, users define the entire computational graph upfront before executing computations (static execution graph).
This upfront definition offers efficiency, optimization opportunities, and portability across different hardware configurations.
In contrast, PyTorch follows a dynamic computation graph paradigm, allowing for greater flexibility and ease of debugging, but provenance data would be more difficult to capture.
If building a data provenance library, it would be ideal to optimize for TensorFlow due to its structured approach and efficiency.
Dependency specification also impacts results.
Each “node” or nanopublication would include the (provenance) attributes outlined in the Machine Learning Reproducibility Checklist (NeurIPS 2019 Reproducibility program); they include:
details of train/validation/testsplits
link to a downloadable version of the dataset or simulation environment (or a sub DOI for data and code)
additional information like annotators and methods for quality control
In science, the ability to replicate experiments and achieve consistent results is crucial for maintaining scientific integrity and credibility. But with the rise of machine learning experiments, especially those using computer simulations, it is getting harder. These experiments have built-in-randomness, making it tough to do them again exactly the same way each time. Reproducibility is important for accountability, but can also make model training easier, and more precise, supporting capabilities in a world where algorithmic creativity will be increasingly important to make breakthroughs.
This literature explores a more straightforward way of sharing scientific information. By combining computer interaction, provenance mapping, and hypermedia, the goal is to make research more straightforward, accessible and reproducible: the foundations of a scientific ecosystem that can be held accountable for their research. The field is nascent, even without a name to label it. This is a scoping literature review aiming to identify future research directions and capture the current state of the literature.
Scientific data and how it’s organized and leveraged have not gotten the attention they deserve, especially as the domain sciences become increasingly computational.
Additionally, In light of developments in software (AI) and tools that aim to support scientific discovery, we should be thinking more about how to make our data infrastructure as good as possible to take full advantage of what automation can offer. There are teachings from interoperability and provenance that are worthwhile to think about.
The scientific process garners a lot of data, from experimental to patent data. There’s a lack of standardization in how we capture, organize and share it, which should change.
Relevant literature is broad but disparate, and the field is nascent. Hopefully, this can be a starting point for more pragmatic conversations.
An excerpt from a paper summary included in this literature review:
If we can't find good ways to share hidden knowledge in chemical procedures, communicating and checking experiments will be hard. This could prevent chemistry from reaching its full potential, and this fact generalizes to the domain sciences at large, establishing the need for more work in this research area. Provenance is crucial because it helps track and verify the origins of information, ensuring reliability and building trust in scientific advancements.
A lack of coordinated data management impedes scientific progress. We can’t collaborate as efficiently, aren’t able to track what’s being worked on and what’s been worked on in the past and data science initiatives aren’t as well-resourced as they could be.
There’s a shift towards open science, and UNESCO has published their recommendations. Ultimately, there are policy concerns regarding dual-use technologies and effects on commercialization and proprietorship when it comes to an open approach to data. However, laying out how to approach data management can make conversations about open science more productive.
Another paper does a good job of distilling the essence of provenance.
In an ideal world, software systems would be engineered to the highest standards. Programs would be expressed in intuitive high-level languages and their behavior would be checked against clean formal specifications. Data would be classified according to precise schemas and curated with ac- curate metadata. But we do not live in that ideal world. Few real-world systems meet these lofty standards. In practice, programs are often built on top of legacy code and dirty data containing errors, omissions and outright lies.
Any computer system can fail in a number of ways: there can be a hardware fault, software bug, malicious user, or sim- ple human error. When a failure occurs, we need to know what happened, how the failure occurred, who was involved, who was to blame, and how safeguard against similar fail- ures in the future. Conversely, if a system is being used to make decisions upon which significant resources or lives de- pend, then it is important for the process leading to a partic- ular decision to be transparent, comprehensible, and persistent. Such decisions need to be justified by explicitly show- ing how the results were derived, what assumptions or ap- proximations were used, who was involved, who deserves credit, and how to reproduce the results in different circumstances.
A form of provenance is often motivated as
being “complete”, or “(more) accurate (than another)”,
as capturing information about “(causal) dependences”, “influences”, “sources”, “relevance”,
or as being an “explanation”, “justification”, or “evi- dence”.
Interoperability refers to the ability of (computer) systems or software to connect and exchange information without restrictions easily. Semantic interoperability extends on this principle and refers to the ability of computer systems to exchange data unambiguously.
Researchers and the public can comprehend how a particular project fits into the broader landscape of existing research.
Having experiments and findings be verifiable, thus holding researchers accountable for their work and findings.
Making provenance a priority means we get better (and more) documentation in research. Aspects of the research process deemed tacit could become explicit.
Looking to provenance and interoperability, research areas synonymous with computer science. There is an opportunity for cross-pollination, and distilling current research can reveal potential synergies with broader (scientific) research. In the modern day, computational methods are increasingly applied across disciplines, already blurring the lines.
-------
MERIT is a computational reproducibility system built on Tribuo. [2] To make ML systems more reproducible, they engineer ways to make research data (especially in computational fields) more interoperable.
Tribuo “is designed for compile-time, type-safe ML computation, and the models, datasets and evaluations produced via Tribuo are self-describing through the incorporation of provenance.” The main provenance mechanism it supports is the ability to re-train a model and modify its parameters, tracking developments along the way.
Models Easily Reproduce in Tribuo (MERIT) uses provenance from the Tribuo ML Library written in Java, making it possible to reproduce a model with model objects stored through the system.
Although MERIT is intended for ML research, the components and concepts can extend to the domain sciences. Today, other fields are increasingly taking advantage of ML techniques, so learnings from the paper stay especially relevant.
Tribuo creates self-describing objects which support provenance and the efficiency of development itself. Therefore, although the system is not explicitly intended for interoperability, it can be built on top of being integrated into machine learning experiments, incorporating provenance into a researcher’s workflow.
Monte Carlo (MC) simulations are “experiments” with an inherent stochastic component and are used across several disciplines.
MERIT works as a wrapper over an existing system, developing a snapshot of its present state by collecting information about its “hyperparameters, RNG seeds and code.” Existing tools like this, such as DeepDIVA, dagger, and dtoolAI, all serve a similar function of creating snapshots for libraries such as MXNet, SparkML, sci-kit-learn, and PyTorch.
Example hyperparameters include: [3]
Learning rate is the rate at which an algorithm updates estimates
Learning rate decay is a gradual reduction in the learning rate over time to speed up learning
Momentum is the direction of the next step with respect to the previous step
Neural network nodes refers to the number of nodes in each hidden layer
Neural network layers refers to the number of hidden layers in a neural network
Mini-batch size is training data batch size
Epochs is the number of times the entire training dataset is shown to the network during training
Eta is step size shrinkage to prevent overfitting
In hyperparameter tuning, various methods are available for fine-tuning, including Bayesian optimization, grid search, and random search.
Storing provenance data and making it usable, allowing models to be better adapted, is well suited for grid search optimization. Traditionally, Grid Search is very computationally intensive, with O(n 2) time complexity (at best). Although this may not change with being able to control hyperparameters through provenance better, it does mean that grid search could be a less tedious task. A developer who can better understand the casualties in its models may be able to make changes more accurately and effectively. In turn, this likely improves computational efficiency by a variable amount.
The same goes for Random Search, where groups of hyperparameters are randomly selected until a model is “fit.” Amazon SageMaker (the Machine Learning IDE) currently facilitates model tuning and replication, but it lacks a prioritization of preserving changes over time and enabling easy navigation of such data.
In Tribuo we built our provenance system to make our models self-describing by which we mean they capture a complete description of the computation that produced them, solving the first issue. In v4.2, we added an automated reproducibility system that consumes the provenance data and retrains the model. As well as the reproducibility system, we added a mechanism for diffing provenance objects, allowing easy comparison between the reproduced and original models. This is because the models are only guaranteed to be identical if the data is the same, and any differences in the data will show up in the data provenance object.
Tribuo has core objects, and it tracks each of their creations. Multiple objects can point to the same provenance object. Since the objects are immutable, this is more efficient than duplicating the provenance.
Core Object Name | Provenance Object Utility |
Data Source |
|
Dataset |
|
Trainer |
|
Model |
|
Evaluation |
|
Ultimately, MERIT works on top of this existing system, attempting to solve the main problems with reproducibility outlined below.
MERIT’s merit is being able to control for sources of non-determinism (like when a model is inherently stochastic through its optimization methods or structure itself) in a “multi-threaded environment and exposing the training differences between two models in a human-readable form.”
C.1 Obtaining a complete record of a model’s provenance.
Context: Language-independent formats like JSON and XML are used for provenance, making them interoperable and readable.
Issue: Code changes over time, posing a challenge due to evolving APIs, programming language syntax, etc.
Note: Although data for an object is immutable, the challenge lies in preserving provenance despite code modifications.
C.2 Re-instantiating correctly configured objects.
Benefit: Language-independent formats are beneficial for domain/biomedical scientists.
Challenge: Code changes may impact computational science work.
Observation: Immutable data objects persist, but adapting to evolving code is challenging.
C.3 Addressing non-determinism due to RNGs and parallelism
Focus: Understanding implications when working with RNGs.
MERIT Solution:
Each trainer object has an RNG requiring correct initialization.
Provenance stores RNG seeds.
Internal RNG state changes with each use.
Example: A Java program training multiple models with a single seed and different initial states, using an invocation count to set the RNG state for each model.
Ultimately, it seems valuable to reinforce the norm of publicly accessible JSON/XML files for research that leverages machine learning, even if it does not constitute their project's entire scope. However, different domains have their reproducibility methods (i.e. psychology): parts of their work risk being left unchecked if they use ML.
(Listing 1, from paper) An example of model's provenance and reproduced model's provenance
// model's provenance
"trained-at" : {
"original" :
"2021-09-30T19:54:02.09",
"reproduced" :
"2021-09-30T19:54:03.83" // notice the time change
// reproduced model's provenance
}
"trained-at" : {
"original" :
"2021-09-30T19:54:02.09",
"reproduced" :
"2021-09-30T19:54:03.83"
} The Tribuo API is provided to MERIT through a new Java class called ReproUtil. ReproUtil takes the Tribuo model object (provenance objects) as inputs and “produces a new model, trained identically to the original, as an output.” These outputs consist of the reproducible RNGs that solve the provenance issues when dealing with models with some stochastic component.
MERIT takes RNG seeds stored with Tribuo but ensures that a program trains multiple models with each unique seed, making each run of an experiment replicable. Therefore, MERIT separates training and testing; training on new models is given to the user. MERIT compares new and original models, providing their timestamps and invocation counts.
Random Number Generators are a key component of (large-scale) agent-based modelling, particularly valuable in the life sciences. Computational approaches are becoming more mainstream in the life sciences. Although the life sciences have good norms for reproducibility, there’s no reason to let this fall astray to technological uptake, for instance, if we look at modelling biological systems, specifically the immune system.
These simulations will often take the form of Markov Chain Monte Carlo Simulations which are common for the modelling of many dynamical systems across disciplines.
The paper, Comprehensive benchmarking of Markov chain Monte Carlo methods for dynamical systems, outlines that comes with not having enough information about parameters.
For most ODE-constrained parameter estimation problems, information about the identifiability properties of parameters will not be available prior to the sampling. This is unfortunate as the sampling performance of all methods could be improved by exploiting such additional information. Models with parameter interchangeabilities such as (M1) are well studied in the context of mixture models. Tailored methods for such problems include post-processing methods or a random permutation sampler [ 72 ,73 ]. We evaluated the benefit of applying a post-processing strategy for this simple ODE model. Researchers found that having access to information about the number and location of the posterior modes improved the sampling performance significantly for all sampling methods. (see Additional file 1 : Section 4). The presented results highlight the need to address chain exploration quality by considering multiple MCMC runs, which can be compared with each other before calculating effective sample sizes.
[4] ImmSim [17] is a framework based on cellular automata where entities interact with other and diffuse through lattice site. In this mode, individuals consider possible interactions based on the given probability rule. The framework has been developed in APL2, which due to language constraints limits the scale of simulations executed. Later, parallel version of ImmSim, C-ImmSim [26] were developed with the focus on scalability and performance. C-ImmSim is an advanced immune system simulation based on ImmSim with added features that allows simulations at the cells and molecules levels. The framework exploits task parallelism on distributed computers to reduce simulation runtimes and enable larger-scale simulations.
In Epstein and Axtell’s book — Growing Artificial Societies: Social Science from the Bottom Up, they describe the use of randomness for mathematical epidemiological modelling through the Agent disease transmission rule, where
For each neighbor, the disease that currently afflicts the agent is selected at random and given to the neighbor.
With modelling immunological and epidemiological systems, infection and transmission have a stochastic component(generating different seeds). T herefore, a scientist using ImmSim or an epidemiologist testing a model cannot necessarily rerun their experiments by default, but provenance tools like Tribuo can change that.
MERIT also has unique adaptations when users want to use provenance to optimize for hyperparameters. Still, this use case falls outside the scope of reproducibility - more information about this is available in the paper.
In thinking about the future, calls for standardization can help make Tribuo and MERIT more effective, serving the goal of good scientific norms - improving research. Standardization can help improve this technology as the technological environments of different researchers differ from scientist to scientist and from experiment to experiment and run to run. Therefore, either more programs are developed to account for this, or there’s significant convergence such that Tribuo and other systems can focus on advancing their systems versus trying to serve diverse use external libraries and environments.
The impacts of provenance are project-specific and help uphold scientific integrity. But, there’s a question of how to weave research together, addressing the definition of interoperability: how information connects and remains navigable. Ted Nelson’s Literary Machines delves into presentational mediums. [5] Presentational mediums, which are hypermedia, are a key concept within this. As such, emphasizing them and potentially hypermedia as a way to present (provenance) data could be valuable.
Computer storage and displays that are “hyperlinked” can help support the non-sequentialist scientific process.
When we say "branching," we're referring to having different things returned to you (results, additional queries, etc.) based on how you navigate a system. Within science, perhaps the most relevant branching form is represented as a "hypertext."
The hypertext is also referred to as "non-sequential writing." This is especially important in science because the ways ideas and concepts interconnect are not linear. And forcing them to be is limiting. Tangents and the respective rabbit holes that come from them are the backbones of the creative scientific process.
The hypergraph is already being experimented with in personal computing (i.e. the GraphOS ). When a document is created with hyperlinks within, these serve as references to other documents; this is a process known as transclusion. When visualized, you get a network of documents as a node-link graph.
Provenance itself can exist as a hypergraph. Ultimately, when we look at Tribuo, analyzing objects is difficult in a non-optimized way, and vizualization is a helpful tool. The Provenance Map Orbiter has applications in ML, but the insights are generalizable to large datasets consisting of papers. [6]
Node-link diagrams allow for a dynamic representation of data but are difficult to navigate comfortably when the number of nodes is large.
Therefore, to reduce clutter and allow a user to be precise in what they want to look at, the diagrams can be adapted to only show relevant nodes to the user, which is decided using a filter. However, the queries, views or filters used depend on what the user inputs, and sometimes it’s valuable to have a full view of the data or, in this case, the provenance objects.
Seltzer’s paper describes a Provenance Map Orbiter tool allowing whole-graph exploration through semantic zoom. [7] Interaction designer, Alexander Obenaeur describes semantic zoom as allowing us to get a “different vantage point on the data that we care about.”
This “undulant interface” was made by John Underkoffler. The heresy implicit within [1] is the premise that the user, not the system, gets to define what is most important at any given moment; where to place the jeweler’s loupes for more detail, and where to show only a simple overview, within one consistent interface. Notice how when a component is expanded for more detail, the surrounding elements adjust their position, so the increased detail remains in the broader context. This contrasts sharply with how we get more detail in mainstream interfaces of the day, where modal popups obscure surrounding context, or separate screens replace it entirely. Being able to adjust the detail of different components within the singular context allows users to shape the interfaces they need in each moment of their work.
Through Semantic Zoom, users can scale large graphs, which is more navigable than the alternative.
Semantic Zoom is not an alternative to traditional node-link diagrams but instead manipulates their structure. There are primary nodes, which could be a specific organization with datasets for model fine-tuning or an experiment with provenance objects collected as it’s been run.
From there, primary nodes are treated as processes and summary nodes are constructed for each process (primary node).
Ultimately, the control flow takes on a tree-like structure that can be visually displayed.
Another property of semantic zoom is that graphs are developed incrementally as summary nodes are expanded when a process node has been interacted with.
Views allow the steps of the process to be viewed in a tree-like timeline.
General interactive functionalities like clicking on a node, which displays a model with more information, searching by node attributes, providing summaries of files and processes, and version collapse functionality.
Scientists are required to imagine (and defend against) all the possible ways their experiments might fail or might mislead us into drawing flawed conclusions. Accordingly, laboratory scientists have developed careful record-keeping practices that make it easier for a scientist to satisfy herself (and to convince others) that her work is valid, repeatable, and accurate. These records anticipate scientists’ need for both success validation and failure recovery in the course of responsible conduct of research. [8]
As mentioned above, provenance could help with the replicability of experiments, particularly computational simulations. A paper titled Semantic interoperability and characterization of data provenance in computational molecular engineering , explores this possibility by looking at computational molecular engineering as an example. They specifically explore semantic interoperability. Semantic interoperability refers to exchanging (provenance) data and simultaneously sharing its meaning. It calls for developing shared vocabulary (common grammar or ontology) that remains constant irrespective of the system: counteracting the counterfactual considered “polysemous.”
All knowledge existing in a hypergraph seems utopic but could improve the searchability of existing and future research, and allow for research to be displayed in more interesting ways as seen in a 2018 NeurIPS paper (Recurrent World Models Facility Policy Evolution) and the Machine Learning publication, Distill.
Today, XML (Extensible Markup Language) and RDF (Resource Description Framework) are the current standards for maintaining semantic interoperability on the web.
It may be necessary to develop entirely new standards, for the use cases that exist outside of traditional applications like Internet protocols and networked devices. The types of data that would be used for a researcher in the (life and physical) sciences are likely not in a format that makes it possible to rely solely on XML and DITA (Darwin Information Typing Architecture). This problem is encapsulated by the idea of domain-specific Document Type Definitions (DTDs) [9] , outlined in a paper entitled Semantic Interoperability on the Web . [10]
Still, semantic interoperability is not a lost cause. Some interventions include:
XSLTs (XSL Transformations) stylesheets, which allow documents to be translated from one format to the other (as they interact with one another). XSLTs are a type of canonizer; they perform the act of canonicalization (also known as standardization or normalization), by which data is converted to a canonical, standard form. [11]
This is computationally intensive, graphed by O(n2) resource usage. A program that does this would likely be very slow, unusably so, especially for large documents. 6
RDF and RDF Schema, are data models of metadata instances. In RDF Schema, nodes (for example, provenance nodes) are linked together by labelled arcs that indicate the properties of the notes. XML syntax is still needed to leverage RDFs.
In a world where all research papers were in web-parsable formats (JSON, HTML, etc.) or there was easy conversion to such formats, SHOE could be an interesting tool to leverage. This language enables Semantic Interoperability on the Web.
SHOE (Simple HTML Ontology Extensions) is an ontology-based knowledge representation language designed for the Web.
Originating in 1996, SHOE anticipated many features later incorporated into XML and RDF, aiming to imbue web content with ontological semantics.
Language Syntax and Structure:
SHOE syntax, an application of SGML, extends HTML DTD, allowing ontological elements to be embedded within web pages.
SHOE ontologies, publicly accessible on web pages, define categories, relations, and other components, fostering semantic clarity and interoperability.
Ontology Features:
Categories, akin to RDF classes, enable hierarchical taxonomies for organizing concepts, fostering structured knowledge representation.
Relations, n-ary predicates, describe object properties, enhancing expressiveness compared to RDF's binary properties.
Interoperability and Extensibility:
SHOE facilitates ontology extension and ontology renaming, promoting interoperability between ontologies and enabling domain-specific vocabulary usage.
Inference rules in SHOE allow for the deduction of implicit knowledge, enhancing the richness of semantic representations.
Implementation and Tool Support:
Annotating web pages with SHOE knowledge is facilitated by tools like the Knowledge Annotator, streamlining the process of adding semantic markup.
Tools like Exposé aid in querying and indexing SHOE-marked pages, leveraging a web-crawler approach to gather and store SHOE knowledge in a knowledge base.
Querying SHOE Knowledge:
Accessing SHOE knowledge is facilitated by query tools like SHOE Search, enabling structured queries against ontologies to retrieve relevant web content based on semantic criteria.
A promising future research avenue lies in integrating provenance-aware incremental computation techniques into adaptive data processing systems. Traces are the key feature, constructing “intermediate” data structures, that capture how the output was made from the input. Therefore, when an input changes, we can efficiently spread out the effects by only replaying some of the trace, more formally known as bidirectional transformation (computations).
A lens, can be loosely defined, as a program that captures the intermediate data or a trace: it shows the mapping from input to output when read from left to right. In the opposite direction, “ it denotes an “update translator” that takes an input together with an updated output and produces a new input that reflects the update.” [12]
With the downfalls of XML outlined above, future research can explore the development of meta-programming languages that allow this to be done. Boomerang is an example of a language that allows “well-behaved bidirectional transformations” to be written. Once again, when the data formats do not fit in these clear boxes, more intricate languages that capture even more nuance are necessitated.
Research efforts could focus on developing optimization algorithms that prioritize parts of the computation based on provenance traces, designing mechanisms for dynamic adaptation of the computational pipeline, exploring efficient management and querying of provenance information in distributed systems, and integrating bidirectional transformation techniques for reversible data transformations.
A policy focus for eco-system building is important, because this should be a government priority. There has been a focus on Artificial Intelligence governance, and 2023 was the year of Open-Science in the Biden-Harris administration. However, there is a disconnect between scientists and the government.
A Piece by the Federation for American Scientists: Develop A Digital Technology Fund To Secure And Sustain Open Source Software
Data infrastructure within the states (and internationally) is relatively neglected compared to other scientific research priorities. Research infrastructure is important as it enables direct researchers to be more effective and efficient. So, what systems can we aim to improve and implement to make research faster and better?
As a part of the Federal Year of Open Science (2023), the NSF launched a funding opportunity through the Geosciences Open Science Ecosystem. [13] The solicitation is aimed at geoscientists who are working on projects to improve data norms within the geosciences and for researchers who are building resources to contribute to the ecosystem of resources to help research data better align with “foundational principles for open science,” including
FAIR Guiding Principles for scientific data management and stewardship (Findable, Accessible, Interoperable, Reusable), the CARE Principles for Indigenous Data Governance (Collective Benefit, Authority to Control, Responsibility, and Ethics), and the TRUST Principles for digital repositories (Transparency, Responsibility, User focus, Sustainability, and Technology), as well as Reproducibility and Replicability (see NSF 23-018, “Dear Colleague Letter: Reproducibility and Replicability in Science”).
The NSF (National Science Foundation) is a major funder and steward of science and social science in the states and abroad.
In 2022, they committed $38 Million to the Research Data Ecosystem (RDE): A National Resource for Reproducible, Robust and Transparent Social Science Research in the 21st Century through the Mid-scale Research Infrastructure program by the NSF, delegating $150-$200 million over the next five years (starting in 2022(?)) to projects that serve this agenda.
This initiative serves as a starting point for public agencies taking a stance on the state of data within science, making room for even more projects to improve the ecosystem.
The NSF questions surrounding data standards include:
How to make research data interoperable.
How to make research data accessible.
How to make research data and relevant processes transparent.
How to make research and data-sharing processes efficient.
How to ensure confidentiality protection while prioritizing data accessibility
The main goal is to help data be FAIR (Findable, Accessible, Interoperable and Reusable).
The key deliverable for the project is coming up with a Research Data Description Framework and developing an "integrated suite of software" developed at the University of Michigan. The project will explore data archiving and software solutions to improve the state of data in research, exploring its accessibility, organization, analysis methods and how academics contribute to data. It’s "estimated completion is expected in January 2027."
In 2022, UNESCO published a Recommendation on Open Science. [14]
UNESCO is an international body, and scientific research is largely governed at the institutional and national levels. Therefore, their recommendations serve a greater purpose of drawing attention to what could be and giving frameworks for future regulation. This is a product of the reality that large governing bodies can’t make concrete proposals due to different legislation across (member) states.
UNESCO outlines what they hope to get out of their recommendations.
Open Science means that there’s more scientific collaboration, the sharing of research is easy, and information is accessible, available and reusable to those in and out of the scientific research community.
They define Open Scientific Knowledge as “open access to scientific publications, research data, metadata, open educational resources, software, and source code and hardware that are available in the public domain or under copyright and licensed under an open licence that allows access, re-use, repurpose, adaptation and distribution.”
Abstract—We present a toolchain for computational research consisting of Sacred and two supporting tools. Sacred is an open source Python framework which aims to provide basic infrastructure for running computational experiments independent of the methods and libraries used. Instead, it focuses on solving universal everyday problems, such as managing configurations, reproducing results, and bookkeeping. Moreover, it provides an extensible basis for other tools, two of which we present here: Labwatch helps with tuning hyperparameters, and Sacredboard provides a web-dashboard for organizing and analyzing runs and results.
Labwatch integrates a convenient unified interface to several automated hyperparameter optimizers, such as random search, RoBO, and SMAC. Sacredboard offers a web-based interface to view runs and organize results.
Solid is a project by Tim-Berners Lee - the inventor of the World Wide Web, HTML, and HTTP. [15] The Solid Project is a specification (outlining what a system should do) that “allows individuals and groups to store their data securely in decentralized data stores called Pods.” Imagine you had “secure web servers” for your data.
Bits of information are retrievable at different instances and hosted on the cloud. These are pods, and the implications of this seem interesting.
The project’s key features include:
Handling various forms of data.
Individuals can access all the data they add to their “pods.”
To keep and get information in your Pod, apps use common, open, and compatible ways of organizing and sharing data.
It has relatively fleshed-out authentication and authorization systems.
Data Provenance Explorer (Cohere)
The Data Provenance Explorer evaluates text datasets, sourcing and displaying their lineage and other pieces of information such as where a dataset is sourced, its licensing and creation heritage, the sourcing of the dataset and more, all of which are constituents of data provenance. The Explorer is currently one of the largest audits of AI datasets. [16]
Here, provenance goes beyond transparency and making datasets more useful; it also helps mitigate the costs of training models on datasets that should not be trained on in certain use cases.
For example, WizardCoder (Luo et al., 2023) was licensed for commercial use but was trained on OpenAI’s data even though OpenAI data is commercially prohibited (Arstechnica, 2023) . StabilityAI and MPT-Storyteller (Frankie, 2023) have encountered copyright lawsuits after license revisions post-public release. Having provenance data that is public and navigable can prevent situations like these from occurring and the financial costs that come from them.
The explorer focuses on open-sourced fine-tuning data repositories largely used for the largest foundational models. However, the true largest ones choose to close their data: still auditing publicly available ones is valuable, and the project explores the prospect of these norms extending more widely.
“Many of the open source LLMs trained from data on these aggregators have an extremely high proportion of missing data licenses (“Unspecified”), ranging from 72 to 83 percent. This compares to 30 percent missing licenses with our annotation protocol, which categorizes licenses for datasets based on legal guidance.”
As an extension of the Provenance Map Orbiter proposed by Seltzer , the Data Provenance Explorer puts the ideas proposed into practice with interactive UI and filtering (which is argued against in Seltzer’s paper). The project is also open-source, allowing for community contributions, accountability and accessibility to the resource.
Sensemaking Networks is a project of the Astera Institute led by Ronen Tamari (link to Making sense of science: open access science needs open access to scholarly sensemaking data ) at the Astera Institute. It leverages AI and semantic search while abiding by FAIR data principles.
It works by “embedding nanopublishing in social networks.” By improving search, readers can engage with the highest value content to read.
Non-traditional publications are inaccessible. An example is a paper entitled “World Models: Can agents learn inside of their own dreams?” which exists in an interactive format on the web, but it was also a paper at NeurIPS 2018. They (largely) do not exist on traditional research search engines (i.e. Google Scholar), where the vast majority of research resides. Papers that exist entirely out of the traditional realm, but have valuable insights are often neglected and researchers miss out because of it.
This system builds on nanopublication, which focuses on capturing and utilizing provenance data in a concise format. Nanopublications serve as condensed versions of papers, presented within a Resource Description Framework (Named Graph) for easy navigation.
Each paper contains various key pieces of information, which users may search for based on their needs. These information units are categorized into searchable "bits":
Concept: The smallest, unambiguous unit of thought, uniquely identifiable.
Triple: A tuple of three concepts (subject, predicate, object).
Statement: A uniquely identifiable triple.
Annotation: A triple where the subject is a statement.
Nanopublication: A collection of annotations referring to the same statement, containing a minimum set of agreed-upon annotations within the community.
S-Evidence: All nanopublications referencing the same statement.
The Semantic Web Publishing ontology is explained by relating a Named Graph to an entity, establishing the relationship between data and metadata.
Nanopublications: A Growing Resource of Provenance-Centric Scientific Linked Data
Making sense of science: open access science needs open access to scholarly sensemaking data.
This paper outlines a “universal chemical programming language (χDL),” [17] , which compresses chemical protocols into around 50 lines of code per chemical reaction including “ reductive amination, ring formation, esterication, carbon- carbon bond formation, and amide coupling.” It’s an instantly variable markup (computer- and human-readable) language that works across any chemical hardware and is compatible with many platforms. Chemical results are stored with seeds that can be shared with other researchers in different locations. With more security, scientific workflows can be effectively reduced by leveraging robotic systems that can sustain yields of 90% per step. Chemistry, like other sciences, is seeing an increase in the use of machine learning in workflows. Still, there’s no open standard for recording and encoding experiments, whether they succeed. When machine learning is used, there’s a question of how to reproduce experiments - the reproducibility being an important measure of scientific validity.
Communicating and checking experiments will be hard if we can't find good ways to share hidden knowledge in chemical procedures. This could prevent chemistry from reaching its full potential, and this fact generalizes to the domain sciences at large, establishing the need for more work in this research area.
This survey hopes to scope scientific research characterized by openness, collaboration, and efficiency. As we navigate the complex landscape of data-driven exploration, adopting FAIR principles, interoperability standards, and provenance tools become increasingly important.
[1] https://csrc.nist.gov/glossary/term/provenance#:~:text=Definitions%3A,%2C%20component%2C%20or%20associated%20data.
[2] https://www.seltzer.com/assets/publications/TribuoReproducibility.pdf
[3] https://aws.amazon.com/fr/what-is/hyperparameter-tuning/
[4] https://bmcsystbiol.biomedcentral.com/articles/10.1186/s12918-017-0433-1/figures/3
[5] https://cs.brown.edu/people/nmeyrowi/LiteraryMachinesChapter2.pdf
[6] https://www.usenix.org/legacy/events/tapp11/tech/final_files/MackoSeltzer.pdf
[7] https://alexanderobenauer.com/labnotes/038/
[8] https://www.cs.cornell.edu/~jnfoster/papers/onward-provenance.pdf
[9] https://dita4practitioners.github.io/dita-specialization-tutorials/index.html
[10] https://apps.dtic.mil/sti/tr/pdf/ADA440535.pdf
[11] http://pauillac.inria.fr/~pilkiewi/papers/quotient-lenses.pdf
[12] https://web.archive.org/web/20151227175857/http://www.cis.upenn.edu/~bcpierce/papers/boomerang.pdf
[13] https://www.whitehouse.gov/ostp/news-updates/2023/01/11/fact-sheet-biden-harris-administration-announces-new-actions-to-advance-open-and-equitable-research/
[14] https://unesdoc.unesco.org/ark:/48223/pf0000383771_eng
[15] https://www.ox.ac.uk/news/2016-10-27-sir-tim-berners-lee-joins-oxfords-department-computer-science
[16] https://arxiv.org/pdf/2310.16787.pdf
[17] https://www.nature.com/articles/s44160-023-00473-6