Verify the reproducibility of an experiment
Capturing provenance into Data Science/Machine Learning workflows

Hello everyone, my name is Jesse and I’m proud to be a fellow in this 2023 Summer of Reproducibility program, contributing to noWorkflow project.
My proposal was accepted under the mentorship of João Felipe Pimentel and Juliana Freire and aims to work mapping and testing the capture of the provenance in typical Data Science and Machine Learning experiments.
What…
Although much can be said about what reproducibility means, the ability to replicate results in day-to-day Data Science and Machine Learning experiments can pose a significant challenge for individuals, companies and researche centers. This challenge becomes even more pronounced with the emergence of analytics and IA, where scientific methodologies are extensively applied on an industrial scale. Then reproducibility assumes a key role in productivity and accountability expected from Data Scientists, Machine Learning Engineers, and other roles engaged in ML/AI projects.
How…
In the day-to-day, the pitfalls of non-reproducibility appear at different points of the experiment lifecycle. These challenges arise when multiple experiments need to be managed for an individual or a team of scientists. In a typical experiment workflow, reproducibility appears in different steps of the process:
- The need to track the provenance of datasets.
- The need to manage changes in hypothesis tests.
- Addressing the management of system hardware and OS setups.
- Dealing with outputs from multiple experiments, including the results of various model trials.
In academic environments, these issues can result in mistakes and inaccuracies. In companies, they can lead to inefficiencies and technical debts that are difficult to address in the future.
Finally…
I believe this is a great opportunity to explore the emergence of these two hot topics that are IA and reproducilibity! I will share more updateds here throughout this summer and hope we can learn a lot together!
