This project aims to show the limitations of using Docker containers as a reliable reproducibility tool.
In particular, as Docker relies on non-reproducible tools, it is difficult to construct a \dfile\ that will rebuild the \emph{exact} same software environment in the future.
In this project, we will collect research artifacts coming from various scientific conferences containing \dfile s, rebuild them periodically, and observe the variation in the resulting software environments.
This Python script\ \cite{ecg_code} takes as input a (verified) JSON representation of the Nickel artifact description, and then tries to build the \dfile\ contained in the artifact.
\item If the build is successful, gather information about the produced software environment (Sections \ref{sec:package_managers}, \ref{sec:git}, \ref{sec:misc}, and \ref{sec:pyenv})
First column is the name of the package, second is the version number given by the package manager, and third is the package manager. The actual outputs will also have a fourth column with the timestamp of when the package list was generated.
In this case, once the container built successfully, \ecg\ logs into the container and extracts the commit hash of the repository (via \texttt{git log}).
To be considered as a Git package, a package must have been downloaded using the \verb|git| command, and the repository's local directory should still have a \verb|.git| subdirectory. Otherwise, it should be considered as a \textit{misc} package, since the hash of the latest commit cannot be retrieved in that case (see below).
First column is the name of the package, second is the cryptographic hash of the latest commit in the current branch of the Git repo (used as version number), and third is the package source (Git). The actual outputs will also have a fourth column with the timestamp of when the package list was generated.
In the case where the \dfile\ downloads content from the internet (\eg\ archives, binaries), \ecg\ will download the same content on the host machine (\ie\ not in the container) and then compute the cryptographic hash of the downloaded content.
First column is the name of the package, second is the cryptographic hash of the downloaded content (used as version number), and third is the package source (misc). The actual outputs will also have a fourth column with the timestamp of when the package list was generated.
Even if \texttt{pip} is managed in the ``Package Managers'' section (Section \ref{sec:package_managers}), when authors use a virtual environment, \ecg\ needs to query this exact Python environment, and not the global one.
The gathering part of the \dfile s will be done right after the publication of the proceeding of a conference.
Contributors of the ``Data Curation'' phase will go through all the papers and their artifact to extract artifact containing \dfile s.
These \dfile s will then be captured with the Nickel description (see Section \ref{sec:nickel}).
To avoid mistake, at least two contributors will be assigned by paper.
If there is any difference in the Nickel description of an artifact, a discussion between the contributors will be initiated to conclude on the correct artifact description.
The second part of the analysis will be done after the first year of data collection, and will focus on the temporal evolution of properties of the artifacts.
\paragraph{Artifact Sources}
\begin{itemize}
\item Number/Proportion of artifacts that can be downloaded
\item Number/Proportion of artifacts which content has changed
\item Number/Proportion of \dfile s errors (\texttt{baseimage\_unavailable}, \texttt{job\_time\_execeed}, \texttt{unknown\_error}) for the failed builds