study-docker-repro-longevity/protocol/protocol.tex

\documentclass{article}
\usepackage[a4paper, margin=20mm]{geometry}
\usepackage{hyperref}
\usepackage[
  datamodel=software
]{biblatex}
\usepackage{software-biblatex}
\usepackage{todonotes}
\addbibresource{references.bib}
\usepackage{listings}
\lstset{
basicstyle=\small\ttfamily,
%columns=flexible,
frame = single,
breaklines=true
}


\newcommand{\noteqg}{\todo[backgroundcolor=blue!10,bordercolor=blue,inline,caption={}]}

%\usepackage{amssymb}
%\usepackage{booktabs}
%\usepackage{adjustbox}

\newcommand{\dfile}{\texttt{Dockerfile}}
\newcommand{\ecg}{\texttt{ecg.py}}
\newcommand{\eg}{\emph{e.g.,}}
\newcommand{\ie}{\emph{i.e.,}}

\title{Protocol: Study of the Longevity of \dfile s from Research Artifacts}

\begin{document}
\maketitle

\section{General Information}

\subsection{Title of the project}

\emph{Study of the Longevity of \dfile s from Research Artifacts}

\subsection{Current and Future Contributors}

\href{https://www.elsevier.com/researcher/author/policies-and-guidelines/credit-author-statement}{CRediT}

\begin{itemize}
  \item \href{https://orcid.org/0009-0003-7645-5044}{Quentin \textsc{Guilloteau}}: Conceptualization, Methodology, Software, Data Curation, Supervision, Project administration
  \item ...
\end{itemize}


\subsection{Description of the project}

This project aims to show the limitations of using Docker containers as a reliable reproducibility tool.
In particular, as Docker relies on non-reproducible tools, it is difficult to construct a \dfile\ that will rebuild the \emph{exact} same software environment in the future.
In this project, we will collect research artifacts coming from various scientific conferences containing \dfile s, rebuild them periodically, and observe the variation in the resulting software environments.

\subsection{Related work from contributors}

\cite{acmrep24}

\section{Architecture}

\subsection{Nickel}\label{sec:nickel}

We use the Nickel configuration language to guarantee the correctness of the descriptions of the artifacts.
This allows us to catch potential errors or incoherencies, from the Data Curation phase, even before trying to build the artifacts.
The definition of the schema is archived on Software Heritage \cite{nickel_schema}.

\noindent\begin{minipage}{.49\textwidth}
\begin{lstlisting}[caption=\dfile]{Name}
  FROM ubuntu
  RUN apt-get update && apt-get install X Y Z
  RUN git clone https://github.com/foo/bar
  RUN cd bar; make
\end{lstlisting}
\end{minipage}
\hfill
\begin{minipage}{.49\textwidth}
\begin{lstlisting}[caption=Nickel]{Name}
{
  version = "1.0",
  artifact_url = "https://zenodo.org/record/XXXXXXX/files/code.tar.gz",
  type = "tar",
  doi = "XX.XXXX/XXXXXXX.XXXXXXX",
  virtualization = "docker",
  buildfile_dir = "path/to/dockerfile",
  package_managers = [ "dpkg" ],
  git_packages = [
    { name = "bar", location = "~/bar" }
  ],
  misc_packages = [
  ],
}
\end{lstlisting}
\end{minipage}

\subsection{\ecg}

This Python script\ \cite{ecg_code} takes as input a (verified) JSON representation of the Nickel artifact description, and then tries to build the \dfile\ contained in the artifact.

\paragraph{Workflow}

\begin{enumerate}
\item Read the JSON description of the artifact
\item Download the artifact (Section \ref{sec:download})
\item Log the cryptographic hash of the downloaded artifact (Section \ref{sec:download})
\item Extract the artifact
\item Build the docker image (Section \ref{sec:docker_build})
\item If the build is successful, gather information about the produced software environment (Sections \ref{sec:package_managers}, \ref{sec:git}, and \ref{sec:misc})
\item If the build failed, gather information about the reason of the failure
\end{enumerate}

\noteqg{should probably be a flowgraph}

\subsubsection{Download of the Artifact}\label{sec:download}

The link the to artifact is the link provided by the authors in their Artifact Description.
\ecg\ will use this link to download the artifact.
If the download is successful, \ecg\ will check the cryptographic hash of the content.
This allows us to also have information about the stability/longevity of the artifact sharing.

\subsubsection{Docker Build Statuses}\label{sec:docker_build}

\ecg\ captures different types of statuses for the build attempt of a \dfile:

\begin{itemize}
\item \texttt{baseimage\_unavailable}: the base image of the \dfile\ (\texttt{FROM} image)
\item \texttt{time\_execeed}: the container did not build under \emph{1 hour}
\item \texttt{unknown\_error}: the build failed for an unknown/classified reason
\end{itemize}

\subsubsection{Information from the Package Manager}\label{sec:package_managers}

Package Managers can provide information about the packages installed: package name and package version.

\paragraph{Supported Package Managers} \texttt{dpkg}, \texttt{rpm}, \texttt{pip}, \texttt{conda}

\paragraph{Example of Data}

Below is an example of data collected for the \texttt{gcc-8} package on a Ubuntu image:

\begin{lstlisting}
gcc-8,8.3.0-6,dpkg
\end{lstlisting}

\subsubsection{Git repositories (\texttt{git})}\label{sec:git}

\dfile\ authors can also install packages from source.
One way to do this is via Git.
In this case, once the container built sucessfully, \ecg\ logs into the container and extract the commit hash of the repository (via \texttt{git log}).

\paragraph{Example of Data}

Below is an example of data collected for a Git repository called \texttt{ctf}:

\begin{lstlisting}
ctf,c3f95829628c381dc9bf631c69f08a7b17580b53,git
\end{lstlisting}

\subsubsection{Download content (\texttt{misc})}\label{sec:misc}

In the case where the \dfile\ download content from the internet (\eg\ archives, binaries), \ecg\ will download the same content on the host machine (\ie\ not in the container) and then compute the cryptographic hash of the downloaded content.

\paragraph{Example of Data}

Below is an example of data collected for the downloading of the \texttt{Miniconda3} binary:

\begin{lstlisting}
Miniconda3-py37_4.12.0-Linux-x86_64,4dc4214839c60b2f5eb3efbdee1ef5d9b45e74f2c09fcae6c8934a13f36ffc3e,misc
\end{lstlisting}

\subsection{Snakemake}

\subsection{R}

\section{Data collection}

\subsection{Considered Conferences}

\begin{itemize}
\item Conference Name, Submission Date, Proceeding Publication Date
\end{itemize}

\noteqg{todo}

\subsection{Gathering of \dfile s}

The gathering part of the \dfile s will be done right after the publication of the proceeding of a conference.
Contributors of the ``Data Curation'' phase will go through all the papers and their artifact to extract artifact containing \dfile s.
These \dfile s will then be captured with the Nickel description (see Section \ref{sec:nickel}).
To avoid mistake, at least two contributors will be assigned by paper.
If there is any difference in the Nickel description of an artifact, a discussion between the contributors will be initiated to conclude on the correct artifact description.

\noteqg{can we do this in the workflow?}

\subsection{Building Periodicity}

The builind workflow will be executed \emph{every month} for one year.
After one year, the workflow will be executed with increasing time intervals between execution.

\noteqg{TODO: A table/list/gantt chart of all the planned executions (dates)}

\section{Analysis}

One paragraph per plot

Any statistical tests?

\subsection{Static Analysis}

The first part of the analysis can be done statically from the description of the artifacts.

\begin{itemize}
\item Number/Proportion of \dfile s using particular package managers
\item Number/Proportion of \dfile s downloading from Git repositories
\item Number/Proportion of \dfile s downloading from internet
\end{itemize}

\subsection{Dynamic Analysis}

The second part of the analysis will be done after the first year of data collection, and will focus on the temporal evolution of properties of the artifacts.

\paragraph{Artifact Sources}

\begin{itemize}
\item Number/Proportion of artifacts that can be downloaded
\item Number/Proportion of artifacts which content has changed
\end{itemize}

\paragraph{Build Status}

\begin{itemize}
\item Number/Proportion of \dfile s that build succesfully
\item Number/Proportion of \dfile s errors (\texttt{baseimage\_unavailable}, \texttt{time\_execeed}, \texttt{unknown\_error}) for the failed builds
\end{itemize}

\paragraph{Software Environment}

\begin{itemize}
\item Number of installed packages per container
\item Number/Proportion of packages that changed version since last build
\item Package sources (package manager, Git, Misc) from where packages are changing the most
\end{itemize}

\section{Other}

\subsection{Computational Environment}

The builds of the \dfile s will be executed on the french testbed Grid'5000 \cite{grid5000}.

\noteqg{TODO: Which cluster?}

The software environment will be managed by Nix \cite{dolstra2004nix}.

\noteqg{TODO: swh link to the shells in the repo}

\subsection{Environmental Cost}

\noteqg{TODO: do an estimation}

\subsection{Data and Source Code Availability}

The collected data will be stored on Zenodo.

The Source code will be archived on Software-Heritage.

\printbibliography

\end{document}