study-docker-repro-longevity/protocol/protocol.tex

299 lines
12 KiB
TeX
Raw Normal View History

2024-08-01 17:49:41 +02:00
\documentclass{article}
\usepackage[a4paper, margin=20mm]{geometry}
\usepackage{hyperref}
\usepackage[
datamodel=software
]{biblatex}
\usepackage{software-biblatex}
\usepackage{todonotes}
\addbibresource{references.bib}
\usepackage{listings}
\lstset{
basicstyle=\small\ttfamily,
%columns=flexible,
frame = single,
breaklines=true
}
\newcommand{\noteqg}{\todo[backgroundcolor=blue!10,bordercolor=blue,inline,caption={}]}
2024-08-01 17:49:41 +02:00
%\usepackage{amssymb}
2024-08-02 17:49:55 +02:00
\usepackage{booktabs}
2024-08-01 17:49:41 +02:00
%\usepackage{adjustbox}
\newcommand{\dfile}{\texttt{Dockerfile}}
\newcommand{\ecg}{\texttt{ecg.py}}
\newcommand{\eg}{\emph{e.g.,}}
\newcommand{\ie}{\emph{i.e.,}}
\title{Protocol: Study of the Longevity of \dfile s from Research Artifacts}
\begin{document}
\maketitle
\section{General Information}
\subsection{Title of the project}
\emph{Study of the Longevity of \dfile s from Research Artifacts}
\subsection{Current and Future Contributors}
\href{https://www.elsevier.com/researcher/author/policies-and-guidelines/credit-author-statement}{CRediT}
\begin{itemize}
\item \href{https://orcid.org/0009-0003-7645-5044}{Quentin \textsc{Guilloteau}}: Conceptualization, Methodology, Software, Data Curation, Supervision, Project administration
\item Antoine Waehren
2024-08-01 17:49:41 +02:00
\end{itemize}
\subsection{Description of the project}
This project aims to show the limitations of using Docker containers as a reliable reproducibility tool.
In particular, as Docker relies on non-reproducible tools, it is difficult to construct a \dfile\ that will rebuild the \emph{exact} same software environment in the future.
2024-08-02 11:04:30 +02:00
In this project, we will collect research artifacts coming from various scientific conferences containing \dfile s, rebuild them periodically, and observe the variation in the resulting software environments.
We will also catch any error that could occur during the building of the image.
2024-08-01 17:49:41 +02:00
\subsection{Related work from contributors}
\cite{acmrep24}
\section{Architecture}
2024-08-02 11:04:30 +02:00
\subsection{Nickel}\label{sec:nickel}
2024-08-01 17:49:41 +02:00
We use the Nickel configuration language to guarantee the correctness of the descriptions of the artifacts.
This allows us to catch potential errors or incoherencies, from the Data Curation phase, even before trying to build the artifacts.
The definition of the schema is archived on Software Heritage \cite{nickel_schema}.
\noindent\begin{minipage}{.49\textwidth}
\begin{lstlisting}[caption=\dfile]{Name}
FROM ubuntu
RUN apt-get update && apt-get install X Y Z
RUN git clone https://github.com/foo/bar
RUN cd bar; make
\end{lstlisting}
\end{minipage}
\hfill
\begin{minipage}{.49\textwidth}
\begin{lstlisting}[caption=Nickel]{Name}
{
version = "1.0",
artifact_url = "https://zenodo.org/record/XXXXXXX/files/code.tar.gz",
type = "tar",
doi = "XX.XXXX/XXXXXXX.XXXXXXX",
virtualization = "docker",
buildfile_dir = "path/to/dockerfile",
package_managers = [ "dpkg" ],
git_packages = [
{ name = "bar", location = "~/bar" }
],
}
\end{lstlisting}
\end{minipage}
\subsection{\ecg}
2024-08-02 11:04:30 +02:00
This Python script\ \cite{ecg_code} takes as input a (verified) JSON representation of the Nickel artifact description, and then tries to build the \dfile\ contained in the artifact.
2024-08-01 17:49:41 +02:00
\paragraph{Workflow}
\begin{enumerate}
\item Read the JSON description of the artifact
2024-08-02 11:04:30 +02:00
\item Download the artifact (Section \ref{sec:download})
\item Log the cryptographic hash of the downloaded artifact (Section \ref{sec:download})
2024-08-01 17:49:41 +02:00
\item Extract the artifact
\item Build the Docker image (Section \ref{sec:docker_build})
2024-08-02 17:49:55 +02:00
\item If the build is successful, gather information about the produced software environment (Sections \ref{sec:package_managers}, \ref{sec:git}, \ref{sec:misc}, and \ref{sec:pyenv})
2024-08-01 17:49:41 +02:00
\item If the build failed, gather information about the reason of the failure
\end{enumerate}
\noteqg{should probably be a flowgraph}
2024-08-02 11:04:30 +02:00
\subsubsection{Download of the Artifact}\label{sec:download}
2024-08-01 17:49:41 +02:00
The link to the to artifact is the link provided by the authors in their Artifact Description.
2024-08-01 17:49:41 +02:00
\ecg\ will use this link to download the artifact.
If the download is successful, \ecg\ will log the cryptographic hash of the content.
2024-08-01 17:49:41 +02:00
This allows us to also have information about the stability/longevity of the artifact sharing.
2024-08-02 11:04:30 +02:00
\subsubsection{Docker Build Statuses}\label{sec:docker_build}
2024-08-01 17:49:41 +02:00
\ecg\ captures different types of statuses for the build attempt of a \dfile:
\begin{itemize}
\item \texttt{baseimage\_unavailable}: the base image of the \dfile\ (\texttt{FROM} image) is not available.
\item \texttt{job\_time\_exceeded}: when running on a batch system such as OAR, this error indicates that the \dfile\ did not build under \emph{1 hour}
\item \texttt{success}: the \dfile\ has been built successfully
\item \texttt{package_install_failed}: a command requested the installation of a package that failed
\item \texttt{artifact_unavailable}: the artifact could not be downloaded
\item \texttt{dockerfile_not_found}: no \dfile\ has been found in the location specified in the configuration file
\item \texttt{script_crash}: an error has occurred with the script itself
\item \texttt{unknown_error}: the \dfile\ could not be built for an unknown reason
2024-08-01 17:49:41 +02:00
\end{itemize}
2024-08-02 11:04:30 +02:00
\subsubsection{Information from the Package Manager}\label{sec:package_managers}
2024-08-01 17:49:41 +02:00
Package Managers can provide information about the packages installed: package name and package version.
\paragraph{Supported Package Managers} \texttt{dpkg}, \texttt{rpm}, \texttt{pacman}, \texttt{pip}, \texttt{conda}
2024-08-01 17:49:41 +02:00
\paragraph{Example of Data}
Below is an example of data collected for the \texttt{gcc-8} package on a Ubuntu image:
\begin{lstlisting}
gcc-8,8.3.0-6,dpkg
\end{lstlisting}
First column is the name of the package, second is the version number given by the package manager, and third is the package manager. The actual outputs will also have a fourth column with the timestamp of when the package list was generated.
2024-08-02 11:04:30 +02:00
\subsubsection{Git repositories (\texttt{git})}\label{sec:git}
2024-08-01 17:49:41 +02:00
\dfile\ authors can also install packages from source.
One way to do this is via Git.
In this case, once the container built successfully, \ecg\ logs into the container and extracts the commit hash of the repository (via \texttt{git log}).
To be considered as a Git package, a package must have been downloaded using the \verb|git| command, and the repository's local directory should still have a \verb|.git| subdirectory. Otherwise, it should be considered as a \textit{misc} package, since the hash of the latest commit cannot be retrieved in that case (see below).
2024-08-01 17:49:41 +02:00
\paragraph{Example of Data}
Below is an example of data collected for a Git repository called \texttt{ctf}:
\begin{lstlisting}
ctf,c3f95829628c381dc9bf631c69f08a7b17580b53,git
\end{lstlisting}
First column is the name of the package, second is the cryptographic hash of the latest commit in the current branch of the Git repo (used as version number), and third is the package source (Git). The actual outputs will also have a fourth column with the timestamp of when the package list was generated.
\subsubsection{Downloaded content (\texttt{misc})}\label{sec:misc}
2024-08-01 17:49:41 +02:00
In the case where the \dfile\ downloads content from the internet (\eg\ archives, binaries), \ecg\ will download the same content on the host machine (\ie\ not in the container) and then compute the cryptographic hash of the downloaded content.
2024-08-01 17:49:41 +02:00
\paragraph{Example of Data}
Below is an example of data collected for the downloading of the \texttt{Miniconda3} binary:
\begin{lstlisting}
Miniconda3-py37_4.12.0-Linux-x86_64,4dc4214839c60b2f5eb3efbdee1ef5d9b45e74f2c09fcae6c8934a13f36ffc3e,misc
\end{lstlisting}
First column is the name of the package, second is the cryptographic hash of the downloaded content (used as version number), and third is the package source (misc). The actual outputs will also have a fourth column with the timestamp of when the package list was generated.
2024-08-02 17:49:55 +02:00
\subsubsection{Python Virtual Environment (\texttt{pyenv})}\label{sec:pyenv}
Even if \texttt{pip} is managed in the ``Package Managers'' section (Section \ref{sec:package_managers}), when authors use a virtual environment, \ecg\ needs to query this exact Python environment, and not the global one.
2024-08-01 17:49:41 +02:00
\subsection{Snakemake}
\subsection{R}
\section{Data collection}
2024-08-02 17:49:55 +02:00
\subsection{Has the data collection started?}
No.
2024-08-02 11:04:30 +02:00
\subsection{Considered Conferences}
2024-08-02 17:49:55 +02:00
\begin{table}
\centering
\begin{tabular}{lrr}
\toprule
Conference Name & Submission Date & Proceedings Publication Date\\
\midrule
EuroPar 2024 & March 2024 & September 2024\\
SuperComputing 2024 & April 2024 & November 2024 \\
\bottomrule
\end{tabular}
\caption{Considered Conferences and associated important dates}
\label{tab:conferences}
\end{table}
Table \ref{tab:conferences} summarizes the considered conferences and their important dates.
2024-08-02 11:04:30 +02:00
\noteqg{todo}
\subsection{Gathering of \dfile s}
The gathering part of the \dfile s will be done right after the publication of the proceeding of a conference.
Contributors of the ``Data Curation'' phase will go through all the papers and their artifact to extract artifact containing \dfile s.
These \dfile s will then be captured with the Nickel description (see Section \ref{sec:nickel}).
To avoid mistake, at least two contributors will be assigned by paper.
If there is any difference in the Nickel description of an artifact, a discussion between the contributors will be initiated to conclude on the correct artifact description.
\noteqg{can we do this in the workflow?}
\subsection{Building Periodicity}
2024-08-01 17:49:41 +02:00
The building workflow will be executed \emph{every month} for one year.
2024-08-01 17:49:41 +02:00
After one year, the workflow will be executed with increasing time intervals between execution.
2024-08-02 11:04:30 +02:00
\noteqg{TODO: A table/list/gantt chart of all the planned executions (dates)}
2024-08-01 17:49:41 +02:00
\section{Analysis}
One paragraph per plot
Any statistical tests?
\subsection{Static Analysis}
The first part of the analysis can be done statically from the description of the artifacts.
\begin{itemize}
\item Number/Proportion of \dfile s using particular package managers
\item Number/Proportion of \dfile s downloading content from Git repositories
\item Number/Proportion of \dfile s downloading content from internet
2024-08-01 17:49:41 +02:00
\end{itemize}
\subsection{Dynamic Analysis}
The second part of the analysis will be done after the first year of data collection, and will focus on the temporal evolution of properties of the artifacts.
\paragraph{Artifact Sources}
\begin{itemize}
\item Number/Proportion of artifacts that can be downloaded
\item Number/Proportion of artifacts which content has changed
\end{itemize}
\paragraph{Build Status}
\begin{itemize}
\item Number/Proportion of \dfile s that build successfully
\item Number/Proportion of \dfile s errors (\texttt{baseimage\_unavailable}, \texttt{job\_time\_execeed}, \texttt{unknown\_error}) for the failed builds
2024-08-01 17:49:41 +02:00
\end{itemize}
\paragraph{Software Environment}
\begin{itemize}
\item Number of installed packages per container
\item Number/Proportion of packages that changed version since last build
\item Package sources (package manager, Git, misc) from where packages are changing the most
2024-08-01 17:49:41 +02:00
\end{itemize}
\section{Other}
\subsection{Computational Environment}
The builds of the \dfile s will be executed on the french testbed Grid'5000 \cite{grid5000}.
\noteqg{TODO: Which cluster?}
The software environment will be managed by Nix \cite{dolstra2004nix}.
\noteqg{TODO: swh link to the shells in the repo}
\subsection{Environmental Cost}
\noteqg{TODO: do an estimation}
2024-08-02 17:49:55 +02:00
\subsection{Data and Source Code Long-term Availability}
2024-08-01 17:49:41 +02:00
The collected data will be stored on Zenodo.
The Source code will be archived on Software-Heritage.
\printbibliography
\end{document}