start writing protocol

This commit is contained in:
Quentin Guilloteau 2024-08-01 17:49:41 +02:00
parent 114d0e5816
commit a24091b390
4 changed files with 329 additions and 0 deletions

1
.gitignore vendored
View File

@ -7,3 +7,4 @@ artifacts/json/*
pkglist.csv
log.txt
build_status.csv
*.pdf

View File

@ -29,6 +29,12 @@
]))
];
};
latex = pkgs.mkShell {
packages = with pkgs; [
texliveFull
rubber
];
};
};
});
}

259
protocol/protocol.tex Normal file
View File

@ -0,0 +1,259 @@
\documentclass{article}
\usepackage[a4paper, margin=20mm]{geometry}
\usepackage{hyperref}
\usepackage[
datamodel=software
]{biblatex}
\usepackage{software-biblatex}
\usepackage{todonotes}
\addbibresource{references.bib}
\usepackage{listings}
\lstset{
basicstyle=\small\ttfamily,
%columns=flexible,
frame = single,
breaklines=true
}
\newcommand{\noteqg}{\todo[backgroundcolor=blue!10,bordercolor=blue,inline,caption={}]}
%\usepackage{amssymb}
%\usepackage{booktabs}
%\usepackage{adjustbox}
\newcommand{\dfile}{\texttt{Dockerfile}}
\newcommand{\ecg}{\texttt{ecg.py}}
\newcommand{\eg}{\emph{e.g.,}}
\newcommand{\ie}{\emph{i.e.,}}
\title{Protocol: Study of the Longevity of \dfile s from Research Artifacts}
\begin{document}
\maketitle
\section{General Information}
\subsection{Title of the project}
\emph{Study of the Longevity of \dfile s from Research Artifacts}
\subsection{Current and Future Contributors}
\href{https://www.elsevier.com/researcher/author/policies-and-guidelines/credit-author-statement}{CRediT}
\begin{itemize}
\item \href{https://orcid.org/0009-0003-7645-5044}{Quentin \textsc{Guilloteau}}: Conceptualization, Methodology, Software, Data Curation, Supervision, Project administration
\item ...
\end{itemize}
\subsection{Description of the project}
This project aims to show the limitations of using Docker containers as a reliable reproducibility tool.
In particular, as Docker relies on non-reproducible tools, it is difficult to construct a \dfile\ that will rebuild the \emph{exact} same software environment in the future.
In this project, we will collect research artifacts coming from various scientific conferences containing \dfile s.
Once collected, we will \emph{periodically} build these \dfile s \emph{without cache}.
If the build is successful, the produced software environment will be extracted (package names and versions).
If the build failed, we will categorize the reason.
This is too much details for this section!
\subsection{Related work from contributors}
\cite{acmrep24}
\section{Architecture}
\subsection{Nickel}
We use the Nickel configuration language to guarantee the correctness of the descriptions of the artifacts.
This allows us to catch potential errors or incoherencies, from the Data Curation phase, even before trying to build the artifacts.
The definition of the schema is archived on Software Heritage \cite{nickel_schema}.
\noteqg{example \dfile\ and nickel}
\noindent\begin{minipage}{.49\textwidth}
\begin{lstlisting}[caption=\dfile]{Name}
FROM ubuntu
RUN apt-get update && apt-get install X Y Z
RUN git clone https://github.com/foo/bar
RUN cd bar; make
\end{lstlisting}
\end{minipage}
\hfill
\begin{minipage}{.49\textwidth}
\begin{lstlisting}[caption=Nickel]{Name}
{
version = "1.0",
artifact_url = "https://zenodo.org/record/XXXXXXX/files/code.tar.gz",
type = "tar",
doi = "XX.XXXX/XXXXXXX.XXXXXXX",
virtualization = "docker",
buildfile_dir = "path/to/dockerfile",
package_managers = [ "dpkg" ],
git_packages = [
{ name = "bar", location = "~/bar" }
],
misc_packages = [
],
}
\end{lstlisting}
\end{minipage}
\subsection{\ecg}
This Python script\ \cite{ecg_code} takes as input a JSON representation of the Nickel artifact description, and then tries to build the \dfile\ contained in the artifact.
\paragraph{Workflow}
\begin{enumerate}
\item Read the JSON description of the artifact
\item Download the artifact
\item Log the cryptographic hash of the downloaded artifact
\item Extract the artifact
\item Build the docker image
\item If the build is successful, gather information about the produced software environment
\item If the build failed, gather information about the reason of the failure
\end{enumerate}
\noteqg{should probably be a flowgraph}
\subsubsection{Download of the Artifact}
The link the to artifact is the link provided by the authors in their Artifact Description.
\ecg\ will use this link to download the artifact.
If the download is successful, \ecg\ will check the cryptographic hash of the content.
This allows us to also have information about the stability/longevity of the artifact sharing.
\subsubsection{Docker Build Statuses}
\ecg\ captures different types of statuses for the build attempt of a \dfile:
\begin{itemize}
\item \texttt{baseimage\_unavailable}: the base image of the \dfile\ (\texttt{FROM} image)
\item \texttt{time\_execeed}: the container did not build under \emph{1 hour}
\item \texttt{unknown\_error}: the build failed for an unknown/classified reason
\end{itemize}
\subsubsection{Information from the Package Manager}
Package Managers can provide information about the packages installed: package name and package version.
\paragraph{Supported Package Managers} \texttt{dpkg}, \texttt{rpm}, \texttt{pip}, \texttt{conda}
\paragraph{Example of Data}
Below is an example of data collected for the \texttt{gcc-8} package on a Ubuntu image:
\begin{lstlisting}
gcc-8,8.3.0-6,dpkg
\end{lstlisting}
\subsubsection{Git repositories (\texttt{git})}
\dfile\ authors can also install packages from source.
One way to do this is via Git.
In this case, once the container built sucessfully, \ecg\ logs into the container and extract the commit hash of the repository (via \texttt{git log}).
\paragraph{Example of Data}
Below is an example of data collected for a Git repository called \texttt{ctf}:
\begin{lstlisting}
ctf,c3f95829628c381dc9bf631c69f08a7b17580b53,git
\end{lstlisting}
\subsubsection{Download content (\texttt{misc})}
In the case where the \dfile\ download content from the internet (\eg\ archives, binaries), \ecg\ will download the same content on the host machine (\ie\ not in the container) and then compute the cryptographic hash of the downloaded content.
\paragraph{Example of Data}
Below is an example of data collected for the downloading of the \texttt{Miniconda3} binary:
\begin{lstlisting}
Miniconda3-py37_4.12.0-Linux-x86_64,4dc4214839c60b2f5eb3efbdee1ef5d9b45e74f2c09fcae6c8934a13f36ffc3e,misc
\end{lstlisting}
\subsection{Snakemake}
\subsection{R}
\section{Data collection}
\subsection{Periodicity}
The workflow will be executed \emph{every month} for one year.
After one year, the workflow will be executed with increasing time intervals between execution.
\noteqg{TODO: A table/list of all the planned executions (dates)}
\section{Analysis}
One paragraph per plot
Any statistical tests?
\subsection{Static Analysis}
The first part of the analysis can be done statically from the description of the artifacts.
\begin{itemize}
\item Number/Proportion of \dfile s using particular package managers
\item Number/Proportion of \dfile s downloading from Git repositories
\item Number/Proportion of \dfile s downloading from internet
\end{itemize}
\subsection{Dynamic Analysis}
The second part of the analysis will be done after the first year of data collection, and will focus on the temporal evolution of properties of the artifacts.
\paragraph{Artifact Sources}
\begin{itemize}
\item Number/Proportion of artifacts that can be downloaded
\item Number/Proportion of artifacts which content has changed
\end{itemize}
\paragraph{Build Status}
\begin{itemize}
\item Number/Proportion of \dfile s that build succesfully
\item Number/Proportion of \dfile s errors (\texttt{baseimage\_unavailable}, \texttt{time\_execeed}, \texttt{unknown\_error}) for the failed builds
\end{itemize}
\paragraph{Software Environment}
\begin{itemize}
\item Number of installed packages per container
\item Number/Proportion of packages that changed version since last build
\item Package sources (package manager, Git, Misc) from where packages are changing the most
\end{itemize}
\section{Other}
\subsection{Computational Environment}
The builds of the \dfile s will be executed on the french testbed Grid'5000 \cite{grid5000}.
\noteqg{TODO: Which cluster?}
The software environment will be managed by Nix \cite{dolstra2004nix}.
\noteqg{TODO: swh link to the shells in the repo}
\subsection{Environmental Cost}
\noteqg{TODO: do an estimation}
\subsection{Data and Source Code Availability}
The collected data will be stored on Zenodo.
The Source code will be archived on Software-Heritage.
\printbibliography
\end{document}

63
protocol/references.bib Normal file
View File

@ -0,0 +1,63 @@
@inproceedings{acmrep24,
TITLE = {{Longevity of Artifacts in Leading Parallel and Distributed Systems Conferences: a Review of the State of the Practice in 2023}},
AUTHOR = {Guilloteau, Quentin and Ciorba, Florina M and Poquet, Millian and Goepp, Dorian and Richard, Olivier},
URL = {https://hal.science/hal-04562691},
BOOKTITLE = {{REP 2024 - ACM Conference on Reproducibility and Replicability}},
ADDRESS = {Rennes, France},
ORGANIZATION = {{ACM}},
PAGES = {1-14},
YEAR = {2024},
MONTH = Jun,
DOI = {10.1145/3641525.3663631},
KEYWORDS = {Empirical studies Reproducibility ; Artifact Evaluation ; Badges ; Longevity},
PDF = {https://hal.science/hal-04562691/file/rep24_longevity_artifacts.pdf},
HAL_ID = {hal-04562691},
HAL_VERSION = {v1},
}
@inproceedings{grid5000,
title={Adding virtualization capabilities to the Grid5000 testbed},
author={Balouek, Daniel and Amarie, Alexandra Carpen and Charrier, Ghislain and Desprez, Fr{\'e}d{\'e}ric and Jeannot, Emmanuel and Jeanvoine, Emmanuel and L{\`e}bre, Adrien and Margery, David and Niclausse, Nicolas and Nussbaum, Lucas and others},
booktitle={Cloud Computing and Services Science: Second International Conference, CLOSER 2012, Porto, Portugal, April 18-21, 2012. Revised Selected Papers 2},
pages={3--20},
year={2013},
organization={Springer}
}
@inproceedings{dolstra2004nix,
title={Nix: A Safe and Policy-Free System for Software Deployment.},
author={Dolstra, Eelco and De Jonge, Merijn and Visser, Eelco and others},
booktitle={LISA},
volume={4},
pages={79--92},
year={2004}
}
@software{ecg,
title = {{ecg.py}},
author = {Guilloteau, Quentin and Waehren, Antoine},
date = {2024},
license = {GPL-3.0},
url = {https://forge.chapril.org/GuilloteauQ/study-docker-repro-longevity},
repository= {https://forge.chapril.org/GuilloteauQ/study-docker-repro-longevity},
}
@codefragment{nickel_schema,
subtitle = {Nickel Artifact Schema},
swhid = {
swh:1:cnt:bc691d4b5c87458bff498223185b2956d182a766;
origin=https://forge.chapril.org/GuilloteauQ/study-docker-repro-longevity;
visit=swh:1:snp:e6702465c82c96864a6d4a49d51789f398a8a336;
anchor=swh:1:rev:114d0e58163d6e01152c683bddb0a7b79269d4fe;
path=/workflow/nickel/artifact_contract.ncl
},
crossref = {ecg}
}
@codefragment{ecg_code,
subtitle = {{ecg.py} Code source},
swhid = {
swh:1:cnt:7c7a8497fed8f31052ef9492a99862678a06d6d4;origin=https://forge.chapril.org/GuilloteauQ/study-docker-repro-longevity;visit=swh:1:snp:e6702465c82c96864a6d4a49d51789f398a8a336;anchor=swh:1:rev:114d0e58163d6e01152c683bddb0a7b79269d4fe;path=/ecg.py
},
crossref = {ecg}
}