study-docker-repro-longevity/protocol/protocol.tex

\documentclass{article}
\usepackage[a4paper, margin=20mm]{geometry}
\usepackage{hyperref}
\usepackage[
  datamodel=software
]{biblatex}
\usepackage{software-biblatex}
\usepackage{todonotes}
\addbibresource{references.bib}
\usepackage{listings}
\lstset{
basicstyle=\small\ttfamily,
%columns=flexible,
frame = single,
breaklines=true
}

\newcommand{\noteqg}{\todo[backgroundcolor=blue!10,bordercolor=blue,inline,caption={}]}

%\usepackage{amssymb}
\usepackage{booktabs}
%\usepackage{adjustbox}

\newcommand{\dfile}{\texttt{Dockerfile}}
\newcommand{\ecg}{\texttt{ecg.py}}
\newcommand{\eg}{\emph{e.g.,}}
\newcommand{\ie}{\emph{i.e.,}}

\title{Protocol: Study of the Longevity of \dfile s from Research Artifacts}

\begin{document}
\maketitle

\section{General Information}

\subsection{Title of the project}

\emph{Study of the Longevity of \dfile s from Research Artifacts}

\subsection{Current and Future Contributors}

\href{https://www.elsevier.com/researcher/author/policies-and-guidelines/credit-author-statement}{CRediT}

\begin{itemize}
  \item \href{https://orcid.org/0009-0003-7645-5044}{Quentin \textsc{Guilloteau}}: Conceptualization, Methodology, Software, Data Curation, Supervision, Project administration
  \item Antoine Waehren
\end{itemize}


\subsection{Description of the project}

This project aims to show the limitations of using Docker containers as a reliable reproducibility tool.
In particular, as Docker relies on non-reproducible tools, it is difficult to construct a \dfile\ that will rebuild the \emph{exact} same software environment in the future.
In this project, we will collect research artifacts coming from various scientific conferences containing \dfile s, rebuild them periodically, and observe the variation in the resulting software environments.
We will also catch any error that could occur during the building of the image.

\subsection{Related work from contributors}

\cite{acmrep24}

\section{Architecture}

\subsection{Nickel}\label{sec:nickel}

We use the Nickel configuration language to guarantee the correctness of the descriptions of the artifacts.
This allows us to catch potential errors or incoherencies, from the Data Curation phase, even before trying to build the artifacts.
The definition of the schema is archived on Software Heritage \cite{nickel_schema}.

\noindent\begin{minipage}{.49\textwidth}
\begin{lstlisting}[caption=\dfile]{Name}
  FROM ubuntu
  RUN apt-get update && apt-get install X Y Z
  RUN git clone https://github.com/foo/bar
  RUN cd bar; make
\end{lstlisting}
\end{minipage}
\hfill
\begin{minipage}{.49\textwidth}
\begin{lstlisting}[caption=Nickel]{Name}
{
  version = "1.0",
  artifact_url = "https://zenodo.org/record/XXXXXXX/files/code.tar.gz",
  type = "tar",
  doi = "XX.XXXX/XXXXXXX.XXXXXXX",
  virtualization = "docker",
  buildfile_dir = "path/to/dockerfile",
  package_managers = [ "dpkg" ],
  git_packages = [
    { name = "bar", location = "~/bar" }
  ],
}
\end{lstlisting}
\end{minipage}

\subsection{\ecg}

This Python script\ \cite{ecg_code} takes as input a (verified) JSON representation of the Nickel artifact description, and then tries to build the \dfile\ contained in the artifact.

\paragraph{Workflow}

\begin{enumerate}
\item Read the JSON description of the artifact
\item Download the artifact (Section \ref{sec:download})
\item Log the cryptographic hash of the downloaded artifact (Section \ref{sec:download})
\item Extract the artifact
\item Build the Docker image (Section \ref{sec:docker_build})
\item If the build is successful, gather information about the produced software environment (Sections \ref{sec:package_managers}, \ref{sec:git}, \ref{sec:misc}, and \ref{sec:pyenv})
\item If the build failed, gather information about the reason of the failure
\end{enumerate}

\noteqg{should probably be a flowgraph}

\subsubsection{Download of the Artifact}\label{sec:download}

The link to the to artifact is the link provided by the authors in their Artifact Description.
\ecg\ will use this link to download the artifact.
If the download is successful, \ecg\ will log the cryptographic hash of the content.
This allows us to also have information about the stability/longevity of the artifact sharing.

\subsubsection{Docker Build Statuses}\label{sec:docker_build}

\ecg\ captures different types of statuses for the build attempt of a \dfile:

\begin{itemize}
  \item \texttt{baseimage\_unavailable}: the base image of the \dfile\ (\texttt{FROM} image) is not available.
  \item \texttt{job\_time\_exceeded}: when running on a batch system such as OAR, this error indicates that the \dfile\ did not build under \emph{1 hour}
  \item \texttt{success}: the \dfile\ has been built successfully
  \item \texttt{package_install_failed}: a command requested the installation of a package that failed
  \item \texttt{artifact_unavailable}: the artifact could not be downloaded
  \item \texttt{dockerfile_not_found}: no \dfile\ has been found in the location specified in the configuration file
  \item \texttt{script_crash}: an error has occurred with the script itself
  \item \texttt{unknown_error}: the \dfile\ could not be built for an unknown reason
\end{itemize}

\subsubsection{Information from the Package Manager}\label{sec:package_managers}

Package Managers can provide information about the packages installed: package name and package version.

\paragraph{Supported Package Managers} \texttt{dpkg}, \texttt{rpm}, \texttt{pacman}, \texttt{pip}, \texttt{conda}

\paragraph{Example of Data}

Below is an example of data collected for the \texttt{gcc-8} package on a Ubuntu image:

\begin{lstlisting}
gcc-8,8.3.0-6,dpkg
\end{lstlisting}

First column is the name of the package, second is the version number given by the package manager, and third is the package manager. The actual outputs will also have a fourth column with the timestamp of when the package list was generated.

\subsubsection{Git repositories (\texttt{git})}\label{sec:git}

\dfile\ authors can also install packages from source.
One way to do this is via Git.
In this case, once the container built successfully, \ecg\ logs into the container and extracts the commit hash of the repository (via \texttt{git log}).
To be considered as a Git package, a package must have been downloaded using the \verb|git| command, and the repository's local directory should still have a \verb|.git| subdirectory. Otherwise, it should be considered as a \textit{misc} package, since the hash of the latest commit cannot be retrieved in that case (see below).

\paragraph{Example of Data}

Below is an example of data collected for a Git repository called \texttt{ctf}:

\begin{lstlisting}
ctf,c3f95829628c381dc9bf631c69f08a7b17580b53,git
\end{lstlisting}

First column is the name of the package, second is the cryptographic hash of the latest commit in the current branch of the Git repo (used as version number), and third is the package source (Git). The actual outputs will also have a fourth column with the timestamp of when the package list was generated.

\subsubsection{Downloaded content (\texttt{misc})}\label{sec:misc}

In the case where the \dfile\ downloads content from the internet (\eg\ archives, binaries), \ecg\ will download the same content on the host machine (\ie\ not in the container) and then compute the cryptographic hash of the downloaded content.

\paragraph{Example of Data}

Below is an example of data collected for the downloading of the \texttt{Miniconda3} binary:

\begin{lstlisting}
Miniconda3-py37_4.12.0-Linux-x86_64,4dc4214839c60b2f5eb3efbdee1ef5d9b45e74f2c09fcae6c8934a13f36ffc3e,misc
\end{lstlisting}

First column is the name of the package, second is the cryptographic hash of the downloaded content (used as version number), and third is the package source (misc). The actual outputs will also have a fourth column with the timestamp of when the package list was generated.

\subsubsection{Python Virtual Environment (\texttt{pyenv})}\label{sec:pyenv}

Even if \texttt{pip} is managed in the ``Package Managers'' section (Section \ref{sec:package_managers}), when authors use a virtual environment, \ecg\ needs to query this exact Python environment, and not the global one.

\subsection{Snakemake}

\subsection{R}

\section{Data collection}

\subsection{Has the data collection started?}

No.

\subsection{Considered Conferences}

\begin{table}
  \centering
  \begin{tabular}{lrr}
    \toprule
    Conference Name     & Submission Date & Proceedings Publication Date\\
    \midrule
    EuroPar 2024        & March 2024      & September 2024\\
    SuperComputing 2024 & April 2024      & November 2024 \\
    \bottomrule
\end{tabular}
  \caption{Considered Conferences and associated important dates}
  \label{tab:conferences}
\end{table}

Table \ref{tab:conferences} summarizes the considered conferences and their important dates.

\noteqg{todo}

\subsection{Gathering of \dfile s}

The gathering part of the \dfile s will be done right after the publication of the proceeding of a conference.
Contributors of the ``Data Curation'' phase will go through all the papers and their artifact to extract artifact containing \dfile s.
These \dfile s will then be captured with the Nickel description (see Section \ref{sec:nickel}).
To avoid mistake, at least two contributors will be assigned by paper.
If there is any difference in the Nickel description of an artifact, a discussion between the contributors will be initiated to conclude on the correct artifact description.

\noteqg{can we do this in the workflow?}

\subsection{Building Periodicity}

The building workflow will be executed \emph{every month} for one year.
After one year, the workflow will be executed with increasing time intervals between execution.

\noteqg{TODO: A table/list/gantt chart of all the planned executions (dates)}

\section{Analysis}

One paragraph per plot

Any statistical tests?

\subsection{Static Analysis}

The first part of the analysis can be done statically from the description of the artifacts.

\begin{itemize}
\item Number/Proportion of \dfile s using particular package managers
\item Number/Proportion of \dfile s downloading content from Git repositories
\item Number/Proportion of \dfile s downloading content from internet
\end{itemize}

\subsection{Dynamic Analysis}

The second part of the analysis will be done after the first year of data collection, and will focus on the temporal evolution of properties of the artifacts.

\paragraph{Artifact Sources}

\begin{itemize}
\item Number/Proportion of artifacts that can be downloaded
\item Number/Proportion of artifacts which content has changed
\end{itemize}

\paragraph{Build Status}

\begin{itemize}
\item Number/Proportion of \dfile s that build successfully
\item Number/Proportion of \dfile s errors (\texttt{baseimage\_unavailable}, \texttt{job\_time\_execeed}, \texttt{unknown\_error}) for the failed builds
\end{itemize}

\paragraph{Software Environment}

\begin{itemize}
\item Number of installed packages per container
\item Number/Proportion of packages that changed version since last build
\item Package sources (package manager, Git, misc) from where packages are changing the most
\end{itemize}

\section{Other}

\subsection{Computational Environment}

The builds of the \dfile s will be executed on the french testbed Grid'5000 \cite{grid5000}.

\noteqg{TODO: Which cluster?}

The software environment will be managed by Nix \cite{dolstra2004nix}.

\noteqg{TODO: swh link to the shells in the repo}

\subsection{Environmental Cost}

\noteqg{TODO: do an estimation}

\subsection{Data and Source Code Long-term Availability}

The collected data will be stored on Zenodo.
The Source code will be archived on Software-Heritage.

\printbibliography

\end{document}
start writing protocol 2024-08-01 17:49:41 +02:00			`\documentclass{article}`
			`\usepackage[a4paper, margin=20mm]{geometry}`
			`\usepackage{hyperref}`
			`\usepackage[`
			`datamodel=software`
			`]{biblatex}`
			`\usepackage{software-biblatex}`
			`\usepackage{todonotes}`
			`\addbibresource{references.bib}`
			`\usepackage{listings}`
			`\lstset{`
			`basicstyle=\small\ttfamily,`
			`%columns=flexible,`
			`frame = single,`
			`breaklines=true`
			`}`

Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos. 2024-08-07 19:51:21 +02:00			`\newcommand{\noteqg}{\todo[backgroundcolor=blue!10,bordercolor=blue,inline,caption={}]}`
start writing protocol 2024-08-01 17:49:41 +02:00
			`%\usepackage{amssymb}`
minor improvement to protocol 2024-08-02 17:49:55 +02:00			`\usepackage{booktabs}`
start writing protocol 2024-08-01 17:49:41 +02:00			`%\usepackage{adjustbox}`

			`\newcommand{\dfile}{\texttt{Dockerfile}}`
			`\newcommand{\ecg}{\texttt{ecg.py}}`
			`\newcommand{\eg}{\emph{e.g.,}}`
			`\newcommand{\ie}{\emph{i.e.,}}`

			`\title{Protocol: Study of the Longevity of \dfile s from Research Artifacts}`

			`\begin{document}`
			`\maketitle`

			`\section{General Information}`

			`\subsection{Title of the project}`

			`\emph{Study of the Longevity of \dfile s from Research Artifacts}`

			`\subsection{Current and Future Contributors}`

			`\href{https://www.elsevier.com/researcher/author/policies-and-guidelines/credit-author-statement}{CRediT}`

			`\begin{itemize}`
			`\item \href{https://orcid.org/0009-0003-7645-5044}{Quentin \textsc{Guilloteau}}: Conceptualization, Methodology, Software, Data Curation, Supervision, Project administration`
Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos. 2024-08-07 19:51:21 +02:00			`\item Antoine Waehren`
start writing protocol 2024-08-01 17:49:41 +02:00			`\end{itemize}`


			`\subsection{Description of the project}`

			`This project aims to show the limitations of using Docker containers as a reliable reproducibility tool.`
			`In particular, as Docker relies on non-reproducible tools, it is difficult to construct a \dfile\ that will rebuild the \emph{exact} same software environment in the future.`
Continue writing the protocol 2024-08-02 11:04:30 +02:00			`In this project, we will collect research artifacts coming from various scientific conferences containing \dfile s, rebuild them periodically, and observe the variation in the resulting software environments.`
Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos. 2024-08-07 19:51:21 +02:00			`We will also catch any error that could occur during the building of the image.`
start writing protocol 2024-08-01 17:49:41 +02:00
			`\subsection{Related work from contributors}`

			`\cite{acmrep24}`

			`\section{Architecture}`

Continue writing the protocol 2024-08-02 11:04:30 +02:00			`\subsection{Nickel}\label{sec:nickel}`
start writing protocol 2024-08-01 17:49:41 +02:00
			`We use the Nickel configuration language to guarantee the correctness of the descriptions of the artifacts.`
			`This allows us to catch potential errors or incoherencies, from the Data Curation phase, even before trying to build the artifacts.`
			`The definition of the schema is archived on Software Heritage \cite{nickel_schema}.`

			`\noindent\begin{minipage}{.49\textwidth}`
			`\begin{lstlisting}[caption=\dfile]{Name}`
			`FROM ubuntu`
			`RUN apt-get update && apt-get install X Y Z`
			`RUN git clone https://github.com/foo/bar`
			`RUN cd bar; make`
			`\end{lstlisting}`
			`\end{minipage}`
			`\hfill`
			`\begin{minipage}{.49\textwidth}`
			`\begin{lstlisting}[caption=Nickel]{Name}`
			`{`
			`version = "1.0",`
			`artifact_url = "https://zenodo.org/record/XXXXXXX/files/code.tar.gz",`
			`type = "tar",`
			`doi = "XX.XXXX/XXXXXXX.XXXXXXX",`
			`virtualization = "docker",`
			`buildfile_dir = "path/to/dockerfile",`
			`package_managers = [ "dpkg" ],`
			`git_packages = [`
			`{ name = "bar", location = "~/bar" }`
			`],`
			`}`
			`\end{lstlisting}`
			`\end{minipage}`

			`\subsection{\ecg}`

Continue writing the protocol 2024-08-02 11:04:30 +02:00			`This Python script\ \cite{ecg_code} takes as input a (verified) JSON representation of the Nickel artifact description, and then tries to build the \dfile\ contained in the artifact.`
start writing protocol 2024-08-01 17:49:41 +02:00
			`\paragraph{Workflow}`

			`\begin{enumerate}`
			`\item Read the JSON description of the artifact`
Continue writing the protocol 2024-08-02 11:04:30 +02:00			`\item Download the artifact (Section \ref{sec:download})`
			`\item Log the cryptographic hash of the downloaded artifact (Section \ref{sec:download})`
start writing protocol 2024-08-01 17:49:41 +02:00			`\item Extract the artifact`
Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos. 2024-08-07 19:51:21 +02:00			`\item Build the Docker image (Section \ref{sec:docker_build})`
minor improvement to protocol 2024-08-02 17:49:55 +02:00			`\item If the build is successful, gather information about the produced software environment (Sections \ref{sec:package_managers}, \ref{sec:git}, \ref{sec:misc}, and \ref{sec:pyenv})`
start writing protocol 2024-08-01 17:49:41 +02:00			`\item If the build failed, gather information about the reason of the failure`
			`\end{enumerate}`

			`\noteqg{should probably be a flowgraph}`

Continue writing the protocol 2024-08-02 11:04:30 +02:00			`\subsubsection{Download of the Artifact}\label{sec:download}`
start writing protocol 2024-08-01 17:49:41 +02:00
Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos. 2024-08-07 19:51:21 +02:00			`The link to the to artifact is the link provided by the authors in their Artifact Description.`
start writing protocol 2024-08-01 17:49:41 +02:00			`\ecg\ will use this link to download the artifact.`
Updated protocol with some details and current state of the workflow. 2024-08-26 15:12:08 +02:00			`If the download is successful, \ecg\ will log the cryptographic hash of the content.`
start writing protocol 2024-08-01 17:49:41 +02:00			`This allows us to also have information about the stability/longevity of the artifact sharing.`

Continue writing the protocol 2024-08-02 11:04:30 +02:00			`\subsubsection{Docker Build Statuses}\label{sec:docker_build}`
start writing protocol 2024-08-01 17:49:41 +02:00
			`\ecg\ captures different types of statuses for the build attempt of a \dfile:`

			`\begin{itemize}`
Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos. 2024-08-07 19:51:21 +02:00			`\item \texttt{baseimage\_unavailable}: the base image of the \dfile\ (\texttt{FROM} image) is not available.`
			`\item \texttt{job\_time\_exceeded}: when running on a batch system such as OAR, this error indicates that the \dfile\ did not build under \emph{1 hour}`
			`\item \texttt{success}: the \dfile\ has been built successfully`
Updated protocol with some details and current state of the workflow. 2024-08-26 15:12:08 +02:00			`\item \texttt{package_install_failed}: a command requested the installation of a package that failed`
Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos. 2024-08-07 19:51:21 +02:00			`\item \texttt{artifact_unavailable}: the artifact could not be downloaded`
			`\item \texttt{dockerfile_not_found}: no \dfile\ has been found in the location specified in the configuration file`
			`\item \texttt{script_crash}: an error has occurred with the script itself`
			`\item \texttt{unknown_error}: the \dfile\ could not be built for an unknown reason`
start writing protocol 2024-08-01 17:49:41 +02:00			`\end{itemize}`

Continue writing the protocol 2024-08-02 11:04:30 +02:00			`\subsubsection{Information from the Package Manager}\label{sec:package_managers}`
start writing protocol 2024-08-01 17:49:41 +02:00
			`Package Managers can provide information about the packages installed: package name and package version.`

Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos. 2024-08-07 19:51:21 +02:00			`\paragraph{Supported Package Managers} \texttt{dpkg}, \texttt{rpm}, \texttt{pacman}, \texttt{pip}, \texttt{conda}`
start writing protocol 2024-08-01 17:49:41 +02:00
			`\paragraph{Example of Data}`

			`Below is an example of data collected for the \texttt{gcc-8} package on a Ubuntu image:`

			`\begin{lstlisting}`
			`gcc-8,8.3.0-6,dpkg`
			`\end{lstlisting}`

Updated protocol with some details and current state of the workflow. 2024-08-26 15:12:08 +02:00			`First column is the name of the package, second is the version number given by the package manager, and third is the package manager. The actual outputs will also have a fourth column with the timestamp of when the package list was generated.`

Continue writing the protocol 2024-08-02 11:04:30 +02:00			`\subsubsection{Git repositories (\texttt{git})}\label{sec:git}`
start writing protocol 2024-08-01 17:49:41 +02:00
			`\dfile\ authors can also install packages from source.`
			`One way to do this is via Git.`
Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos. 2024-08-07 19:51:21 +02:00			`In this case, once the container built successfully, \ecg\ logs into the container and extracts the commit hash of the repository (via \texttt{git log}).`
Updated protocol with some details and current state of the workflow. 2024-08-26 15:12:08 +02:00			`To be considered as a Git package, a package must have been downloaded using the \verb\|git\| command, and the repository's local directory should still have a \verb\|.git\| subdirectory. Otherwise, it should be considered as a \textit{misc} package, since the hash of the latest commit cannot be retrieved in that case (see below).`
start writing protocol 2024-08-01 17:49:41 +02:00
			`\paragraph{Example of Data}`

			`Below is an example of data collected for a Git repository called \texttt{ctf}:`

			`\begin{lstlisting}`
			`ctf,c3f95829628c381dc9bf631c69f08a7b17580b53,git`
			`\end{lstlisting}`

Updated protocol with some details and current state of the workflow. 2024-08-26 15:12:08 +02:00			`First column is the name of the package, second is the cryptographic hash of the latest commit in the current branch of the Git repo (used as version number), and third is the package source (Git). The actual outputs will also have a fourth column with the timestamp of when the package list was generated.`

			`\subsubsection{Downloaded content (\texttt{misc})}\label{sec:misc}`
start writing protocol 2024-08-01 17:49:41 +02:00
Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos. 2024-08-07 19:51:21 +02:00			`In the case where the \dfile\ downloads content from the internet (\eg\ archives, binaries), \ecg\ will download the same content on the host machine (\ie\ not in the container) and then compute the cryptographic hash of the downloaded content.`
start writing protocol 2024-08-01 17:49:41 +02:00
			`\paragraph{Example of Data}`

			`Below is an example of data collected for the downloading of the \texttt{Miniconda3} binary:`

			`\begin{lstlisting}`
			`Miniconda3-py37_4.12.0-Linux-x86_64,4dc4214839c60b2f5eb3efbdee1ef5d9b45e74f2c09fcae6c8934a13f36ffc3e,misc`
			`\end{lstlisting}`

Updated protocol with some details and current state of the workflow. 2024-08-26 15:12:08 +02:00			`First column is the name of the package, second is the cryptographic hash of the downloaded content (used as version number), and third is the package source (misc). The actual outputs will also have a fourth column with the timestamp of when the package list was generated.`

minor improvement to protocol 2024-08-02 17:49:55 +02:00			`\subsubsection{Python Virtual Environment (\texttt{pyenv})}\label{sec:pyenv}`

			Even if \texttt{pip} is managed in the ``Package Managers'' section (Section \ref{sec:package_managers}), when authors use a virtual environment, \ecg\ needs to query this exact Python environment, and not the global one.

start writing protocol 2024-08-01 17:49:41 +02:00			`\subsection{Snakemake}`

			`\subsection{R}`

			`\section{Data collection}`

minor improvement to protocol 2024-08-02 17:49:55 +02:00			`\subsection{Has the data collection started?}`

			`No.`

Continue writing the protocol 2024-08-02 11:04:30 +02:00			`\subsection{Considered Conferences}`

minor improvement to protocol 2024-08-02 17:49:55 +02:00			`\begin{table}`
			`\centering`
			`\begin{tabular}{lrr}`
			`\toprule`
			`Conference Name & Submission Date & Proceedings Publication Date\\`
			`\midrule`
			`EuroPar 2024 & March 2024 & September 2024\\`
			`SuperComputing 2024 & April 2024 & November 2024 \\`
			`\bottomrule`
			`\end{tabular}`
			`\caption{Considered Conferences and associated important dates}`
			`\label{tab:conferences}`
			`\end{table}`

			`Table \ref{tab:conferences} summarizes the considered conferences and their important dates.`
Continue writing the protocol 2024-08-02 11:04:30 +02:00
			`\noteqg{todo}`

			`\subsection{Gathering of \dfile s}`

			`The gathering part of the \dfile s will be done right after the publication of the proceeding of a conference.`
			Contributors of the ``Data Curation'' phase will go through all the papers and their artifact to extract artifact containing \dfile s.
			`These \dfile s will then be captured with the Nickel description (see Section \ref{sec:nickel}).`
			`To avoid mistake, at least two contributors will be assigned by paper.`
			`If there is any difference in the Nickel description of an artifact, a discussion between the contributors will be initiated to conclude on the correct artifact description.`

			`\noteqg{can we do this in the workflow?}`

			`\subsection{Building Periodicity}`
start writing protocol 2024-08-01 17:49:41 +02:00
Updated protocol with some details and current state of the workflow. 2024-08-26 15:12:08 +02:00			`The building workflow will be executed \emph{every month} for one year.`
start writing protocol 2024-08-01 17:49:41 +02:00			`After one year, the workflow will be executed with increasing time intervals between execution.`

Continue writing the protocol 2024-08-02 11:04:30 +02:00			`\noteqg{TODO: A table/list/gantt chart of all the planned executions (dates)}`
start writing protocol 2024-08-01 17:49:41 +02:00
			`\section{Analysis}`

			`One paragraph per plot`

			`Any statistical tests?`

			`\subsection{Static Analysis}`

			`The first part of the analysis can be done statically from the description of the artifacts.`

			`\begin{itemize}`
			`\item Number/Proportion of \dfile s using particular package managers`
Updated protocol with some details and current state of the workflow. 2024-08-26 15:12:08 +02:00			`\item Number/Proportion of \dfile s downloading content from Git repositories`
			`\item Number/Proportion of \dfile s downloading content from internet`
start writing protocol 2024-08-01 17:49:41 +02:00			`\end{itemize}`

			`\subsection{Dynamic Analysis}`

			`The second part of the analysis will be done after the first year of data collection, and will focus on the temporal evolution of properties of the artifacts.`

			`\paragraph{Artifact Sources}`

			`\begin{itemize}`
			`\item Number/Proportion of artifacts that can be downloaded`
			`\item Number/Proportion of artifacts which content has changed`
			`\end{itemize}`

			`\paragraph{Build Status}`

			`\begin{itemize}`
Updated protocol with some details and current state of the workflow. 2024-08-26 15:12:08 +02:00			`\item Number/Proportion of \dfile s that build successfully`
Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos. 2024-08-07 19:51:21 +02:00			`\item Number/Proportion of \dfile s errors (\texttt{baseimage\_unavailable}, \texttt{job\_time\_execeed}, \texttt{unknown\_error}) for the failed builds`
start writing protocol 2024-08-01 17:49:41 +02:00			`\end{itemize}`

			`\paragraph{Software Environment}`

			`\begin{itemize}`
			`\item Number of installed packages per container`
			`\item Number/Proportion of packages that changed version since last build`
Updated protocol with some details and current state of the workflow. 2024-08-26 15:12:08 +02:00			`\item Package sources (package manager, Git, misc) from where packages are changing the most`
start writing protocol 2024-08-01 17:49:41 +02:00			`\end{itemize}`

			`\section{Other}`

			`\subsection{Computational Environment}`

			`The builds of the \dfile s will be executed on the french testbed Grid'5000 \cite{grid5000}.`

			`\noteqg{TODO: Which cluster?}`

			`The software environment will be managed by Nix \cite{dolstra2004nix}.`

			`\noteqg{TODO: swh link to the shells in the repo}`

			`\subsection{Environmental Cost}`

			`\noteqg{TODO: do an estimation}`

minor improvement to protocol 2024-08-02 17:49:55 +02:00			`\subsection{Data and Source Code Long-term Availability}`
start writing protocol 2024-08-01 17:49:41 +02:00
			`The collected data will be stored on Zenodo.`
			`The Source code will be archived on Software-Heritage.`

			`\printbibliography`

			`\end{document}`