Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos.

2024-08-07 19:51:21 +02:00 · 2024-08-07 19:51:21 +02:00 · e2903ffac1
commit e2903ffac1
parent c816cbde2c
7 changed files with 77 additions and 35 deletions
--- a/README.md
+++ b/README.md
@ -16,12 +16,21 @@ This repository contains configuration files for multiple artifacts from HPC con

 ### ECG

-ECG is a program that automates software environment checking for scientific artifacts that use Docker. It takes as input a JSON configuration telling where to download the artifact, where to find the Dockerfile to build in the artifact, and which package managers are used by the Docker container.
+ECG is a program that automates software environment checking for scientific artifacts that use Docker. It takes as input a JSON configuration telling where to download the artifact, where to find the Dockerfile to build in the artifact, and which package sources are used by the Docker container.

 It will then download the artifact, build the Dockerfile, and then create a list of the installed packages in the Docker container (if it was built successfully). It also stores the potential errors encountered when building the Dockerfile, and logs the hash of the artifact for future comparison.

 It is meant to be executed periodically to analyze variations in the software environment of the artifact through time.

+Supported package sources:
+- `dpkg`
+- `rpm`
+- `pacman`
+- `pip`
+- `conda`
+- `git`
+- `misc` *(miscellaneous packages are packages that have been installed outside of a package manager or VCS such as Git)*
+
 ### Analysis

 Multiple type of analysis are done with the output of ECG to create tables that can later be plotted. The analysis done for this study are software environment, artifact, and build status analysis. Each type of analysis is done through a different script.
--- a/analysis/artifact_analysis.py
+++ b/analysis/artifact_analysis.py
@ -159,7 +159,7 @@ def main():
    # Writing analysis to output file:
    output_file = open(output_path, "w+")
    dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
-    dict_writer.writeheader()
+    # dict_writer.writeheader()
    dict_writer.writerow(output_dict)
    output_file.close()

--- a/analysis/buildstatus_analysis.py
+++ b/analysis/buildstatus_analysis.py
@ -25,12 +25,12 @@ def analysis(input_table):
    dict
        Output table of the analysis in the form of a dict with headers as keys.
    """
-    buildstatus = {}
+    # All build status, initialized to 0.
+    # This is required to make the column of the result table deterministic,
+    # so they can be determined without the header in the CSV file.
+    buildstatus = {"success":0, "package_unavailable":0, "baseimage_unavailable":0, "artifact_unavailable":0, "dockerfile_not_found":0, "script_crash":0, "job_time_exceeded":0, "unknown_error":0}
    for row in input_table:
        # Third column is the result:
-        if row[2] not in buildstatus:
-            buildstatus[row[2]] = 1
-        else:
        buildstatus[row[2]] += 1
    return buildstatus

@ -92,7 +92,7 @@ def main():
    # Writing analysis to output file:
    output_file = open(output_path, "w+")
    dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
-    dict_writer.writeheader()
+    # dict_writer.writeheader()
    dict_writer.writerow(output_dict)
    output_file.close()

--- a/analysis/softenv_analysis.py
+++ b/analysis/softenv_analysis.py
@ -11,7 +11,12 @@ import csv
 import os
 import datetime

-def sources_stats(input_table):
+# All possible package sources, initialized to 0.
+# This is required to make the column of the result table deterministic,
+# so they can be determined without the header in the CSV file.
+pkgsources = {"dpkg":0, "rpm":0, "pacman":0, "pip":0, "conda":0, "git":0, "misc":0}
+
+def sources_stats(input_table, pkgsources):
    """
    Analyzes the given package lists table to determine the number of artifacts
    using a package manager, Git packages or misc packages.
@ -21,20 +26,23 @@ def sources_stats(input_table):
    input_table: str
        Table to analyse.

+    pkgsources: dict
+        A dictionnary that contains all the possible package sources as keys,
+        with all keys' value initialized at 0.
+
    Returns
    -------
    dict
        Output table of the analysis in the form of a dict with headers as keys.
    """
-    pkgmgr = {}
    i = 0
    for row in input_table:
        # Third column is the package source:
-        if row[2] not in pkgmgr:
-            pkgmgr[row[2]] = 1
+        if row[2] not in pkgsources:
+            pkgsources[row[2]] = 1
        else:
-            pkgmgr[row[2]] += 1
-    return pkgmgr
+            pkgsources[row[2]] += 1
+    return pkgsources

 def pkg_changed(table, artifact_name, pkgname, pkgsource):
    """
@ -78,7 +86,7 @@ def pkg_changed(table, artifact_name, pkgname, pkgsource):
        i += 1
    return changed

-def pkgs_changes(input_table):
+def pkgs_changes(input_table, pkgsources):
    """
    Analyzes the given package lists table to determine the number of packages
    that changed for every package source.
@ -88,12 +96,15 @@ def pkgs_changes(input_table):
    input_table: str
        Table to analyse.

+    pkgsources: dict
+        A dictionnary that contains all the possible package sources as keys,
+        with all keys' value initialized at 0.
+
    Returns
    -------
    dict
        Output table of the analysis in the form of a dict with headers as keys.
    """
-    pkgchanges_dict = {}
    # Key is the artifact name, and value is a list of tuples constituted
    # of the package that has been checked and its source for this artifact:
    # FIXME: Memory usage?
@ -106,12 +117,10 @@ def pkgs_changes(input_table):
        pkgname = row[0] # Package name is in the first column
        pkgsource = row[2] # Package source is in the 3rd column
        if (pkgname, pkgsource) not in checked_artifacts[artifact_name]:
-            if pkgsource not in pkgchanges_dict:
-                pkgchanges_dict[pkgsource] = 0
            if pkg_changed(input_table, artifact_name, pkgname, pkgsource):
-                pkgchanges_dict[pkgsource] += 1
+                pkgsources[pkgsource] += 1
            checked_artifacts[artifact_name].append((pkgname, pkgsource))
-    return pkgchanges_dict
+    return pkgsources

 def pkgs_per_container(input_table):
    print("ERROR: Not implemented!")
@ -181,9 +190,9 @@ def main():

    # Analyzing the inputs:
    if analysis_type == "sources-stats":
-        output_dict = sources_stats(input_table)
+        output_dict = sources_stats(input_table, pkgsources)
    elif analysis_type == "pkgs-changes":
-        output_dict = pkgs_changes(input_table)
+        output_dict = pkgs_changes(input_table, pkgsources)
    elif analysis_type == "pkgs-per-container":
        output_dict = pkgs_per_container(input_table)
    # Adding the current time to every row:
@ -194,7 +203,7 @@ def main():
    # Writing analysis to output file:
    output_file = open(output_path, "w+")
    dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
-    dict_writer.writeheader()
+    # dict_writer.writeheader()
    dict_writer.writerow(output_dict)
    output_file.close()

--- a/ecg.py
+++ b/ecg.py
@ -336,7 +336,7 @@ def check_env(config, src_dir, artifact_name, pkglist_path):
        pkglist_file.write(f"{repo_row}\n")

    # Misc packages:
-    logging.info("Checking packages obtained outside of a package manager or VCS")
+    logging.info("Checking miscellaneous packages")
    for pkg in config["misc_packages"]:
        logging.info(f"Downloading package {pkg['name']} from {pkg['url']}")
        pkg_file = tempfile.NamedTemporaryFile()
--- a/plot/line_plot.r
+++ b/plot/line_plot.r
@ -0,0 +1,18 @@
+#!/usr/bin/env Rscript
+
+# Libraries:
+library(ggplot2)
+library(reshape2)
+
+# Parsing command line arguments:
+options <- commandArgs(trailingOnly = TRUE)
+
+# Loading files:
+table = read.csv(options[1], header = FALSE)
+
+colnames(table) = c("dpkg", "pip", "git", "misc", "timestamp")
+
+melted_table = melt(table, id.vars = "timestamp", variable.name = "category")
+
+# Plotting:
+ggplot(melted_table, aes(timestamp, value)) + geom_line(aes(colour = category))
--- a/protocol/protocol.tex
+++ b/protocol/protocol.tex
@ -43,7 +43,7 @@ breaklines=true

 \begin{itemize}
  \item \href{https://orcid.org/0009-0003-7645-5044}{Quentin \textsc{Guilloteau}}: Conceptualization, Methodology, Software, Data Curation, Supervision, Project administration
-  \item ...
+  \item Antoine Waehren
 \end{itemize}


@ -52,6 +52,7 @@ breaklines=true
 This project aims to show the limitations of using Docker containers as a reliable reproducibility tool.
 In particular, as Docker relies on non-reproducible tools, it is difficult to construct a \dfile\ that will rebuild the \emph{exact} same software environment in the future.
 In this project, we will collect research artifacts coming from various scientific conferences containing \dfile s, rebuild them periodically, and observe the variation in the resulting software environments.
+We will also catch any error that could occur during the building of the image.

 \subsection{Related work from contributors}

@ -102,7 +103,7 @@ This Python script\ \cite{ecg_code} takes as input a (verified) JSON representat
 \item Download the artifact (Section \ref{sec:download})
 \item Log the cryptographic hash of the downloaded artifact (Section \ref{sec:download})
 \item Extract the artifact
-\item Build the docker image (Section \ref{sec:docker_build})
+\item Build the Docker image (Section \ref{sec:docker_build})
 \item If the build is successful, gather information about the produced software environment (Sections \ref{sec:package_managers}, \ref{sec:git}, \ref{sec:misc}, and \ref{sec:pyenv})
 \item If the build failed, gather information about the reason of the failure
 \end{enumerate}
@ -111,7 +112,7 @@ This Python script\ \cite{ecg_code} takes as input a (verified) JSON representat

 \subsubsection{Download of the Artifact}\label{sec:download}

-The link the to artifact is the link provided by the authors in their Artifact Description.
+The link to the to artifact is the link provided by the authors in their Artifact Description.
 \ecg\ will use this link to download the artifact.
 If the download is successful, \ecg\ will check the cryptographic hash of the content.
 This allows us to also have information about the stability/longevity of the artifact sharing.
@ -121,16 +122,21 @@ This allows us to also have information about the stability/longevity of the art
 \ecg\ captures different types of statuses for the build attempt of a \dfile:

 \begin{itemize}
-\item \texttt{baseimage\_unavailable}: the base image of the \dfile\ (\texttt{FROM} image)
-\item \texttt{time\_execeed}: the container did not build under \emph{1 hour}
-\item \texttt{unknown\_error}: the build failed for an unknown/classified reason
+  \item \texttt{baseimage\_unavailable}: the base image of the \dfile\ (\texttt{FROM} image) is not available.
+  \item \texttt{job\_time\_exceeded}: when running on a batch system such as OAR, this error indicates that the \dfile\ did not build under \emph{1 hour}
+  \item \texttt{success}: the \dfile\ has been built successfully
+  \item \texttt{package\_unavailable}: a command requested the installation of a package that is not available
+  \item \texttt{artifact_unavailable}: the artifact could not be downloaded
+  \item \texttt{dockerfile_not_found}: no \dfile\ has been found in the location specified in the configuration file
+  \item \texttt{script_crash}: an error has occurred with the script itself
+  \item \texttt{unknown_error}: the \dfile\ could not be built for an unknown reason
 \end{itemize}

 \subsubsection{Information from the Package Manager}\label{sec:package_managers}

 Package Managers can provide information about the packages installed: package name and package version.

-\paragraph{Supported Package Managers} \texttt{dpkg}, \texttt{rpm}, \texttt{pip}, \texttt{conda}
+\paragraph{Supported Package Managers} \texttt{dpkg}, \texttt{rpm}, \texttt{pacman}, \texttt{pip}, \texttt{conda}

 \paragraph{Example of Data}

@ -144,7 +150,7 @@ gcc-8,8.3.0-6,dpkg

 \dfile\ authors can also install packages from source.
 One way to do this is via Git.
-In this case, once the container built sucessfully, \ecg\ logs into the container and extract the commit hash of the repository (via \texttt{git log}).
+In this case, once the container built successfully, \ecg\ logs into the container and extracts the commit hash of the repository (via \texttt{git log}).

 \paragraph{Example of Data}

@ -156,7 +162,7 @@ ctf,c3f95829628c381dc9bf631c69f08a7b17580b53,git

 \subsubsection{Download content (\texttt{misc})}\label{sec:misc}

-In the case where the \dfile\ download content from the internet (\eg\ archives, binaries), \ecg\ will download the same content on the host machine (\ie\ not in the container) and then compute the cryptographic hash of the downloaded content.
+In the case where the \dfile\ downloads content from the internet (\eg\ archives, binaries), \ecg\ will download the same content on the host machine (\ie\ not in the container) and then compute the cryptographic hash of the downloaded content.

 \paragraph{Example of Data}

@ -248,7 +254,7 @@ The second part of the analysis will be done after the first year of data collec

 \begin{itemize}
 \item Number/Proportion of \dfile s that build succesfully
-\item Number/Proportion of \dfile s errors (\texttt{baseimage\_unavailable}, \texttt{time\_execeed}, \texttt{unknown\_error}) for the failed builds
+\item Number/Proportion of \dfile s errors (\texttt{baseimage\_unavailable}, \texttt{job\_time\_execeed}, \texttt{unknown\_error}) for the failed builds
 \end{itemize}

 \paragraph{Software Environment}