The aim of this study is to show the current practices with Docker in High Performance Computing (HPC) conferences' artifacts.
This repository contains the code and the artifacts used for the study.
## Components of the workflow
The workflow is managed by Snakemake. Below are brief explanations on the various components of this workflow. Further details can be found in the chapter "Usage".
### Artifact configuration files
This repository contains configuration files for multiple artifacts from HPC conferences. These configuration files contain multiple informations that are used by ECG to build the Docker container and perform a software environment analysis.
ECG is a program that automates software environment checking for scientific artifacts that use Docker. It takes as input a JSON configuration telling where to download the artifact, where to find the Dockerfile to build in the artifact, and which package sources are used by the Docker container.
It will then download the artifact, build the Dockerfile, and then create a list of the installed packages in the Docker container (if it was built successfully). It also stores the potential errors encountered when building the Dockerfile, and logs the hash of the artifact for future comparison.
Multiple type of analysis are done with the output of ECG to create tables that can later be plotted. The analysis done for this study are software environment, artifact, and build status analysis. Each type of analysis is done through a different script.
First, open `config/config.yaml` and set `system` to `local` to run the workflow on your local machine, or to `g5k` to run it on the Grid'5000 testbed.
Where `<nb_cores>` is the number of cores you want to assign to the workflow. The number of cores determine the number of instances of ECG that can run in parallel, so you may want to assign as many cores as possible here.
Under `artifacts/nickel`, you will find some configuration files in the Nickel format. You will need to run the following command to convert a configuration file in the JSON format to make it readable for ECG and to check for errors:
-`<config_file>` is the configuration file of the artifact in JSON format. WARNING: The name of the file (without the extension) must comply with the Docker image naming convention: only characters allowed are lowercase letters and numbers, separated with either one "." maximum, or two "_" maximum, or an unlimited number of "-", and should be of 128 characters maximum.
-`<cache_directory>` is the path to the cache directory, where downloaded artifacts will be stored for future usage. If not specified, cache is disabled.
The list of packages installed in the container, depending on the sources (a package manager, `git` or `misc`) given in the config file, in the form of a CSV file, with the following columns in order:
For Git packages, the hash of the last commit is used as version number. For miscellaneous packages, the hash of the file that has been used to install the package is used as version number. The timestamp corresponds to the time when ECG started building the package list, so it will be the same for each package that has been logged during the same execution of ECG.
-`job_time_exceeded`: When running on a batch system such as OAR, this error indicates that the script exceeded the allocated run time and had to be terminated.
The timestamp corresponds to when the hash has been logged, not to when the artifact has been downloaded. If the artifact couldn't be downloaded, the hash is equal to `-1`.
Under the folder `analysis`, you will find multiple analysis scripts. These scripts take as input the outputs of ECG to generate tables that can then be plotted by another program.
The outputs are CSV files with the following structure:
| Category 1 | Category 2 | ... | Timestamp |
|------------|------------|-----|-----------|
Where `Category 1`, `Category 2`, ... are the categories (package sources, build status or artifact status) being measured by the analysis functions. For each category, the amount of entities (packages, containers or artifacts) belonging to this category is given in the respective column. Categories are clarified below for each type of analysis.
The timestamp corresponds to the time when the output file is being written.
-`pkgs-changes`: Number of packages that changed over time (`0` if only one file is given, since it will only include the package list of a single execution).
The script `artifact_analysis.py` performs an artifact analysis by parsing one or more artifact hash logs generated by ECG.
The table generated by this script gives the amount of artifacts that are available or not available, and the amount of artifacts that have been modified over time.
The script `buildstatus_analysis.py` performs a build status analysis by parsing one or more build status log generated by ECG.
The table generated by this script gives the amount of images that have been built successfully, and the amount of images that failed to build, for each category of error.
The categories are all build status supported by ECG, in the following order : `success, package_install_failed, baseimage_unavailable, artifact_unavailable, dockerfile_not_found, script_crash, job_time_exceeded, unknown_error`