Go to file
2024-08-07 12:26:36 +02:00
analysis Finished the package changes analysis. 2024-08-07 12:26:36 +02:00
artifacts/nickel Now optional with a default value in the Nickel contract: version, comment, git_packages, misc_packages, python_venvs (close #30). 2024-08-06 11:44:38 +02:00
blacklists Written a doc for the output. Removed the "type" attribute from the Nickel contract, closing #17. 2024-07-19 16:18:49 +02:00
config add prefix to config 2024-07-21 16:14:58 +02:00
protocol minor improvement to protocol 2024-08-02 17:49:55 +02:00
workflow Now optional with a default value in the Nickel contract: version, comment, git_packages, misc_packages, python_venvs (close #30). 2024-08-06 11:44:38 +02:00
.gitignore start writing protocol 2024-08-01 17:49:41 +02:00
blacklist.csv basic workflow for ecg 2024-07-11 15:17:16 +02:00
check.ncl Fixed Nickel contract and check files. Fixed input/output files in Snakefile. 2024-07-16 17:35:09 +02:00
clean.sh Added support for cache. Added run and clean scripts. 2024-07-12 12:10:03 +02:00
ecg.py Finished the package changes analysis. 2024-08-07 12:26:36 +02:00
flake.lock first try for g5k execution 2024-07-20 15:41:56 +02:00
flake.nix Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added sed as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00
LICENSE Added license. 2024-07-16 16:04:02 +02:00
nickel.sh Added support for other virtualization techs in the Nickel contract (close #28). Added Nix dependency for Snakemake diagram generation. 2024-07-30 18:06:16 +02:00
README.md Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added sed as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00
run.sh Added initial support for Python venvs, close #21. Disabled the Docker build cache by default, and added an option to reenable it, close #23. Removed conditions on the existance of git_packages and misc_packages, because the workflow enforces their existance with the Nickel artifact contract. 2024-07-24 18:05:31 +02:00

Study of the Reproducibility and Longevity of Dockerfiles

ECG is a program that automates software environment checking for scientific artifacts that use Docker.

It is meant to be executed periodically to analyze variations in the software environment of the artifact through time.

How it works

ECG takes as input a JSON configuration telling where to download the artifact, where to find the Dockerfile to build in the artifact, and which package managers are used by the Docker container.

It will then download the artifact, build the Dockerfile, and then create a list of the installed packages in the Docker container. It also stores the potential errors encountered when building the Dockerfile, and logs the hash of the artifact for future comparison.

Setup

A Linux operating system and the following packages are required:

  • python
  • docker
  • snakemake
  • gawk
  • nickel
  • sed

The following Python package is also required:

  • requests

Otherwise, you can use the Nix package manager and run nix develop in this directory to setup the full software environment.

Usage

Run ecg.py as follow:

python3 ecg.py <config_file> -p <pkglist_path> -l <log_file> -b <build_status_file> -a <artifact_hash_log> -c <cache_directory>

Where:

  • <config_file> is the configuration file of the artifact in JSON format. A template of the Nickel file to use to produce the JSON config file is given in artifacts/nickel/template.ncl. WARNING: The name of the file (without the extension) must comply with the Docker image naming convention: only characters allowed are lowercase letters and numbers, separated with either one "." maximum, or two "_" maximum, or an unlimited number of "-", and should be of 128 characters maximum.
  • <pkglist_path> is the path to the file where the package list generated by the program should be written.
  • <log_file> is the path to the file where to log the output of the program.
  • <build_status_file> is the path to the file where to write the build status of the Docker image given in the configuration file.
  • <artifact_hash_log> is the path to the file where to log the hash of the downloaded artifact.
  • <cache_directory> is the path to the cache directory, where downloaded artifacts will be stored for future usage. If not specified, cache is disabled.

You can also use --docker-cache to enable the cache of the Docker layers, and -v to show the full output of the script in your terminal (by default, it is only written to the specified log_file).

Output

Package list

The list of packages installed in the container, depending on the sources (a package manager, git or misc) given in the config file, in the form of a CSV file, with the following columns in order:

Package name Version Source Config name Timestamp

For Git packages, the hash of the last commit is used as version number. For miscellaneous packages, the hash of the file that has been used to install the package is used as version number. The timestamp corresponds to the time when ECG started building the package list, so it will be the same for each package that has been logged during the same execution of ECG.

Output log

Just a plain text file containing the output of the script.

Build status file

The log of the attempts to build the Docker image, in the form of a CSV file, with the following columns in order:

Config name Timestamp Result

The timestamp corresponds to when the result is being logged, not to when it happened.

The following are the possible results of the build:

  • success: The Docker image has been built successfully.
  • package_unavailable: A command requested the installation of a package that is not available.
  • baseimage_unavailable: The base image needed for this container is not available.
  • artifact_unavailable: The artifact could not be downloaded.
  • dockerfile_not_found: No Dockerfile has been found in the location specified in the configuration file.
  • script_crash: An error has occurred with the script itself.
  • job_time_exceeded: When running on a batch system such as OAR, this error indicates that the script exceeded the allocated run time and had to be terminated.
  • unknown_error: Any other error.

Artifact hash log

The log of the hash of the artifact archive file, in the form of a CSV file, with the following columns in order:

Timestamp Hash Config name

The timestamp corresponds to when the hash has been logged, not to when the artifact has been downloaded. If the artifact couldn't be downloaded, the hash is equal to -1.

License

This project is licensed under the GNU General Public License version 3. You can find the terms of the license in the file LICENSE.