Updated doc with more details.
This commit is contained in:
parent
56c3682124
commit
3d6b6d1ade
102
README.md
102
README.md
@ -1,14 +1,34 @@
|
||||
# Study of the Reproducibility and Longevity of Dockerfiles
|
||||
|
||||
ECG is a program that automates software environment checking for scientific artifacts that use Docker.
|
||||
The aim of this study is to show the current practices with Docker in High Performance Computing (HPC) conferences' artifacts.
|
||||
|
||||
This repository contains the code and the artifacts used for the study.
|
||||
|
||||
## Components of the workflow
|
||||
|
||||
The workflow is managed by Snakemake. Below are brief explanations on the various components of this workflow. Further details can be found in the chapter "Usage".
|
||||
|
||||
### Artifact configuration files
|
||||
|
||||
This repository contains configuration files for multiple artifacts from HPC conferences. These configuration files contain multiple informations that are used by ECG to build the Docker container and perform a software environment analysis.
|
||||
|
||||
*TODO: talk about the protocol to create one's own config file*
|
||||
|
||||
### ECG
|
||||
|
||||
ECG is a program that automates software environment checking for scientific artifacts that use Docker. It takes as input a JSON configuration telling where to download the artifact, where to find the Dockerfile to build in the artifact, and which package managers are used by the Docker container.
|
||||
|
||||
It will then download the artifact, build the Dockerfile, and then create a list of the installed packages in the Docker container (if it was built successfully). It also stores the potential errors encountered when building the Dockerfile, and logs the hash of the artifact for future comparison.
|
||||
|
||||
It is meant to be executed periodically to analyze variations in the software environment of the artifact through time.
|
||||
|
||||
## How it works
|
||||
### Analysis
|
||||
|
||||
ECG takes as input a JSON configuration telling where to download the artifact, where to find the Dockerfile to build in the artifact, and which package managers are used by the Docker container.
|
||||
Multiple type of analysis are done with the output of ECG to create tables that can later be plotted. The analysis done for this study are software environment, artifact, and build status analysis. Each type of analysis is done through a different script.
|
||||
|
||||
It will then download the artifact, build the Dockerfile, and then create a list of the installed packages in the Docker container. It also stores the potential errors encountered when building the Dockerfile, and logs the hash of the artifact for future comparison.
|
||||
### Plots with R
|
||||
|
||||
*TODO: write the code...*
|
||||
|
||||
## Setup
|
||||
|
||||
@ -27,6 +47,26 @@ Otherwise, you can use the Nix package manager and run `nix develop` in this dir
|
||||
|
||||
## Usage
|
||||
|
||||
### Running the whole workflow
|
||||
|
||||
*TODO: write the doc*
|
||||
|
||||
### Running each component separately
|
||||
|
||||
#### Artifact configuration files
|
||||
|
||||
Under `artifacts/nickel`, you will find some configuration files in the Nickel format. You will need to run the following command to convert a configuration file in the JSON format to make it readable for ECG and to check for errors:
|
||||
|
||||
```
|
||||
nickel export --format json --output <output_path> <<< 'let {Artifact, ..} = import "'workflow/nickel/artifact_contract.ncl'" in ((import "'<input_config>'") | Artifact)'
|
||||
```
|
||||
|
||||
Where:
|
||||
- `<input_config>` is the configuration file in the Nickel format to check and convert to JSON.
|
||||
- `<output_path>` is the path where to store the converted configuration file.
|
||||
|
||||
#### ECG
|
||||
|
||||
Run `ecg.py` as follow:
|
||||
|
||||
```
|
||||
@ -34,7 +74,7 @@ python3 ecg.py <config_file> -p <pkglist_path> -l <log_file> -b <build_status_fi
|
||||
```
|
||||
|
||||
Where:
|
||||
- `<config_file>` is the configuration file of the artifact in JSON format. A template of the Nickel file to use to produce the JSON config file is given in `artifacts/nickel/template.ncl`. WARNING: The name of the file (without the extension) must comply with the Docker image naming convention: only characters allowed are lowercase letters and numbers, separated with either one "." maximum, or two "_" maximum, or an unlimited number of "-", and should be of 128 characters maximum.
|
||||
- `<config_file>` is the configuration file of the artifact in JSON format. WARNING: The name of the file (without the extension) must comply with the Docker image naming convention: only characters allowed are lowercase letters and numbers, separated with either one "." maximum, or two "_" maximum, or an unlimited number of "-", and should be of 128 characters maximum.
|
||||
- `<pkglist_path>` is the path to the file where the package list generated by the program should be written.
|
||||
- `<log_file>` is the path to the file where to log the output of the program.
|
||||
- `<build_status_file>` is the path to the file where to write the build status of the Docker image given in the configuration file.
|
||||
@ -43,9 +83,9 @@ Where:
|
||||
|
||||
You can also use `--docker-cache` to enable the cache of the Docker layers, and `-v` to show the full output of the script in your terminal (by default, it is only written to the specified `log_file`).
|
||||
|
||||
## Output
|
||||
##### Outputs
|
||||
|
||||
### Package list
|
||||
###### Package list
|
||||
|
||||
The list of packages installed in the container, depending on the sources (a package manager, `git` or `misc`) given in the config file, in the form of a CSV file, with the following columns in order:
|
||||
|
||||
@ -54,11 +94,11 @@ The list of packages installed in the container, depending on the sources (a pac
|
||||
|
||||
For Git packages, the hash of the last commit is used as version number. For miscellaneous packages, the hash of the file that has been used to install the package is used as version number. The timestamp corresponds to the time when ECG started building the package list, so it will be the same for each package that has been logged during the same execution of ECG.
|
||||
|
||||
### Output log
|
||||
###### Output log
|
||||
|
||||
Just a plain text file containing the output of the script.
|
||||
|
||||
### Build status file
|
||||
###### Build status file
|
||||
|
||||
The log of the attempts to build the Docker image, in the form of a CSV file, with the following columns in order:
|
||||
|
||||
@ -77,7 +117,7 @@ The following are the possible results of the build:
|
||||
- `job_time_exceeded`: When running on a batch system such as OAR, this error indicates that the script exceeded the allocated run time and had to be terminated.
|
||||
- `unknown_error`: Any other error.
|
||||
|
||||
### Artifact hash log
|
||||
###### Artifact hash log
|
||||
|
||||
The log of the hash of the artifact archive file, in the form of a CSV file, with the following columns in order:
|
||||
|
||||
@ -86,6 +126,48 @@ The log of the hash of the artifact archive file, in the form of a CSV file, wit
|
||||
|
||||
The timestamp corresponds to when the hash has been logged, not to when the artifact has been downloaded. If the artifact couldn't be downloaded, the hash is equal to `-1`.
|
||||
|
||||
#### Analysis
|
||||
|
||||
Under the folder `analysis`, you will find multiple analysis scripts. These scripts take as input the outputs of ECG to generate tables that can then be plotted by another program.
|
||||
|
||||
All the analysis scripts can be run the same way:
|
||||
|
||||
```
|
||||
python3 <analysis_script> -i <input_table1> -i <input_table2> ... -o <output_table>
|
||||
```
|
||||
|
||||
Where:
|
||||
- `<analysis_script>` is one of the following analysis scripts.
|
||||
- `<input_table1>`, `<input_table2>`... are one or more output tables from ECG. The required ECG output depends on the analysis script, see below.
|
||||
- `<output_table>` is the path where to store the table generated by the analysis script.
|
||||
|
||||
##### Software environment analysis
|
||||
|
||||
The script `softenv_analysis.py` performs a software environment analysis by parsing one or more package lists generated by ECG.
|
||||
|
||||
Depending on the type of analysis, multiple tables can be generated:
|
||||
- `sources-stats`: Number of packages per source (a package manager, `git` or `misc`).
|
||||
- `pkg-changes`: Number of packages that changed over time (`0` if only one file is given, since it will only include the package list of a single execution).
|
||||
- `pkgs-per-container`: Number of packages per container.
|
||||
|
||||
The type of analysis can be specified using the option `-t`.
|
||||
|
||||
##### Artifact analysis
|
||||
|
||||
The script `artifact_analysis.py` performs an artifact analysis by parsing one or more artifact hash logs generated by ECG.
|
||||
|
||||
The table generated by this script gives the amount of artifacts that are available or not available, and the amount of artifacts that have been modified over time.
|
||||
|
||||
##### Build status analysis
|
||||
|
||||
The script `buildstatus_analysis.py` performs a build status analysis by parsing one or more build status log generated by ECG.
|
||||
|
||||
The table generated by this script gives the amount of images that have been built successfully, and the amount of images that failed to build, for each category of error.
|
||||
|
||||
#### Plots with R
|
||||
|
||||
*TODO: Write the code...*
|
||||
|
||||
## License
|
||||
|
||||
This project is licensed under the GNU General Public License version 3. You can find the terms of the license in the file LICENSE.
|
@ -3,10 +3,6 @@
|
||||
"""
|
||||
This script performs an artifact analysis on the outputs of the workflow
|
||||
to generate tables that can then be plotted by another program.
|
||||
|
||||
The generated table gives the amount of artifacts that are available
|
||||
or not available, and the amount of artifacts that have been modified
|
||||
over time.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
|
@ -3,10 +3,6 @@
|
||||
"""
|
||||
This script performs a build status analysis on the outputs of the workflow
|
||||
to generate tables that can then be plotted by another program.
|
||||
|
||||
The generated table gives the amount of images that have been built
|
||||
sucessfully, and the amount of images that failed to build, for each
|
||||
category of error.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
@ -46,7 +42,7 @@ def main():
|
||||
This script performs a build status analysis on the outputs of the
|
||||
workflow to generate tables that can then be plotted by another program.
|
||||
The generated table gives the amount of images that have been
|
||||
built sucessfully, and the amount of images that failed to build,
|
||||
built successfully, and the amount of images that failed to build,
|
||||
for each category of error.
|
||||
"""
|
||||
)
|
||||
|
@ -4,13 +4,6 @@
|
||||
This script performs a software environment analysis on the outputs
|
||||
of the workflow to generate tables that can then be plotted by another
|
||||
program.
|
||||
|
||||
Depending on the type of analysis, multiple tables can be generated:
|
||||
- `sources-stats`: Number of packages per source (a package manager, git or
|
||||
misc)
|
||||
- `pkg-changes`: Number of packages that changed over time (0 if only one file
|
||||
is given, since it will only include the package list of a single execution)
|
||||
- `pkgs-per-container`: Number of packages per container
|
||||
"""
|
||||
|
||||
import argparse
|
||||
|
Loading…
Reference in New Issue
Block a user