Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added sed as dependency for Nix. Started writing package changes analysis.

This commit is contained in:
antux18 2024-08-07 11:22:54 +02:00
parent eae7c40d59
commit 4b91a6cb5d
6 changed files with 135 additions and 85 deletions

View File

@ -1,6 +1,6 @@
# Study of the Reproducibility and Longevity of Dockerfiles
ECG is a program that automates software environment checking for scientific artifacts.
ECG is a program that automates software environment checking for scientific artifacts that use Docker.
It is meant to be executed periodically to analyze variations in the software environment of the artifact through time.
@ -41,16 +41,18 @@ Where:
- `<artifact_hash_log>` is the path to the file where to log the hash of the downloaded artifact.
- `<cache_directory>` is the path to the cache directory, where downloaded artifacts will be stored for future usage. If not specified, cache is disabled.
You can also use `--docker-cache` to enable the cache of the Docker layers, and `-v` to show the full output of the script in your terminal (by default, it is only written to the specified `log_file`).
## Output
### Package list
The list of packages installed in the container, depending on the package managers, Git packages and other miscellaneous packages given in the config file, in the form of a CSV file, with the following columns in order:
The list of packages installed in the container, depending on the sources (a package manager, `git` or `misc`) given in the config file, in the form of a CSV file, with the following columns in order:
| Package name | Version | Package manager |
|--------------|---------|-----------------|
| Package name | Version | Source | Config name | Timestamp |
|--------------|---------|-----------------|-------------|-----------|
For Git packages, the hash of the last commit is used as version number. For miscellaneous packages, the hash of the file that has been used to install the package is used as version number.
For Git packages, the hash of the last commit is used as version number. For miscellaneous packages, the hash of the file that has been used to install the package is used as version number. The timestamp corresponds to the time when ECG started building the package list, so it will be the same for each package that has been logged during the same execution of ECG.
### Output log
@ -60,8 +62,8 @@ Just a plain text file containing the output of the script.
The log of the attempts to build the Docker image, in the form of a CSV file, with the following columns in order:
| Config file path | Timestamp | Result |
|------------------|-----------|-----------------|
| Config name | Timestamp | Result |
|-------------|-----------|-----------------|
The timestamp corresponds to when the result is being logged, not to when it happened.
@ -79,10 +81,10 @@ The following are the possible results of the build:
The log of the hash of the artifact archive file, in the form of a CSV file, with the following columns in order:
| Timestamp | Hash |
|-----------|------|
| Timestamp | Hash | Config name |
|-----------|------|-------------|
The timestamp corresponds to when the hash has been logged, not to when the artifact has been downloaded.
The timestamp corresponds to when the hash has been logged, not to when the artifact has been downloaded. If the artifact couldn't be downloaded, the hash is equal to `-1`.
## License

View File

@ -13,60 +13,76 @@ import argparse
import csv
import os
def artifact_changed(table):
def artifact_changed(table, name):
"""
Indicates whether the artifact involved in the given hash log table
has changed over time.
Indicates whether the artifact of the given name has changed over time.
An artifact becoming unavailable is considered as modified.
Parameters
----------
table: list
Artifact hash log table.
name: str
Name of the artifact to check.
Returns
-------
bool
True if artifact changed, False otherwise.
"""
changed = False
# Hash is in the 2nd column:
artifact_hash = table[0][1]
i = 0
artifact_hash = ""
while i < len(table) and not changed:
if table[i][1] != artifact_hash:
changed = True
row = table[i]
if row[2] == name:
# If the first hash has not been saved yet:
if artifact_hash == "":
artifact_hash = row[1] # Hash is in the 2nd column
elif row[1] != artifact_hash:
changed = True
i += 1
return changed
def artifact_available(table):
def artifact_available(table, name):
"""
Indicates whether the artifact involved in the given hash log table
is still available.
Indicates whether the artifact of the given name is still available.
Parameters
----------
table: list
Artifact hash log table.
name: str
Name of the artifact to check.
Returns
-------
bool
True if artifact is still available, False otherwise.
"""
available = True
# We check the last line to check current availability:
if table[-1][1] == "":
available = False
for row in table:
if row[2] == name:
if row[1] == "-1":
# -1 means the artifact could not be downloaded. Otherwise,
# this column would contain the hash of the artifact.
available = False
else:
available = True
# The last log of the artifact hash will determine if the artifact is
# currently available or not.
return available
def analysis(input_tables):
def analysis(input_table):
"""
Analyzes the given artifact hash tables to determine if the artifacts are
Analyzes the given artifact hash table to determine if the artifacts are
still available and didn't change, changed, or aren't available anymore.
Parameters
----------
input_tables: str
input_table: str
Table to analyse.
Returns
@ -75,13 +91,17 @@ def analysis(input_tables):
Output table of the analysis in the form of a dict with headers as keys.
"""
artifacts = {"available":0, "unavailable":0, "changed":0}
for table in input_tables:
if artifact_available(table):
artifacts["available"] += 1
else:
artifacts["unavailable"] += 1
if artifact_changed(table):
artifacts["changed"] += 1
checked = [] # Artifacts that have been checked already
for row in input_table:
artifact_name = row[2] # Name of the artifact in the 3rd column
if artifact_name not in checked:
if artifact_available(input_table, artifact_name):
artifacts["available"] += 1
else:
artifacts["unavailable"] += 1
if artifact_changed(input_table, artifact_name):
artifacts["changed"] += 1
checked.append(artifact_name)
return artifacts
def main():
@ -126,18 +146,17 @@ def main():
output_path = args.output
# Parsing the input files:
input_tables = []
input_table = []
for path in input_paths:
input_file = open(path)
input_tables.append(list(csv.reader(input_file)))
input_table += list(csv.reader(input_file))
input_file.close()
# Analyzing the inputs:
output_file = open(output_path, "w+")
output_dict = {}
output_dict = analysis(input_tables)
output_dict = analysis(input_table)
# Writing analysis to output file:
output_file = open(output_path, "w+")
dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
dict_writer.writeheader()
dict_writer.writerow(output_dict)

View File

@ -13,15 +13,15 @@ import argparse
import csv
import os
def analysis(input_tables):
def analysis(input_table):
"""
Analyzes the given build status tables to count the results of the building
Analyzes the given build status table to count the results of the building
of the Dockerfile for each category.
Parameters
----------
input_tables: str
Tables to analyse.
input_table: str
Table to analyse.
Returns
-------
@ -29,21 +29,12 @@ def analysis(input_tables):
Output table of the analysis in the form of a dict with headers as keys.
"""
buildstatus = {}
for table in input_tables:
# # There has never been any error:
# if table == [[]]:
# if "never_failed" not in buildstatus:
# buildstatus["never_failed"] = 1
# else:
# buildstatus["never_failed"] += 1
# # There has been an error at least once:
# else:
for row in table:
# Third column is the result:
if row[2] not in buildstatus:
buildstatus[row[2]] = 1
else:
buildstatus[row[2]] += 1
for row in input_table:
# Third column is the result:
if row[2] not in buildstatus:
buildstatus[row[2]] = 1
else:
buildstatus[row[2]] += 1
return buildstatus
def main():
@ -88,18 +79,17 @@ def main():
output_path = args.output
# Parsing the input files:
input_tables = []
input_table = []
for path in input_paths:
input_file = open(path)
input_tables.append(list(csv.reader(input_file)))
input_table += list(csv.reader(input_file))
input_file.close()
# Analyzing the inputs:
output_file = open(output_path, "w+")
output_dict = {}
output_dict = analysis(input_tables)
output_dict = analysis(input_table)
# Writing analysis to output file:
output_file = open(output_path, "w+")
dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
dict_writer.writeheader()
dict_writer.writerow(output_dict)

View File

@ -6,26 +6,26 @@
program.
Depending on the type of analysis, multiple tables can be generated:
- sources-stats: Number of packages per source (a package manager, git or
- `sources-stats`: Number of packages per source (a package manager, git or
misc)
- pkg-changes: Number of packages that changed over time (0 if only one file
- `pkg-changes`: Number of packages that changed over time (0 if only one file
is given, since it will only include the package list of a single execution)
- pkg-per-container: Number of packages per container
- `pkgs-per-container`: Number of packages per container
"""
import argparse
import csv
import os
def sources_stats(input_tables):
def sources_stats(input_table):
"""
Analyzes the given package lists tables to determine the number of artifacts
Analyzes the given package lists table to determine the number of artifacts
using a package manager, Git packages or misc packages.
Parameters
----------
input_tables: str
Tables to analyse.
input_table: str
Table to analyse.
Returns
-------
@ -34,15 +34,46 @@ def sources_stats(input_tables):
"""
pkgmgr = {}
i = 0
for table in input_tables:
for row in table:
# Third column is the package source:
if row[2] not in pkgmgr:
pkgmgr[row[2]] = 1
else:
pkgmgr[row[2]] += 1
for row in input_table:
# Third column is the package source:
if row[2] not in pkgmgr:
pkgmgr[row[2]] = 1
else:
pkgmgr[row[2]] += 1
return pkgmgr
# def pkg_changed(pkgname, )
def pkgs_changes(input_table):
"""
Analyzes the given package lists table to determine the number of packages
that changed for every package source.
Parameters
----------
input_table: str
Table to analyse.
Returns
-------
dict
Output table of the analysis in the form of a dict with headers as keys.
"""
pkgmgr = {}
i = 0
for row in input_table:
# Third column is the package source:
if row[2] not in pkgmgr:
pkgmgr[row[2]] = 1
else:
pkgmgr[row[2]] += 1
return pkgmgr
def pkgs_per_container(input_table):
"""
"""
pass
def main():
# Command line arguments parsing:
parser = argparse.ArgumentParser(
@ -72,7 +103,7 @@ def main():
of a single execution) by using `pkg-changes`,
the number of packages per container by specifying `pkgs-per-container`.
""",
choices = ["sources-stats", "pkg-changes", "pkgs-per-container"],
choices = ["sources-stats", "pkgs-changes", "pkgs-per-container"],
required = True
)
parser.add_argument(
@ -100,18 +131,22 @@ def main():
analysis_type = args.analysis_type
# Parsing the input files:
input_tables = []
input_table = []
for path in input_paths:
input_file = open(path)
input_tables.append(list(csv.reader(input_file)))
input_table += list(csv.reader(input_file))
input_file.close()
# Analyzing the inputs:
output_file = open(output_path, "w+")
if analysis_type == "sources-stats":
output_dict = sources_stats(input_tables)
output_dict = sources_stats(input_table)
elif analysis_type == "pkgs-changes":
output_dict = pkgs_changes(input_table)
elif analysis_type == "pkgs-per-container":
output_dict = pkgs_per_container(input_table)
# Writing analysis to output file:
output_file = open(output_path, "w+")
dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
dict_writer.writeheader()
dict_writer.writerow(output_dict)

11
ecg.py
View File

@ -80,7 +80,7 @@ def download_file(url, dest):
pass
return file_hash
def download_sources(config, arthashlog_path, dl_dir, use_cache):
def download_sources(config, arthashlog_path, dl_dir, use_cache, artifact_name):
"""
Downloads the source of the artifact in 'config'.
@ -98,6 +98,9 @@ def download_sources(config, arthashlog_path, dl_dir, use_cache):
use_cache: bool
Indicates whether the cache should be used or not.
artifact_name: str
Name of the artifact, for the artifact hash log.
Returns
-------
temp_dir: str
@ -134,7 +137,7 @@ def download_sources(config, arthashlog_path, dl_dir, use_cache):
now = datetime.datetime.now()
timestamp = str(datetime.datetime.timestamp(now))
# Artifact hash will be an empty string if download failed:
arthashlog_file.write(f"{timestamp},{artifact_hash}\n")
arthashlog_file.write(f"{timestamp},{artifact_hash},{artifact_name}\n")
arthashlog_file.close()
else:
logging.info(f"Cache found for {url}, skipping download")
@ -462,10 +465,10 @@ def main():
else:
use_cache = True
dl_dir = cache_dir
artifact_dir = download_sources(config, arthashlog_path, dl_dir, use_cache)
artifact_name = os.path.splitext(os.path.basename(config_path))[0]
artifact_dir = download_sources(config, arthashlog_path, dl_dir, use_cache, artifact_name)
# If download was successful:
if artifact_dir != "":
artifact_name = os.path.splitext(os.path.basename(config_path))[0]
return_code, build_output = build_image(config, artifact_dir, artifact_name, args.docker_cache)
if return_code == 0:
status = "success"

View File

@ -20,6 +20,7 @@
packages = with pkgs; [
snakemake
gawk
gnused
nickel
graphviz
# TODO separate into several shells