Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added sed as dependency for Nix. Started writing package changes analysis.

This commit is contained in:
antux18 2024-08-07 11:22:54 +02:00
parent eae7c40d59
commit 4b91a6cb5d
6 changed files with 135 additions and 85 deletions

View File

@ -1,6 +1,6 @@
# Study of the Reproducibility and Longevity of Dockerfiles # Study of the Reproducibility and Longevity of Dockerfiles
ECG is a program that automates software environment checking for scientific artifacts. ECG is a program that automates software environment checking for scientific artifacts that use Docker.
It is meant to be executed periodically to analyze variations in the software environment of the artifact through time. It is meant to be executed periodically to analyze variations in the software environment of the artifact through time.
@ -41,16 +41,18 @@ Where:
- `<artifact_hash_log>` is the path to the file where to log the hash of the downloaded artifact. - `<artifact_hash_log>` is the path to the file where to log the hash of the downloaded artifact.
- `<cache_directory>` is the path to the cache directory, where downloaded artifacts will be stored for future usage. If not specified, cache is disabled. - `<cache_directory>` is the path to the cache directory, where downloaded artifacts will be stored for future usage. If not specified, cache is disabled.
You can also use `--docker-cache` to enable the cache of the Docker layers, and `-v` to show the full output of the script in your terminal (by default, it is only written to the specified `log_file`).
## Output ## Output
### Package list ### Package list
The list of packages installed in the container, depending on the package managers, Git packages and other miscellaneous packages given in the config file, in the form of a CSV file, with the following columns in order: The list of packages installed in the container, depending on the sources (a package manager, `git` or `misc`) given in the config file, in the form of a CSV file, with the following columns in order:
| Package name | Version | Package manager | | Package name | Version | Source | Config name | Timestamp |
|--------------|---------|-----------------| |--------------|---------|-----------------|-------------|-----------|
For Git packages, the hash of the last commit is used as version number. For miscellaneous packages, the hash of the file that has been used to install the package is used as version number. For Git packages, the hash of the last commit is used as version number. For miscellaneous packages, the hash of the file that has been used to install the package is used as version number. The timestamp corresponds to the time when ECG started building the package list, so it will be the same for each package that has been logged during the same execution of ECG.
### Output log ### Output log
@ -60,8 +62,8 @@ Just a plain text file containing the output of the script.
The log of the attempts to build the Docker image, in the form of a CSV file, with the following columns in order: The log of the attempts to build the Docker image, in the form of a CSV file, with the following columns in order:
| Config file path | Timestamp | Result | | Config name | Timestamp | Result |
|------------------|-----------|-----------------| |-------------|-----------|-----------------|
The timestamp corresponds to when the result is being logged, not to when it happened. The timestamp corresponds to when the result is being logged, not to when it happened.
@ -79,10 +81,10 @@ The following are the possible results of the build:
The log of the hash of the artifact archive file, in the form of a CSV file, with the following columns in order: The log of the hash of the artifact archive file, in the form of a CSV file, with the following columns in order:
| Timestamp | Hash | | Timestamp | Hash | Config name |
|-----------|------| |-----------|------|-------------|
The timestamp corresponds to when the hash has been logged, not to when the artifact has been downloaded. The timestamp corresponds to when the hash has been logged, not to when the artifact has been downloaded. If the artifact couldn't be downloaded, the hash is equal to `-1`.
## License ## License

View File

@ -13,60 +13,76 @@ import argparse
import csv import csv
import os import os
def artifact_changed(table): def artifact_changed(table, name):
""" """
Indicates whether the artifact involved in the given hash log table Indicates whether the artifact of the given name has changed over time.
has changed over time. An artifact becoming unavailable is considered as modified.
Parameters Parameters
---------- ----------
table: list table: list
Artifact hash log table. Artifact hash log table.
name: str
Name of the artifact to check.
Returns Returns
------- -------
bool bool
True if artifact changed, False otherwise. True if artifact changed, False otherwise.
""" """
changed = False changed = False
# Hash is in the 2nd column:
artifact_hash = table[0][1]
i = 0 i = 0
artifact_hash = ""
while i < len(table) and not changed: while i < len(table) and not changed:
if table[i][1] != artifact_hash: row = table[i]
changed = True if row[2] == name:
# If the first hash has not been saved yet:
if artifact_hash == "":
artifact_hash = row[1] # Hash is in the 2nd column
elif row[1] != artifact_hash:
changed = True
i += 1 i += 1
return changed return changed
def artifact_available(table): def artifact_available(table, name):
""" """
Indicates whether the artifact involved in the given hash log table Indicates whether the artifact of the given name is still available.
is still available.
Parameters Parameters
---------- ----------
table: list table: list
Artifact hash log table. Artifact hash log table.
name: str
Name of the artifact to check.
Returns Returns
------- -------
bool bool
True if artifact is still available, False otherwise. True if artifact is still available, False otherwise.
""" """
available = True available = True
# We check the last line to check current availability: for row in table:
if table[-1][1] == "": if row[2] == name:
available = False if row[1] == "-1":
# -1 means the artifact could not be downloaded. Otherwise,
# this column would contain the hash of the artifact.
available = False
else:
available = True
# The last log of the artifact hash will determine if the artifact is
# currently available or not.
return available return available
def analysis(input_tables): def analysis(input_table):
""" """
Analyzes the given artifact hash tables to determine if the artifacts are Analyzes the given artifact hash table to determine if the artifacts are
still available and didn't change, changed, or aren't available anymore. still available and didn't change, changed, or aren't available anymore.
Parameters Parameters
---------- ----------
input_tables: str input_table: str
Table to analyse. Table to analyse.
Returns Returns
@ -75,13 +91,17 @@ def analysis(input_tables):
Output table of the analysis in the form of a dict with headers as keys. Output table of the analysis in the form of a dict with headers as keys.
""" """
artifacts = {"available":0, "unavailable":0, "changed":0} artifacts = {"available":0, "unavailable":0, "changed":0}
for table in input_tables: checked = [] # Artifacts that have been checked already
if artifact_available(table): for row in input_table:
artifacts["available"] += 1 artifact_name = row[2] # Name of the artifact in the 3rd column
else: if artifact_name not in checked:
artifacts["unavailable"] += 1 if artifact_available(input_table, artifact_name):
if artifact_changed(table): artifacts["available"] += 1
artifacts["changed"] += 1 else:
artifacts["unavailable"] += 1
if artifact_changed(input_table, artifact_name):
artifacts["changed"] += 1
checked.append(artifact_name)
return artifacts return artifacts
def main(): def main():
@ -126,18 +146,17 @@ def main():
output_path = args.output output_path = args.output
# Parsing the input files: # Parsing the input files:
input_tables = [] input_table = []
for path in input_paths: for path in input_paths:
input_file = open(path) input_file = open(path)
input_tables.append(list(csv.reader(input_file))) input_table += list(csv.reader(input_file))
input_file.close() input_file.close()
# Analyzing the inputs: # Analyzing the inputs:
output_file = open(output_path, "w+") output_dict = analysis(input_table)
output_dict = {}
output_dict = analysis(input_tables)
# Writing analysis to output file: # Writing analysis to output file:
output_file = open(output_path, "w+")
dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys()) dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
dict_writer.writeheader() dict_writer.writeheader()
dict_writer.writerow(output_dict) dict_writer.writerow(output_dict)

View File

@ -13,15 +13,15 @@ import argparse
import csv import csv
import os import os
def analysis(input_tables): def analysis(input_table):
""" """
Analyzes the given build status tables to count the results of the building Analyzes the given build status table to count the results of the building
of the Dockerfile for each category. of the Dockerfile for each category.
Parameters Parameters
---------- ----------
input_tables: str input_table: str
Tables to analyse. Table to analyse.
Returns Returns
------- -------
@ -29,21 +29,12 @@ def analysis(input_tables):
Output table of the analysis in the form of a dict with headers as keys. Output table of the analysis in the form of a dict with headers as keys.
""" """
buildstatus = {} buildstatus = {}
for table in input_tables: for row in input_table:
# # There has never been any error: # Third column is the result:
# if table == [[]]: if row[2] not in buildstatus:
# if "never_failed" not in buildstatus: buildstatus[row[2]] = 1
# buildstatus["never_failed"] = 1 else:
# else: buildstatus[row[2]] += 1
# buildstatus["never_failed"] += 1
# # There has been an error at least once:
# else:
for row in table:
# Third column is the result:
if row[2] not in buildstatus:
buildstatus[row[2]] = 1
else:
buildstatus[row[2]] += 1
return buildstatus return buildstatus
def main(): def main():
@ -88,18 +79,17 @@ def main():
output_path = args.output output_path = args.output
# Parsing the input files: # Parsing the input files:
input_tables = [] input_table = []
for path in input_paths: for path in input_paths:
input_file = open(path) input_file = open(path)
input_tables.append(list(csv.reader(input_file))) input_table += list(csv.reader(input_file))
input_file.close() input_file.close()
# Analyzing the inputs: # Analyzing the inputs:
output_file = open(output_path, "w+") output_dict = analysis(input_table)
output_dict = {}
output_dict = analysis(input_tables)
# Writing analysis to output file: # Writing analysis to output file:
output_file = open(output_path, "w+")
dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys()) dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
dict_writer.writeheader() dict_writer.writeheader()
dict_writer.writerow(output_dict) dict_writer.writerow(output_dict)

View File

@ -6,26 +6,26 @@
program. program.
Depending on the type of analysis, multiple tables can be generated: Depending on the type of analysis, multiple tables can be generated:
- sources-stats: Number of packages per source (a package manager, git or - `sources-stats`: Number of packages per source (a package manager, git or
misc) misc)
- pkg-changes: Number of packages that changed over time (0 if only one file - `pkg-changes`: Number of packages that changed over time (0 if only one file
is given, since it will only include the package list of a single execution) is given, since it will only include the package list of a single execution)
- pkg-per-container: Number of packages per container - `pkgs-per-container`: Number of packages per container
""" """
import argparse import argparse
import csv import csv
import os import os
def sources_stats(input_tables): def sources_stats(input_table):
""" """
Analyzes the given package lists tables to determine the number of artifacts Analyzes the given package lists table to determine the number of artifacts
using a package manager, Git packages or misc packages. using a package manager, Git packages or misc packages.
Parameters Parameters
---------- ----------
input_tables: str input_table: str
Tables to analyse. Table to analyse.
Returns Returns
------- -------
@ -34,15 +34,46 @@ def sources_stats(input_tables):
""" """
pkgmgr = {} pkgmgr = {}
i = 0 i = 0
for table in input_tables: for row in input_table:
for row in table: # Third column is the package source:
# Third column is the package source: if row[2] not in pkgmgr:
if row[2] not in pkgmgr: pkgmgr[row[2]] = 1
pkgmgr[row[2]] = 1 else:
else: pkgmgr[row[2]] += 1
pkgmgr[row[2]] += 1
return pkgmgr return pkgmgr
# def pkg_changed(pkgname, )
def pkgs_changes(input_table):
"""
Analyzes the given package lists table to determine the number of packages
that changed for every package source.
Parameters
----------
input_table: str
Table to analyse.
Returns
-------
dict
Output table of the analysis in the form of a dict with headers as keys.
"""
pkgmgr = {}
i = 0
for row in input_table:
# Third column is the package source:
if row[2] not in pkgmgr:
pkgmgr[row[2]] = 1
else:
pkgmgr[row[2]] += 1
return pkgmgr
def pkgs_per_container(input_table):
"""
"""
pass
def main(): def main():
# Command line arguments parsing: # Command line arguments parsing:
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
@ -72,7 +103,7 @@ def main():
of a single execution) by using `pkg-changes`, of a single execution) by using `pkg-changes`,
the number of packages per container by specifying `pkgs-per-container`. the number of packages per container by specifying `pkgs-per-container`.
""", """,
choices = ["sources-stats", "pkg-changes", "pkgs-per-container"], choices = ["sources-stats", "pkgs-changes", "pkgs-per-container"],
required = True required = True
) )
parser.add_argument( parser.add_argument(
@ -100,18 +131,22 @@ def main():
analysis_type = args.analysis_type analysis_type = args.analysis_type
# Parsing the input files: # Parsing the input files:
input_tables = [] input_table = []
for path in input_paths: for path in input_paths:
input_file = open(path) input_file = open(path)
input_tables.append(list(csv.reader(input_file))) input_table += list(csv.reader(input_file))
input_file.close() input_file.close()
# Analyzing the inputs: # Analyzing the inputs:
output_file = open(output_path, "w+")
if analysis_type == "sources-stats": if analysis_type == "sources-stats":
output_dict = sources_stats(input_tables) output_dict = sources_stats(input_table)
elif analysis_type == "pkgs-changes":
output_dict = pkgs_changes(input_table)
elif analysis_type == "pkgs-per-container":
output_dict = pkgs_per_container(input_table)
# Writing analysis to output file: # Writing analysis to output file:
output_file = open(output_path, "w+")
dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys()) dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
dict_writer.writeheader() dict_writer.writeheader()
dict_writer.writerow(output_dict) dict_writer.writerow(output_dict)

11
ecg.py
View File

@ -80,7 +80,7 @@ def download_file(url, dest):
pass pass
return file_hash return file_hash
def download_sources(config, arthashlog_path, dl_dir, use_cache): def download_sources(config, arthashlog_path, dl_dir, use_cache, artifact_name):
""" """
Downloads the source of the artifact in 'config'. Downloads the source of the artifact in 'config'.
@ -98,6 +98,9 @@ def download_sources(config, arthashlog_path, dl_dir, use_cache):
use_cache: bool use_cache: bool
Indicates whether the cache should be used or not. Indicates whether the cache should be used or not.
artifact_name: str
Name of the artifact, for the artifact hash log.
Returns Returns
------- -------
temp_dir: str temp_dir: str
@ -134,7 +137,7 @@ def download_sources(config, arthashlog_path, dl_dir, use_cache):
now = datetime.datetime.now() now = datetime.datetime.now()
timestamp = str(datetime.datetime.timestamp(now)) timestamp = str(datetime.datetime.timestamp(now))
# Artifact hash will be an empty string if download failed: # Artifact hash will be an empty string if download failed:
arthashlog_file.write(f"{timestamp},{artifact_hash}\n") arthashlog_file.write(f"{timestamp},{artifact_hash},{artifact_name}\n")
arthashlog_file.close() arthashlog_file.close()
else: else:
logging.info(f"Cache found for {url}, skipping download") logging.info(f"Cache found for {url}, skipping download")
@ -462,10 +465,10 @@ def main():
else: else:
use_cache = True use_cache = True
dl_dir = cache_dir dl_dir = cache_dir
artifact_dir = download_sources(config, arthashlog_path, dl_dir, use_cache) artifact_name = os.path.splitext(os.path.basename(config_path))[0]
artifact_dir = download_sources(config, arthashlog_path, dl_dir, use_cache, artifact_name)
# If download was successful: # If download was successful:
if artifact_dir != "": if artifact_dir != "":
artifact_name = os.path.splitext(os.path.basename(config_path))[0]
return_code, build_output = build_image(config, artifact_dir, artifact_name, args.docker_cache) return_code, build_output = build_image(config, artifact_dir, artifact_name, args.docker_cache)
if return_code == 0: if return_code == 0:
status = "success" status = "success"

View File

@ -20,6 +20,7 @@
packages = with pkgs; [ packages = with pkgs; [
snakemake snakemake
gawk gawk
gnused
nickel nickel
graphviz graphviz
# TODO separate into several shells # TODO separate into several shells