Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added sed
as dependency for Nix. Started writing package changes analysis.
This commit is contained in:
parent
eae7c40d59
commit
4b91a6cb5d
22
README.md
22
README.md
@ -1,6 +1,6 @@
|
|||||||
# Study of the Reproducibility and Longevity of Dockerfiles
|
# Study of the Reproducibility and Longevity of Dockerfiles
|
||||||
|
|
||||||
ECG is a program that automates software environment checking for scientific artifacts.
|
ECG is a program that automates software environment checking for scientific artifacts that use Docker.
|
||||||
|
|
||||||
It is meant to be executed periodically to analyze variations in the software environment of the artifact through time.
|
It is meant to be executed periodically to analyze variations in the software environment of the artifact through time.
|
||||||
|
|
||||||
@ -41,16 +41,18 @@ Where:
|
|||||||
- `<artifact_hash_log>` is the path to the file where to log the hash of the downloaded artifact.
|
- `<artifact_hash_log>` is the path to the file where to log the hash of the downloaded artifact.
|
||||||
- `<cache_directory>` is the path to the cache directory, where downloaded artifacts will be stored for future usage. If not specified, cache is disabled.
|
- `<cache_directory>` is the path to the cache directory, where downloaded artifacts will be stored for future usage. If not specified, cache is disabled.
|
||||||
|
|
||||||
|
You can also use `--docker-cache` to enable the cache of the Docker layers, and `-v` to show the full output of the script in your terminal (by default, it is only written to the specified `log_file`).
|
||||||
|
|
||||||
## Output
|
## Output
|
||||||
|
|
||||||
### Package list
|
### Package list
|
||||||
|
|
||||||
The list of packages installed in the container, depending on the package managers, Git packages and other miscellaneous packages given in the config file, in the form of a CSV file, with the following columns in order:
|
The list of packages installed in the container, depending on the sources (a package manager, `git` or `misc`) given in the config file, in the form of a CSV file, with the following columns in order:
|
||||||
|
|
||||||
| Package name | Version | Package manager |
|
| Package name | Version | Source | Config name | Timestamp |
|
||||||
|--------------|---------|-----------------|
|
|--------------|---------|-----------------|-------------|-----------|
|
||||||
|
|
||||||
For Git packages, the hash of the last commit is used as version number. For miscellaneous packages, the hash of the file that has been used to install the package is used as version number.
|
For Git packages, the hash of the last commit is used as version number. For miscellaneous packages, the hash of the file that has been used to install the package is used as version number. The timestamp corresponds to the time when ECG started building the package list, so it will be the same for each package that has been logged during the same execution of ECG.
|
||||||
|
|
||||||
### Output log
|
### Output log
|
||||||
|
|
||||||
@ -60,8 +62,8 @@ Just a plain text file containing the output of the script.
|
|||||||
|
|
||||||
The log of the attempts to build the Docker image, in the form of a CSV file, with the following columns in order:
|
The log of the attempts to build the Docker image, in the form of a CSV file, with the following columns in order:
|
||||||
|
|
||||||
| Config file path | Timestamp | Result |
|
| Config name | Timestamp | Result |
|
||||||
|------------------|-----------|-----------------|
|
|-------------|-----------|-----------------|
|
||||||
|
|
||||||
The timestamp corresponds to when the result is being logged, not to when it happened.
|
The timestamp corresponds to when the result is being logged, not to when it happened.
|
||||||
|
|
||||||
@ -79,10 +81,10 @@ The following are the possible results of the build:
|
|||||||
|
|
||||||
The log of the hash of the artifact archive file, in the form of a CSV file, with the following columns in order:
|
The log of the hash of the artifact archive file, in the form of a CSV file, with the following columns in order:
|
||||||
|
|
||||||
| Timestamp | Hash |
|
| Timestamp | Hash | Config name |
|
||||||
|-----------|------|
|
|-----------|------|-------------|
|
||||||
|
|
||||||
The timestamp corresponds to when the hash has been logged, not to when the artifact has been downloaded.
|
The timestamp corresponds to when the hash has been logged, not to when the artifact has been downloaded. If the artifact couldn't be downloaded, the hash is equal to `-1`.
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|
||||||
|
@ -13,60 +13,76 @@ import argparse
|
|||||||
import csv
|
import csv
|
||||||
import os
|
import os
|
||||||
|
|
||||||
def artifact_changed(table):
|
def artifact_changed(table, name):
|
||||||
"""
|
"""
|
||||||
Indicates whether the artifact involved in the given hash log table
|
Indicates whether the artifact of the given name has changed over time.
|
||||||
has changed over time.
|
An artifact becoming unavailable is considered as modified.
|
||||||
|
|
||||||
Parameters
|
Parameters
|
||||||
----------
|
----------
|
||||||
table: list
|
table: list
|
||||||
Artifact hash log table.
|
Artifact hash log table.
|
||||||
|
|
||||||
|
name: str
|
||||||
|
Name of the artifact to check.
|
||||||
|
|
||||||
Returns
|
Returns
|
||||||
-------
|
-------
|
||||||
bool
|
bool
|
||||||
True if artifact changed, False otherwise.
|
True if artifact changed, False otherwise.
|
||||||
"""
|
"""
|
||||||
changed = False
|
changed = False
|
||||||
# Hash is in the 2nd column:
|
|
||||||
artifact_hash = table[0][1]
|
|
||||||
i = 0
|
i = 0
|
||||||
|
artifact_hash = ""
|
||||||
while i < len(table) and not changed:
|
while i < len(table) and not changed:
|
||||||
if table[i][1] != artifact_hash:
|
row = table[i]
|
||||||
changed = True
|
if row[2] == name:
|
||||||
|
# If the first hash has not been saved yet:
|
||||||
|
if artifact_hash == "":
|
||||||
|
artifact_hash = row[1] # Hash is in the 2nd column
|
||||||
|
elif row[1] != artifact_hash:
|
||||||
|
changed = True
|
||||||
i += 1
|
i += 1
|
||||||
return changed
|
return changed
|
||||||
|
|
||||||
def artifact_available(table):
|
def artifact_available(table, name):
|
||||||
"""
|
"""
|
||||||
Indicates whether the artifact involved in the given hash log table
|
Indicates whether the artifact of the given name is still available.
|
||||||
is still available.
|
|
||||||
|
|
||||||
Parameters
|
Parameters
|
||||||
----------
|
----------
|
||||||
table: list
|
table: list
|
||||||
Artifact hash log table.
|
Artifact hash log table.
|
||||||
|
|
||||||
|
name: str
|
||||||
|
Name of the artifact to check.
|
||||||
|
|
||||||
Returns
|
Returns
|
||||||
-------
|
-------
|
||||||
bool
|
bool
|
||||||
True if artifact is still available, False otherwise.
|
True if artifact is still available, False otherwise.
|
||||||
"""
|
"""
|
||||||
available = True
|
available = True
|
||||||
# We check the last line to check current availability:
|
for row in table:
|
||||||
if table[-1][1] == "":
|
if row[2] == name:
|
||||||
available = False
|
if row[1] == "-1":
|
||||||
|
# -1 means the artifact could not be downloaded. Otherwise,
|
||||||
|
# this column would contain the hash of the artifact.
|
||||||
|
available = False
|
||||||
|
else:
|
||||||
|
available = True
|
||||||
|
# The last log of the artifact hash will determine if the artifact is
|
||||||
|
# currently available or not.
|
||||||
return available
|
return available
|
||||||
|
|
||||||
def analysis(input_tables):
|
def analysis(input_table):
|
||||||
"""
|
"""
|
||||||
Analyzes the given artifact hash tables to determine if the artifacts are
|
Analyzes the given artifact hash table to determine if the artifacts are
|
||||||
still available and didn't change, changed, or aren't available anymore.
|
still available and didn't change, changed, or aren't available anymore.
|
||||||
|
|
||||||
Parameters
|
Parameters
|
||||||
----------
|
----------
|
||||||
input_tables: str
|
input_table: str
|
||||||
Table to analyse.
|
Table to analyse.
|
||||||
|
|
||||||
Returns
|
Returns
|
||||||
@ -75,13 +91,17 @@ def analysis(input_tables):
|
|||||||
Output table of the analysis in the form of a dict with headers as keys.
|
Output table of the analysis in the form of a dict with headers as keys.
|
||||||
"""
|
"""
|
||||||
artifacts = {"available":0, "unavailable":0, "changed":0}
|
artifacts = {"available":0, "unavailable":0, "changed":0}
|
||||||
for table in input_tables:
|
checked = [] # Artifacts that have been checked already
|
||||||
if artifact_available(table):
|
for row in input_table:
|
||||||
artifacts["available"] += 1
|
artifact_name = row[2] # Name of the artifact in the 3rd column
|
||||||
else:
|
if artifact_name not in checked:
|
||||||
artifacts["unavailable"] += 1
|
if artifact_available(input_table, artifact_name):
|
||||||
if artifact_changed(table):
|
artifacts["available"] += 1
|
||||||
artifacts["changed"] += 1
|
else:
|
||||||
|
artifacts["unavailable"] += 1
|
||||||
|
if artifact_changed(input_table, artifact_name):
|
||||||
|
artifacts["changed"] += 1
|
||||||
|
checked.append(artifact_name)
|
||||||
return artifacts
|
return artifacts
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
@ -126,18 +146,17 @@ def main():
|
|||||||
output_path = args.output
|
output_path = args.output
|
||||||
|
|
||||||
# Parsing the input files:
|
# Parsing the input files:
|
||||||
input_tables = []
|
input_table = []
|
||||||
for path in input_paths:
|
for path in input_paths:
|
||||||
input_file = open(path)
|
input_file = open(path)
|
||||||
input_tables.append(list(csv.reader(input_file)))
|
input_table += list(csv.reader(input_file))
|
||||||
input_file.close()
|
input_file.close()
|
||||||
|
|
||||||
# Analyzing the inputs:
|
# Analyzing the inputs:
|
||||||
output_file = open(output_path, "w+")
|
output_dict = analysis(input_table)
|
||||||
output_dict = {}
|
|
||||||
output_dict = analysis(input_tables)
|
|
||||||
|
|
||||||
# Writing analysis to output file:
|
# Writing analysis to output file:
|
||||||
|
output_file = open(output_path, "w+")
|
||||||
dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
|
dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
|
||||||
dict_writer.writeheader()
|
dict_writer.writeheader()
|
||||||
dict_writer.writerow(output_dict)
|
dict_writer.writerow(output_dict)
|
||||||
|
@ -13,15 +13,15 @@ import argparse
|
|||||||
import csv
|
import csv
|
||||||
import os
|
import os
|
||||||
|
|
||||||
def analysis(input_tables):
|
def analysis(input_table):
|
||||||
"""
|
"""
|
||||||
Analyzes the given build status tables to count the results of the building
|
Analyzes the given build status table to count the results of the building
|
||||||
of the Dockerfile for each category.
|
of the Dockerfile for each category.
|
||||||
|
|
||||||
Parameters
|
Parameters
|
||||||
----------
|
----------
|
||||||
input_tables: str
|
input_table: str
|
||||||
Tables to analyse.
|
Table to analyse.
|
||||||
|
|
||||||
Returns
|
Returns
|
||||||
-------
|
-------
|
||||||
@ -29,21 +29,12 @@ def analysis(input_tables):
|
|||||||
Output table of the analysis in the form of a dict with headers as keys.
|
Output table of the analysis in the form of a dict with headers as keys.
|
||||||
"""
|
"""
|
||||||
buildstatus = {}
|
buildstatus = {}
|
||||||
for table in input_tables:
|
for row in input_table:
|
||||||
# # There has never been any error:
|
# Third column is the result:
|
||||||
# if table == [[]]:
|
if row[2] not in buildstatus:
|
||||||
# if "never_failed" not in buildstatus:
|
buildstatus[row[2]] = 1
|
||||||
# buildstatus["never_failed"] = 1
|
else:
|
||||||
# else:
|
buildstatus[row[2]] += 1
|
||||||
# buildstatus["never_failed"] += 1
|
|
||||||
# # There has been an error at least once:
|
|
||||||
# else:
|
|
||||||
for row in table:
|
|
||||||
# Third column is the result:
|
|
||||||
if row[2] not in buildstatus:
|
|
||||||
buildstatus[row[2]] = 1
|
|
||||||
else:
|
|
||||||
buildstatus[row[2]] += 1
|
|
||||||
return buildstatus
|
return buildstatus
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
@ -88,18 +79,17 @@ def main():
|
|||||||
output_path = args.output
|
output_path = args.output
|
||||||
|
|
||||||
# Parsing the input files:
|
# Parsing the input files:
|
||||||
input_tables = []
|
input_table = []
|
||||||
for path in input_paths:
|
for path in input_paths:
|
||||||
input_file = open(path)
|
input_file = open(path)
|
||||||
input_tables.append(list(csv.reader(input_file)))
|
input_table += list(csv.reader(input_file))
|
||||||
input_file.close()
|
input_file.close()
|
||||||
|
|
||||||
# Analyzing the inputs:
|
# Analyzing the inputs:
|
||||||
output_file = open(output_path, "w+")
|
output_dict = analysis(input_table)
|
||||||
output_dict = {}
|
|
||||||
output_dict = analysis(input_tables)
|
|
||||||
|
|
||||||
# Writing analysis to output file:
|
# Writing analysis to output file:
|
||||||
|
output_file = open(output_path, "w+")
|
||||||
dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
|
dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
|
||||||
dict_writer.writeheader()
|
dict_writer.writeheader()
|
||||||
dict_writer.writerow(output_dict)
|
dict_writer.writerow(output_dict)
|
||||||
|
@ -6,26 +6,26 @@
|
|||||||
program.
|
program.
|
||||||
|
|
||||||
Depending on the type of analysis, multiple tables can be generated:
|
Depending on the type of analysis, multiple tables can be generated:
|
||||||
- sources-stats: Number of packages per source (a package manager, git or
|
- `sources-stats`: Number of packages per source (a package manager, git or
|
||||||
misc)
|
misc)
|
||||||
- pkg-changes: Number of packages that changed over time (0 if only one file
|
- `pkg-changes`: Number of packages that changed over time (0 if only one file
|
||||||
is given, since it will only include the package list of a single execution)
|
is given, since it will only include the package list of a single execution)
|
||||||
- pkg-per-container: Number of packages per container
|
- `pkgs-per-container`: Number of packages per container
|
||||||
"""
|
"""
|
||||||
|
|
||||||
import argparse
|
import argparse
|
||||||
import csv
|
import csv
|
||||||
import os
|
import os
|
||||||
|
|
||||||
def sources_stats(input_tables):
|
def sources_stats(input_table):
|
||||||
"""
|
"""
|
||||||
Analyzes the given package lists tables to determine the number of artifacts
|
Analyzes the given package lists table to determine the number of artifacts
|
||||||
using a package manager, Git packages or misc packages.
|
using a package manager, Git packages or misc packages.
|
||||||
|
|
||||||
Parameters
|
Parameters
|
||||||
----------
|
----------
|
||||||
input_tables: str
|
input_table: str
|
||||||
Tables to analyse.
|
Table to analyse.
|
||||||
|
|
||||||
Returns
|
Returns
|
||||||
-------
|
-------
|
||||||
@ -34,15 +34,46 @@ def sources_stats(input_tables):
|
|||||||
"""
|
"""
|
||||||
pkgmgr = {}
|
pkgmgr = {}
|
||||||
i = 0
|
i = 0
|
||||||
for table in input_tables:
|
for row in input_table:
|
||||||
for row in table:
|
# Third column is the package source:
|
||||||
# Third column is the package source:
|
if row[2] not in pkgmgr:
|
||||||
if row[2] not in pkgmgr:
|
pkgmgr[row[2]] = 1
|
||||||
pkgmgr[row[2]] = 1
|
else:
|
||||||
else:
|
pkgmgr[row[2]] += 1
|
||||||
pkgmgr[row[2]] += 1
|
|
||||||
return pkgmgr
|
return pkgmgr
|
||||||
|
|
||||||
|
# def pkg_changed(pkgname, )
|
||||||
|
|
||||||
|
def pkgs_changes(input_table):
|
||||||
|
"""
|
||||||
|
Analyzes the given package lists table to determine the number of packages
|
||||||
|
that changed for every package source.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
input_table: str
|
||||||
|
Table to analyse.
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
dict
|
||||||
|
Output table of the analysis in the form of a dict with headers as keys.
|
||||||
|
"""
|
||||||
|
pkgmgr = {}
|
||||||
|
i = 0
|
||||||
|
for row in input_table:
|
||||||
|
# Third column is the package source:
|
||||||
|
if row[2] not in pkgmgr:
|
||||||
|
pkgmgr[row[2]] = 1
|
||||||
|
else:
|
||||||
|
pkgmgr[row[2]] += 1
|
||||||
|
return pkgmgr
|
||||||
|
|
||||||
|
def pkgs_per_container(input_table):
|
||||||
|
"""
|
||||||
|
"""
|
||||||
|
pass
|
||||||
|
|
||||||
def main():
|
def main():
|
||||||
# Command line arguments parsing:
|
# Command line arguments parsing:
|
||||||
parser = argparse.ArgumentParser(
|
parser = argparse.ArgumentParser(
|
||||||
@ -72,7 +103,7 @@ def main():
|
|||||||
of a single execution) by using `pkg-changes`,
|
of a single execution) by using `pkg-changes`,
|
||||||
the number of packages per container by specifying `pkgs-per-container`.
|
the number of packages per container by specifying `pkgs-per-container`.
|
||||||
""",
|
""",
|
||||||
choices = ["sources-stats", "pkg-changes", "pkgs-per-container"],
|
choices = ["sources-stats", "pkgs-changes", "pkgs-per-container"],
|
||||||
required = True
|
required = True
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
@ -100,18 +131,22 @@ def main():
|
|||||||
analysis_type = args.analysis_type
|
analysis_type = args.analysis_type
|
||||||
|
|
||||||
# Parsing the input files:
|
# Parsing the input files:
|
||||||
input_tables = []
|
input_table = []
|
||||||
for path in input_paths:
|
for path in input_paths:
|
||||||
input_file = open(path)
|
input_file = open(path)
|
||||||
input_tables.append(list(csv.reader(input_file)))
|
input_table += list(csv.reader(input_file))
|
||||||
input_file.close()
|
input_file.close()
|
||||||
|
|
||||||
# Analyzing the inputs:
|
# Analyzing the inputs:
|
||||||
output_file = open(output_path, "w+")
|
|
||||||
if analysis_type == "sources-stats":
|
if analysis_type == "sources-stats":
|
||||||
output_dict = sources_stats(input_tables)
|
output_dict = sources_stats(input_table)
|
||||||
|
elif analysis_type == "pkgs-changes":
|
||||||
|
output_dict = pkgs_changes(input_table)
|
||||||
|
elif analysis_type == "pkgs-per-container":
|
||||||
|
output_dict = pkgs_per_container(input_table)
|
||||||
|
|
||||||
# Writing analysis to output file:
|
# Writing analysis to output file:
|
||||||
|
output_file = open(output_path, "w+")
|
||||||
dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
|
dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
|
||||||
dict_writer.writeheader()
|
dict_writer.writeheader()
|
||||||
dict_writer.writerow(output_dict)
|
dict_writer.writerow(output_dict)
|
||||||
|
11
ecg.py
11
ecg.py
@ -80,7 +80,7 @@ def download_file(url, dest):
|
|||||||
pass
|
pass
|
||||||
return file_hash
|
return file_hash
|
||||||
|
|
||||||
def download_sources(config, arthashlog_path, dl_dir, use_cache):
|
def download_sources(config, arthashlog_path, dl_dir, use_cache, artifact_name):
|
||||||
"""
|
"""
|
||||||
Downloads the source of the artifact in 'config'.
|
Downloads the source of the artifact in 'config'.
|
||||||
|
|
||||||
@ -98,6 +98,9 @@ def download_sources(config, arthashlog_path, dl_dir, use_cache):
|
|||||||
use_cache: bool
|
use_cache: bool
|
||||||
Indicates whether the cache should be used or not.
|
Indicates whether the cache should be used or not.
|
||||||
|
|
||||||
|
artifact_name: str
|
||||||
|
Name of the artifact, for the artifact hash log.
|
||||||
|
|
||||||
Returns
|
Returns
|
||||||
-------
|
-------
|
||||||
temp_dir: str
|
temp_dir: str
|
||||||
@ -134,7 +137,7 @@ def download_sources(config, arthashlog_path, dl_dir, use_cache):
|
|||||||
now = datetime.datetime.now()
|
now = datetime.datetime.now()
|
||||||
timestamp = str(datetime.datetime.timestamp(now))
|
timestamp = str(datetime.datetime.timestamp(now))
|
||||||
# Artifact hash will be an empty string if download failed:
|
# Artifact hash will be an empty string if download failed:
|
||||||
arthashlog_file.write(f"{timestamp},{artifact_hash}\n")
|
arthashlog_file.write(f"{timestamp},{artifact_hash},{artifact_name}\n")
|
||||||
arthashlog_file.close()
|
arthashlog_file.close()
|
||||||
else:
|
else:
|
||||||
logging.info(f"Cache found for {url}, skipping download")
|
logging.info(f"Cache found for {url}, skipping download")
|
||||||
@ -462,10 +465,10 @@ def main():
|
|||||||
else:
|
else:
|
||||||
use_cache = True
|
use_cache = True
|
||||||
dl_dir = cache_dir
|
dl_dir = cache_dir
|
||||||
artifact_dir = download_sources(config, arthashlog_path, dl_dir, use_cache)
|
artifact_name = os.path.splitext(os.path.basename(config_path))[0]
|
||||||
|
artifact_dir = download_sources(config, arthashlog_path, dl_dir, use_cache, artifact_name)
|
||||||
# If download was successful:
|
# If download was successful:
|
||||||
if artifact_dir != "":
|
if artifact_dir != "":
|
||||||
artifact_name = os.path.splitext(os.path.basename(config_path))[0]
|
|
||||||
return_code, build_output = build_image(config, artifact_dir, artifact_name, args.docker_cache)
|
return_code, build_output = build_image(config, artifact_dir, artifact_name, args.docker_cache)
|
||||||
if return_code == 0:
|
if return_code == 0:
|
||||||
status = "success"
|
status = "success"
|
||||||
|
Loading…
Reference in New Issue
Block a user