Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added sed as dependency for Nix. Started writing package changes analysis.

2024-08-07 11:22:54 +02:00 · 2024-08-07 11:22:54 +02:00 · 4b91a6cb5d
commit 4b91a6cb5d
parent eae7c40d59
6 changed files with 135 additions and 85 deletions
--- a/README.md
+++ b/README.md
@ -1,6 +1,6 @@
 # Study of the Reproducibility and Longevity of Dockerfiles
-ECG is a program that automates software environment checking for scientific artifacts.
+ECG is a program that automates software environment checking for scientific artifacts that use Docker.
 It is meant to be executed periodically to analyze variations in the software environment of the artifact through time.
@ -41,16 +41,18 @@ Where:
 - `<artifact_hash_log>` is the path to the file where to log the hash of the downloaded artifact.
 - `<cache_directory>` is the path to the cache directory, where downloaded artifacts will be stored for future usage. If not specified, cache is disabled.
 You can also use `--docker-cache` to enable the cache of the Docker layers, and `-v` to show the full output of the script in your terminal (by default, it is only written to the specified `log_file`).
 ## Output
 ### Package list
-The list of packages installed in the container, depending on the package managers, Git packages and other miscellaneous packages given in the config file, in the form of a CSV file, with the following columns in order:
+The list of packages installed in the container, depending on the sources (a package manager, `git` or `misc`) given in the config file, in the form of a CSV file, with the following columns in order:
-| Package name | Version | Package manager |
+| Package name | Version | Source          | Config name | Timestamp |
-|--------------|---------|-----------------|
+|--------------|---------|-----------------|-------------|-----------|
-For Git packages, the hash of the last commit is used as version number. For miscellaneous packages, the hash of the file that has been used to install the package is used as version number.
+For Git packages, the hash of the last commit is used as version number. For miscellaneous packages, the hash of the file that has been used to install the package is used as version number. The timestamp corresponds to the time when ECG started building the package list, so it will be the same for each package that has been logged during the same execution of ECG.
 ### Output log
@ -60,8 +62,8 @@ Just a plain text file containing the output of the script.
 The log of the attempts to build the Docker image, in the form of a CSV file, with the following columns in order:
-| Config file path | Timestamp | Result          |
+| Config name | Timestamp | Result          |
-|------------------|-----------|-----------------|
+|-------------|-----------|-----------------|
 The timestamp corresponds to when the result is being logged, not to when it happened.
@ -79,10 +81,10 @@ The following are the possible results of the build:
 The log of the hash of the artifact archive file, in the form of a CSV file, with the following columns in order:
-| Timestamp | Hash |
+| Timestamp | Hash | Config name |
-|-----------|------|
+|-----------|------|-------------|
-The timestamp corresponds to when the hash has been logged, not to when the artifact has been downloaded.
+The timestamp corresponds to when the hash has been logged, not to when the artifact has been downloaded. If the artifact couldn't be downloaded, the hash is equal to `-1`.
 ## License
--- a/analysis/artifact_analysis.py
+++ b/analysis/artifact_analysis.py
@ -13,60 +13,76 @@ import argparse
 import csv
 import os
-def artifact_changed(table):
+def artifact_changed(table, name):
    """
-    Indicates whether the artifact involved in the given hash log table
+    Indicates whether the artifact of the given name has changed over time.
-    has changed over time.
+    An artifact becoming unavailable is considered as modified.
    Parameters
    ----------
    table: list
        Artifact hash log table.
    name: str
        Name of the artifact to check.
    Returns
    -------
    bool
        True if artifact changed, False otherwise.
    """
    changed = False
    # Hash is in the 2nd column:
    artifact_hash = table[0][1]
    i = 0
    artifact_hash = ""
    while i < len(table) and not changed:
-        if table[i][1] != artifact_hash:
+        row = table[i]
-            changed = True
+        if row[2] == name:
            # If the first hash has not been saved yet:
            if artifact_hash == "":
                artifact_hash = row[1] # Hash is in the 2nd column
            elif row[1] != artifact_hash:
                changed = True
        i += 1
    return changed
-def artifact_available(table):
+def artifact_available(table, name):
    """
-    Indicates whether the artifact involved in the given hash log table
+    Indicates whether the artifact of the given name is still available.
    is still available.
    Parameters
    ----------
    table: list
        Artifact hash log table.
    name: str
        Name of the artifact to check.
    Returns
    -------
    bool
        True if artifact is still available, False otherwise.
    """
    available = True
-    # We check the last line to check current availability:
+    for row in table:
-    if table[-1][1] == "":
+        if row[2] == name:
-        available = False
+            if row[1] == "-1":
                # -1 means the artifact could not be downloaded. Otherwise,
                # this column would contain the hash of the artifact.
                available = False
            else:
                available = True
    # The last log of the artifact hash will determine if the artifact is
    # currently available or not.
    return available
-def analysis(input_tables):
+def analysis(input_table):
    """
-    Analyzes the given artifact hash tables to determine if the artifacts are
+    Analyzes the given artifact hash table to determine if the artifacts are
    still available and didn't change, changed, or aren't available anymore.
    Parameters
    ----------
-    input_tables: str
+    input_table: str
        Table to analyse.
    Returns
@ -75,13 +91,17 @@ def analysis(input_tables):
        Output table of the analysis in the form of a dict with headers as keys.
    """
    artifacts = {"available":0, "unavailable":0, "changed":0}
-    for table in input_tables:
+    checked = [] # Artifacts that have been checked already
-        if artifact_available(table):
+    for row in input_table:
-            artifacts["available"] += 1
+        artifact_name = row[2] # Name of the artifact in the 3rd column
-        else:
+        if artifact_name not in checked:
-            artifacts["unavailable"] += 1
+            if artifact_available(input_table, artifact_name):
-        if artifact_changed(table):
+                artifacts["available"] += 1
-            artifacts["changed"] += 1
+            else:
                artifacts["unavailable"] += 1
            if artifact_changed(input_table, artifact_name):
                artifacts["changed"] += 1
            checked.append(artifact_name)
    return artifacts
 def main():
@ -126,18 +146,17 @@ def main():
    output_path = args.output
    # Parsing the input files:
-    input_tables = []
+    input_table = []
    for path in input_paths:
        input_file = open(path)
-        input_tables.append(list(csv.reader(input_file)))
+        input_table += list(csv.reader(input_file))
        input_file.close()
    # Analyzing the inputs:
-    output_file = open(output_path, "w+")
+    output_dict = analysis(input_table)
    output_dict = {}
    output_dict = analysis(input_tables)
    # Writing analysis to output file:
    output_file = open(output_path, "w+")
    dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
    dict_writer.writeheader()
    dict_writer.writerow(output_dict)
--- a/analysis/buildstatus_analysis.py
+++ b/analysis/buildstatus_analysis.py
@ -13,15 +13,15 @@ import argparse
 import csv
 import os
-def analysis(input_tables):
+def analysis(input_table):
    """
-    Analyzes the given build status tables to count the results of the building
+    Analyzes the given build status table to count the results of the building
    of the Dockerfile for each category.
    Parameters
    ----------
-    input_tables: str
+    input_table: str
-        Tables to analyse.
+        Table to analyse.
    Returns
    -------
@ -29,21 +29,12 @@ def analysis(input_tables):
        Output table of the analysis in the form of a dict with headers as keys.
    """
    buildstatus = {}
-    for table in input_tables:
+    for row in input_table:
-        # # There has never been any error:
+        # Third column is the result:
-        # if table == [[]]:
+        if row[2] not in buildstatus:
-        #     if "never_failed" not in buildstatus:
+            buildstatus[row[2]] = 1
-        #             buildstatus["never_failed"] = 1
+        else:
-        #     else:
+            buildstatus[row[2]] += 1
        #         buildstatus["never_failed"] += 1
        # # There has been an error at least once:
        # else:
        for row in table:
            # Third column is the result:
            if row[2] not in buildstatus:
                buildstatus[row[2]] = 1
            else:
                buildstatus[row[2]] += 1
    return buildstatus
 def main():
@ -88,18 +79,17 @@ def main():
    output_path = args.output
    # Parsing the input files:
-    input_tables = []
+    input_table = []
    for path in input_paths:
        input_file = open(path)
-        input_tables.append(list(csv.reader(input_file)))
+        input_table += list(csv.reader(input_file))
        input_file.close()
    # Analyzing the inputs:
-    output_file = open(output_path, "w+")
+    output_dict = analysis(input_table)
    output_dict = {}
    output_dict = analysis(input_tables)
    # Writing analysis to output file:
    output_file = open(output_path, "w+")
    dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
    dict_writer.writeheader()
    dict_writer.writerow(output_dict)
--- a/analysis/softenv_analysis.py
+++ b/analysis/softenv_analysis.py
@ -6,26 +6,26 @@
    program.
    Depending on the type of analysis, multiple tables can be generated:
-    - sources-stats: Number of packages per source (a package manager, git or
+    - `sources-stats`: Number of packages per source (a package manager, git or
    misc)
-    - pkg-changes: Number of packages that changed over time (0 if only one file
+    - `pkg-changes`: Number of packages that changed over time (0 if only one file
    is given, since it will only include the package list of a single execution)
-    - pkg-per-container: Number of packages per container
+    - `pkgs-per-container`: Number of packages per container
 """
 import argparse
 import csv
 import os
-def sources_stats(input_tables):
+def sources_stats(input_table):
    """
-    Analyzes the given package lists tables to determine the number of artifacts
+    Analyzes the given package lists table to determine the number of artifacts
    using a package manager, Git packages or misc packages.
    Parameters
    ----------
-    input_tables: str
+    input_table: str
-        Tables to analyse.
+        Table to analyse.
    Returns
    -------
@ -34,15 +34,46 @@ def sources_stats(input_tables):
    """
    pkgmgr = {}
    i = 0
-    for table in input_tables:
+    for row in input_table:
-        for row in table:
+        # Third column is the package source:
-            # Third column is the package source:
+        if row[2] not in pkgmgr:
-            if row[2] not in pkgmgr:
+            pkgmgr[row[2]] = 1
-                pkgmgr[row[2]] = 1
+        else:
-            else:
+            pkgmgr[row[2]] += 1
                pkgmgr[row[2]] += 1
    return pkgmgr
 # def pkg_changed(pkgname, )
 def pkgs_changes(input_table):
    """
    Analyzes the given package lists table to determine the number of packages
    that changed for every package source.
    Parameters
    ----------
    input_table: str
        Table to analyse.
    Returns
    -------
    dict
        Output table of the analysis in the form of a dict with headers as keys.
    """
    pkgmgr = {}
    i = 0
    for row in input_table:
        # Third column is the package source:
        if row[2] not in pkgmgr:
            pkgmgr[row[2]] = 1
        else:
            pkgmgr[row[2]] += 1
    return pkgmgr
 def pkgs_per_container(input_table):
    """
    """
    pass
 def main():
    # Command line arguments parsing:
    parser = argparse.ArgumentParser(
@ -72,7 +103,7 @@ def main():
        of a single execution) by using `pkg-changes`,
        the number of packages per container by specifying `pkgs-per-container`.
        """,
-        choices = ["sources-stats", "pkg-changes", "pkgs-per-container"],
+        choices = ["sources-stats", "pkgs-changes", "pkgs-per-container"],
        required = True
    )
    parser.add_argument(
@ -100,18 +131,22 @@ def main():
    analysis_type = args.analysis_type
    # Parsing the input files:
-    input_tables = []
+    input_table = []
    for path in input_paths:
        input_file = open(path)
-        input_tables.append(list(csv.reader(input_file)))
+        input_table += list(csv.reader(input_file))
        input_file.close()
    # Analyzing the inputs:
    output_file = open(output_path, "w+")
    if analysis_type == "sources-stats":
-        output_dict = sources_stats(input_tables)
+        output_dict = sources_stats(input_table)
    elif analysis_type == "pkgs-changes":
        output_dict = pkgs_changes(input_table)
    elif analysis_type == "pkgs-per-container":
        output_dict = pkgs_per_container(input_table)
    # Writing analysis to output file:
    output_file = open(output_path, "w+")
    dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
    dict_writer.writeheader()
    dict_writer.writerow(output_dict)
--- a/ecg.py
+++ b/ecg.py
@ -80,7 +80,7 @@ def download_file(url, dest):
        pass
    return file_hash
-def download_sources(config, arthashlog_path, dl_dir, use_cache):
+def download_sources(config, arthashlog_path, dl_dir, use_cache, artifact_name):
    """
    Downloads the source of the artifact in 'config'.
@ -98,6 +98,9 @@ def download_sources(config, arthashlog_path, dl_dir, use_cache):
    use_cache: bool
        Indicates whether the cache should be used or not.
    artifact_name: str
        Name of the artifact, for the artifact hash log.
    Returns
    -------
    temp_dir: str
@ -134,7 +137,7 @@ def download_sources(config, arthashlog_path, dl_dir, use_cache):
        now = datetime.datetime.now()
        timestamp = str(datetime.datetime.timestamp(now))
        # Artifact hash will be an empty string if download failed:
-        arthashlog_file.write(f"{timestamp},{artifact_hash}\n")
+        arthashlog_file.write(f"{timestamp},{artifact_hash},{artifact_name}\n")
        arthashlog_file.close()
    else:
        logging.info(f"Cache found for {url}, skipping download")
@ -462,10 +465,10 @@ def main():
        else:
            use_cache = True
            dl_dir = cache_dir
-        artifact_dir = download_sources(config, arthashlog_path, dl_dir, use_cache)
+        artifact_name = os.path.splitext(os.path.basename(config_path))[0]
        artifact_dir = download_sources(config, arthashlog_path, dl_dir, use_cache, artifact_name)
        # If download was successful:
        if artifact_dir != "":
            artifact_name = os.path.splitext(os.path.basename(config_path))[0]
            return_code, build_output = build_image(config, artifact_dir, artifact_name, args.docker_cache)
            if return_code == 0:
                status = "success"
--- a/flake.nix
+++ b/flake.nix
@ -20,6 +20,7 @@
          packages = with pkgs; [
            snakemake
            gawk
            gnused
            nickel
            graphviz
 	    # TODO separate into several shells