study-docker-repro-longevity/analysis/artifact_analysis.py

#!/bin/python3

"""
    This script performs an artifact analysis on the outputs of the workflow
    to generate tables that can then be plotted by another program.
"""

import argparse
import csv
import os
import datetime

def artifact_changed(table, name):
    """
    Indicates whether the artifact of the given name has changed over time.
    An artifact becoming unavailable is considered as modified.

    Parameters
    ----------
    table: list
        Artifact hash log table.

    name: str
        Name of the artifact to check.

    Returns
    -------
    bool
        True if artifact changed, False otherwise.
    """
    changed = False
    i = 0
    artifact_hash = ""
    while i < len(table) and not changed:
        row = table[i]
        if row[2] == name:
            # If the first hash has not been saved yet:
            if artifact_hash == "":
                artifact_hash = row[1] # Hash is in the 2nd column
            elif row[1] != artifact_hash:
                changed = True
        i += 1
    return changed

def artifact_available(table, name):
    """
    Indicates whether the artifact of the given name is still available.

    Parameters
    ----------
    table: list
        Artifact hash log table.

    name: str
        Name of the artifact to check.

    Returns
    -------
    bool
        True if artifact is still available, False otherwise.
    """
    available = True
    for row in table:
        if row[2] == name:
            if row[1] == "-1":
                # -1 means the artifact could not be downloaded. Otherwise,
                # this column would contain the hash of the artifact.
                available = False
            else:
                available = True
    # The last log of the artifact hash will determine if the artifact is
    # currently available or not.
    return available

def analysis(input_table):
    """
    Analyzes the given artifact hash table to determine if the artifacts are
    still available and didn't change, changed, or aren't available anymore.

    Parameters
    ----------
    input_table: str
        Table to analyse.

    Returns
    -------
    dict
        Output table of the analysis in the form of a dict with headers as keys.
    """
    artifacts = {"available":0, "unavailable":0, "changed":0}
    checked = [] # Artifacts that have been checked already
    for row in input_table:
        artifact_name = row[2] # Name of the artifact in the 3rd column
        if artifact_name not in checked:
            if artifact_available(input_table, artifact_name):
                artifacts["available"] += 1
            else:
                artifacts["unavailable"] += 1
            if artifact_changed(input_table, artifact_name):
                artifacts["changed"] += 1
            checked.append(artifact_name)
    return artifacts

def main():
    # Command line arguments parsing:
    parser = argparse.ArgumentParser(
        prog = "artifact_analysis",
        description =
        """
        This script performs an artifact analysis on the outputs of the workflow
        to generate tables that can then be plotted by another program.
        The generated table gives the amount of artifacts that are available
        or not available, and the amount of artifacts that have been modified
        over time.
        """
    )
    parser.add_argument(
        "-v", "--verbose",
        action = "store_true",
        help = "Shows more details on what is being done."
    )
    parser.add_argument(
        "-i", "--input",
        action = "append",
        nargs = "+",
        help =
        """
        The CSV file used as input for the analysis function. Multiple files
        can be specified at once by separating them with a space.
        All the input files must be artifact hash logs generated by ECG.
        """,
        required = True
    )
    parser.add_argument(
        "-o", "--output",
        help =
        """
        Path to the output CSV file that will be created by the analysis function.
        """,
        required = True
    )
    args = parser.parse_args()
    inputs = args.input
    output_path = args.output

    # Parsing the input files:
    input_table = []
    for i in inputs:
        for path in i:
            input_file = open(path)
            input_table += list(csv.reader(input_file))
            input_file.close()

    # Analyzing the inputs:
    output_dict = analysis(input_table)
    # Adding the current time to every row:
    now = datetime.datetime.now()
    timestamp = str(datetime.datetime.timestamp(now))
    output_dict["timestamp"] = timestamp

    # Writing analysis to output file:
    output_file = open(output_path, "w+")
    dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
    # dict_writer.writeheader()
    dict_writer.writerow(output_dict)
    output_file.close()

if __name__ == "__main__":
    main()
Written skeleton for output analysis (#26). Renamed some arguments of ECG. 2024-07-25 18:03:14 +02:00			`#!/bin/python3`

			`"""`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00			`This script performs an artifact analysis on the outputs of the workflow`
			`to generate tables that can then be plotted by another program.`
Written skeleton for output analysis (#26). Renamed some arguments of ECG. 2024-07-25 18:03:14 +02:00			`"""`

			`import argparse`
			`import csv`
			`import os`
Added timestamp to each row of each analysis' result. Package changes analysis now specifies if a package source has no packages that changed. 2024-08-07 17:31:35 +02:00			`import datetime`
Written skeleton for output analysis (#26). Renamed some arguments of ECG. 2024-07-25 18:03:14 +02:00
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`def artifact_changed(table, name):`
Implemented artifact hash analysis. Switched to a table of tables instead of a single table to be able to identify multiple artifacts for the artifact analysis. 2024-07-26 17:01:59 +02:00			`"""`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`Indicates whether the artifact of the given name has changed over time.`
			`An artifact becoming unavailable is considered as modified.`
Implemented artifact hash analysis. Switched to a table of tables instead of a single table to be able to identify multiple artifacts for the artifact analysis. 2024-07-26 17:01:59 +02:00
			`Parameters`
			`----------`
			`table: list`
			`Artifact hash log table.`

Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`name: str`
			`Name of the artifact to check.`

Implemented artifact hash analysis. Switched to a table of tables instead of a single table to be able to identify multiple artifacts for the artifact analysis. 2024-07-26 17:01:59 +02:00			`Returns`
			`-------`
			`bool`
			`True if artifact changed, False otherwise.`
			`"""`
			`changed = False`
			`i = 0`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`artifact_hash = ""`
Implemented artifact hash analysis. Switched to a table of tables instead of a single table to be able to identify multiple artifacts for the artifact analysis. 2024-07-26 17:01:59 +02:00			`while i < len(table) and not changed:`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`row = table[i]`
			`if row[2] == name:`
			`# If the first hash has not been saved yet:`
			`if artifact_hash == "":`
			`artifact_hash = row[1] # Hash is in the 2nd column`
			`elif row[1] != artifact_hash:`
			`changed = True`
Implemented artifact hash analysis. Switched to a table of tables instead of a single table to be able to identify multiple artifacts for the artifact analysis. 2024-07-26 17:01:59 +02:00			`i += 1`
			`return changed`

Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`def artifact_available(table, name):`
Implemented artifact hash analysis. Switched to a table of tables instead of a single table to be able to identify multiple artifacts for the artifact analysis. 2024-07-26 17:01:59 +02:00			`"""`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`Indicates whether the artifact of the given name is still available.`
Implemented artifact hash analysis. Switched to a table of tables instead of a single table to be able to identify multiple artifacts for the artifact analysis. 2024-07-26 17:01:59 +02:00
			`Parameters`
			`----------`
			`table: list`
			`Artifact hash log table.`

Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`name: str`
			`Name of the artifact to check.`

Implemented artifact hash analysis. Switched to a table of tables instead of a single table to be able to identify multiple artifacts for the artifact analysis. 2024-07-26 17:01:59 +02:00			`Returns`
			`-------`
			`bool`
			`True if artifact is still available, False otherwise.`
Written skeleton for output analysis (#26). Renamed some arguments of ECG. 2024-07-25 18:03:14 +02:00			`"""`
Implemented artifact hash analysis. Switched to a table of tables instead of a single table to be able to identify multiple artifacts for the artifact analysis. 2024-07-26 17:01:59 +02:00			`available = True`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`for row in table:`
			`if row[2] == name:`
			`if row[1] == "-1":`
			`# -1 means the artifact could not be downloaded. Otherwise,`
			`# this column would contain the hash of the artifact.`
			`available = False`
			`else:`
			`available = True`
			`# The last log of the artifact hash will determine if the artifact is`
			`# currently available or not.`
Implemented artifact hash analysis. Switched to a table of tables instead of a single table to be able to identify multiple artifacts for the artifact analysis. 2024-07-26 17:01:59 +02:00			`return available`

Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`def analysis(input_table):`
Implemented artifact hash analysis. Switched to a table of tables instead of a single table to be able to identify multiple artifacts for the artifact analysis. 2024-07-26 17:01:59 +02:00			`"""`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`Analyzes the given artifact hash table to determine if the artifacts are`
Implemented artifact hash analysis. Switched to a table of tables instead of a single table to be able to identify multiple artifacts for the artifact analysis. 2024-07-26 17:01:59 +02:00			`still available and didn't change, changed, or aren't available anymore.`
Written skeleton for output analysis (#26). Renamed some arguments of ECG. 2024-07-25 18:03:14 +02:00
			`Parameters`
			`----------`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`input_table: str`
Written skeleton for output analysis (#26). Renamed some arguments of ECG. 2024-07-25 18:03:14 +02:00			`Table to analyse.`

			`Returns`
			`-------`
Softenv analysis written. 2024-07-26 12:58:00 +02:00			`dict`
			`Output table of the analysis in the form of a dict with headers as keys.`
Written skeleton for output analysis (#26). Renamed some arguments of ECG. 2024-07-25 18:03:14 +02:00			`"""`
Implemented artifact hash analysis. Switched to a table of tables instead of a single table to be able to identify multiple artifacts for the artifact analysis. 2024-07-26 17:01:59 +02:00			`artifacts = {"available":0, "unavailable":0, "changed":0}`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`checked = [] # Artifacts that have been checked already`
			`for row in input_table:`
			`artifact_name = row[2] # Name of the artifact in the 3rd column`
			`if artifact_name not in checked:`
			`if artifact_available(input_table, artifact_name):`
			`artifacts["available"] += 1`
			`else:`
			`artifacts["unavailable"] += 1`
			`if artifact_changed(input_table, artifact_name):`
			`artifacts["changed"] += 1`
			`checked.append(artifact_name)`
Implemented artifact hash analysis. Switched to a table of tables instead of a single table to be able to identify multiple artifacts for the artifact analysis. 2024-07-26 17:01:59 +02:00			`return artifacts`
Written skeleton for output analysis (#26). Renamed some arguments of ECG. 2024-07-25 18:03:14 +02:00
			`def main():`
			`# Command line arguments parsing:`
			`parser = argparse.ArgumentParser(`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00			`prog = "artifact_analysis",`
			`description =`
			`"""`
			`This script performs an artifact analysis on the outputs of the workflow`
			`to generate tables that can then be plotted by another program.`
			`The generated table gives the amount of artifacts that are available`
			`or not available, and the amount of artifacts that have been modified`
			`over time.`
			`"""`
Written skeleton for output analysis (#26). Renamed some arguments of ECG. 2024-07-25 18:03:14 +02:00			`)`
output_analysis now takes multiple files and not just a single directory as argument for the analysis input (close #36). 2024-08-05 17:19:45 +02:00			`parser.add_argument(`
			`"-v", "--verbose",`
			`action = "store_true",`
Written skeleton for output analysis (#26). Renamed some arguments of ECG. 2024-07-25 18:03:14 +02:00			`help = "Shows more details on what is being done."`
			`)`
			`parser.add_argument(`
output_analysis now takes multiple files and not just a single directory as argument for the analysis input (close #36). 2024-08-05 17:19:45 +02:00			`"-i", "--input",`
			`action = "append",`
Integrated softenv analysis to workflow. Changed input and output cmd options in analysis scripts to take multiple files at once. Moved test and template artifacts in an excluded folder. 2024-08-19 14:59:08 +02:00			`nargs = "+",`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00			`help =`
			`"""`
			`The CSV file used as input for the analysis function. Multiple files`
Integrated softenv analysis to workflow. Changed input and output cmd options in analysis scripts to take multiple files at once. Moved test and template artifacts in an excluded folder. 2024-08-19 14:59:08 +02:00			`can be specified at once by separating them with a space.`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00			`All the input files must be artifact hash logs generated by ECG.`
			`""",`
output_analysis now takes multiple files and not just a single directory as argument for the analysis input (close #36). 2024-08-05 17:19:45 +02:00			`required = True`
Written skeleton for output analysis (#26). Renamed some arguments of ECG. 2024-07-25 18:03:14 +02:00			`)`
			`parser.add_argument(`
output_analysis now takes multiple files and not just a single directory as argument for the analysis input (close #36). 2024-08-05 17:19:45 +02:00			`"-o", "--output",`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00			`help =`
			`"""`
			`Path to the output CSV file that will be created by the analysis function.`
			`""",`
output_analysis now takes multiple files and not just a single directory as argument for the analysis input (close #36). 2024-08-05 17:19:45 +02:00			`required = True`
Written skeleton for output analysis (#26). Renamed some arguments of ECG. 2024-07-25 18:03:14 +02:00			`)`
			`args = parser.parse_args()`
Integrated softenv analysis to workflow. Changed input and output cmd options in analysis scripts to take multiple files at once. Moved test and template artifacts in an excluded folder. 2024-08-19 14:59:08 +02:00			`inputs = args.input`
output_analysis now takes multiple files and not just a single directory as argument for the analysis input (close #36). 2024-08-05 17:19:45 +02:00			`output_path = args.output`
Written skeleton for output analysis (#26). Renamed some arguments of ECG. 2024-07-25 18:03:14 +02:00
output_analysis now takes multiple files and not just a single directory as argument for the analysis input (close #36). 2024-08-05 17:19:45 +02:00			`# Parsing the input files:`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`input_table = []`
Integrated softenv analysis to workflow. Changed input and output cmd options in analysis scripts to take multiple files at once. Moved test and template artifacts in an excluded folder. 2024-08-19 14:59:08 +02:00			`for i in inputs:`
			`for path in i:`
			`input_file = open(path)`
			`input_table += list(csv.reader(input_file))`
			`input_file.close()`
Written skeleton for output analysis (#26). Renamed some arguments of ECG. 2024-07-25 18:03:14 +02:00
			`# Analyzing the inputs:`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`output_dict = analysis(input_table)`
Added timestamp to each row of each analysis' result. Package changes analysis now specifies if a package source has no packages that changed. 2024-08-07 17:31:35 +02:00			`# Adding the current time to every row:`
			`now = datetime.datetime.now()`
			`timestamp = str(datetime.datetime.timestamp(now))`
			`output_dict["timestamp"] = timestamp`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00
Softenv analysis written. 2024-07-26 12:58:00 +02:00			`# Writing analysis to output file:`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`output_file = open(output_path, "w+")`
Softenv analysis written. 2024-07-26 12:58:00 +02:00			`dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())`
Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos. 2024-08-07 19:51:21 +02:00			`# dict_writer.writeheader()`
Softenv analysis written. 2024-07-26 12:58:00 +02:00			`dict_writer.writerow(output_dict)`
Written skeleton for output analysis (#26). Renamed some arguments of ECG. 2024-07-25 18:03:14 +02:00			`output_file.close()`

			`if __name__ == "__main__":`
			`main()`