study-docker-repro-longevity/analysis/buildstatus_analysis.py

#!/bin/python3

"""
    This script performs a build status analysis on the outputs of the workflow
    to generate tables that can then be plotted by another program.
"""

import argparse
import csv
import os
import datetime

def analysis(input_table):
    """
    Analyzes the given build status table to count the results of the building
    of the Dockerfile for each category.

    Parameters
    ----------
    input_table: str
        Table to analyse.

    Returns
    -------
    dict
        Output table of the analysis in the form of a dict with headers as keys.
    """
    # All build status, initialized to 0.
    # This is required to make the column of the result table deterministic,
    # so they can be determined without the header in the CSV file.
    buildstatus = {"success":0, "package_install_failed":0, "baseimage_unavailable":0, "artifact_unavailable":0, "dockerfile_not_found":0, "script_crash":0, "job_time_exceeded":0, "unknown_error":0}
    for row in input_table:
        # Third column is the result:
        buildstatus[row[2]] += 1
    return buildstatus

def main():
    # Command line arguments parsing:
    parser = argparse.ArgumentParser(
        prog = "buildstatus_analysis",
        description =
        """
        This script performs a build status analysis on the outputs of the
        workflow to generate tables that can then be plotted by another program.
        The generated table gives the amount of images that have been
        built successfully, and the amount of images that failed to build,
        for each category of error.
        """
    )
    parser.add_argument(
        "-v", "--verbose",
        action = "store_true",
        help = "Shows more details on what is being done."
    )
    parser.add_argument(
        "-i", "--input",
        action = "append",
        nargs = "+",
        help =
        """
        The CSV file used as input for the analysis function. Multiple files
        can be specified at once by separating them with a space.
        All the input files must be build status logs generated by ECG.
        """,
        required = True
    )
    parser.add_argument(
        "-o", "--output",
        help =
        """
        Path to the output CSV file that will be created by the analysis function.
        """,
        required = True
    )
    args = parser.parse_args()
    inputs = args.input
    output_path = args.output

    # Parsing the input files:
    input_table = []
    for i in inputs:
        for path in i:
            input_file = open(path)
            input_table += list(csv.reader(input_file))
            input_file.close()

    # Analyzing the inputs:
    output_dict = analysis(input_table)
    # Adding the current time to every row:
    now = datetime.datetime.now()
    timestamp = str(datetime.datetime.timestamp(now))
    output_dict["timestamp"] = timestamp

    # Writing analysis to output file:
    output_file = open(output_path, "w+")
    dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())
    # dict_writer.writeheader()
    dict_writer.writerow(output_dict)
    output_file.close()

if __name__ == "__main__":
    main()
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00			`#!/bin/python3`

			`"""`
			`This script performs a build status analysis on the outputs of the workflow`
			`to generate tables that can then be plotted by another program.`
			`"""`

			`import argparse`
			`import csv`
			`import os`
Added timestamp to each row of each analysis' result. Package changes analysis now specifies if a package source has no packages that changed. 2024-08-07 17:31:35 +02:00			`import datetime`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`def analysis(input_table):`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00			`"""`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`Analyzes the given build status table to count the results of the building`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00			`of the Dockerfile for each category.`

			`Parameters`
			`----------`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`input_table: str`
			`Table to analyse.`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00
			`Returns`
			`-------`
			`dict`
			`Output table of the analysis in the form of a dict with headers as keys.`
			`"""`
Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos. 2024-08-07 19:51:21 +02:00			`# All build status, initialized to 0.`
			`# This is required to make the column of the result table deterministic,`
			`# so they can be determined without the header in the CSV file.`
Changed build status "package_unavailable" to "package_install_failed" to cover more errors. Added an error message for this status. Outputs of ECG are now always created at the beginning to avoid issues with Snakemake (maybe not a good idea). 2024-08-20 18:47:32 +02:00			`buildstatus = {"success":0, "package_install_failed":0, "baseimage_unavailable":0, "artifact_unavailable":0, "dockerfile_not_found":0, "script_crash":0, "job_time_exceeded":0, "unknown_error":0}`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`for row in input_table:`
			`# Third column is the result:`
Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos. 2024-08-07 19:51:21 +02:00			`buildstatus[row[2]] += 1`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00			`return buildstatus`

			`def main():`
			`# Command line arguments parsing:`
			`parser = argparse.ArgumentParser(`
			`prog = "buildstatus_analysis",`
			`description =`
			`"""`
			`This script performs a build status analysis on the outputs of the`
			`workflow to generate tables that can then be plotted by another program.`
			`The generated table gives the amount of images that have been`
Updated doc with more details. 2024-08-07 16:51:19 +02:00			`built successfully, and the amount of images that failed to build,`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00			`for each category of error.`
			`"""`
			`)`
			`parser.add_argument(`
			`"-v", "--verbose",`
			`action = "store_true",`
			`help = "Shows more details on what is being done."`
			`)`
			`parser.add_argument(`
			`"-i", "--input",`
			`action = "append",`
Integrated softenv analysis to workflow. Changed input and output cmd options in analysis scripts to take multiple files at once. Moved test and template artifacts in an excluded folder. 2024-08-19 14:59:08 +02:00			`nargs = "+",`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00			`help =`
			`"""`
			`The CSV file used as input for the analysis function. Multiple files`
Integrated softenv analysis to workflow. Changed input and output cmd options in analysis scripts to take multiple files at once. Moved test and template artifacts in an excluded folder. 2024-08-19 14:59:08 +02:00			`can be specified at once by separating them with a space.`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00			`All the input files must be build status logs generated by ECG.`
			`""",`
			`required = True`
			`)`
			`parser.add_argument(`
			`"-o", "--output",`
			`help =`
			`"""`
			`Path to the output CSV file that will be created by the analysis function.`
			`""",`
			`required = True`
			`)`
			`args = parser.parse_args()`
Integrated softenv analysis to workflow. Changed input and output cmd options in analysis scripts to take multiple files at once. Moved test and template artifacts in an excluded folder. 2024-08-19 14:59:08 +02:00			`inputs = args.input`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00			`output_path = args.output`

			`# Parsing the input files:`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`input_table = []`
Integrated softenv analysis to workflow. Changed input and output cmd options in analysis scripts to take multiple files at once. Moved test and template artifacts in an excluded folder. 2024-08-19 14:59:08 +02:00			`for i in inputs:`
			`for path in i:`
			`input_file = open(path)`
			`input_table += list(csv.reader(input_file))`
			`input_file.close()`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00
			`# Analyzing the inputs:`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`output_dict = analysis(input_table)`
Added timestamp to each row of each analysis' result. Package changes analysis now specifies if a package source has no packages that changed. 2024-08-07 17:31:35 +02:00			`# Adding the current time to every row:`
			`now = datetime.datetime.now()`
			`timestamp = str(datetime.datetime.timestamp(now))`
			`output_dict["timestamp"] = timestamp`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00
			`# Writing analysis to output file:`
Modified analysis according to the fact that logs for multiple executions of ECG will not be appended to the same file, rather written to a new file every time. Added a column for the artifact name in the artifact hash log for that reason. Updated README. Added `sed` as dependency for Nix. Started writing package changes analysis. 2024-08-07 11:22:54 +02:00			`output_file = open(output_path, "w+")`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00			`dict_writer = csv.DictWriter(output_file, fieldnames=output_dict.keys())`
Removed the headers from the output of the analysis, because they will have to be combined. So I made the columns deterministic. Added the supported package sources to the doc. Updated the protocol with the information from the doc + typos. 2024-08-07 19:51:21 +02:00			`# dict_writer.writeheader()`
Separated the analysis into 3 different scripts. 2024-08-06 16:50:07 +02:00			`dict_writer.writerow(output_dict)`
			`output_file.close()`

			`if __name__ == "__main__":`
			`main()`