Analyzing the output of ECG #26

New Issue

antux18 · 2024-07-25T17:31:40+02:00

antux18 commented

2024-07-25 17:31:40 +02:00

In order to study the reproducibility of the Dockerfiles, we need a script that can analyze the outputs of ECG to extract interesting data, and another script that plots the extracted data.

Some ideas of what we could plot:

Software environment:
- Number of installed packages.
- Proportion/number of packages that change version.
- Package sources from where packages are changing the most.
- Proportion/number of Dockerfiles using package managers/Git/misc.
- Proportion/number of Dockerfiles using dpkg/rpm/pip/conda/other known package managers.
Dockerfiles build status:
- Proportion/number of Dockerfiles for which build is successful.
- Proportion/number of Dockerfiles for which build is successful/baseimage_unavailable/other known error types.
Artifacts' source:
- Proportion/number of artifacts that can be downloaded.
- Proportion/number of artifacts that have changed.
- Proportion/number of artifacts that are available/unavailable/modified.

All of these plots can be done both on the outputs of a single execution, or on outputs generated over time after multiple executions (except for plots that measure change).

In order to study the reproducibility of the Dockerfiles, we need a script that can analyze the outputs of ECG to extract interesting data, and another script that plots the extracted data. Some ideas of what we could plot: - Software environment: - Number of installed packages. - Proportion/number of packages that change version. - Package sources from where packages are changing the most. - Proportion/number of Dockerfiles using package managers/Git/misc. - Proportion/number of Dockerfiles using dpkg/rpm/pip/conda/other known package managers. - Dockerfiles build status: - Proportion/number of Dockerfiles for which build is successful. - Proportion/number of Dockerfiles for which build is successful/baseimage_unavailable/other known error types. - Artifacts' source: - Proportion/number of artifacts that can be downloaded. - Proportion/number of artifacts that have changed. - Proportion/number of artifacts that are available/unavailable/modified. All of these plots can be done both on the outputs of a single execution, or on outputs generated over time after multiple executions (except for plots that measure change).

antux18 referenced this issue from a commit

2024-07-25 18:04:07 +02:00

Written skeleton for output analysis (#26). Renamed some arguments of ECG.

antux18 commented

2024-07-26 17:08:08 +02:00

Regarding the build status analysis: it was decided not to log the build status of Dockerfiles that build successfully, but it makes analysis harder: how to determine the number of Dockerfiles that build successfully if this success hasn't been logged?

Shouldn't we log that when build was successful?

Regarding the build status analysis: it was decided not to log the build status of Dockerfiles that build successfully, but it makes analysis harder: how to determine the number of Dockerfiles that build successfully if this success hasn't been logged? Shouldn't we log that when build was successful?

antux18 self-assigned this 2024-07-26 17:08:54 +02:00

GuilloteauQ commented

2024-07-26 17:35:36 +02:00

i would say that the ones that built successfully are the ones not present in the blacklist

i would say that the ones that built successfully are the ones not present in the `blacklist`

antux18 commented

2024-07-26 17:37:07 +02:00

My question is rather how to count them if they are not in the file?

GuilloteauQ commented

2024-07-26 17:38:22 +02:00

number of artifacts in the artifacts/ folder minus the one in the blacklist?

number of artifacts in the `artifacts/` folder minus the one in the blacklist?

antux18 commented

2024-07-26 17:40:54 +02:00

Sure, but wouldn't it be simpler to just put them on the build status file? Your solution would require another argument for the analysis script which is the artifact folder, and then the script should count them and subtract the amount of artifact that failed... It looks a bit cumbersome for nothing to me...

GuilloteauQ commented

2024-07-26 17:44:26 +02:00

So we can log the successful build status, we just need to generate the blacklist correctly afterward.
for now the blacklist is just a concatenation of all the build status files.
If you feel like it helps the analysis to have the success status, go for it 🙂

So we can log the successful build status, we just need to generate the blacklist correctly afterward. for now the blacklist is just a concatenation of all the build status files. If you feel like it helps the analysis to have the success status, go for it 🙂

antux18 commented

2024-07-26 17:45:53 +02:00

Oh, I see... It would complicate the generation of the blacklist?

GuilloteauQ commented

2024-07-26 17:46:48 +02:00

not much i would say.
we would just need to filter out the successful builds

not much i would say. we would just need to filter out the successful builds

antux18 commented

2024-07-26 17:49:51 +02:00

Okay, then I'm going to do that on the Snakefile, it looks like it would be simpler for the analysis indeed.

antux18 referenced this issue from a commit

2024-07-26 19:01:23 +02:00

Added build status analysis. Now logging build status even if build is successful, to make analysis easier (as mentionned in #26). Added a new error type.

antux18 added the

Priority

Critical

Kind/Thinking

labels 2024-08-05 16:03:16 +02:00

antux18 commented

2024-08-06 16:45:31 +02:00

We can summarize the output analysis as follow:

Only 5 tables are needed to draw the plots:
1. Number of packages per source (each of the supported package managers presented separately, or git, or misc).
2. Number of build results per category (either "success", or the error category).
3. Number of artifact status per category (etiher "available" or not, and if it was modified over time).
4. Number of packages per container.
5. Number of packages that changed version per source.
These 5 tables only need either the package list, the artifact hash log or the build status log to be generated. This means that we need to add a column in the package list to identify the time when the list was created, so it can be used to spot changes on the version number of each package over time (see #38).
For each of these 5 tables, there will be one file per row, so it is easier to generate with Snakemake than having a single file in which a new row is added every time an analysis is being done. Also we avoid having to trust the state of the aggregating file before appending a new row.
We could count, not only the number of packages that changed, but also the number of times it has changed (same thing for the artifact status). But this is not essential for the study, so it likely won't be implemented. See #39.

We can summarize the output analysis as follow: - Only 5 tables are needed to draw the plots: 1. Number of packages per source (each of the supported package managers presented separately, or git, or misc). 2. Number of build results per category (either "success", or the error category). 3. Number of artifact status per category (etiher "available" or not, and if it was modified over time). 4. Number of packages per container. 5. Number of packages that changed version per source. - These 5 tables only need either the package list, the artifact hash log or the build status log to be generated. This means that we need to add a column in the package list to identify the time when the list was created, so it can be used to spot changes on the version number of each package over time (see #38). - For each of these 5 tables, there will be one file per row, so it is easier to generate with Snakemake than having a single file in which a new row is added every time an analysis is being done. Also we avoid having to trust the state of the aggregating file before appending a new row. - We could count, not only the number of packages that changed, but also the number of times it has changed (same thing for the artifact status). But this is not essential for the study, so it likely won't be implemented. See #39.

antux18 commented

2024-08-07 17:09:15 +02:00

The analysis part should be done now, closing this issue.

antux18 closed this issue

2024-08-07 17:09:15 +02:00

Sign in to join this conversation.

2 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: GuilloteauQ/study-docker-repro-longevity#26