Analyzing the output of ECG #26

Closed
opened 2024-07-25 17:31:40 +02:00 by antux18 · 11 comments
Collaborator

In order to study the reproducibility of the Dockerfiles, we need a script that can analyze the outputs of ECG to extract interesting data, and another script that plots the extracted data.

Some ideas of what we could plot:

  • Software environment:
    • Number of installed packages.
    • Proportion/number of packages that change version.
    • Package sources from where packages are changing the most.
    • Proportion/number of Dockerfiles using package managers/Git/misc.
    • Proportion/number of Dockerfiles using dpkg/rpm/pip/conda/other known package managers.
  • Dockerfiles build status:
    • Proportion/number of Dockerfiles for which build is successful.
    • Proportion/number of Dockerfiles for which build is successful/baseimage_unavailable/other known error types.
  • Artifacts' source:
    • Proportion/number of artifacts that can be downloaded.
    • Proportion/number of artifacts that have changed.
    • Proportion/number of artifacts that are available/unavailable/modified.

All of these plots can be done both on the outputs of a single execution, or on outputs generated over time after multiple executions (except for plots that measure change).

In order to study the reproducibility of the Dockerfiles, we need a script that can analyze the outputs of ECG to extract interesting data, and another script that plots the extracted data. Some ideas of what we could plot: - Software environment: - Number of installed packages. - Proportion/number of packages that change version. - Package sources from where packages are changing the most. - Proportion/number of Dockerfiles using package managers/Git/misc. - Proportion/number of Dockerfiles using dpkg/rpm/pip/conda/other known package managers. - Dockerfiles build status: - Proportion/number of Dockerfiles for which build is successful. - Proportion/number of Dockerfiles for which build is successful/baseimage_unavailable/other known error types. - Artifacts' source: - Proportion/number of artifacts that can be downloaded. - Proportion/number of artifacts that have changed. - Proportion/number of artifacts that are available/unavailable/modified. All of these plots can be done both on the outputs of a single execution, or on outputs generated over time after multiple executions (except for plots that measure change).
Author
Collaborator

Regarding the build status analysis: it was decided not to log the build status of Dockerfiles that build successfully, but it makes analysis harder: how to determine the number of Dockerfiles that build successfully if this success hasn't been logged?

Shouldn't we log that when build was successful?

Regarding the build status analysis: it was decided not to log the build status of Dockerfiles that build successfully, but it makes analysis harder: how to determine the number of Dockerfiles that build successfully if this success hasn't been logged? Shouldn't we log that when build was successful?
antux18 self-assigned this 2024-07-26 17:08:54 +02:00
Owner

i would say that the ones that built successfully are the ones not present in the blacklist

i would say that the ones that built successfully are the ones not present in the `blacklist`
Author
Collaborator

My question is rather how to count them if they are not in the file?

My question is rather how to count them if they are not in the file?
Owner

number of artifacts in the artifacts/ folder minus the one in the blacklist?

number of artifacts in the `artifacts/` folder minus the one in the blacklist?
Author
Collaborator

Sure, but wouldn't it be simpler to just put them on the build status file? Your solution would require another argument for the analysis script which is the artifact folder, and then the script should count them and subtract the amount of artifact that failed... It looks a bit cumbersome for nothing to me...

Sure, but wouldn't it be simpler to just put them on the build status file? Your solution would require another argument for the analysis script which is the artifact folder, and then the script should count them and subtract the amount of artifact that failed... It looks a bit cumbersome for nothing to me...
Owner

So we can log the successful build status, we just need to generate the blacklist correctly afterward.
for now the blacklist is just a concatenation of all the build status files.
If you feel like it helps the analysis to have the success status, go for it 🙂

So we can log the successful build status, we just need to generate the blacklist correctly afterward. for now the blacklist is just a concatenation of all the build status files. If you feel like it helps the analysis to have the success status, go for it 🙂
Author
Collaborator

Oh, I see... It would complicate the generation of the blacklist?

Oh, I see... It would complicate the generation of the blacklist?
Owner

not much i would say.
we would just need to filter out the successful builds

not much i would say. we would just need to filter out the successful builds
Author
Collaborator

Okay, then I'm going to do that on the Snakefile, it looks like it would be simpler for the analysis indeed.

Okay, then I'm going to do that on the Snakefile, it looks like it would be simpler for the analysis indeed.
antux18 added the
Priority
Critical
Kind/Thinking
labels 2024-08-05 16:03:16 +02:00
Author
Collaborator

We can summarize the output analysis as follow:

  • Only 5 tables are needed to draw the plots:
    1. Number of packages per source (each of the supported package managers presented separately, or git, or misc).
    2. Number of build results per category (either "success", or the error category).
    3. Number of artifact status per category (etiher "available" or not, and if it was modified over time).
    4. Number of packages per container.
    5. Number of packages that changed version per source.
  • These 5 tables only need either the package list, the artifact hash log or the build status log to be generated. This means that we need to add a column in the package list to identify the time when the list was created, so it can be used to spot changes on the version number of each package over time (see #38).
  • For each of these 5 tables, there will be one file per row, so it is easier to generate with Snakemake than having a single file in which a new row is added every time an analysis is being done. Also we avoid having to trust the state of the aggregating file before appending a new row.
  • We could count, not only the number of packages that changed, but also the number of times it has changed (same thing for the artifact status). But this is not essential for the study, so it likely won't be implemented. See #39.
We can summarize the output analysis as follow: - Only 5 tables are needed to draw the plots: 1. Number of packages per source (each of the supported package managers presented separately, or git, or misc). 2. Number of build results per category (either "success", or the error category). 3. Number of artifact status per category (etiher "available" or not, and if it was modified over time). 4. Number of packages per container. 5. Number of packages that changed version per source. - These 5 tables only need either the package list, the artifact hash log or the build status log to be generated. This means that we need to add a column in the package list to identify the time when the list was created, so it can be used to spot changes on the version number of each package over time (see #38). - For each of these 5 tables, there will be one file per row, so it is easier to generate with Snakemake than having a single file in which a new row is added every time an analysis is being done. Also we avoid having to trust the state of the aggregating file before appending a new row. - We could count, not only the number of packages that changed, but also the number of times it has changed (same thing for the artifact status). But this is not essential for the study, so it likely won't be implemented. See #39.
Author
Collaborator

The analysis part should be done now, closing this issue.

The analysis part should be done now, closing this issue.
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: GuilloteauQ/study-docker-repro-longevity#26
No description provided.