cran-archiving-stats.qmd

---
title: "Study: Many Archived Packages Return to CRAN"
execute:
  freeze: auto
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE,
  echo = FALSE
)
```

```{r package-dependencies, include = FALSE}
## Install package dependencies, if missing
pkgs <- c("dplyr", "tidyr", "ggplot2", "patchwork", "lubridate", "forcats")
pkgs <- pkgs[!vapply(pkgs, FUN = requireNamespace, FUN.VALUE = FALSE)]
lapply(pkgs, FUN = install.packages, character.only = TRUE)
if (!requireNamespace("cransays")) {
  if (!requireNamespace("remotes")) install.packages("remotes")
  remotes::install_github("r-hub/cransays")
}
```

CRAN packages are archived all the time, but a large portion of them
eventually gets fixed and return to CRAN.  Using public data available
from different resources[^1] on CRAN, we have found that 36% of the
archived packages get unarchived at some point [@revilla_2022]. The
median time for these packages to return to CRAN is ~33 days.

[^1]: Data sources used are `tools:::CRAN_current_db()`,
`tools:::CRAN_archive_db()`, and [PACKAGES.in]. The first holds a data
frame of packages currently on CRAN, which information on the package
name, the package version, and the publishing timestamp.  The second
holds a list of data frames, each comprising the same package
information for all versions ever published on CRAN, except the
currently available version.  The third, holds information on events
for packages that have ever been archived, removed, orphaned, etc.

[PACKAGES.in]: https://cran.r-project.org/src/contrib/PACKAGES.in

```{r cran-history, echo = FALSE}
# Searches packages in PACKAGES.in file and extracts relevant information.
library("dplyr")
options(repos ="https://cloud.r-project.org")
Sys.setLanguage("en")

url <- "https://cran.r-project.org/src/contrib/PACKAGES.in"
con <- url(url)
file <- read.dcf(con) |> 
  as.data.frame()

# Extract multiline comments
comments_l <- lapply(file$`X-CRAN-Comment`, function(x) {
  trimws(unlist(strsplit(x, "[\n]+"), FALSE, FALSE))
})
comments_c <- unlist(comments_l, FALSE, FALSE)
df <- data.frame(package = rep(file$Package, lengths(comments_l)),
                 comment = comments_c)
regex_date <- "([0-9]{4}-[0-9]{2}-[0-9]{2})"
regex_action <- "([Uu]narchived?|[Aa]rchived?|[Rr]enamed?|[Oo]rphaned?|[Rr]eplaced?|[Rr]emoved?)"
comments_df <- cbind(df, 
                     strcapture(pattern = regex_date, x = df$comment, 
                                proto = data.frame(date = Sys.Date()[0])),
                     strcapture(pattern = regex_action, x = df$comment,
                                proto = data.frame(action = character()))
) |> 
  filter(!is.na(comment)) |> 
  mutate(action = tolower(action))
# Check that count(comments_df, !is.na(date), !is.na(action), sort = TRUE) makes sense
# Handle rolled and no keyword used
comments_df$action[!is.na(comments_df$date) & is.na(comments_df$action)] <- "archived"

# filter(comments_df, !is.na(action) & is.na(date)) |> count(tolower(comment)) |> View("a")
# filter(comments_df, is.na(action) & is.na(date)) |> count(tolower(comment)) |> View("b")
# Handle CRAN-history
history_l <- lapply(file$`X-CRAN-History`, function(x) {
  trimws(unlist(strsplit(x, "[\n]+"), FALSE, FALSE))
})
history_c <- unlist(history_l)

history_df <- data.frame(package = rep(file$Package, lengths(history_l)),
                         comment = history_c) |> 
  filter(!is.na(comment))

history_df <- cbind(history_df,
                    strcapture(pattern = regex_date, x = history_df$comment, 
                               proto = data.frame(date = Sys.Date()[0])),
                    strcapture(pattern = regex_action, x = history_df$comment,
                               proto = data.frame(action = character()))
) |> 
  mutate(action = tolower(action))
history_df$action[grep("Back on CRAN", history_df$comment, ignore.case = TRUE)] <- "unarchived"

full_history <- rbind(comments_df, history_df) |>
  ## Manually fix typos
  mutate(action = gsub(pattern = "e$", replacement = "ed", action)) |> 
  arrange(package) |>
  relocate(date) |>
  relocate(comment, .after = last_col())
# Keeping only the history with a recognized event (even if it is not interesting)
history <- filter(full_history, 
                  !is.na(action),
                  !is.na(date)) 
```

```{r search-cran-archive, echo = FALSE}
archive <- tools:::CRAN_archive_db()
pkges <- unique(history$package[history$action %in% c("archived", "unarchived")])

# packages in archive
pkgs <- intersect(pkges, names(archive))
# all_other <- intersect(names(archive), pkges)
# lapply(archive[all_other], function(x)x[1, ])
relevant_archive <- archive[pkgs]
archive_df <- do.call(rbind, relevant_archive)

archives <- vapply(relevant_archive, nrow, numeric(1))
pkg <- rep(names(relevant_archive), times = archives)
archive_df$package <- pkg

current <- tools:::CRAN_current_db()
current_packages <- gsub("(.*)_.*\\.tar\\.gz$", "\\1", 
                                 rownames(current))
current$package <- current_packages
relevant_current <- current[current$package %in% pkgs, ]

packages <- rbind(archive_df, relevant_current) |> 
  mutate(date = as.Date(mtime), action = "accepted") |> 
  arrange(package, date) |> 
  select(date, package, action)

rownames(packages) <- NULL
```

```{r merge, echo = FALSE}
# Merge history and packages in archive
pkg_history <- merge(packages, history, all = TRUE, sort = FALSE) |> 
  arrange(package, date) 
```


## Data quality checks

To make sure our assumptions about the raw input data is valid, we
will run some initial quality checks based on the data that are
available as of `r format(Sys.Date(), format = "%F")`.

```{r over-unarchived, echo = FALSE}
# Packages with problems with annotation about being archived or recording
# the submission
over_unarchived <- pkg_history |> 
  summarise(.by = package,
            missing = sum(action == "unarchived") > sum(action == "archived")) |> 
  filter(missing) |> 
  pull(package)
```

A package should never unarchived more times than it is
archive. However, there are currently `r length(over_unarchived)`
packages unarchived more times than archived. There could be several
reasons for this:

 - one of the previous issues with the package lead to its removal.
 - the term used for annotating an event was different ('orphaned'
   instead of 'archived').


```{r start}
#| fig-cap: "**Figure: First recorded action taken of a package.** Looking by date most
#|  packages' first action recorded is being added to CRAN. For some it isn't."
# Packages not registered when they were first included
pkg_history |> 
  arrange(package, date) |> 
  summarise(.by = package, first_action = first(action)) |> 
  count(first_action, sort = TRUE) |> 
  filter(first_action != "accepted") |> 
  knitr::kable(col.names = c("First action", "Packages"))

weird_first <- pkg_history |> 
  arrange(package, date) |> 
  summarise(.by = package, first_action = first(action)) |> 
  filter(first_action != "accepted",
         first_action != "removed") |> 
  pull(package)
no_accepted <- pkg_history |> 
  summarize(.by = package, 
            no_accepted = any(action == "accepted")) |> 
  filter(!no_accepted) |> 
  pull(package)
```

We also check what the first recorded event packages have.  If the
first action recorded for a package is not that it is 'accepted', this
can indicate some problems on the data that could lead to problems on
the conclusions.

A special mention of the 'removed' action: This action is usually
reserved to copyright issues and it is normal that it is the first
action in record for a package as previous records are removed too
from CRAN (package source code).

By contrast, we should not expected lack of records on CRAN of
'accepted' packages. Based on the current data, this is the case for
`r length(no_accepted)` packages. This could indicated that packages
have been 'renamed' or 'removed'.  Another explanations could be that
there was a dialogue between the package maintainers and the CRAN Team
that lead to the package being 'unarchived' without new 'accepted'
packages.  It could be because of a missing entry in the CRAN
data.

```{r multiple-actions}
#| tbl-cap: "**Table: Events per package in the same date.** Most packages have just 
#|  one action per day. Unarchiving usually requires a new package's version 
#|  accepted the same day."
# Count dates that have multiple actions/events.
# Observations:
# - The 'AMR' package is 'archived' multiple times and new packages dates
#   match with 'unarchived' dates
# - The 'ACEP' package is 'archived' and 'unarchived', but there is no new
#   entry for it
hfdd <- summarise(pkg_history, .by = c(package, date), multi = n_distinct(action))

# Merge data with itself
multiple_actions <- merge(pkg_history, hfdd, all = TRUE, sort = FALSE)
library("tidyr")
library("scales")
library("ggplot2")
ma <- multiple_actions |> 
  count(multi) |> 
  rename(events = n)
multiple_actions |> 
  count(action, multi, sort = TRUE) |> 
  pivot_wider(names_from = action, values_from = n) |> 
  full_join(ma, by = join_by(multi)) |> 
  mutate(across(accepted:renamed, function(x){scales::percent(x/sum(x, na.rm = TRUE))}),
         events_percent = scales::percent(events/sum(events, na.rm = TRUE))) |> 
  select(multiple_actions = multi, events, events_percent) |> 
  knitr::kable(align = "c",
               col.names = c("Multiple actions", "Events", "% events"))

no_multiples_actions_pack <- multiple_actions |> 
  filter(action == "unarchived") |> 
  group_by(package) |> 
  count(action_not_same_date = action == "unarchived" & multi == 1L) |> 
  ungroup() |> 
  filter(action_not_same_date) |> 
  pull(package)

no_muliples_actions <- multiple_actions |> 
  filter(action == "unarchived" & multi == 1L) |> 
  nrow()
```

In total there are `r no_muliples_actions` 'unarchived' events that do not have the corresponding 'accepted' package included event (on the same date).  Currently, this is the case for `r length(no_multiples_actions_pack)` packages out of `r n_distinct(pkg_history$package)` (`r percent(length(no_multiples_actions_pack)/n_distinct(pkg_history$package))`).

On the contrary, there are some events that are not expected to happen
on the same day:

```{r combined-actions}
#| tbl-cap: "**Table: Actions that happend on the same date in a given package.** Mostly a new acceptance lead to a package being unarchived. In some occasions other actions."
multiple_actions |> 
  filter(multi != 1) |> 
  group_by(package, date) |> 
  mutate(cg = cur_group_id()) |> 
  ungroup() |> 
  summarise(.by = cg, 
            type = paste(sort(unique(action)), sep = " ", collapse = " & ")) |> 
  count(type, sort = TRUE) |> 
  mutate(percent = percent(n/sum(n))) |>
  knitr::kable(align = "c", col.names = c("Multiple actions", "Events", "%"))
```


Those with three different actions imply that there has been multiple
revisions from the CRAN Team on the same day.

```{r current}
all_packages <- unique(c(current_packages, names(archive)))
# length(all_packages)
# To account for packages removed too:
all_packages2 <- unique(c(all_packages, pkg_history$package))
# length(all_packages2)

# packages_removed <- out |> 
#   filter(action == "removed") |> 
#   distinct(package) |> 
#   pull(package)
# setdiff(packages_removed, all_packages2)
# setdiff(packages_removed, all_packages) |> length()
```


```{r qc}
individually <- length(no_multiples_actions_pack) +
  length(over_unarchived) +
  length(no_accepted) +
  length(weird_first)
all_failing_qc <- c(no_multiples_actions_pack,
        over_unarchived,
        no_accepted,
        weird_first)
pkg_failing_qc <- unique(all_failing_qc)

tb <- as.data.frame(table(table(all_failing_qc))) |> 
  filter(Var1 != 1) |> 
  summarise(n = sum(Freq)) |> 
  pull(n)
```

In total there are `r length(pkg_failing_qc)` different packages
identified with problematic records/processing from `r n_distinct(pkg_history$package)`.  Out of these, `r tb` were found to have two or more different issues.  Depending on which issue, they might be corrected to the best of our abilities, or simply be discarded depending on the issue and question we are trying to answer.


## Analysis

Now that we have looked into the data quality, we can start trying to
answer some questions:

```{r history-back, echo = FALSE}
library("tidyr")
times_unarchived <- pkg_history |> 
  filter(!package %in% c(over_unarchived, no_accepted)) |> 
  group_by(package) |> 
  # Keep those packages archived once
  filter(cumsum(action %in% c("archived")) >= 1,
         # Removes some 5k packages that only have one remaining action.
         n() >= 2,
         sum(action %in% c("unarchived", "accepted")) >= sum(action == "archived")
         ) |>
  mutate(lead = lead(action, default = NA),
         lag = lag(action, default = NA),
         n = 1:n()) |> 
  filter((action == "archived" & lead %in% c("unarchived", "accepted")) | 
           (action %in% c("unarchived", "accepted") & lag == "archived")) |>
  # A string with the number of times that a given packages was archived (and there is a new)
  mutate(time_archived = rep(1:n(), each = 2, length.out = n())) |> 
  ungroup() |> 
  # Checking that the actions are consecutive
  group_by(package, time_archived) |> 
  filter(last(n) - first(n) == 1) |> 
  ungroup() |> 
  select(package, time_archived, action, date) |> 
  pivot_wider(names_from = action, values_from = date) |> 
  mutate(
    # Assume all the ones unarchived but without accepted are a new submission 
    # or a repeal
    accepted = if_else(is.na(accepted) & unarchived != archived, unarchived, accepted)) |> 
  select(-unarchived)
```

```{r resubmissions-data}
if (file.exists("cdh.rds")) {
  cdh <- readRDS("cdh.rds")
} else {
  # download of ~31MB and long task of organizing it to discard most data.
  cdh <- cransays::download_history()
}

library("lubridate", warn.conflicts = FALSE)
cran_submissions <- cdh |> 
  filter(package %in% pkg_history$package) |> 
  distinct(package, version, snapshot_time, submission_time) |> 
  arrange(package, snapshot_time) |> 
  group_by(package, version) |> 
  summarise(initial_resubmission = as.Date(as.POSIXct(
    min(snapshot_time, submission_time, na.rm = TRUE),
    tz = tz(cdh$snapshot_time)
    ))) |> 
  ungroup() |> 
  select(-version)
# Date we started recording the CRAN submission queue.
start_archive <- min(as.Date(cdh$snapshot_time))
```


```{r all-data}
archived <- pkg_history |> 
  mutate(.by = package,
         v_archived = cumsum(action == "archived"),
         v_accepted = cumsum(action == "accepted")) |> 
  arrange(package, v_accepted, v_archived, date) |> 
  mutate(.by = package,
         first_accepted = min(date[action == "accepted"])) |> 
  mutate(.by = c(package, v_accepted),
         last_accepted = max(date[action == "accepted"])) |>  
  mutate(.by = package,
         previous_accepted = pmax(first_accepted, last_accepted)) |> 
  mutate(last_accepted = if_else(is.infinite(last_accepted), NA, last_accepted),
         previous_accepted = if_else(is.infinite(previous_accepted), NA, previous_accepted)) |> 
  filter(v_archived >= 1, action == "archived")

archived_resubmitted <- merge(archived, cran_submissions, all.x = TRUE) |> 
  filter(# This removes cases when submission was before archived
         is.na(initial_resubmission) | 
           !is.na(initial_resubmission) & initial_resubmission >= date) |> 
  mutate(
    not_addressed = grepl("not addressed", comment, fixed = TRUE ),
    not_corrected = grepl("not corrected", comment, fixed = TRUE ),
    not_fixed = grepl("not fixed", comment, fixed = TRUE ),
    depends_on = grepl("depend(s|ed) on", comment),
    requires = grepl("as require[sd]", comment),
    archived = grepl("archived package", comment, fixed = TRUE),
    maintainer_address = grepl("maintainer address", comment, fixed = TRUE),
    email_bounced = grepl("email bounced", comment, fixed = TRUE),
    requested = grepl("requested", comment, fixed = TRUE),
    policy_violation = grepl("policy violation", comment, fixed = TRUE),
    # From WRE: "This should contain only (ASCII) letters, numbers and dot, 
    # have at least two characters and start with a letter and not end in a dot. "
    # Not fully compliant but close enough ( including quotations to find them)
    # gm <- gregexpr("['\\\"]([[:alnum:]\\.]+)['\\\"]", out_depends$comment)
    position_package = gregexpr("['\\\"]([[:alnum:]\\.]+)['\\\"]", comment),
    possible_packages = regmatches(comment, position_package),
    dependency_package = lapply(possible_packages, gsub, pattern = "\\\"|'", replacement = ""),
    # accepted = date[which(action %in% c("unarchived", "accepted"))],
    # FIXME find the package closest to the archived word (in the 2 cases it affects)
    # For the moment pick the first package
    dp = sapply(dependency_package, function(x) {x[1]}),
    dp = if_else(dp %in% c("README", "\\donttest"), NA, dp)
         ) |> 
  summarise(.by = c(package, date),
            initial_resubmission = min(initial_resubmission),
            maintainer_address = any(maintainer_address | email_bounced),
            not_fixed = any(not_addressed | not_corrected | not_fixed),
            archived_by_dep = any(depends_on | requires | archived),
            requested = any(requested),
            policy_violation = any(policy_violation),
            first_accepted = unique(first_accepted),
            last_accepted = unique(last_accepted),
            previous_accepted = unique(previous_accepted),
            version = max(v_accepted),
            # accepted = tail(accepted, 1),
            dp = unique(dp)
            ) |> 
  # Remove those submissions that are after some already archived version.
  mutate(.by = package,
         initial_resubmission = if_else(
    initial_resubmission > lead(date, default = Sys.Date()), NA, initial_resubmission))

archived_all <- merge(times_unarchived, archived_resubmitted, all = TRUE,
                      by.x = c("package", "archived"),
                      by.y = c("package", "date")) |> 
  mutate(.by = package, first_reacceptance = min(accepted)) |>
  mutate(before_registry = (archived < start_archive | 
                              initial_resubmission < start_archive) & 
           !(archived > start_archive),
         
         initial_resubmission = if_else(is.na(initial_resubmission) &
                                          !is.na(accepted) &
                                          accepted > start_archive,
                                        accepted, initial_resubmission),
  # Decision tree
  #                       Accepted => Accepted
  #             Submitted
  # Archived              Rejected => Not accepted
  #          No submitted          => Never resubmitted
  #          
  # Archived before resubmission archive
  #          Accepted              => Accepted  
  # Archived
  #          No record             => Unkown (Never submitted or rejected)
  # 
         back_on_cran = case_when(
           !is.na(accepted) ~ "Accepted",
           is.na(initial_resubmission) & archived < start_archive ~ "Unknown",
           is.na(initial_resubmission) & archived >= start_archive ~ "Never resubmitted",
           is.na(accepted) & !is.na(initial_resubmission) ~ "Not accepted",
           .default = "Unkown"),
         back_on_cran = factor(back_on_cran, 
                               levels = c("Accepted", "Never resubmitted", 
                                          "Not accepted", "Unknown")),
         delay_submission = difftime(initial_resubmission, archived, units = "days"),
         delay_accepted = difftime(accepted, archived, units = "days"),
         delay_acceptance = difftime(accepted, initial_resubmission, units = "days"),
         time_since_first_v = difftime(archived, first_accepted, units = "weeks"),
         time_archived = if_else(is.na(time_archived), 1, time_archived))
# For future references
legend <- c("Accepted" = "green", "Never resubmitted" = "orange", 
            "Not accepted" = "red", "Unknown" = "grey50")
```


```{r archived-version}
rver <- rversions::r_versions() |> 
  mutate(date = as.Date(date)) |> 
  filter(endsWith(version, ".0"))

r_next <- function(rver, date) {
  w <- max(which(unique(date) > rver$date), na.rm = TRUE) + 1
  if (w > length(rver$date)) {
    return(NA)
  }
  rver$date[w]
}

r_release <- function(rver, date) {
  # The next one to the one that is smaller than the date. 
  w <- max(which(unique(date) > rver$date), na.rm = TRUE)
  rver$date[w]
}

archived_all_versions <- archived_all |> 
  arrange(archived) |> 
  group_by(archived) |>  
  mutate(date_r_rel = r_release(rver, archived), 
         date_r_next = r_next(rver, archived),
         time_since_rel = difftime(archived, date_r_rel, units = "week"),
         time_before_next = difftime(archived, date_r_next, units = "week"),
         time_since_previous = difftime(archived, previous_accepted, units = "weeks")
         ) |> 
  ungroup() |> 
  filter(!is.na(date_r_rel))
```

### Summary of how long it takes packages to be unarchived

```{r table-times}
#| tbl-cap: "**Table: Summary statistics of time to get back to CRAN after
#|  being archived.** Median time is 33 days."
fiu <- function(x){is(x, "difftime")}
archived_all_versions |> 
  filter(!is.na(accepted),
         !package %in% all_failing_qc) |> 
  summarise(.by = time_archived, 
            packages = n(), 
            min = min(delay_accepted, na.rm = TRUE), 
            q1 = quantile(delay_accepted, 0.25, na.rm = TRUE), 
            median = median(delay_accepted, na.rm = TRUE), 
            mean = mean(delay_accepted, na.rm = TRUE), 
            q3 = quantile(delay_accepted, 0.75, na.rm = TRUE), 
            max = max(delay_accepted, na.rm = TRUE)) |> 
  mutate(across(where(fiu), round)) |> 
  as.data.frame() |> 
  knitr::kable(align = "c",
               col.names = c("Times archived", "Packages", "Min.", "1st Qu.", 
               "Median", "Mean", "3rd Qu.", "Max."))
```


```{r plot-times}
#| fig-cap: "**Figure: Summary statistics of time to get back to CRAN after
#| being archived.** Graphical representation of the previous table."
archived_all_versions |> 
  filter(!is.na(accepted), !package %in% all_failing_qc) |>
  ggplot() +
  geom_boxplot(aes(as.factor(time_archived), as.numeric(delay_accepted))) +
  scale_y_log10(labels = label_log(),
                expand = expansion(c(0, NA), c(0, NA)),
                sec.axis = dup_axis(breaks = c(1, 7, 15, 30*1:5),
                                    labels = c(1, 7, 15, 30*1:5),
                                    name = "days")) +
  theme_minimal() +
  labs(y = "Number of days before package returned on CRAN", 
       x = "Number of times a package has been archived",
       title = "Time per each ")
```


### Return time for packages archived only once in their lifetime

```{r plot-ecdf}
#| fig-cap: "**Figure: Empirical distribution of the time it takes packages to get
#|  unarchived as a function of number of days since being archived on CRAN for the first time.**"
if (!requireNamespace("ggarrow", quietly = TRUE)) {
  install.packages("ggarrow")
}
library("ggarrow")
archived_all_versions |> 
  filter(time_archived == "1",
         !package %in% all_failing_qc) |>
  ggplot() +
  stat_ecdf(aes(delay_accepted)) +
  coord_cartesian(xlim =  c(0, 365), expand = FALSE) +
  scale_y_continuous(labels = scales::label_percent(), 
                     sec.axis = dup_axis(name = element_blank())) +
  geom_arrow_segment(y = 0.5, x = 0, xend = 35, col = "red", linewidth = 2) + 
  geom_arrow_segment(y = 0.5, x = 35, yend = 0, col = "red", linewidth = 2) + 
  scale_x_continuous(breaks = c(0, 30*1:12)) +
  theme_minimal() +
  labs(title = "Most packages are back to CRAN within 33 days",
       subtitle = "Focusing on packages returning in less than a year",
       y = "Percentage of packages back on CRAN",
       x = "Number of days until packages are back on CRAN") +
  theme(plot.title.position = "plot", axis.title.x = element_text(hjust = 0))
```


### Return time for packages archived

```{r plot-ecdf-all}
#| fig-cap: "**Figure: Empirical distribution of the time it takes packages to get
#|  unarchived as a function of number of days since being archived on CRAN.**"
archived_all_versions |> 
  filter(!is.na(accepted),
         !package %in% all_failing_qc) |>
  ggplot() +
  stat_ecdf(aes(delay_accepted)) +
  coord_cartesian(xlim =  c(0, 365), expand = FALSE) +
  scale_y_continuous(labels = scales::label_percent(), 
                     sec.axis = dup_axis(name = element_blank())) +
  geom_arrow_segment(y = 0.5, x = 0, xend = 30, col = "red", linewidth = 2) + 
  geom_arrow_segment(y = 0.5, x = 30, yend = 0, col = "red", linewidth = 2) + 
  scale_x_continuous(breaks = c(0, 30*1:12)) +
  theme_minimal() +
  labs(title = "Time for packages to be back on CRAN",
       y = "Percentage of packages back on CRAN",
       x = "Days till packages are back on CRAN")

```


```{r return-time, eval=FALSE}
library("forcats")
archived_all_versions |> 
  filter(!is.na(accepted), !package %in% all_failing_qc) |>
  mutate(package = fct_reorder(package, archived, .fun = min)) |> 
  ggplot() +
  # geom_point(aes(archived, package), shape = "square") +
  # geom_point(aes(accepted, package)) +
  geom_segment(aes(y = package, x = accepted, xend = archived, col = delay_accepted)) +
  scale_x_date(date_breaks = "2 years", date_labels = "'%y") +
  scale_color_continuous(transform = "reverse") +
  labs(x = "Accepted date", y = "Pacakge", col = "Time (days)",
       title = "Packages sorted by date of acceptance and their time off CRAN") +
  theme_minimal() +
  theme(axis.text.y = element_blank(), 
        panel.grid.major.y = element_blank(), 
        panel.grid.minor.y = element_blank())
```


```{r return-time2, eval=FALSE}
#| fig-cap: ""
archived_all_versions |> 
  filter(!is.na(time_since_first_v), !package %in% all_failing_qc) |>
  ggplot() +
  geom_count(aes(time_since_first_v, time_since_previous, col = time_archived)) +
  labs(x = "Time from publication to archival (weeks)",
       y = "Time since archival to new accepted pacakge (weeks)",
       col = "Times archived",
       # FIXME
       title = "Newer packages are archived sooner and take longer to be fixed",
       subtitle = "Not adjusted to the number of packages published") +
  scale_x_continuous(expand = expansion(add = NA_integer_)) +
  scale_y_continuous(expand = expansion(add = c(0, NA), mult = c(0, NA))) +
  theme_minimal() +
  theme(legend.position = "inside", legend.position.inside = c(0.1, 0.7),
        legend.background = element_rect(), plot.title.position = "plot")
```


```{r return-time3}
#| fig-cap: "**Figure: Packages archived and date since the previous release.** The 
#|  color indicates if a given package was archived multiple times; the more 
#|  times it has been archived the lighter the point is."
archived_all_versions |> 
  filter(!is.na(time_since_previous), !package %in% all_failing_qc) |>
  ggplot() +
  geom_point(aes(archived, time_since_previous, col = time_archived)) +
  geom_abline(intercept = Sys.Date(), slope = -2) +
  scale_y_continuous(expand = expansion(mult = c(0, NA), add = c(0, NA)), 
                     sec.axis = dup_axis(name = element_blank())) +
  scale_x_date(date_breaks = "2 years", date_labels = "%Y",
               expand = expansion()) +
  theme_minimal() +
  labs(title = "Time for packages to be back on CRAN",
       y = "Time since last release (weeks)",
       x = "Date of archival",
       col = "Times archived") +
  theme(legend.position = "inside", legend.position.inside = c(0.1, 0.7),
        legend.background = element_rect(), plot.title.position = "plot")
```


### Cumulative number of archived packages over the years

```{r plot-cumulative}
#| fig-cap: "**Figure: Packages actions done by the CRAN Team over time**.
#|  The CRAN Team may take different actions for packages currently on
#|  e.g. archived (solid red), orphaned (dotted yellow), removed
#|  (dashed green), renamed (dashed blue), and unarchived (dotted purple).
#|  Presented is the cumulative number of such events over time on the linear
#|  (left) and the logarithmic (right) scale."
library("patchwork")
p_type <- full_history |> 
  filter(!is.na(action), !is.na(date)) |> 
  arrange(date) |> 
  select(-comment) |> 
  group_by(action) |> 
  mutate(n = seq_len(n())) |> 
  ungroup() |> 
  ggplot() +
  geom_line(aes(date, n, col = action, linetype = action)) +
  scale_x_date(date_breaks = "2 year", date_labels = "'%y",
               expand = expansion()) +
  theme_minimal() +
  labs(x = "Date of the archive",
       y = "Total number of packages"
  )
p_type + scale_y_continuous(expand = expansion()) + 
  p_type + scale_y_log10(guide = "axis_logticks", expand = expansion(), 
                    breaks = c(1, 100, 2500, 5000, 7500, 10000)) +
  plot_annotation(
    title = "Accumulation of actions on packages",
  ) +
  plot_layout(guides = 'collect', axes = "collect") &
  theme(legend.position = 'bottom') &
  labs(col = "Event", linetype = "Event")
```


### Days to return versus date when archived

```{r plot-events}
#| fig-cap: "**Figure: Packages being archived and returning to CRAN.**
#|    Each data point represents when a CRAN package was archived (horizontal
#|    axis) and when it was unarchived (vertical axis).
#|    If more than one package was archive and unarchived on the same dates,
#|    the corresponding data point is presented as a larger disk.
#|    The gray dashed line is the event horizon."
archived_all_versions |> 
  filter(!is.na(delay_accepted)) |> 
  ggplot() +
  geom_count(aes(archived, delay_accepted)) +
  geom_abline(slope = -1, intercept = Sys.Date(), linetype = 2, col = "gray") +
  geom_rug(aes(archived, delay_accepted), sides = "b", outside = TRUE, 
           length = unit(0.015, "npc"), 
           col = "gray") +
  theme_minimal() +
  coord_cartesian(clip = "off") +
  scale_y_continuous(expand = expansion(c(0, NA), c(0, NA))) +
  annotate("text", x = as.Date("2018-06-01"), y = 2700, 
           label = "Event horizon", col = "gray") +
  labs(x = "Date when the package was archived",
       y = "Time until it went back to CRAN",
       title = "Time till archived packages are back to CRAN",
       size = "Packages")
```


### Distribution of number of days for packages to return to CRAN

```{r plot-distribution}
#| fig-cap: "**Figure: Histogram of how long packages remain archived on CRAN**. 
#|  Each bar represents a week. Most packages return to CRAN within a month."
p1 <- archived_all_versions |> 
  filter(!is.na(delay_accepted)) |> 
  ggplot() +
  geom_histogram(aes(as.numeric(delay_accepted)), binwidth = 7) +
  theme_minimal() +
  scale_y_continuous(expand = expansion(c(0, NA), c(0, NA))) +
  labs(y = "Packages that got back",
       x = "Time from archival to acceptance (days)",
       title = "Time till packages are back to CRAN")

p2 <- archived_all_versions |> 
  filter(!is.na(delay_accepted), !package %in% all_failing_qc) |>
  filter(delay_accepted <= 366) |> 
  ggplot() +
  geom_histogram(aes(as.numeric(delay_accepted)), bins = 52) +
  scale_y_log10(expand = expansion(), breaks = c(1, 10, 100, 200, 400, 600, 800)) +
  scale_x_continuous(expand = expansion()) +
  labs(y = "Packages that got back",
       x = "Time from archival to acceptance (days)",
       title = "Focusing on the first year") +
  theme_minimal() +
  theme(plot.background = element_rect())

p1 + 
  labs(title = "Time till packages are back to CRAN") +
  inset_element(p2, 0.2, 0.2, 1, 1) & 
  labs(x = "Time since last version (weeks)")
```

### Packages archived over all

There have been at least `r n_distinct(pkg_history$package[pkg_history$action == "archived"])` packages archived from CRAN. 
From the total of `r length(archive)` in its whole history. 
Which results in  `r scales::percent(n_distinct(pkg_history$package[pkg_history$action == "archived"])/length(archive))` of all packages ever in CRAN got at one point archived.

```{r packages-archived}
#| fig-cap: "**Figure: Packages are archived multiple times**.
#|  Packages archived are sometimes back on CRAN and archived again."
library("forcats")
pa_histo <- archived_all_versions |> 
  summarise(.by = c(package, back_on_cran), time_archived = max(time_archived)) |> 
  mutate(back_on_cran = fct_relevel(back_on_cran, "Unknown", "Never resubmitted", "Not accepted", "Accepted")) |> 
  ggplot() +
  geom_histogram(aes(time_archived, fill = back_on_cran), binwidth = 1) +
  scale_x_continuous(breaks = 1:6, expand = expansion()) +
  scale_y_continuous(expand = expansion(mult = c(0, NA), add = c(0, NA))) +
  scale_fill_manual(values = legend) +
  theme_minimal() +
  labs(x = "Times archived", 
       y = "Packages",
       fill = "Resubmission process",
       title = "Times a package has been archived")
pa_bar <- archived_all_versions |> 
  summarise(.by = c(package, back_on_cran),
            time_archived = max(time_archived)) |> 
  count(back_on_cran, time_archived, name = "packages") |> 
  mutate(rel = packages / length(all_packages2),
         back_on_cran = fct_relevel(back_on_cran, "Unknown", "Never resubmitted", 
                                    "Not accepted", "Accepted")) |> 
  arrange(time_archived, back_on_cran) |> 
  mutate(rel_accum = cumsum(rel),
         ymin = rel_accum - rel) |> 
  ggplot() +
  geom_rect(aes(xmin = time_archived - 0.5, xmax = time_archived + 0.5, 
                ymin = ymin, ymax = rel_accum,
                fill = back_on_cran)) +
  # geom_col(aes(time_archived, rel, fill = back_on_cran, 
  #              group = time_archived)) +
  scale_y_continuous(labels = scales::label_percent(), 
                     expand = expansion(),
                     limits = c(0, 1)) +
  scale_x_continuous(breaks = 1:6, expand = expansion()) +
  scale_fill_manual(values = legend) +
  scale_color_continuous(transform = "reverse") +
  theme_minimal() +
  labs(x = "Times archived", y = "Percentage of all packages ever on CRAN", 
       title = "Percentage of archived packages", 
       fill = "Resubmission process") +
  theme(legend.position = "inside", legend.position.inside = c(0.2, 0.5))
pa_histo + pa_bar + 
  plot_layout(guides = 'collect') & 
  theme(plot.title.position = "plot", legend.position = "bottom")
```

Most packages are not archived, but if they are mostly archived once. 
This is probably because 50% of those archived never get back to CRAN.

```{r packages-archived-unarchived}
#| fig-cap: "**Figure: Most packages are not back to CRAN after being archived**.
#|  Packages that got archived sometimes go back on CRAN."
pa <- archived_all_versions |> 
  filter(!package %in% all_failing_qc) |>
  summarise(.by = package, 
            archived = sum(!is.na(archived)),
            unarchived = sum(!is.na(accepted))) |> 
  # Some corrections to make sure it looks well 
  mutate(archived = if_else(archived - unarchived < 0,  unarchived, archived)) |>
  mutate(unarchived = if_else(archived - unarchived > 1,  archived - 1, unarchived)) |>
  filter(archived >= 1) |> 
  count(archived, unarchived) |> 
  mutate(rel = n/sum(n)) 

pa |> 
  ggplot() +
  geom_point(aes(archived, unarchived, size = n, col = rel)) +
  scale_color_continuous(labels = percent, breaks = 0.1*0:6) + 
  scale_x_continuous(breaks = 0:6) +
  scale_y_continuous(breaks = 0:6) +
  theme_minimal() +
  labs(x = "Times archived", y = "Times unarchived", col = "Percentage",
       size = "Packages",
       title = "Packages archived rarely get back to CRAN")
pa_percentage <- pa |> 
  group_by(back = unarchived >= archived) |> 
  summarise(n = sum(n)) |> 
  mutate(rel = n/sum(n)) |> 
  ungroup() |> 
  filter(back) |> 
  pull(rel)
```

Approximately `r percent(pa_percentage)` packages gets back on CRAN.


### Packages resubmission 

```{r archived-type}
#| fig-cap: "**Figure: Resubmission process by date of being archived.** Events in black
#|  are those we cannot say if they were not re-submitted or rejected. The dashed 
#|  gray lines are the dates of R minor releases."

rver2 <- rver |> 
  filter(date >= min(archived$date)) |> 
  mutate(version = substr(version, 1, 4),
         version = sub("\\.$", "", version))
archived_all_versions |> 
  # summarise(.by = c(archived, back_on_cran),
  #           n = n()) |> 
  group_by(month = floor_date(archived, unit = "month"),
           back_on_cran) |> 
  summarise(n = n()) |> 
  ggplot() +
  geom_vline(xintercept = rver$date, linetype = 2, col = "darkgray") +
  # annotate("text", 
  #          x = rver2$date, 
  #          y = 210, 
  #          label = rver2$version,
  #          col = "darkgray", 
  #          hjust = 0, vjust = runif(nrow(rver2))
  #          ) +
  geom_col(aes(month, n, fill = back_on_cran, col = back_on_cran)) +
  geom_text(aes(x = date, y = 210, label = version), col = "darkgray", data = rver2,
            position = position_jitter()) +
  # facet_zoom(x = archived > (Sys.Date() - 30), ylim = c(0, 125), 
             # zoom.size = 1, horizontal = FALSE, show.area = FALSE) +
  scale_x_date(expand = expansion(), date_labels = "%Y", date_breaks = "1 year") +
  scale_y_continuous(expand = expansion(), limits = c(0, 230)) +
  scale_fill_manual(values = legend) +
  scale_color_manual(values = legend) +
  labs(y = "Events", x = "Date", title = "Packages archived",
       fill = "Resubmission process", col = "Resubmission process") +
  theme_minimal() +
  theme(legend.direction = "horizontal", legend.position = "bottom")
```

Notice how there are some packages that submitted a new version to CRAN after `r start_archive`, but where archived long before (those that are in red before that date).
Those maintainers might need help to get their packages back on CRAN.

```{r archived2resubmit}
#| fig-cap: "**Figure: Submissions are slightly faster for those that are accepted.** Percentage of packages that are submitted before a given time."
resubm_t <- archived_all_versions |> 
  filter(!is.na(delay_submission)) |> 
  ggplot() +
  stat_ecdf(aes(delay_submission, col = back_on_cran)) +
  scale_color_manual(values = legend) +
  scale_y_continuous(labels = label_percent(), expand = expansion(c(0, 0.005), c(0, 0.005))) +
  scale_x_continuous(expand = expansion()) +
  theme_minimal() +
  labs(col = "Resubmission?", 
       y = "Events %",
       x = "From being archived to resubmit (weeks)",
       title = "How fast are new packages re-submitted to CRAN?") +
  theme(plot.title.position = "plot",
        legend.position = "inside", legend.position.inside = c(0.8, 0.8),
        legend.box.background = element_rect(color = "white"))
resubm_t_focus <- resubm_t +
  coord_cartesian(xlim = c(0, 90), ylim = c(0, .60)) +
  theme_minimal() + 
  labs(x = "days", y = element_blank(), title = "First 90 days") +
  theme(plot.background = element_rect(), plot.title.position = "plot")

resubm_t + 
  inset_element(resubm_t_focus, 0.2, 0.01, 1, 0.7)
```


# Slowdown due to resubmission

```{r resubmissions}
# We only have data of submissions since 2020-09-12:
# min(cdh$snapshot_time) === start_archive
sample_size <- sum(!archived_all$before_registry)/nrow(archived_all)
# `% submitted` = events/sum(events[!before_registry & new_submission]),
#          `% submitted` = if_else(!(!before_registry & new_submission), NA, `% submitted`),
#          `% submitted` = scales::percent(`% submitted`)
archived_all_versions |> 
  filter(!package %in% pkg_failing_qc) |> 
  # filter(archived >= as.Date("2020-09-12")) |> 
  count(back_on_cran, before_registry, sort = TRUE,
        name = "events") |> 
  filter(!before_registry) |> 
  mutate(`%` = scales::percent(events/sum(events))) |> 
  mutate(submitted = !back_on_cran %in% c("Never resubmitted")) |> 
  mutate(.by =submitted,
         `% submitted` = scales::percent(events/sum(events))) |> 
  mutate(`% submitted` = if_else(!submitted, NA, `% submitted`)) |> 
  select(-submitted, -before_registry) |> 
  knitr::kable(align = "c", col.names = c("Back to CRAN?", "Events", "%", "% submitted"))
```

Based on a the latest data available, which is `r percent(sample_size)` of the archived packages, slightly more than half of the packages try to get back to CRAN.
Those that try almost all of them eventually get back to CRAN.

But how fast is the process of being back on CRAN?

```{r speed-resubmission}
#| fig-cap: "**Figure: Resubmission delay: % spend in review after submitting a new version till it is accepted.** Only includes packages that were accepted one day later than submitted (to avoid dividing by 0)"
subm_rel <- archived_all_versions |> 
  filter(!is.na(initial_resubmission) & !is.na(accepted),
         delay_acceptance > 0) |> 
  mutate(time_rel = as.numeric(delay_acceptance)/as.numeric(delay_accepted),
         time_perc = scales::percent(time_rel)) |> 
  select(package, archived, initial_resubmission, accepted, 
         delay_acceptance, delay_accepted,
         time_rel, time_perc)
p_subm_rel <- subm_rel |> 
  ggplot() +
  geom_histogram(aes(time_rel), fill = "green", bins = 100) +
  scale_x_continuous(labels = scales::label_percent(), expand = expansion()) +
  scale_y_continuous(expand = expansion(), sec.axis = dup_axis(name = element_blank())) +
  labs(title = "Percentage of time to be back on CRAN spent in review",
       y = "Events", x = "Time since resubmission to be back on CRAN") +
  theme_minimal()
p_subm_rel
subm_rel_n <- subm_rel |> 
  count(tim = time_rel > 0.5) |> 
  mutate(n_rel =n/sum(n)) |> 
  filter(tim) |> 
  pull(n_rel)
```

Most packages that get back to CRAN are accepted soon after submitting the new version fixing the problems of being archived.
But some (~ `r percent(subm_rel_n)`) spend most of the time trying to pass checks to comply with CRAN policies: maybe trying to fix issues detected on submission or waiting for CRAN maintainers feedback.

### Packages not addressed in time

```{r history-failed}
multiple_archived <- archived_all_versions |> 
  group_by(package) |> 
  filter(any(not_fixed)) |> 
  ungroup() |> 
  count(mult = time_archived > 1) |> 
  filter(mult == TRUE) |> 
  pull(n)
```


Packages are archived because they are not addressed/corrected in time.
If we look in more detail on this packages we see there are `r length(unique(archived_all_versions$package[archived_all_versions$not_fixed]))` packages that failed to correct in time. 
Most of them where archived once but some of them (`r multiple_archived`) were archived multiple times.


```{r failed-check-archived}
#| fig-cap: "**Figure: Packages archived because problems were not fixed on time are mostly back.**.
#|  Packages that got archived because maintainers couldn't fix the packages on time got back on time"

archived_all_versions |> 
  filter(not_fixed) |> 
  ggplot() +
  geom_histogram(aes(archived, fill = back_on_cran), bins = 8*12) +
  scale_fill_manual(values = legend) +
  scale_y_continuous(expand = expansion(add = c(0, NA), mult = c(0, NA)), 
                     sec.axis = dup_axis(name = element_blank())) +
  scale_x_date(date_breaks = "6 months", date_labels = "%Y-%m", 
               expand = expansion(add = NA_integer_, mult = 0)) +
  labs(y = "Events", x = element_blank(), fill = "Back on CRAN?",
       title = "Packages with failed checks go back to CRAN?") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90), 
        panel.grid.minor.x = element_blank(),
        legend.position = "inside", legend.position.inside = c(0.2, 0.75),
        legend.background = element_rect()) 
```


```{r tbl-not-fixed}
#| tbl-cap: "**Table: Tally of reasons of packages not fixed**. Many packages are not 
#|  fixed and the maintainer address."
archived_all_versions |> 
  filter(!is.na(not_fixed) & !requested) |>
  count(not_fixed, dependency = !is.na(dp),
        maintainer_address, policy_violation, sort = TRUE) |> 
  knitr::kable(align = "c", col.names = c("Not fixed", "Dependencies", "Maintainer address", "Policy violation", "Packages"))
```

The first cause of archiving is packages not fixed, the second cause is not clear
as it seems a mix o circumstances and difficulties parsing the cause. 
The third cause of archiving is due to a package it depends being archived and the fourth because the package didn't comply with CRAN's policy.
The fifth most common reason is that the email address of the maintainer fails to receive emails.

#### Linked to R-releases?

```{r r-versions}
#| fig-cap: "**Figure: More packages are archived the closer the release date is.** 
#|  On the left side packages archived long before the next R minor release, on the 
#|  right side packages archived closer to R release minor colored by their 
#|  submission process after being archived."
archived_all_versions |> 
  ggplot() +
  geom_histogram(aes(abs(time_before_next), fill = back_on_cran), 
                 col = "lightgray", bins = 52) +
  theme_minimal() +
  scale_y_continuous(expand = expansion(add = 0, mult = c(0, NA)), 
                     sec.axis = dup_axis(name = element_blank())) +
  scale_x_continuous(expand = expansion(), transform = "reverse") +
  annotate("text", x = 0, y = 350, label = "R release", hjust = 1,
           vjust = 0) +
  scale_fill_manual(values = legend) +
  labs(y = "Archived packages", fill = "Submitted to CRAN?", 
       x = "Time before next release (weeks)",
       title = "Packages archived with failing text before next release") +
  theme(legend.position = "inside", legend.position.inside = c(0.5, 0.79),
        plot.title.position = "plot", legend.background = element_rect())
```


```{r r-versions2}
#| fig-cap: "**Figure: Packages are usually archived weeks after a release.**
#|  On the left side packages archived after a R minor release, on the 
#|  right side packages archived long after a R release minor colored by their 
#|  submission process after being archived."
archived_all_versions |> 
  ggplot() +
  geom_histogram(aes(time_since_rel, fill = back_on_cran), col = "lightgray", bins = 52) +
  scale_y_continuous(expand = expansion(add = 0, mult = c(0, NA)),
                     sec.axis = dup_axis()) +
  scale_x_continuous(expand = expansion()) +
  scale_fill_manual(values = legend) + 
  annotate("text", x = 0, y = 450, label = "R release", vjust = 0, hjust = 0) +
  labs(y = "Archived packages", fill = "Submitted to CRAN?", 
       x = "Time since previous release (weeks)",
       title = "Packages archived failing since last release") +
  theme_minimal() +
  theme(legend.position = "inside", legend.position.inside = c(0.5, 0.79),
        plot.title.position = "plot", legend.background = element_rect()) 
```

```{r r-versions3}
#| fig-cap: "**Figure: Packages tend to be archived right before a release or 
#|  after it but not in the middle of releases.** On the left side packages 
#|  archived after a R minor release, on the right side packages archived before 
#|  the next a R release minor colored by their submission process after being 
#|  archived."
ggplot(archived_all_versions) +
  geom_histogram(aes(time_before_next + time_since_rel, fill = back_on_cran), 
                 col = "lightgray", binwidth = 4) +
  annotate("text", x = c(-54, 54), y = 610, 
           label = c("Next release", "Previous release"),
           vjust = 0, hjust = c(1, 0)) +
  labs(y = "Packages", x = "Time to R release (weeks)", fill = "Submitted to CRAN?",
       title = "Packages archived in relation to R releases") +
  scale_x_continuous(expand = expansion(), transform = "reverse") +
  scale_y_continuous(expand = expansion(add = c(0, NA), mult = c(0, NA))) +
  scale_fill_manual(values = legend) + 
  theme_minimal() +
  theme(legend.position = "inside", legend.position.inside = c(0.5, 0.75),
        plot.title.position = "plot", legend.background = element_rect())
```

Only in 2022 and 2023 there has been a clear trend for packages that are archived closer to next release and after release respectively:

```{r r- version4}
#| fig-cap: "**Figure: Trend by year of packcages being archived.** The closer to the left the closer they are after a release, the more to the right the closer they are archived before a R-release."
ggplot(archived_all_versions) +
  geom_histogram(aes(time_before_next + time_since_rel), 
                 binwidth = 4) +
  labs(y = "Packages", x = "Time to R release (weeks)", fill = "Submitted to CRAN?",
       title = "Packages archived in relation to R releases") +
  scale_x_continuous(expand = expansion(), transform = "reverse") +
  scale_y_continuous(expand = expansion(add = c(0, NA), mult = c(0, NA))) +
  scale_fill_manual(values = legend) + 
  facet_wrap(~year(archived), scales = "free_y") +
  theme_minimal() +
  theme(
    # legend.position = "inside", legend.position.inside = c(0.5, 0.75),
    plot.title.position = "plot", legend.background = element_rect(),
    legend.position = "bottom", legend.direction = "horizontal")
```


#### Age of archival

```{r aged}
#| fig-cap: "**Figure: Time since being accepted to being archived.** Most 
#|  packages are archived soon after being accepted. There are packages that keep 
#|  the initial version for several years without problems."
# Some are simply impossible:
# time_since_first shouldn't be negative. DirichletReg, CDS, AgroR: the archives are missing versions
# Probably there is a typo in the date DirichletReg could have been archived in 2020-05-03 instead of 2010-05-03 and unarchived on 2020-05-29 (27 days later instead of 10)
impossible_dates <- archived_all_versions |> 
  filter(time_since_first_v <= 0 & time_since_previous > 50) |> 
  pull(package) |> 
  setdiff(pkg_failing_qc)
# DirichletReg WhopGenome permGPU IOHanalyzer NACHO testextra hpa transfR symengine TDA AgroR Matching
archived_all_versions |> 
  filter(!package %in% c(impossible_dates, pkg_failing_qc),
         !is.na(time_since_previous)) |> 
  ggplot() +
  geom_count(aes(as.numeric(time_since_first_v), as.numeric(time_since_previous))) +
  theme_minimal() +
  labs(x = "Time since first version (weeks)", y = "Time since previous release (weeks)",
       size = "Packages")
```


```{r aged2}
#| fig-cap: "**Figure: Time since being accepted for the first time to being 
#|  archived.** Most packages are archived soon after being accepted for the 
#|  first time."
p_archived <- archived_all_versions |> 
  filter(!package %in% c(impossible_dates, pkg_failing_qc),
         !is.na(time_since_previous)) |> 
  ggplot() +
  geom_histogram(aes(time_since_first_v), bins = 52*23) +
  scale_x_continuous(expand = expansion(), breaks = seq(0, 1200, 52)) +
  scale_y_continuous(expand = expansion(), breaks = seq(0, 1200, 50)) +
  labs(x = "Time since first accepted version to being archived (weeks)",
       y = "Events",
       title = "Most packages are archived shortly after being accepted") +
  theme_minimal()

p_archived_zoom <- p_archived +
  coord_cartesian(xlim = c(0, 52)) +
  scale_x_continuous(expand = expansion(), breaks = seq(1, 52, 4)) +
  theme(plot.background = element_rect()) +
  labs(title = element_blank())
p_archived +
  inset_element(p_archived_zoom, 0.1, 0.2, 1, 1)
```

Packages are often archived after being first accepted. 
There is a peak of archived packages 2 weeks after the acceptance, but there are also packages archived before the usually 2 weeks period to fix issues.
Passing the first month seems critical for packages as the rates later on seems more stable. 

```{r aged-public}
#| fig-cap: "**Figure: Time since last version till the package is archived.** Most 
#|  packages are archived shortly after a release, which might indicate that 
#|  problems are only found after being accepted via additional checks not 
#|  available to maintainers."
tsp <- archived_all_versions |> 
  filter(!package %in% c(impossible_dates, pkg_failing_qc)) |> 
  ggplot() +
  geom_histogram(aes(time_since_previous, group = version, fill = version), bins = 52*27) +
  scale_x_continuous(expand = expansion(), breaks = seq(0, 1200, 52)) +
  scale_y_continuous(expand = expansion(), breaks = seq(0, 1200, 50)) +
  theme_minimal()
tsp_inset <- tsp +
  coord_cartesian(xlim = c(0, 52)) +
  scale_x_continuous(expand = expansion(), breaks = seq(0, 52, 4)) +
    theme(plot.background = element_rect())

tsp + 
  labs(title = "Time since acceptance till archived") +
  inset_element(tsp_inset, 0.15, 0.1, 1, 1) & 
  labs(x = "Time since last version (weeks)") &
  plot_layout(guides = "collect")
```

Packages that were already on CRAN are archived sooner after a new release. 
This matches the trend on first time accepted packages.
If anything the trend to archive packages soon after being accepted is higher. 

```{r aged-public1}
#| fig-cap: "**Figure: Age of packages being archived.** Older packages are increasingly being 
#|  archived but not at the same rate time passes."
archived_all_versions |> 
  filter(!package %in% c(impossible_dates, pkg_failing_qc)) |> 
  ggplot(aes(archived, time_since_first_v)) +
  geom_vline(xintercept = rver$date, linetype = 2, col = "darkgray") +
  geom_text(aes(x = date, y = 1050, label = version), col = "darkgray", data = rver2,
            position = position_jitter()) +
  geom_count(aes(col = time_since_previous)) +
  geom_smooth(span = 0.1, method = "loess") +
  scale_x_date(expand = expansion()) +
  scale_y_continuous(expand = expansion(), 
                     sec.axis = sec_axis(transform = ~./52, name = "~Years",
                                         breaks = seq(0, 24, by = 2))) +
  theme_minimal() +
  labs(x = "Date a package was archived", y = "Time since inclusion in CRAN (weeks)",
       size = "Events", col = "Since prev. (w)",
       title = "Age of packages archived")
```

Age of packages archived is increasing, sometimes changes in TODO

```{r aged-public2}
#| fig-cap: "**Figure: Time since previous releases of packages being archived.** Almost
#|  a constant rate except on a a specific moments."
archived_all_versions |> 
  filter(!package %in% c(impossible_dates, pkg_failing_qc)) |> 
  ggplot() +
  geom_vline(xintercept = rver$date, linetype = 2, col = "darkgray") +
  geom_text(aes(x = date, y = 800, label = version), col = "darkgray", data = rver2,
            position = position_jitter()) +
  geom_count(aes(archived, time_since_previous, col = time_since_previous)) +
  geom_smooth(aes(archived, time_since_previous), method = "loess", span = 0.1) +
  scale_color_continuous(trans = "reverse") +
  scale_x_date(expand = expansion()) +
  scale_y_continuous(expand = expansion(), 
                     sec.axis = sec_axis(transform = ~./52, name = "~Years",
                                         breaks = seq(0, 14, by = 2))) +
  theme_minimal() +
  labs(x = "Date a package was archived",
       y = "Time since previous release (weeks)",
       size = "Events",
       col = "Since prev. (w)",
       title = "Time since previous release") 
```

Packages archived keep updating to an almost fixed rate. 

### Archived because depends on other packages

When a package is going to be archived CRAN sends an email to the maintainer which package in trouble and all the packages maintainers that depend on it. 
This often results in people stepping up and fixing the package.
When this doesn't happen, packages will be archived together with their dependency.

```{r depending-on}
#| tbl-cap: "**Table: Dependencies impact**. Most packages archived result in another package archived."
count_deps <- archived_all_versions |> 
  filter(!is.na(dp)) |> 
  count(dp, back_on_cran, sort = TRUE) |> 
  mutate(.by = dp, 
         `Affected packages` = sum(n))
count_deps |> 
  summarise(.by = `Affected packages`, 
            Times = n()) |> 
  arrange(`Affected packages`) |> 
  knitr::kable(align = "c")
```


The packages that affected more packages lead to 18 packages archived.

```{r depending-on2}
#| fig-cap: "**Figure: Archived packages due to a dependency often come back.** In 
#|  absolute numbers (left) and in percentage (right). Many packages that are 
#|  archived result in another package being archived, which are usually back to
#|  CRAN."
p_deps <- count_deps |> 
  ggplot() +
  geom_col(aes(`Affected packages`, n, fill = back_on_cran)) +
  scale_fill_manual(values = legend) +
  theme_minimal() +
  scale_y_continuous(expand = expansion(add = c(0, NA), mult = c(0, NA))) +
  scale_x_continuous( expand = expansion()) +
  labs(x = "Dependencies affected", y = "Events", fill = "Submitted to CRAN?",
       title = "Missing dependencies") +
  theme(panel.grid.minor.x = element_blank(), legend.position.inside = c(0.8, 0.8),
        legend.position = "inside", legend.background = element_rect())
p_deps_rel <- count_deps |> 
  mutate(.by = `Affected packages`,
       rel = n/sum(n)) |> 
  ggplot() +
  geom_col(aes(`Affected packages`, rel, fill = back_on_cran), show.legend = FALSE) +
  scale_fill_manual(values = legend) +
  scale_x_continuous(expand = expansion()) +
  scale_y_continuous(labels = label_percent(), expand = expansion()) +
  theme_minimal() +
  labs(x = "Dependencies affected", y = "Percentage of packages affected", 
       fill = "Resubmission process",
       title = "Percentage of packages")
p_deps + p_deps_rel
```

Those packages that were archived were mostly back on CRAN.

### Maintainers

#### Failing email

Sometimes the problem is with maintainer's email.

```{r maintainers-emails}
#| fig-cap: "**Figure: Packages with not responsive maintainer address are archived 
#|  later after the last version.** The line indicates the approximation of these two variables."
archived_all_versions |> 
  filter(maintainer_address) |> 
  ggplot(aes(archived, time_since_previous)) +
  geom_count() +
  geom_smooth() +
  scale_x_date(expand = expansion()) +
  scale_size(range = c(2, 6)) +
  scale_y_continuous(expand = expansion(), 
                     sec.axis = sec_axis(transform = ~./52, name = "~Years",
                                         breaks = seq(0, 14, by = 2))) +
  theme_minimal() +
  labs(x = "Date package was archived",
       y = "Time since previous version (weeks)",
       size = "Events",
       title = "Maintainers keep their emails")
```

As the time increase between being archived and the failing email, this seems to indicate that maintainers are now more careful with the email given.