diff --git a/_posts/oaire_graph_2020/oaire_graph_post.Rmd b/_posts/oaire_graph_2020/oaire_graph_post.Rmd index a5e7aae..17ac302 100644 --- a/_posts/oaire_graph_2020/oaire_graph_post.Rmd +++ b/_posts/oaire_graph_2020/oaire_graph_post.Rmd @@ -2,12 +2,12 @@ title: "Accessing and analysing the OpenAIRE Research Graph data dumps" description: | The OpenAIRE Research Graph provides a wide range of metadata about grant-supported research publications. This blog post presents an experimental R package with helpers for splitting, de-compressing and parsing the underlying data dumps. I will demonstrate how to use them by examining the compliance of funded projects with the open access mandate in Horizon 2020. -draft: true author: - name: Najko Jahn url: https://twitter.com/najkoja affiliation: State and University Library Göttingen affiliation_url: https://www.sub.uni-goettingen.de/ +date: "`r Sys.Date()`" output: distill::distill_article bibliography: literature.bib resources: @@ -145,6 +145,8 @@ In this use case, I will illustrate how to make use of the OpenAIRE Research Gra As a start, I load a dataset, which was compiled following the above-described methods using the whole `h2020_results.gz` dump. + + ```{r} oaire_df <- jsonlite::stream_in(file("data/h2020_parsed.json"), verbose = FALSE) %>% diff --git a/_posts/oaire_graph_2020/oaire_graph_post.html b/_posts/oaire_graph_2020/oaire_graph_post.html index f2d75a2..aaed58c 100644 --- a/_posts/oaire_graph_2020/oaire_graph_post.html +++ b/_posts/oaire_graph_2020/oaire_graph_post.html @@ -25,8 +25,8 @@ - - + + @@ -55,7 +55,7 @@ @@ -5092,7 +5092,7 @@

This article is in review.

@@ -5109,7 +5109,7 @@

Accessing and analysing the OpenAIRE Research Graph data dumps

Najko Jahn https://twitter.com/najkoja (State and University Library Göttingen)https://www.sub.uni-goettingen.de/ -
04-02-2020 +
2020-04-07
@@ -5202,7 +5202,7 @@

Parsing OpenAIRE Research Graph out }) toc() -#> 30.559 sec elapsed +#> 42.859 sec elapsed oaire_df <- dplyr::bind_rows(oaire_data)

A note on performance: Parsing the whole dump h2020_results using these parsers took me around 2 hours on my MacBook Pro (Early 2015, 2,9 GHz Intel Core i5, 8GB RAM, 256 SSD). I therefore recommend to back up the resulting data, instead of un-packing the whole dump for each analysis. jsonlite::stream_out() outputs the data frame to a text-based json-file, where list-columns are preserved per row.

@@ -5218,6 +5218,9 @@

As a start, I load a dataset, which was compiled following the above-described methods using the whole h2020_results.gz dump.

+

 oaire_df <-
@@ -5437,14 +5440,14 @@ 

-
- +
+

Figure 3: Open Access Compliance Rates of Horizon 2020 projects affiliated with the University of Göttingen (purple dots) relative to the overall performance of the funding activity, visualised as a box plot. Only projects with at least five publications were considered. Data: OpenAIRE Research Graph(Manghi, Atzori, et al. 2019)

-

Figure 3 shows that many H2020-projects with University of Göttingen participation have an uptake of open access to grant-supported publications that is above the average in the peer group. At the same time, some perform below expectation. Together, this provides a valuable insight into open access compliance at the university-level, especially for research support librarians who are in charge of helping grantees to make their work open access. They can, for instance, point grantees to OpenAIRE-compliant repositoires for self-archiving their works. # How does knowing how projects compare with others funded by the same institutions help to help grantees make their own work open access? To my knowledge the availability of outlets of acceptable quality for publication is highly field specific and I don’t really see how the funder comes into play, unless funders only fund certain fields. # NJ: self-archiving is also possible to comply with the EC’s oa mandate, added a sentence

+

Figure 3 shows that many H2020-projects with University of Göttingen participation have an uptake of open access to grant-supported publications that is above the average in the peer group. At the same time, some perform below expectation. Together, this provides a valuable insight into open access compliance at the university-level, especially for research support librarians who are in charge of helping grantees to make their work open access. They can, for instance, point grantees to OpenAIRE-compliant repositoires for self-archiving their works.

Discussion and conclusion

Using data from the OpenAIRE Research Graph dumps makes it possible to put the results of a specific data analysis into context. Open access compliance rates of H2020 projects vary. These variations should be considered when reporting compliance rates of specific projects under the same open access mandate.

Although the OpenAIRE Research Graph is a large collection of scholarly data, it is likely that it still does not provide the whole picture. OpenAIRE mainly collects data from open sources. It is still unknown how the OpenAIRE Research Graph compares to well-established toll-access bibliometrics data sources like the Web of Science in terms of coverage and data quality.

diff --git a/docs/index.html b/docs/index.html index 4752492..3f9d294 100644 --- a/docs/index.html +++ b/docs/index.html @@ -21,36 +21,36 @@ Scholarly Communication Analytics: Blog | Scholarly Communication Analytics with R - - + + - - + + - + - + - + - + - + - + - + - + - + @@ -1268,22 +1268,22 @@ font-size: 15px; font-weight: 300; } - + .distill-site-nav a { color: inherit; text-decoration: none; } - + .distill-site-nav a:hover { color: white; } - + .distill-site-header { } - + .distill-site-footer { } - + @media print { .distill-site-nav { display: none; @@ -1644,6 +1644,19 @@

Blog | Scholarly Communication Analytics with R

+ + + +
+ +
+
+

Accessing and analysing the OpenAIRE Research Graph data dumps

+

The OpenAIRE Research Graph provides a wide range of metadata about grant-supported research publications. This blog post presents an experimental R package with helpers for splitting, de-compressing and parsing the underlying data dumps. I will demonstrate how to use them by examining the compliance of funded projects with the open access mandate in Horizon 2020.

+
+
+ diff --git a/docs/index.xml b/docs/index.xml index 6dcd353..f5fe1aa 100644 --- a/docs/index.xml +++ b/docs/index.xml @@ -11,14 +11,23 @@ to publish case-studies rapidely showing how to support data-driven workflows an decision-making around scholarly communication in libraries using R. Distill - Mo, 30 Mär 2020 00:00:00 +0000 + Tue, 07 Apr 2020 00:00:00 +0000 + + Accessing and analysing the OpenAIRE Research Graph data dumps + Najko Jahn + https://subugoe.github.io/scholcomm_analytics/posts/oaire_graph_2020 + The OpenAIRE Research Graph provides a wide range of metadata about grant-supported research publications. This blog post presents an experimental R package with helpers for splitting, de-compressing and parsing the underlying data dumps. I will demonstrate how to use them by examining the compliance of funded projects with the open access mandate in Horizon 2020. + https://subugoe.github.io/scholcomm_analytics/posts/oaire_graph_2020 + Tue, 07 Apr 2020 00:00:00 +0000 + + Exploring the Open Access Evidence base in Unpaywall with Python Nick Haupka https://subugoe.github.io/scholcomm_analytics/posts/unpaywall_python Open Access evidence sources constantly change. In this blog post, I present a Python based approach for analysing the most recent snapshots from the open access discovery service Unpaywall. Results shows a growth in open access content, partly because of newly introduced evidence sources like Semantic Scholar. https://subugoe.github.io/scholcomm_analytics/posts/unpaywall_python - Mo, 30 Mär 2020 00:00:00 +0000 + Mon, 30 Mar 2020 00:00:00 +0000 @@ -27,7 +36,7 @@ decision-making around scholarly communication in libraries using R. https://subugoe.github.io/scholcomm_analytics/posts/elsevier_invoice Publishers rarely make publication fee spending for hybrid journals transparent. Elsevier is a remarkable exception, as the publisher provides open and machine-readable data relative to its central invoicing with funding bodies and fee waivers at the article level. This blogpost illustrates how to mine Elsevier full-texts for these data with the data science tool R and presents new insights by analysing the resulting dataset: of 70,657 articles published open access in 1,753 hybrid journals from 2015 to date, around one third of the publication fees were paid through central agreements. Nevertheless, the majority of funding sources for hybrid open access remains unclear. https://subugoe.github.io/scholcomm_analytics/posts/elsevier_invoice - Mo, 25 Nov 2019 00:00:00 +0000 + Mon, 25 Nov 2019 00:00:00 +0000 @@ -36,7 +45,7 @@ decision-making around scholarly communication in libraries using R. https://subugoe.github.io/scholcomm_analytics/posts/datacite_graph The PID Graph from DataCite interlinks persistent identifiers (PID) in research. In this blog post, I will present how to interface this graph using the DataCite GraphQL API with R. To illustrate it, I will visualise the research information network of a person. https://subugoe.github.io/scholcomm_analytics/posts/datacite_graph - Do, 24 Okt 2019 00:00:00 +0000 + Thu, 24 Oct 2019 00:00:00 +0000 @@ -46,7 +55,7 @@ decision-making around scholarly communication in libraries using R. https://subugoe.github.io/scholcomm_analytics/posts/unpaywall_evidence We investigated more than 31 million scholarly journal articles published between 2008 and 2018 that are indexed in Unpaywall, a widely used open access discovery tool. Using Google BigQuery and R, we determined over 11.6 million journal articles with open access full-text links in Unpaywall, corresponding to an open access share of 37 %. Our data analysis revealed various open access location and evidence types, as well as large overlaps between them, raising important questions about how to responsibly re-use Unpaywall data in bibliometric research and open access monitoring. https://subugoe.github.io/scholcomm_analytics/posts/unpaywall_evidence - Di, 07 Mai 2019 00:00:00 +0000 + Tue, 07 May 2019 00:00:00 +0000 diff --git a/docs/posts/oaire_graph_2020/distill-preview.png b/docs/posts/oaire_graph_2020/distill-preview.png new file mode 100644 index 0000000..83a5204 Binary files /dev/null and b/docs/posts/oaire_graph_2020/distill-preview.png differ diff --git a/docs/posts/oaire_graph_2020/index.html b/docs/posts/oaire_graph_2020/index.html new file mode 100644 index 0000000..a6788ce --- /dev/null +++ b/docs/posts/oaire_graph_2020/index.html @@ -0,0 +1,6068 @@ + + + + + + + + + + + + + + + + +Scholarly Communication Analytics: Accessing and analysing the OpenAIRE Research Graph data dumps + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

Accessing and analysing the OpenAIRE Research Graph data dumps

+

The OpenAIRE Research Graph provides a wide range of metadata about grant-supported research publications. This blog post presents an experimental R package with helpers for splitting, de-compressing and parsing the underlying data dumps. I will demonstrate how to use them by examining the compliance of funded projects with the open access mandate in Horizon 2020.

+
+ + + +
+
+ +
+

OpenAIRE has collected and interlinked scholarly data from various openly available sources for over ten years. In December 2019, this open science network released the OpenAIRE Research Graph(Manghi, Atzori, et al. 2019), a big scholarly data dump that contains metadata about more than 100 million research publications and 8 million datasets, as well as the relationships between them. These metadata are furthermore connected to open access locations and disambiguated information about persons, organisations and funders.

+

Like most big scholarly data dumps, the OpenAIRE Research Graph offers many data analytics opportunities, but working with it is challenging. One reason is the size of the dump. Although the OpenAIRE Research Graph is already split into several files, most of these data files are too large to fit the memory of a moderately equipped laptop, when directly imported into computing environments like R. Another challenge is the format. The dump consists of compressed XML-files following the comprehensive OpenAIRE data model(Manghi, Bardi, et al. 2019), from which only certain elements may be needed for a specific data analysis.

+

In this blog post, I introduce the R package openairegraph, an experimental effort, that helps to transform the large OpenAIRE Research Graph dumps into relevant small datasets for analysis. These tools aim at data analysts and researchers alike who wish to conduct their own analysis using the OpenAIRE Research Graph, but are wary of handling its large data dumps. Focusing on grant-supported research results from the European Commission’s Horizon 2020 framework programme (H2020), I present how to subset and analyse the graph using this openairegraph. My analytical use case is to benchmark the open access activities of grant-supported projects affiliated with the University of Göttingen against the overall uptake across the H2020 funding activities.

+

What is the R package openairegraph about?

+

So far, the R package openairegraph, which is available on GitHub as a development verion, has two sets of functions. The first set provides helpers to split a large OpenAIRE Research Graph data dump into separate, de-coded XML records that can be stored individually. The other set consists of parsers that convert data from these XML files to a table-like representation following the tidyverse philosophy, a popular approach and toolset for doing data analysis with R (Wickham et al. 2019). Splitting, de-coding and parsing are essential steps before analysing the OpenAIRE Research Graph.

+

Installation

+

openairegraph can be installed from GitHub using the remotes(Hester et al. 2019) package:

+

+library(remotes)
+remotes::install_github("subugoe/openairegraph")
+

Loading a dump into R

+

Several dumps from the OpenAIRE Research Graph are available on Zenodo(Manghi, Atzori, et al. 2019). So far, I tested openairegraph to work with the dump h2020_results.gz, which comprises research outputs funded by the European Commission’s Horizon 2020 funding programme (H2020).

+

After downloading it, the file can be imported into R using the jsonlite package(Ooms 2014). The following example shows that each line contains a record identifier and the corresponding Base64-encoded XML file. Base64 is a standard that allows file compression in a text-based format.

+
+

+library(jsonlite) # tools to work with json files
+library(tidyverse) # tools from the tidyverse useful for data analysis
+# download the file from Zenodo and store it locally
+oaire <- jsonlite::stream_in(file("data/h2020_results.gz"), verbose = FALSE) %>%
+  tibble::as_tibble()
+oaire
+#> # A tibble: 92,218 x 2
+#>    `_id`$`$oid`       body$`$binary`                          $`$type`
+#>    <chr>              <chr>                                   <chr>   
+#>  1 5dbc22f81e82127b5… UEsDBBQACAgIAIRiYU8AAAAAAAAAAAAAAAAEAA… 00      
+#>  2 5dbc22f9b531c546e… UEsDBBQACAgIAIRiYU8AAAAAAAAAAAAAAAAEAA… 00      
+#>  3 5dbc22fa45e3122d9… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAA… 00      
+#>  4 5dbc22fa45e3122d9… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAA… 00      
+#>  5 5dbc22fa4e0c061a4… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAA… 00      
+#>  6 5dbc22fb81f3c12c0… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAA… 00      
+#>  7 5dbc22fb895be1246… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAA… 00      
+#>  8 5dbc22fbe56570673… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAA… 00      
+#>  9 5dbc22fc81f3c12bf… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAA… 00      
+#> 10 5dbc22fcb531c546e… UEsDBBQACAgIAIZiYU8AAAAAAAAAAAAAAAAEAA… 00      
+#> # … with 92,208 more rows
+
+

De-coding and storing OpenAIRE Research Graph records

+

The function openairegraph::oarg_decode() splits and de-codes each record. Storing the records individually allows to process the files independent from each other, which is a common approach when working with big data.

+
+

+library(openairegraph)
+openairegraph::oarg_decode(oaire, records_path = "data/records/", 
+  limit = 500, verbose = FALSE)
+
+

openairegraph::oarg_decode() writes out each XML-formatted record as a zip file to a specified folder. Because the dumps are quite large, the function furthermore has a parameter that allows setting a limit, which is helpful for inspecting the output first. By default, a progress bar presents the current state of the process.

+

Parsing OpenAIRE Research Graph records

+

So far, there are four parsers available to consume the H2020 results set:

+
    +
  • openairegraph::oarg_publications_md() retrieves basic publication metadata complemented by author details and access status
  • +
  • openairegraph::oarg_linked_projects() parses grants linked to publications
  • +
  • openairegraph::oarg_linked_ftxt() gives full-text links including access information
  • +
  • openairegraph::oarg_linked_affiliations() parses affiliation data
  • +
+

These parsers can be used alone, or together like this:

+

First, I obtain the locations of the de-coded XML records.

+
+

+openaire_records <- list.files("data/records", full.names = TRUE)
+
+

After that, I read each XML file using the xml2(Wickham, Hester, and Ooms 2019) package, and apply three parsers: openairegraph::oarg_publications_md(), openairegraph::oarg_linked_projects() and openairegraph::oarg_linked_ftxt(). I use the future(Bengtsson 2020b) and future.apply(Bengtsson 2020a) packages to enable reading and parsing these records simultaneously with multiple R sessions. Running code in parallel reduces the execution time.

+ +
+

+library(xml2) # working with xml files
+library(future) # parallel computing
+library(future.apply) # functional programming with parallel computing
+library(tictoc) # timing functions
+
+openaire_records <- list.files("data/records", full.names = TRUE)
+
+future::plan(multisession)
+tic()
+oaire_data <- future.apply::future_lapply(openaire_records, function(files) {
+  # load xml file
+  doc <- xml2::read_xml(files)
+  # parser
+  out <- oarg_publications_md(doc)
+  out$linked_projects <- list(oarg_linked_projects(doc))
+  out$linked_ftxt <- list(oarg_linked_ftxt(doc))
+  # use file path as id
+  out$id <- files
+  out
+})
+toc()
+#> 42.859 sec elapsed
+oaire_df <- dplyr::bind_rows(oaire_data)
+
+

A note on performance: Parsing the whole dump h2020_results using these parsers took me around 2 hours on my MacBook Pro (Early 2015, 2,9 GHz Intel Core i5, 8GB RAM, 256 SSD). I therefore recommend to back up the resulting data, instead of un-packing the whole dump for each analysis. jsonlite::stream_out() outputs the data frame to a text-based json-file, where list-columns are preserved per row.

+
+

+jsonlite::stream_out(oaire_df, file("data/h2020_parsed_short.json"))
+#> 
+Processed 500 rows...
+Complete! Processed total of 500 rows.
+
+

Use case: Monitoring the Open Access Compliance across H2020 grant-supported projects at the institutional level

+

Usually, it is not individual researchers who sign grant agreements with the European Commission (EC), but the institution they are affiliated with. Universities and other research institutions hosting EC-funded projects are therefore looking for ways to monitor the insitutions’s overall compliance with funder rules. In the case of the open access mandate in Horizon 2020 (H2020), librarians are often assigned this task. Moreover, quantitative science studies have started to investigate the efficacy of funders’ open-access mandates.(Larivière and Sugimoto 2018)

+

In this use case, I will illustrate how to make use of the OpenAIRE Research Graph, which links grants to publications and open access full-texts, to benchmark compliance with the open access mandate against other H2020 funding activities.

+

Overview

+

As a start, I load a dataset, which was compiled following the above-described methods using the whole h2020_results.gz dump.

+ +
+

+oaire_df <-
+  jsonlite::stream_in(file("data/h2020_parsed.json"), verbose = FALSE) %>%
+  tibble::as_tibble()
+
+

It contains 92,218 grant-supported research outputs. Here, I will focus on the prevalence of open access across H2020 projects using metadata about the open access status of a publication and related project information stored in the list-column linked_projects.

+
+

+pubs_projects <- oaire_df %>%
+  filter(type == "publication") %>%
+  select(id, type, best_access_right, linked_projects) %>%
+  # transform to a regular data frame with a row for each project
+  unnest(linked_projects) 
+
+

The dataset contains 84,781 literature publications from 9,008 H2020 projects. What H2020 funding activity published most?

+
+

+library(cowplot)
+library(scales)
+pubs_projects %>%
+  filter(funding_level_0 == "H2020") %>% 
+  mutate(funding_scheme = fct_infreq(funding_level_1)) %>%
+  group_by(funding_scheme) %>%
+  summarise(n = n_distinct(id)) %>%
+  mutate(funding_fct = fct_other(funding_scheme, keep = levels(funding_scheme)[1:10])) %>%
+  mutate(highlight = ifelse(funding_scheme %in% c("ERC", "RIA"), "yes", "no")) %>%
+  ggplot(aes(reorder(funding_fct, n), n, fill = highlight)) +
+  geom_bar(stat = "identity") +
+  coord_flip() +
+  scale_fill_manual(
+    values = c("#B0B0B0D0", "#56B4E9D0"),
+    name = NULL) +
+  scale_y_continuous(
+    labels = scales::number_format(big.mark = ","),
+    expand = expansion(mult = c(0, 0.05)),
+    breaks =  scales::extended_breaks()(0:25000)
+    ) +
+  labs(x = NULL, y = "Publications", caption = "Data: OpenAIRE Research Graph") +
+  theme_minimal_vgrid(font_family = "Roboto") +
+  theme(legend.position = "none")
+
+Publication Output of Horizon 2020 funding activities captured by the OpenAIRE Research Graph, released in December 2019. +

+Figure 1: Publication Output of Horizon 2020 funding activities captured by the OpenAIRE Research Graph, released in December 2019. +

+
+
+

Figure 1 shows that most publications in the OpenAIRE Research Graph originate from the European Research Council (ERC), Research and Innovation Actions (RIA) and Marie Skłodowska-Curie Actions (MSCA). On average, 10 articles were published per project. However, the publication performance per H2020 funding activity varies considerably (SD = 33).

+

The European Commission mandates open access to publications. Let’s measure the compliance to this policy using the OpenAIRE Research Graph per project:

+
+

+library(rmarkdown)
+oa_monitor_ec <- pubs_projects %>%
+  filter(funding_level_0 == "H2020") %>%
+  mutate(funding_scheme = fct_infreq(funding_level_1)) %>%
+  group_by(funding_scheme,
+           project_code,
+           project_acronym,
+           best_access_right) %>%
+  summarise(oa_n = n_distinct(id)) %>% # per pub
+  mutate(oa_prop = oa_n / sum(oa_n)) %>%
+  filter(best_access_right == "Open Access") %>%
+  ungroup() %>%
+  mutate(all_pub = as.integer(oa_n / oa_prop)) 
+rmarkdown::paged_table(oa_monitor_ec)
+
+ +
+
+

In the following, this aggregated data, oa_monitor_ec, will provide the basis to explore variations among and within H2020 funding programmes.

+
+

+oa_monitor_ec %>%
+  # only projects with at least five publications
+  mutate(funding_fct = fct_other(funding_scheme, keep = levels(funding_scheme)[1:10])) %>%
+  filter(all_pub >= 5) %>%
+  ggplot(aes(fct_rev(funding_fct), oa_prop)) +
+  geom_boxplot() +
+  geom_hline(aes(
+    yintercept = mean(oa_prop),
+    color = paste0("Mean=", as.character(round(
+      mean(oa_prop) * 100, 0
+    )), "%")
+  ),
+  linetype = "dashed",
+  size = 1) +
+  geom_hline(aes(
+    yintercept = median(oa_prop),
+    color = paste0("Median=", as.character(round(
+      median(oa_prop) * 100, 0
+    )), "%")
+  ),
+  linetype = "dashed",
+  size = 1) +
+  scale_color_manual("H2020 OA Compliance", values = c("orange", "darkred")) +
+  coord_flip() +
+  scale_y_continuous(labels = scales::percent_format(accuracy = 5L),
+                     expand = expansion(mult = c(0, 0.05))) +
+  labs(x = NULL,
+       y = "Open Access Percentage",
+       caption = "Data: OpenAIRE Research Graph") +
+  theme_minimal_vgrid(font_family = "Roboto") +
+  theme(legend.position = "top",
+        legend.justification = "right")
+
+Open Access Compliance Rates of Horizon 2020 projects relative to funding activities, visualised as box plot. Only projects with at least five publications are shown individually. +

+Figure 2: Open Access Compliance Rates of Horizon 2020 projects relative to funding activities, visualised as box plot. Only projects with at least five publications are shown individually. +

+
+
+

About 77% of research publications under the H2020 open access mandate are openly available. Figure 2 highlights a generally high rate of compliance with the open access mandate, however, uptake levels vary the funding schemes. In particular, ERC grants and Marie Skłodowska-Curie activities show higher levels of compliance compared to the overall average.

+ +

Because of their large variations, I want to put the open access rates of H2020-funded projects in context when presenting the share for projects affiliated with the University of Göttingen. Again, the data analysis starts with loading the previously backed up file with decoded and parsed data, choosing project and access information from it.

+
+

+oaire_df <- jsonlite::stream_in(file("data/h2020_parsed.json"), verbose = FALSE) %>%
+  tibble::as_tibble()
+
+pubs_projects <- oaire_df %>%
+  select(id, type, best_access_right, linked_projects) %>%
+  unnest(linked_projects) 
+pubs_projects
+#> # A tibble: 136,298 x 12
+#>    id    type  best_access_rig… to    project_title funder
+#>    <chr> <chr> <chr>            <chr> <chr>         <chr> 
+#>  1 data… publ… Open Access      proj… Planning and… Europ…
+#>  2 data… publ… Open Access      proj… Cortical alg… Europ…
+#>  3 data… publ… Open Access      proj… Human Brain … Europ…
+#>  4 data… publ… Restricted       proj… Implementati… Europ…
+#>  5 data… publ… Open Access      proj… The power of… Europ…
+#>  6 data… publ… Open Access      proj… A psychologi… Wellc…
+#>  7 data… publ… Open Access      proj… Effects of N… Europ…
+#>  8 data… publ… Open Access      proj… Aggression s… Europ…
+#>  9 data… publ… Open Access      proj… Global trend… Europ…
+#> 10 data… publ… Open Access      proj… Mapping grav… Europ…
+#> # … with 136,288 more rows, and 6 more variables:
+#> #   funding_level_0 <chr>, funding_level_1 <chr>, project_code <chr>,
+#> #   project_acronym <chr>, contract_type <chr>, funding_level_2 <chr>
+
+

Next, I want to identify H2020 projects with participation from the university. There are at least two ways to obtain links between projects and organisations: One is the OpenAIRE Research Graph. It provides project details from 29 funders in a separate dump, project.gz. Another option is to relate our dataset to open data provided by CORDIS, the European Commission’s research information portal. For convenience, I am going to follow the second option.

+
+

+# load local copy downloaded from the EC open data portal
+cordis_org <-
+  readr::read_delim(
+    "data/cordis-h2020organizations.csv",
+    delim = ";",
+    locale = locale(decimal_mark = ",")
+  ) %>%
+  # data cleaning
+  mutate_if(is.double, as.character) 
+
+

After loading the file, I am able to tag projects affiliated with the University of Göttingen.

+
+

+ugoe_projects <- cordis_org %>%
+  filter(shortName %in% c("UGOE", "UMG-GOE")) %>% 
+  select(project_id = projectID, role, project_acronym = projectAcronym)
+
+pubs_projects_ugoe <- pubs_projects %>%
+  mutate(ugoe_project = funding_level_0 == "H2020" & project_code %in% ugoe_projects$project_id)
+
+

Let’s put it all together and benchmark the rates of compliance with the H2020 open access mandate using data from the OpenAIRE Research Graph. The package plotly(Sievert 2018) allows presenting the figure as an interactive chart.

+
+

+# funding programmes with Uni Göttingen participation
+ugoe_funding_programme <- pubs_projects_ugoe %>% 
+  filter(ugoe_project == TRUE) %>%
+  group_by(funding_level_1, project_code) %>% 
+  # min 5 pubs
+  summarise(n = n_distinct(id)) %>%
+  filter(n >= 5) %>%
+  distinct(funding_level_1, project_code)
+goe_oa <- oa_monitor_ec %>%
+  # min 5 pubs
+  filter(all_pub >=5) %>%
+  filter(funding_scheme %in% ugoe_funding_programme$funding_level_1) %>%
+  mutate(ugoe = project_code %in% ugoe_funding_programme$project_code) %>%
+  mutate(`H2020 project` = paste0(project_acronym, " | OA share: ", round(oa_prop * 100, 0), "%"))
+# plot as interactive graph using plotly
+library(plotly)
+p <- ggplot(goe_oa, aes(funding_scheme, oa_prop)) +
+  geom_boxplot() +
+  geom_jitter(data = filter(goe_oa, ugoe == TRUE),
+               aes(label = `H2020 project`),
+             colour = "#AF42AE",
+             alpha = 0.9,
+             size = 3,
+             width = 0.25) +
+  geom_hline(aes(
+    yintercept = mean(oa_prop),
+    color = paste0("Mean=", as.character(round(
+      mean(oa_prop) * 100, 0
+    )), "%")
+  ),
+  linetype = "dashed",
+  size = 1) +
+  geom_hline(aes(
+    yintercept = median(oa_prop),
+    color = paste0("Median=", as.character(round(
+      median(oa_prop) * 100, 0
+    )), "%")
+  ),
+  linetype = "dashed",
+  size = 1) +
+  scale_color_manual(NULL, values = c("orange", "darkred")) +
+  scale_y_continuous(labels = scales::percent_format(accuracy = 5L)) +
+  labs(x = NULL,
+       y = "Open Access Percentage",
+       caption = "Data: OpenAIRE Research Graph") +
+  theme_minimal(base_family = "Roboto") +
+  theme(legend.position = "top",
+        legend.justification = "right")
+plotly::ggplotly(p, tooltip = c("label"))
+
+
+ +

+Figure 3: Open Access Compliance Rates of Horizon 2020 projects affiliated with the University of Göttingen (purple dots) relative to the overall performance of the funding activity, visualised as a box plot. Only projects with at least five publications were considered. Data: OpenAIRE Research Graph(Manghi, Atzori, et al. 2019) +

+
+
+

Figure 3 shows that many H2020-projects with University of Göttingen participation have an uptake of open access to grant-supported publications that is above the average in the peer group. At the same time, some perform below expectation. Together, this provides a valuable insight into open access compliance at the university-level, especially for research support librarians who are in charge of helping grantees to make their work open access. They can, for instance, point grantees to OpenAIRE-compliant repositoires for self-archiving their works.

+

Discussion and conclusion

+

Using data from the OpenAIRE Research Graph dumps makes it possible to put the results of a specific data analysis into context. Open access compliance rates of H2020 projects vary. These variations should be considered when reporting compliance rates of specific projects under the same open access mandate.

+

Although the OpenAIRE Research Graph is a large collection of scholarly data, it is likely that it still does not provide the whole picture. OpenAIRE mainly collects data from open sources. It is still unknown how the OpenAIRE Research Graph compares to well-established toll-access bibliometrics data sources like the Web of Science in terms of coverage and data quality.

+

As a member of the OpenAIRE consortium, improving the re-use of the OpenAIRE Research Graph dumps has become a SUB Göttingen working priority. In the scholarly communication analysts team, we want to support this with a number of data analyses and outreach activities. In doing so, we will add more helper functions to the openairegraph R package. It targets data analysts and researchers who wish to conduct their own analysis using the OpenAIRE Research Graph, but are wary of handling its large data dumps.

+

If you like to contribute, head on over to the packages’ source code repository and get started!

+
+
+

Bengtsson, Henrik. 2020a. Future.apply: Apply Function to Elements in Parallel Using Futures. https://CRAN.R-project.org/package=future.apply.

+
+
+

———. 2020b. Future: Unified Parallel and Distributed Processing in R for Everyone. https://CRAN.R-project.org/package=future.

+
+
+

Hester, Jim, Gábor Csárdi, Hadley Wickham, Winston Chang, Martin Morgan, and Dan Tenenbaum. 2019. Remotes: R Package Installation from Remote Repositories, Including ’Github’. https://CRAN.R-project.org/package=remotes.

+
+
+

Larivière, Vincent, and Cassidy R. Sugimoto. 2018. “Do Authors Comply When Funders Enforce Open Access to Research?” Nature 562 (7728): 483–86. https://doi.org/10.1038/d41586-018-07101-w.

+
+
+

Manghi, Paolo, Claudio Atzori, Alessia Bardi, Jochen Schirrwagen, Harry Dimitropoulos, Sandro La Bruzzo, Ioannis Foufoulas, et al. 2019. “OpenAIRE Research Graph Dump.” Zenodo. https://doi.org/10.5281/zenodo.3516918.

+
+
+

Manghi, Paolo, Alessia Bardi, Claudio Atzori, Miriam Baglioni, Natalia Manola, Jochen Schirrwagen, and Pedro Principe. 2019. “The Openaire Research Graph Data Model.” Zenodo. https://doi.org/10.5281/zenodo.2643199.

+
+
+

Ooms, Jeroen. 2014. “The Jsonlite Package: A Practical and Consistent Mapping Between Json Data and R Objects.” arXiv:1403.2805 [stat.CO]. https://arxiv.org/abs/1403.2805.

+
+
+

Sievert, Carson. 2018. Plotly for R. https://plotly-r.com.

+
+
+

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

+
+
+

Wickham, Hadley, Jim Hester, and Jeroen Ooms. 2019. Xml2: Parse Xml. https://CRAN.R-project.org/package=xml2.

+
+
+ + +
+ +
+
+ + + + + +
+

Corrections

+

If you see mistakes or want to suggest changes, please create an issue on the source repository.

+

Reuse

+

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/subugoe/scholcomm_analytics, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

+

Citation

+

For attribution, please cite this work as

+
Jahn (2020, April 7). Scholarly Communication Analytics: Accessing and analysing the OpenAIRE Research Graph data dumps. Retrieved from https://subugoe.github.io/scholcomm_analytics/posts/oaire_graph_2020/
+

BibTeX citation

+
@misc{jahn2020accessing,
+  author = {Jahn, Najko},
+  title = {Scholarly Communication Analytics: Accessing and analysing the OpenAIRE Research Graph data dumps},
+  url = {https://subugoe.github.io/scholcomm_analytics/posts/oaire_graph_2020/},
+  year = {2020}
+}
+
+ + + + + + + + + + diff --git a/docs/posts/oaire_graph_2020/literature.bib b/docs/posts/oaire_graph_2020/literature.bib new file mode 100644 index 0000000..071da4f --- /dev/null +++ b/docs/posts/oaire_graph_2020/literature.bib @@ -0,0 +1,137 @@ +@Manual{future, + title = {future: Unified Parallel and Distributed Processing in R for Everyone}, + author = {Henrik Bengtsson}, + year = {2020}, + note = {R package version 1.16.0}, + url = {https://CRAN.R-project.org/package=future}, + } + + +@article{jsonlite, + title = {The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects}, + author = {Jeroen Ooms}, + journal = {arXiv:1403.2805 [stat.CO]}, + year = {2014}, + url = {https://arxiv.org/abs/1403.2805}, + } + +@article{tidyverse, + title = {Welcome to the tidyverse}, + author = {Hadley Wickham and Mara Averick and Jennifer Bryan and Winston Chang and Lucy D'Agostino McGowan and Romain François and Garrett Grolemund and Alex Hayes and Lionel Henry and Jim Hester and Max Kuhn and Thomas Lin Pedersen and Evan Miller and Stephan Milton Bache and Kirill Müller and Jeroen Ooms and David Robinson and Dana Paige Seidel and Vitalie Spinu and Kohske Takahashi and Davis Vaughan and Claus Wilke and Kara Woo and Hiroaki Yutani}, + year = {2019}, + journal = {Journal of Open Source Software}, + volume = {4}, + number = {43}, + pages = {1686}, + doi = {10.21105/joss.01686}, + } + + +@dataset{manghi_paolo_2019_3516918, + author = {Manghi, Paolo and + Atzori, Claudio and + Bardi, Alessia and + Schirrwagen, Jochen and + Dimitropoulos, Harry and + La Bruzzo, Sandro and + Foufoulas, Ioannis and + Löhden, Aenne and + Bäcker, Amelie and + Mannocci, Andrea and + Horst, Marek and + Baglioni, Miriam and + Czerniak, Andreas and + Kiatropoulou, Katerina and + Kokogiannaki, Argiro and + De Bonis, Michele and + Artini, Michele and + Ottonello, Enrico and + Lempesis, Antonis and + Nielsen, Lars Holm and + Ioannidis, Alexandros and + Bigarella, Chiara and + Summan, Friedrich}, + title = {OpenAIRE Research Graph Dump}, + month = dec, + year = 2019, + publisher = {Zenodo}, + version = {1.0.0-beta}, + doi = {10.5281/zenodo.3516918}, + url = {https://doi.org/10.5281/zenodo.3516918} +} + + @Manual{future_apply, + title = {future.apply: Apply Function to Elements in Parallel using Futures}, + author = {Henrik Bengtsson}, + year = {2020}, + note = {R package version 1.4.0}, + url = {https://CRAN.R-project.org/package=future.apply}, + } + + @Manual{xml2, + title = {xml2: Parse XML}, + author = {Hadley Wickham and Jim Hester and Jeroen Ooms}, + year = {2019}, + note = {R package version 1.2.2}, + url = {https://CRAN.R-project.org/package=xml2}, + } + + @Manual{plotly, + title = {plotly for R}, + author = {Carson Sievert}, + year = {2018}, + url = {https://plotly-r.com}, + } + + @article{Hicks_2015, + doi = {10.1038/520429a}, + url = {https://doi.org/10.1038%2F520429a}, + year = 2015, + month = {apr}, + publisher = {Springer Science and Business Media {LLC}}, + volume = {520}, + number = {7548}, + pages = {429--431}, + author = {Diana Hicks and Paul Wouters and Ludo Waltman and Sarah de Rijcke and Ismael Rafols}, + title = {Bibliometrics: The Leiden Manifesto for research metrics}, + journal = {Nature} + } + + @Manual{remotes, + title = {remotes: R Package Installation from Remote Repositories, Including +'GitHub'}, + author = {Jim Hester and Gábor Csárdi and Hadley Wickham and Winston Chang and Martin Morgan and Dan Tenenbaum}, + year = {2019}, + note = {R package version 2.1.0}, + url = {https://CRAN.R-project.org/package=remotes}, + } + +@article{Larivi_re_2018, + doi = {10.1038/d41586-018-07101-w}, + url = {https://doi.org/10.1038%2Fd41586-018-07101-w}, + year = 2018, + month = {oct}, + publisher = {Springer Science and Business Media {LLC}}, + volume = {562}, + number = {7728}, + pages = {483--486}, + author = {Vincent Larivière and Cassidy R. Sugimoto}, + title = {Do authors comply when funders enforce open access to research?}, + journal = {Nature}} + +@misc{manghi_paolo_2019_2643199, + author = {Manghi, Paolo and + Bardi, Alessia and + Atzori, Claudio and + Baglioni, Miriam and + Manola, Natalia and + Schirrwagen, Jochen and + Principe, Pedro}, + title = {The OpenAIRE Research Graph Data Model}, + month = apr, + year = 2019, + publisher = {Zenodo}, + version = {1.3}, + doi = {10.5281/zenodo.2643199}, + url = {https://doi.org/10.5281/zenodo.2643199} +} \ No newline at end of file diff --git a/docs/posts/oaire_graph_2020/oaire_graph_post_files/figure-docx/activity-1.png b/docs/posts/oaire_graph_2020/oaire_graph_post_files/figure-docx/activity-1.png new file mode 100644 index 0000000..d0b40b7 Binary files /dev/null and b/docs/posts/oaire_graph_2020/oaire_graph_post_files/figure-docx/activity-1.png differ diff --git a/docs/posts/oaire_graph_2020/oaire_graph_post_files/figure-docx/ugoe-1.png b/docs/posts/oaire_graph_2020/oaire_graph_post_files/figure-docx/ugoe-1.png new file mode 100644 index 0000000..f98b1a9 Binary files /dev/null and b/docs/posts/oaire_graph_2020/oaire_graph_post_files/figure-docx/ugoe-1.png differ diff --git a/docs/posts/oaire_graph_2020/oaire_graph_post_files/figure-docx/variations-1.png b/docs/posts/oaire_graph_2020/oaire_graph_post_files/figure-docx/variations-1.png new file mode 100644 index 0000000..8a2fe91 Binary files /dev/null and b/docs/posts/oaire_graph_2020/oaire_graph_post_files/figure-docx/variations-1.png differ diff --git a/docs/posts/posts.json b/docs/posts/posts.json index 8c4e726..932206c 100644 --- a/docs/posts/posts.json +++ b/docs/posts/posts.json @@ -1,4 +1,21 @@ [ + { + "path": "posts/oaire_graph_2020/", + "title": "Accessing and analysing the OpenAIRE Research Graph data dumps", + "description": "The OpenAIRE Research Graph provides a wide range of metadata about grant-supported research publications. This blog post presents an experimental R package with helpers for splitting, de-compressing and parsing the underlying data dumps. I will demonstrate how to use them by examining the compliance of funded projects with the open access mandate in Horizon 2020.", + "author": [ + { + "name": "Najko Jahn", + "url": "https://twitter.com/najkoja" + } + ], + "date": "2020-04-07", + "categories": [], + "preview": "posts/oaire_graph_2020/distill-preview.png", + "last_modified": "2020-04-07T11:26:11+02:00", + "preview_width": 1248, + "preview_height": 768 + }, { "path": "posts/unpaywall_python/", "title": "Exploring the Open Access Evidence base in Unpaywall with Python", @@ -12,7 +29,7 @@ "date": "2020-03-30", "categories": [], "preview": "posts/unpaywall_python/distill-preview.png", - "last_modified": "2020-03-30T12:41:37+02:00", + "last_modified": "2020-04-01T15:54:21+02:00", "preview_width": 3385, "preview_height": 1256 }, @@ -29,7 +46,7 @@ "date": "2019-11-25", "categories": [], "preview": "posts/elsevier_invoice/distill-preview.png", - "last_modified": "2020-03-30T12:30:46+02:00", + "last_modified": "2020-02-27T08:55:02+01:00", "preview_width": 1248, "preview_height": 768 }, @@ -46,7 +63,7 @@ "date": "2019-10-24", "categories": [], "preview": "posts/datacite_graph/distill-preview.png", - "last_modified": "2020-03-30T12:30:46+02:00", + "last_modified": "2020-02-27T08:59:43+01:00", "preview_width": 3900, "preview_height": 2400 }, @@ -67,7 +84,7 @@ "date": "2019-05-07", "categories": [], "preview": "posts/unpaywall_evidence/distill-preview.png", - "last_modified": "2020-03-30T12:30:46+02:00", + "last_modified": "2019-11-18T13:55:18+01:00", "preview_width": 1248, "preview_height": 768 } diff --git a/docs/sitemap.xml b/docs/sitemap.xml index 50b9b42..b9ca500 100644 --- a/docs/sitemap.xml +++ b/docs/sitemap.xml @@ -2,26 +2,30 @@ https://subugoe.github.io/scholcomm_analytics/about.html - 2020-03-30T12:30:46+02:00 + 2019-11-18T13:55:18+01:00 https://subugoe.github.io/scholcomm_analytics/ - 2020-03-30T12:30:46+02:00 + 2019-11-18T13:55:18+01:00 + + + https://subugoe.github.io/scholcomm_analytics/posts/oaire_graph_2020/ + 2020-04-07T11:26:11+02:00 https://subugoe.github.io/scholcomm_analytics/posts/unpaywall_python/ - 2020-03-30T12:41:37+02:00 + 2020-04-01T15:54:21+02:00 https://subugoe.github.io/scholcomm_analytics/posts/elsevier_invoice/ - 2020-03-30T12:30:46+02:00 + 2020-02-27T08:55:02+01:00 https://subugoe.github.io/scholcomm_analytics/posts/datacite_graph/ - 2020-03-30T12:30:46+02:00 + 2020-02-27T08:59:43+01:00 https://subugoe.github.io/scholcomm_analytics/posts/unpaywall_evidence/ - 2020-03-30T12:30:46+02:00 + 2019-11-18T13:55:18+01:00