This report was generated on 2019-03-05 14:22:20. R version: 3.5.0 on x86_64-apple-darwin15.6.0. For this report, CRAN packages as of 2018-09-01 were used.
The preprocessing and analysis of the data was conducted in the R project for statistical computing. The RMarkdown script used to generate this document and all the resulting data can be downloaded under this link. Through executing main.Rmd
, the herein described process can be reproduced and this document can be generated. In the course of this, data from the folder ìnput
will be processed and results will be written to output
. There are multiple RData
files in the folder rdata
. They are emitted in the chunks of main.Rmd
and saved for a faster process after the first execution. The download stored to input/ignore/pubmed
could not be saved in this git repository, because of it’s size (1.4 GB). Find out how to get your own version of the xml file below.
The code for the herein described process can also be freely downloaded from https://github.com/srfdata/2019-02-clinical-trials.
2019-02-clinical-trials by SRF Data is licensed under a Creative Commons Namensnennung - Attribution ShareAlike 4.0 International License.
trials.csv
Attribute | Type | Description |
---|---|---|
unique_id | String | Unique identifier assigned by SRF |
primary | String | Trial identification number |
size | Numeric | Number of people participating in study (total, world wide) |
study_type | Factor | Type of study, either Interventional or Observational |
status | Factor | Status of the trial, either Completed, Ongoing, Terminated or Unknown |
primary_sponsor | String | The primary sponsor of the trial (responsible for updating register information) |
leading_institution | String | The (scientifically) leading institution of the trial |
registration_date | Date | Date the trial was registered in the registry |
start_date | Date | Date on which the enrollment process started |
end_date | Date | Date on which the enrollment process ended |
results_submitted | Boolean | Equals true if a result was submitted to the trial registry |
publication_found | Boolean | Equals true if the trial ID was found in cochrane / pubmed |
source | Factor | Shows in which registry this trial was found, either NCT, EUCTR or DRKS |
has_g_scholar_results | Boolean | Equals true if google scholar returned any results when queried with the trial id |
primary_sponsor_simple | String | Simplified Name of primary sponsor (mainly done for swiss institutions) |
leading_institution_simple | String | Simplified Name of leading institution (mainly done for swiss institutions) |
On the flow chart below, we describe how we tried to find all trials related to Switzerland and how we evaluated whether a publication for that trial exitsts.
We searched the following registers by selecting Switzerland OR Schweiz
as a country of recruitement:
→ http://apps.who.int/trialsearch/AdvSearch.aspx
Downloaded 01/14/2019 at 12:35 CET by setting recruitment status to ALL
and chosing Export all trials to XML
.
→ https://clinicaltrials.gov/ct2/results?cond=&term=Switzerland&cntry=&state=&city=&dist=
Downloaded 01/14/2019 at 12:31 CET via Download > For Advanced Users: Full Study Record XML Download
.
→ https://www.drks.de/drks_web/navigate.do?navigationId=search&reset=true
Downloaded 01/24/2019 at 13:20 CET as csv
.
You can query PubMed for all publications with a linked NCT ClinicalTrials.gov ID:
→ https://www.ncbi.nlm.nih.gov/pubmed/?cmd=Search&term=clinicaltrials.gov%5bsi%5d
We also search the following terms and saved the result as xml by choosing Send to > File > XML
:
EUDRACT*
EUCTR*
NCT0*
DRKS*
https://www.ncbi.nlm.nih.gov/pubmed?term=(((EUDRACT*)%20OR%20EUCTR*)%20OR%20NCT0*)%20OR%20DRKS*
Downloaded 01/14/2019 at 10:56 CET.
You can read more about finding results of studies on ClinicalTrials.gov.
Downloaded 01/14/2019 at 10:25 CET by searching ID wildcards e.g. *NCT0*
in ti, ab, kw
(Title, Abstract, Keywords).
Limit the years to get the total number of trials below the export limit of 20k.
Search terms:
NCT0*
EUCTR*
EUDRACT*
Switzerland
Read more about the Cochrane CENTRAL database here.
The swiss registry for trials is only a secondary register. All the trials in it must also be registered in one of the primary registers (European, German, American registers). It does not contain any information about whether a trial was started or completed.
## [1] "package package:rmarkdown detached"
# from https://mran.revolutionanalytics.com/web/packages/checkpoint/vignettes/using-checkpoint-with-knitr.html
# if you don't need a package, remove it from here (commenting is probably not sufficient)
# tidyverse: see https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/
cat("
library(rstudioapi)
library(tidyverse) # ggplot2, dplyr, tidyr, readr, purrr, tibble
library(glue) # interpreted string literals
library(ggrepel) # repellingng text labels for ggplot
library(xml2) # read xml files
library(readxl) # read xls(x) files
library(rvest) # used for scraping
library(magrittr) # pipes
library(foreach) # parallelization
library(doParallel) # parallelization
library(stringr) # string manipulation
library(scales) # scales for ggplot2
library(jsonlite) # json
library(lubridate) # date handling
library(forcats) # easier factor handling,
library(lintr) # code linting, auf keinen Fall entfernen ;-)
library(styler) # code formatting
library(googlesheets) # googlesheets (replace with tidyverse/googlesheets4 asap)
library(rmarkdown) # needed for automated knitting
library(R.utils) # needed for withTimeout",
file = "manifest.R"
)
# if checkpoint is not yet installed, install it (for people using this
# system for the first time)
if (!require(checkpoint)) {
if (!require(devtools)) {
install.packages("devtools", repos = "http://cran.us.r-project.org")
require(devtools)
}
devtools::install_github("RevolutionAnalytics/checkpoint",
ref = "v0.3.2", # could be adapted later,
# as of now (beginning of July 2017
# this is the current release on CRAN)
repos = "http://cran.us.r-project.org")
require(checkpoint)
}
# nolint start
if (!dir.exists("~/.checkpoint")) {
dir.create("~/.checkpoint")
}
# nolint end
# install packages for the specified CRAN snapshot date
checkpoint(snapshotDate = package_date,
project = path_to_wd,
verbose = T,
scanForPackages = T,
use.knitr = F,
R.version = R_version)
rm(package_date)
source("manifest.R")
unlink("manifest.R")
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] de_CH.UTF-8/de_CH.UTF-8/de_CH.UTF-8/C/de_CH.UTF-8/de_CH.UTF-8
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] R.utils_2.7.0 R.oo_1.22.0 R.methodsS3_1.7.1
## [4] rmarkdown_1.10 googlesheets_0.3.0 styler_1.0.2
## [7] lintr_1.0.2 lubridate_1.7.4 jsonlite_1.5
## [10] scales_1.0.0 doParallel_1.0.11 iterators_1.0.10
## [13] foreach_1.4.4 magrittr_1.5 rvest_0.3.2
## [16] readxl_1.1.0 xml2_1.2.0 ggrepel_0.8.0
## [19] glue_1.3.0 forcats_0.3.0 stringr_1.3.1
## [22] dplyr_0.7.6 purrr_0.2.5 readr_1.1.1
## [25] tidyr_0.8.1 tibble_1.4.2 ggplot2_3.0.0
## [28] tidyverse_1.2.1 rstudioapi_0.7 checkpoint_0.4.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.4 haven_1.1.2 lattice_0.20-35 colorspace_1.3-2
## [5] htmltools_0.3.6 yaml_2.2.0 rlang_0.2.2 pillar_1.3.0
## [9] withr_2.1.2 modelr_0.1.2 bindrcpp_0.2.2 bindr_0.1.1
## [13] plyr_1.8.4 munsell_0.5.0 gtable_0.2.0 cellranger_1.1.0
## [17] codetools_0.2-15 evaluate_0.11 knitr_1.20 rex_1.1.2
## [21] broom_0.5.0 Rcpp_0.12.18 backports_1.1.2 hms_0.4.2
## [25] digest_0.6.16 stringi_1.2.4 grid_3.5.0 rprojroot_1.3-2
## [29] cli_1.0.0 tools_3.5.0 lazyeval_0.2.1 crayon_1.3.4
## [33] pkgconfig_2.0.2 assertthat_0.2.0 httr_1.3.1 R6_2.2.2
## [37] nlme_3.1-137 compiler_3.5.0
We read the data from the files (one xml file per study) and collect it in multiple “lookup tables” (because one trial can have multiple locations and sponsors in an 1:n relationship). You can see what nodes of the xml file we use in the code below.
The xml files are not included in this git repo. Read the chapter data sources and download your own version of the file, if you wish to reproduce the report or work with the provided rdata
file.or work with the provided rdata
file.
# load data form rdata file if is present
if (
file.exists("rdata/nct_trials.rdata") &
file.exists("rdata/nct_locations.rdata") &
file.exists("rdata/nct_sponsors.rdata") &
file.exists("rdata/nct_citations.rdata") &
file.exists("rdata/nct_officials.rdata")
) {
load(file = "rdata/nct_trials.rdata")
load(file = "rdata/nct_locations.rdata")
load(file = "rdata/nct_sponsors.rdata")
load(file = "rdata/nct_citations.rdata")
load(file = "rdata/nct_officials.rdata")
} else {
# create new data frame for xmls data
# one main data frame will contain one row per trial
nct_trials <- data.frame()
# the following data frames can contain more than one row per trial
nct_locations <- data.frame()
nct_sponsors <- data.frame()
nct_citations <- data.frame()
nct_officials <- data.frame()
# the following function has side effects! It looks up the XML data for a
# given NCT ID by reading the file "input/...NCTID.xml" and adding it to the
# data frames above so they can be joined later
extract_info_from_file <- function(filename) {
# get xml file
xml <- read_xml(filename)
# store id in variable as it's used multiple times
nct_id <- xml_text(xml_find_first(xml, ".//id_info/nct_id"))
# append to main data frame
nct_trials <<- nct_trials %>% bind_rows(
data.frame(
"id" = nct_id,
"secondary_id" = xml_text(
xml_find_first(xml, ".//id_info/secondary_id")
),
"source" = xml_text(
xml_find_first(xml, ".//source")
),
"size" = xml_text(
xml_find_first(xml, ".//enrollment")
),
"study_type" = xml_text(
xml_find_first(xml, ".//study_type")
),
"overall_status" = xml_text(
xml_find_first(xml, ".//overall_status")
),
"clinical_results_present" = length( # if node is present
xml_find_all(xml, ".//clinical_results")
) > 0,
"start_date" = xml_text(
xml_find_first(xml, ".//start_date")
),
"completion_date" = xml_text(
xml_find_first(xml, ".//completion_date")
),
"primary_completion_date" = xml_text(
xml_find_first(xml, ".//primary_completion_date")
),
"study_first_submitted" = xml_text(
xml_find_first(xml, ".//study_first_submitted")
),
"study_first_submitted_qc" = xml_text(
xml_find_first(xml, ".//study_first_submitted_qc")
),
"study_first_posted" = xml_text(
xml_find_first(xml, ".//study_first_posted")
),
"results_first_submitted" = xml_text(
xml_find_first(xml, ".//results_first_submitted")
),
"results_first_submitted_qc" = xml_text(
xml_find_first(xml, ".//results_first_submitted_qc")
),
"results_first_posted" = xml_text(
xml_find_first(xml, ".//results_first_posted")
),
"submitted_pending_results" = xml_text(
xml_find_first(xml, ".//pending_results/submitted")
),
"why_stopped" = xml_text(
xml_find_first(xml, ".//why_stopped")
),
# the following subchildren can only be accessed if the parent
# node overall_official is present, otherwise they cause errors
"overall_official" = ifelse(
length(xml_find_all(xml, ".//overall_official")) > 0,
xml_text(xml_find_first(xml, ".//overall_official/last_name")),
NA # if not present
),
"official_role" = ifelse(
length(xml_find_all(xml, ".//overall_official")) > 0,
xml_text(xml_find_first(xml, ".//overall_official/role")),
NA # if not present
),
"official_affiliation" = ifelse(
length(xml_find_all(xml, ".//overall_official")) > 0,
xml_text(xml_find_first(xml, ".//overall_official/affiliation")),
NA # if not present
),
stringsAsFactors = FALSE
)
)
# get locations
xml_find_all(xml, ".//location") %>%
# keep only those with country Switzerland
keep(~ xml_text(xml_find_first(., ".//country")) == "Switzerland") %>%
walk(function(node) {
# append to data frame
nct_locations <<- nct_locations %>% bind_rows(
data.frame(
"id" = nct_id,
"name" = xml_text(xml_find_first(node, ".//name")),
"city" = xml_text(xml_find_first(node, ".//city")),
stringsAsFactors = FALSE
)
)
})
# get sponsors
xml_find_all(xml, ".//lead_sponsor|.//collaborator") %>%
walk(function(node) {
# append to data frame
nct_sponsors <<- nct_sponsors %>% bind_rows(
data.frame(
"id" = nct_id,
"sponsor" = xml_text(xml_find_first(node, ".//agency")),
"type" = xml_name(node),
"class" = xml_text(xml_find_first(node, ".//agency_class")),
stringsAsFactors = FALSE
)
)
})
# get citations
xml_find_all(xml, ".//results_reference") %>%
walk(function(node) {
# append to data frame
nct_citations <<- nct_citations %>% bind_rows(
data.frame(
"id" = nct_id,
"citation" = xml_text(xml_find_first(node, ".//citation")),
"pm_id" = xml_text(xml_find_first(node, ".//PMID")),
stringsAsFactors = FALSE
)
)
})
# get officials
xml_find_all(xml, ".//overall_official") %>%
walk(function(node) {
# append to data frame
nct_officials <<- nct_officials %>% bind_rows(
data.frame(
"id" = nct_id,
"name" = xml_text(xml_find_first(node, ".//last_name")),
"role" = xml_text(xml_find_first(node, ".//role")),
"affiliation" = xml_text(xml_find_first(node, ".//affiliation")),
stringsAsFactors = FALSE
)
)
})
}
print("Now extracting data from XML files (ClinicalTrials.gov data)…")
print("Takes about 2.5 minutes")
# go over all entries
system.time(
list.files("input/ignore/clinicaltrials", full.names = TRUE) %>%
walk(extract_info_from_file)
)
# save to rdata
save(nct_trials, file = "rdata/nct_trials.rdata")
save(nct_locations, file = "rdata/nct_locations.rdata")
save(nct_sponsors, file = "rdata/nct_sponsors.rdata")
save(nct_citations, file = "rdata/nct_citations.rdata")
save(nct_officials, file = "rdata/nct_officials.rdata")
# clean up
rm(extract_info_from_file)
}
## [1] "Now extracting data from XML files (ClinicalTrials.gov data)…"
## [1] "Takes about 2.5 minutes"
Some specifications about dates:
study_first_submitted
: The date on which the study sponsor or investigator first submitted a study record to ClinicalTrials.gov. There is typically a delay of a few days between the first submitted date and the record’s availability on ClinicalTrials.gov (the first posted date).
study_first_submitted_qc
: The date on which the study sponsor or investigator first submits a study record that is consistent with National Library of Medicine (NLM) quality control (QC) review criteria. The sponsor or investigator may need to revise and submit a study record one or more times before NLM’s QC review criteria are met. It is the responsibility of the sponsor or investigator to ensure that the study record is consistent with the NLM QC review criteria.
primary_completion_date
: The date on which the last participant in a clinical study was examined or received an intervention to collect final data for the primary outcome measure. Whether the clinical study ended according to the protocol or was terminated does not affect this date. For clinical studies with more than one primary outcome measure with different completion dates, this term refers to the date on which data collection is completed for all the primary outcome measures. The “estimated” primary completion date is the date that the researchers think will be the primary completion date for the study.
Conclusion: This report uses a trial’s primary completion point as the key criterion to determine whether or not a trial is due to post results.
We followed the argumentation of the Clinical Trial Reporting: “University Policies and Performance in the UKBristol and London, 06 November 2018”) This approach is likely to slightly overcount due trials. For example, if a trial’s expected primary completion date is extended during the trial due to slower than expected patient recruitment, and university staff fails to update the registry entry accordingly, the expected primary completion date listed in the registry will be further in the past than the actual or currently expected primary completion date. On balance, the approach used here has two significant advantages: In terms of accuracy, the number of trials falsely identified as overdue using this approach is assumed to be substantially lower than the number of trials falsely identified as not yet due when using the conventional approach. In terms of faithfully depicting a university’s registry management performance, this approach is preferable because it will never falsely identify trials as overdue if a university keeps its registry entries up to date. Thus, the approach used here incentivises universities to keep their registry entries up to date. In contrast, the conventional approach creates perverse incentives for trial sponsors to postpone or neglect updating a trial’s status to ‘completed’."
# load data form rdata file if is present
if (file.exists("rdata/ictrp_trials.rdata")) {
load(file = "rdata/ictrp_trials.rdata")
} else {
# Read file. Caution! This file is 55MB, it contains 7602 Trials
ictrp_xml <- read_xml("input/ictrp/ICTRP-Results.xml")
ictrp_nodes <- xml_find_all(ictrp_xml, "//Trial")
print("Now extracting data from XML file (ICTRP data)")
# In this file there are no n to 1 relations, so we can read it in with one
# split into batches of size 1000 nodes and calculate in parallel
batch_size <- 1000
split_factor <- as.factor(
rep_len(
1:ceiling(length(ictrp_nodes) / batch_size),
length.out = length(ictrp_nodes)
)
)
system.time(
ictrp_trials <- foreach(
node = split(ictrp_nodes, split_factor),
.inorder = FALSE,
.combine = rbind
) %dopar%
data.frame(
"id" =
xml_text(xml_find_first(node, ".//TrialID")),
"secondary" =
xml_text(xml_find_first(node, ".//Secondary_ID")),
"size" =
xml_text(xml_find_first(node, ".//Target_size")),
"study_type" =
xml_text(xml_find_first(node, ".//Study_type")),
"registration_date" =
xml_text(xml_find_first(node, ".//Date_registration")),
"enrollment_date" =
xml_text(xml_find_first(node, ".//Date_enrollement")),
"primary_sponsor" =
xml_text(xml_find_first(node, ".//Primary_sponsor")),
"secondary_sponsor" =
xml_text(xml_find_first(node, ".//Secondary_Sponsor")),
"source" =
xml_text(xml_find_first(node, ".//Source_Register")),
"type" =
xml_text(xml_find_first(node, ".//Study_type")),
"countries" =
xml_text(xml_find_first(node, ".//Countries")),
"contact_name" = paste(
xml_text(xml_find_first(node, ".//Contact_Firstname")),
xml_text(xml_find_first(node, ".//Contact_Lastname"))
),
"contact_address" =
xml_text(xml_find_first(node, "./Contact_Address")),
"affiliation" =
xml_text(xml_find_first(node, ".//Contact_Affiliation")),
"support" =
xml_text(xml_find_first(node, ".//Source_Support")),
stringsAsFactors = FALSE
)
)
# save to rdata
save(ictrp_trials, file = "rdata/ictrp_trials.rdata")
# clean up
rm(ictrp_xml, ictrp_nodes, batch_size, split_factor)
}
## [1] "Now extracting data from XML file (ICTRP data)"
Unfortunately the ICTRP registry is missing some key information. There is no structured information in the download about the end date of a trial. We have to add another source of data for that information.
The EMB Data Lab is hosting a website tracking the trials and publications in the EU Clinical Trials Register EUCTR. You can read more about their methodology here. The tracker also includes a flag for whether results are published in the registry and about the status of the trial. This data source is said to be updated once a month.
EMB Data Lab also collects the dates of the trials in the different countries (see example here). The earliest date is in the column min_end_date
and the latest date is in max_end_date
. We use max_end_date
to evaluate whether the trial has been completed or not.
The EU does not collect information about observational studies. On the website you can read: “The European Union Clinical Trials Register allows you to search for protocol and results information on:
We do have information about observational trials from the other two registers though.
# this chunk is changing the original data frames, execute once only
if ("max_end_date" %in% colnames(ictrp_trials)) {
print("EMBD data already present, this chunk will be skipped")
} else {
# filter and keep only european registry
ictrp_trials %<>%
filter(str_detect(id, "\\d{4}-\\d{6}-\\d{2}")) %>%
# filter(str_detect(countries, "Switzerland")) %>%
mutate(simple_id = str_extract(id, "\\d{4}-\\d{6}-\\d{2}"))
# download file from github
download.file(
"https://raw.githubusercontent.com/ebmdatalab/euctr-tracker-data/master/trials.csv",
"input/ignore/embd/trials.csv"
)
# read in
embd_euctr_data <- read_csv("input/ignore/embd/trials.csv", col_types = cols())
# status codes
embd_status_codes <- data.frame(
"0" = "Ongoing",
"1" = "Completed, but no date",
"2" = "Ongoing",
"3" = "Suspended, withdrawn, not authorised or prohibited",
"4" = "No trial status on register",
stringsAsFactors = FALSE
) %>%
gather("code", "status_description") %>%
mutate(code = as.numeric(str_sub(code, 2)))
# join min/max end date and the indicator for results
# this is a range of the end dates of all countries involved. A trial can be
# completed in one country in 2015 but in another it dured until 2016
ictrp_trials %<>%
left_join(
embd_euctr_data %>%
select(
simple_id = trial_id,
min_end_date,
max_end_date,
trial_status,
has_results
),
by = "simple_id"
) %>%
# in the embd data all ids occur twice and get joined twice, remove them
distinct(simple_id, .keep_all = TRUE) %>%
# replace status codes with words from data frame defined above
left_join(
embd_status_codes,
by = c("trial_status" = "code")
) %>%
select(-trial_status) %>%
rename(trial_status = status_description)
# clean up
rm(embd_euctr_data, embd_status_codes)
}
100% of all EU Trials in ICTRP were also found in the EMBD data from github. For 61.1% of the trials there is now an end date available.
## Warning: Missing column names filled in: 'X322' [322]
## Warning: Duplicated column names deduplicated: 'url' => 'url_1' [320]
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 2)
## Warning: 1 parsing failure.
## row # A tibble: 1 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 209 <NA> 322 columns 336 columns 'input/drks/trials.csv' file # A tibble: 1 x 5
Let’s check which one of the trial registries is the one with the most trials in it and how big is the overlap between the different registries.
We count the following number of unique trial ids:
knitr::kable(data.frame(
"NCT" = nrow(nct_trials %>% distinct(id)),
"EUCTR" = nrow(ictrp_trials %>% distinct(id)),
"DRKS" = nrow(drks_trials %>% distinct(id))
))
NCT | EUCTR | DRKS |
---|---|---|
6574 | 1101 | 520 |
The clinicaltrials.gov register has by far the most number of ids, we’ll start from there and try to identify duplicates in the other registers. To mark duplicates, we’ll introduce a new and custom unique id created by the digest
package and the xxhash32
algorithm.
# load data form rdata file if is present
if (
file.exists("rdata/ictrp_unique_ids.rdata") &
file.exists("rdata/drks_unique_ids.rdata") &
file.exists("rdata/nct_unique_ids.rdata")
) {
load(file = "rdata/ictrp_unique_ids.rdata")
load(file = "rdata/drks_unique_ids.rdata")
load(file = "rdata/nct_unique_ids.rdata")
} else {
# assign unique ids to nct_trials
nct_unique_ids <- nct_trials %>%
rowwise() %>%
select(primary = id, secondary = secondary_id) %>%
mutate(
unique_id = digest::digest(primary, algo = "xxhash32")
)
join_nct_or_generate_new_id <- function(df) {
# create regex search for primary OR secondary
search <- glue("{df$id}|{df$secondary}", .na = "") %>%
str_replace("\\|$", "") # remove ORs if no secondary present
if (!is.na(search)) {
nct_match <- nct_unique_ids %>%
filter(
str_detect(primary, search) |
str_detect(secondary, search)
)
# if entry was found, return the corresponding unique id
if (nrow(nct_match) == 1) {
return(data.frame(
"id" = df$id,
"found" = TRUE,
"unique_id" = nct_match$unique_id,
stringsAsFactors = FALSE
))
# if not found: generate a new one
} else if (nrow(nct_match) == 0) {
return(data.frame(
"id" = df$id,
"found" = FALSE,
"unique_id" = digest::digest(search, algo = "xxhash32"),
stringsAsFactors = FALSE
))
# sadly, the ictrp rows are not necessarily unique
# in that case return NA in the column found and all unique_ids
} else {
return(data.frame(
"id" = df$id,
"found" = NA,
"unique_id" = paste(nct_match$unique_id, collapse = " OR "),
stringsAsFactors = FALSE
))
} # else (not 0 and not 1 entries found)
} # if id is present
} # function
# convert to list of one-rowed data frames with selection of columns
drks_ids <- split(
drks_trials %>%
# keep only secondary ids that are NCT ids
mutate(
secondary = str_extract(
glue("{secondary} {tertiary} {quaternary} {quinary} {senary}"),
"NCT0\\d{7}"
)
) %>%
select(id, secondary),
# by sequencing over all row numbers we split the data frame into
# a list of tibbles - one for each row, that we can then pass to foreach
seq(nrow(drks_trials))
)
print(glue("Searching for DRKS IDs in the clinicaltrials data set"))
duration <- system.time(
drks_unique_ids <- foreach(
ids = drks_ids,
.inorder = FALSE,
.combine = rbind
) %dopar%
join_nct_or_generate_new_id(ids)
)
print(glue(
"The matching took {round(duration[3])} seconds. \\
Between DRKS in NCT IDs {duplicates_found} \\
duplicates were found, that's {percent(share)}.",
duplicates_found = drks_unique_ids %>%
filter(found) %>%
nrow(),
share = drks_unique_ids %>%
group_by(found) %>%
summarise(n = n()) %>%
mutate(share = n / sum(n)) %>%
filter(found) %>%
select(share) %>%
simplify()
))
# repeat for ictrp ids
# convert to list of one-rowed data frames with selection of columns
ictrp_ids <- split(
ictrp_trials %>%
# search for the id without EUCTR and country code to broaden the search
select(id = simple_id, secondary) %>%
distinct(id, .keep_all = TRUE) %>%
# keep only secondary ids that are NCT ids
mutate(secondary = str_extract(secondary, "NCT0\\d{7}")),
seq(nrow(ictrp_trials))
)
print(glue("Searching for ICTRP IDs in the clinicaltrials data set"))
duration <- system.time(
ictrp_unique_ids <- foreach(
ids = ictrp_ids,
.inorder = FALSE,
.combine = rbind
) %dopar%
join_nct_or_generate_new_id(ids)
)
print(glue(
"The matching took {round(duration[3])} seconds. \\
Between ICTRP in NCT IDs {duplicates_found} \\
duplicates were found, that's {percent(share)}",
duplicates_found = ictrp_unique_ids %>%
filter(found) %>%
nrow(),
share = ictrp_unique_ids %>%
group_by(found) %>%
summarise(n = n()) %>%
mutate(share = n / sum(n)) %>%
filter(found) %>%
select(share) %>%
simplify()
))
not_unique <- ictrp_unique_ids %>%
filter(is.na(found))
# some ICTRP entries were found more than once, show which
print(glue(
"The following {nrow(not_unique)} EUCTR IDs were found in more than one \\
clinicaltrials.gov entry:"
))
print(knitr::kable(
not_unique
))
print(glue("We separate them and treat them as different trials"))
ictrp_unique_ids %<>%
filter(!is.na(found)) %>%
bind_rows(
# put each of the non unique trials into it's own row
# and bind back to original data frame
not_unique %>%
mutate(unique_id = str_split(unique_id, " OR ")) %>%
unnest(unique_id) %>%
mutate(found = TRUE)
)
# save to rdata
save(ictrp_unique_ids, file = "rdata/ictrp_unique_ids.rdata")
save(drks_unique_ids, file = "rdata/drks_unique_ids.rdata")
save(nct_unique_ids, file = "rdata/nct_unique_ids.rdata")
# clean up
rm(
drks_ids,
ictrp_ids,
join_nct_or_generate_new_id,
duration,
not_unique
)
}
## Searching for DRKS IDs in the clinicaltrials data set
## The matching took 39 seconds. Between DRKS in NCT IDs 190 duplicates were found, that's 36.5%.
## Searching for ICTRP IDs in the clinicaltrials data set
## The matching took 83 seconds. Between ICTRP in NCT IDs 495 duplicates were found, that's 45.0%
## The following 4 EUCTR IDs were found in more than one clinicaltrials.gov entry:
##
##
## id found unique_id
## --------------- ------ ---------------------------------
## 2009-012476-28 NA b489857b OR e09f835a OR f9792432
## 2006-004495-13 NA 68f841e5 OR 94e9e897
## 2011-001729-25 NA 4bc81970 OR 75628864
## 2007-003013-14 NA 34923059 OR 02733b10
## We separate them and treat them as different trials
# join NCT unique ids
if (!"unique_id" %in% colnames(nct_trials)) {
nct_trials %<>%
left_join(
nct_unique_ids %>%
select(-secondary),
by = c("id" = "primary")
)
}
# join DRKS unique ids
if (!"unique_id" %in% colnames(drks_trials)) {
drks_trials %<>%
left_join(
drks_unique_ids %>%
select(-found),
by = "id"
)
}
# join ICTRP unique ids
if (!"unique_id" %in% colnames(ictrp_trials)) {
ictrp_trials %<>%
left_join(
ictrp_unique_ids %>%
select(-found),
by = c("simple_id" = "id")
)
}
# clean up
rm(drks_unique_ids, nct_unique_ids, ictrp_unique_ids)
list_coverage <- function(df) {
df %>%
mutate_if(is.character, funs(factor(replace(., . == "[---]*", NA)))) %>%
summarize_all(funs(sum(!is.na(.)) / length(.))) %>%
gather("column", "coverage") %>%
arrange(desc(coverage)) %>%
mutate(coverage = percent(coverage))
}
knitr::kable(
ictrp_trials %>%
list_coverage()
)
column | coverage |
---|---|
id | 100.0% |
secondary | 100.0% |
size | 100.0% |
study_type | 100.0% |
registration_date | 100.0% |
enrollment_date | 100.0% |
primary_sponsor | 100.0% |
secondary_sponsor | 100.0% |
source | 100.0% |
type | 100.0% |
countries | 100.0% |
contact_name | 100.0% |
contact_address | 100.0% |
affiliation | 100.0% |
support | 100.0% |
simple_id | 100.0% |
has_results | 100.0% |
trial_status | 100.0% |
unique_id | 100.0% |
min_end_date | 61.3% |
max_end_date | 61.3% |
knitr::kable(
drks_trials %>%
list_coverage()
)
column | coverage |
---|---|
id | 100.0% |
registration_date | 100.0% |
investorInitiated | 100.0% |
study_type | 100.0% |
national | 100.0% |
address.type0 | 100.0% |
address.affiliation0 | 100.0% |
unique_id | 100.0% |
last-update | 99.8% |
start_date | 99.6% |
target_size | 99.2% |
address.type1 | 98.8% |
phase | 98.7% |
address.lastname1 | 98.3% |
address.affiliation1 | 97.9% |
plannedActual | 67.5% |
address.land0 | 67.1% |
address.zip0 | 66.9% |
address.city0 | 66.9% |
address.streetAndNo1 | 66.9% |
address.zip1 | 66.9% |
address.city1 | 66.9% |
address.land1 | 66.9% |
address.firstname1 | 66.7% |
address.streetAndNo0 | 63.8% |
secondary | 53.8% |
address.firstname0 | 50.4% |
address.lastname0 | 50.4% |
firstPartnerPublishDate | 41.3% |
tertiary | 39.4% |
address.form1 | 34.0% |
publication.category0 | 32.1% |
address.title1 | 31.5% |
publication.value0 | 31.2% |
study_end | 26.9% |
address.form0 | 26.3% |
publication.type0 | 25.8% |
address.title0 | 24.6% |
quaternary | 22.7% |
quinary | 15.8% |
publication.category1 | 15.4% |
publication.value1 | 14.8% |
publication.type1 | 11.0% |
publication.key0 | 8.1% |
publication.category2 | 7.7% |
publication.value2 | 7.7% |
publication.category3 | 5.2% |
publication.value3 | 5.0% |
publication.type2 | 4.0% |
publication.key1 | 3.7% |
senary | 3.1% |
publication.category4 | 2.7% |
publication.value4 | 2.7% |
publication.type3 | 2.5% |
publication.key2 | 1.5% |
publication.key3 | 0.8% |
publication.type4 | 0.8% |
publication.key4 | 0.6% |
knitr::kable(
nct_trials %>%
list_coverage()
)
column | coverage |
---|---|
id | 100.0% |
source | 100.0% |
study_type | 100.0% |
overall_status | 100.0% |
clinical_results_present | 100.0% |
study_first_submitted | 100.0% |
study_first_submitted_qc | 100.0% |
study_first_posted | 100.0% |
unique_id | 100.0% |
start_date | 99.4% |
size | 98.7% |
primary_completion_date | 95.1% |
completion_date | 93.9% |
overall_official | 87.6% |
official_role | 87.6% |
official_affiliation | 87.4% |
secondary_id | 28.5% |
results_first_submitted | 12.1% |
results_first_submitted_qc | 12.1% |
results_first_posted | 12.1% |
why_stopped | 6.4% |
submitted_pending_results | 1.8% |
A trials is labeled as results_submitted
if one of the following conditions meet:
results_first_posted
has a value ORsubmitted_pending_results
has a value ORwhy_stopped
has a valueIt is labeled as publication_found
if:
As the end_date
the field primary_completion_date
is used.
The study type “expanded access” is treated as “interventional” so we only have two different types: interventional and observational.
A trials is labeled as results_submitted
if one of the following condition meets:
publications
columns in the downloaded xls contains the following words: Veröffentlichung
OR Abstract zur Studie
As the end_date
the field study_end
is used.
For the labelling of the results_submitted
and the end_date
we use the date that the EU Trials Tracker lists as has_results
and max_end_date
.
# set to english month names etc.
locale_for_later_restore <- Sys.getlocale("LC_TIME")
Sys.setlocale("LC_TIME", "C")
## [1] "C"
english_months <- paste0(month.name, collapse = "|")
permissive_date_converter <- function(string) {
case_when(
# check for dates with a day init (e.g. "May 8, 2004")
str_detect(string, glue("({english_months}) \\d{{1,2}}, \\d{{4}}")) ~
as.Date(string, format = "%B %e, %Y"),
# else check for dates without da day (e.g. "May 2004")
str_detect(string, glue("({english_months}) \\d{{4}}")) ~
as.Date(glue("01 {string}"), format = "%d %B %Y"),
# else check for dates in standard format (e.g. 30/01/2018)
str_detect(string, "\\d{1,2}/\\d{2}/\\d{4}") ~
as.Date(string, format = "%d/%m/%Y"),
# else check for dates in standard format (e.g. 30-01-2018)
# by converting to date, if it fails, the month and the day are
# in exchanged order (see next case)
# of course we can never be really sure what an author meant, when he
# entered e.g. 06-06-2018
str_detect(string, "\\d{1,2}-\\d{2}-\\d{4}") &
!is.na(as.Date(string, format = "%m-%d-%Y")) ~
as.Date(string, format = "%m-%d-%Y"),
# else check for dates in strange order (e.g. 01-30-2018)
str_detect(string, "\\d{2}-\\d{2}-\\d{4}") ~
as.Date(string, format = "%d-%m-%Y"),
# else check for dates in strange order (e.g. 2014-03-17)
str_detect(string, "\\d{4}-\\d{2}-\\d{2}") ~
as.Date(string, format = "%Y-%m-%d"),
# check if is month/year (e.g. 11/2015)
str_detect(string, "\\d{2}/\\d{4}") ~
as.Date(glue("01/{string}"), format = "%d/%m/%Y")
)
}
all_trials <- bind_rows(
# NCT
nct_trials %>%
mutate(
# generate boolean based on multiple possible conditions
results_submitted = (
clinical_results_present |
!is.na(results_first_posted) |
!is.na(submitted_pending_results) |
!is.na(why_stopped)
),
publication_found = id %in% nct_citations$id,
# convert to correct data type
size = as.numeric(size),
study_first_submitted_qc = as.Date(
study_first_submitted_qc,
format = "%B %e, %Y"
),
study_type = ifelse(
str_detect(
study_type,
regex("^(intervent|expanded)", ignore_case = TRUE)
),
"interventional",
"observational"
),
source = "NCT"
) %>%
rowwise() %>%
# permissive date converter is not able to process whole vectors/columns
mutate(
start_date =
permissive_date_converter(start_date),
primary_completion_date =
permissive_date_converter(primary_completion_date)
) %>%
# join lead_sponsor as primary_sponsor, there's one for each trial
left_join(
nct_sponsors %>%
filter(type == "lead_sponsor") %>%
select(id, primary_sponsor = sponsor),
by = "id"
) %>%
select(
unique_id,
primary = id,
size,
study_type,
status = overall_status,
primary_sponsor,
leading_institution = official_affiliation,
registration_date = study_first_submitted_qc,
start_date,
end_date = primary_completion_date,
results_submitted,
publication_found,
source
),
# DRKS
drks_trials %>%
mutate(
# add column for publication if one was uploaded
has_publication =
grepl("Veröffentlichung|Abstract zur Studie|", publication.category0) |
grepl("Veröffentlichung|Abstract zur Studie|", publication.category1) |
grepl("Veröffentlichung|Abstract zur Studie|", publication.category2) |
grepl("Veröffentlichung|Abstract zur Studie|", publication.category3),
# convert to correct data type, faulty end dates [---]* will be NA
target_size = as.numeric(target_size),
registration_date = as.Date(registration_date, format = "%d.%m.%Y"),
start_date = as.Date(start_date, format = "%d.%m.%Y"),
study_end = as.Date(study_end, format = "%d.%m.%Y"),
study_type = ifelse(
str_detect(
study_type,
regex("^intervent", ignore_case = TRUE)
),
"interventional",
"observational"
),
source = "DRKS"
) %>%
select(
unique_id,
primary = id,
size = target_size,
study_type,
primary_sponsor = address.affiliation0,
leading_institution = address.affiliation1,
registration_date,
start_date,
end_date = study_end,
results_submitted = has_publication,
source
),
# ICTRP
ictrp_trials %>%
mutate(
# convert to correct data type
size = as.numeric(size),
registration_date = as.Date(registration_date, format = "%d/%m/%Y"),
has_results = has_results == 1,
# take only first entry of semicolon-separated list
affiliation = sub("(.*?);.*", "\\1", affiliation),
study_type = ifelse(
str_detect(
study_type,
regex("^intervent", ignore_case = TRUE)
),
"interventional",
"observational"
),
source = "ICTRP"
) %>%
rowwise() %>%
# permissive date converter is not able to process whole vectors/columns
mutate(
enrollment_date = permissive_date_converter(enrollment_date)
) %>%
select(
unique_id,
primary = id,
size,
study_type,
status = trial_status,
primary_sponsor,
leading_institution = affiliation,
registration_date,
start_date = enrollment_date,
end_date = max_end_date,
results_submitted = has_results,
source
)
) %>%
ungroup() %>%
# convert to factor
mutate(study_type = factor(study_type))
## Warning in evalq(as.numeric(target_size), <environment>): NAs durch
## Umwandlung erzeugt
# restore locale
Sys.setlocale("LC_TIME", locale_for_later_restore)
## [1] "de_CH.UTF-8"
# clean up
rm(
locale_for_later_restore,
english_months,
permissive_date_converter,
nct_citations,
nct_locations,
nct_officials,
nct_sponsors,
nct_trials,
drks_trials,
ictrp_trials
)
knitr::kable(
all_trials %>%
list_coverage()
)
column | coverage |
---|---|
unique_id | 100.0% |
primary | 100.0% |
study_type | 100.0% |
primary_sponsor | 100.0% |
registration_date | 100.0% |
results_submitted | 100.0% |
source | 100.0% |
start_date | 99.1% |
size | 98.8% |
status | 93.7% |
leading_institution | 89.7% |
end_date | 86.2% |
publication_found | 80.2% |
# clean up
rm(list_coverage)
By now we still have some different wordings regarding the status of the trial:
knitr::kable(
all_trials %>%
group_by(source, status) %>%
tally() %>%
arrange(desc(n))
)
source | status | n |
---|---|---|
NCT | Completed | 3576 |
NCT | Recruiting | 1256 |
ICTRP | Ongoing | 840 |
DRKS | NA | 520 |
NCT | Active, not recruiting | 497 |
NCT | Unknown status | 483 |
NCT | Terminated | 425 |
ICTRP | Completed, but no date | 203 |
NCT | Not yet recruiting | 153 |
NCT | Enrolling by invitation | 77 |
NCT | Withdrawn | 73 |
ICTRP | Suspended, withdrawn, not authorised or prohibited | 60 |
NCT | Suspended | 15 |
NCT | No longer available | 7 |
NCT | Available | 6 |
NCT | Approved for marketing | 5 |
ICTRP | No trial status on register | 3 |
NCT | Temporarily not available | 1 |
We simplify these into the following groups:
simplified_status <- c(
"ICTRP: Completed, but no date" = "Completed",
"ICTRP: No trial status on register" = "Unknown",
"ICTRP: Ongoing" = "Ongoing",
"ICTRP: Suspended, withdrawn, not authorised or prohibited" = "Unknown",
"NCT: Active, not recruiting" = "Ongoing",
"NCT: Completed" = "Completed",
"NCT: Enrolling by invitation" = "Ongoing",
"NCT: Not yet recruiting" = "Ongoing",
"NCT: Recruiting" = "Ongoing",
"NCT: Suspended" = "Ongoing",
"NCT: Terminated" = "Terminated",
"NCT: Unknown status" = "Unknown",
"NCT: Withdrawn" = "Terminated",
"NCT: Available" = "Ongoing",
"NCT: Temporarily not available" = "Ongoing",
"NCT: No longer available" = "Completed",
"NCT: Approved for marketing" = "Completed",
"DRKS: NA" = "Unknown"
)
knitr::kable(
simplified_status
)
x | |
---|---|
ICTRP: Completed, but no date | Completed |
ICTRP: No trial status on register | Unknown |
ICTRP: Ongoing | Ongoing |
ICTRP: Suspended, withdrawn, not authorised or prohibited | Unknown |
NCT: Active, not recruiting | Ongoing |
NCT: Completed | Completed |
NCT: Enrolling by invitation | Ongoing |
NCT: Not yet recruiting | Ongoing |
NCT: Recruiting | Ongoing |
NCT: Suspended | Ongoing |
NCT: Terminated | Terminated |
NCT: Unknown status | Unknown |
NCT: Withdrawn | Terminated |
NCT: Available | Ongoing |
NCT: Temporarily not available | Ongoing |
NCT: No longer available | Completed |
NCT: Approved for marketing | Completed |
DRKS: NA | Unknown |
# simplify status
if (nrow(all_trials %>% filter(status == "Recruiting")) > 0) {
all_trials %<>% mutate(
status = glue("{source}: {status}"),
status = str_replace_all(status, simplified_status),
status = factor(status)
)
}
rm(simplified_status)
That we transform “suspended” to “ongoing” is not very precise, but we are mainly interested in trials that are completed.
ggplot(
data = all_trials %>%
distinct(unique_id, .keep_all = TRUE),
aes(
x = start_date,
color = source
)
) +
geom_freqpoly(bins = 50) +
# we have so me weird dates lying back in 1900-01-01 (errors obviously)
xlim(as.Date("1990-01-01"), as.Date("2020-01-01")) +
theme_minimal() +
scale_color_brewer(palette = "Set1") +
labs(
title = "When were the trials started?",
x = "Start date",
y = "Number of studies"
)
## Warning: Removed 62 rows containing non-finite values (stat_bin).
## Warning: Removed 9 rows containing missing values (geom_path).
ggplot(
all_trials %>%
group_by(source, status) %>%
tally(),
aes(
x = source,
fill = status,
y = n
)
) +
geom_bar(stat = "identity") +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
labs(
x = "Registry",
y = "Number of trials",
fill = "Trial Status",
title = "How many trials have we per status and registry?"
)
Read in txt files exported by the Cochrane website and keep all entries with a value in the fields publication type PT
, embase ID EM
, pubmed ID PM
or Digital Object Identifier DOI
. Then apply a regex search to look for NCT, EUCTR or DRKS IDs. If an ID is found, the corresponding trial is treated as publication_found
.
The txt files are not included in this git repo. Read the chapter data sources and download your own version of the file, if you wish to reproduce the report or work with the provided rdata
file. Read the chapter data sources and download your own version of the file, if you wish to reproduce the report or work with the provided rdata
file.
# define regex patterns for different register ids
ids_regex_patterns <- c(
"NCT" = "NCT\\d{8}",
"EUCTR" = "\\d{4}\\s?\\-\\s?\\d{6}\\s?\\-\\s?\\d{2}",
"DRKS" = "DRKS\\d{8}"
)
# load data form rdata file if is present
if (file.exists("rdata/ids_in_cochrane.rdata")) {
load(file = "rdata/ids_in_cochrane.rdata")
} else {
# Recalculate.
# Define files that need to be parsed
cochrane_files <- list.files("input/ignore/cochrane", full.names = TRUE)
# paste them together with three newlines so all citations get split correctly
cochrane_concatenated <- cochrane_files %>%
map(~ read_file(.)) %>%
reduce(paste0, collapse = "\n\n\n")
# then split again by three newlines
cochrane_chunks <- str_split(cochrane_concatenated, "\n\n\n")
# by now the variable is a list with one child, a char vector
# with as.list/unlist we can untangle that and get a regular list
cochrane_chunks <- as.list(unlist(cochrane_chunks))
# convertLinesToDataframe gets called with a batch
# but what we need to analyse is what I call one chunk,
# meaning one citation consisting of a Record iterator and some lines
map_over_chunks_in_batch <- function(batch) {
# that's why we split it up a second time and parallelize it
sub_batch_size <- 2000
sub_split_factor <- as.factor(
rep_len(
1:ceiling(length(batch) / sub_batch_size),
length.out = length(batch)
)
)
foreach(
chunk = split(batch, sub_split_factor),
.inorder = FALSE,
.combine = rbind
) %dopar%
get_dataframe_from_chunk(chunk)
}
get_dataframe_from_chunk <- function(chunks) {
# return data frame of id, key, value triples
chunks %>%
map_df(function(chunk) {
# extract id from top of chunk
id_total <- str_match(chunk, "^Record #(\\d*) of (\\d*)")
id <- id_total[1, 2]
total <- id_total[1, 3]
unlist(str_split(chunk, "\n")) %>%
map_df(function(line) {
if (str_detect(line, "^(\\w{2,3}): ")) {
key_value <- str_match(line, "^(\\w{2,3}): (.*)")
key <- key_value[1, 2]
value <- key_value[1, 3]
# append to super tidy (narrow) data frame tidy way
data.frame(
# make new id by concatenating total rows in file and iterator
"id" = paste0(id, "-", total),
"key" = key,
"value" = value,
stringsAsFactors = FALSE
)
}
})
})
}
# map over all lines of the textfile and bind results as data frame
batch_size <- 20000
split_factor <- as.factor(
rep_len(
1:ceiling(length(cochrane_chunks) / batch_size),
length.out = length(cochrane_chunks)
)
)
glue("Now extracting ids from Cochrane Library Export")
duration <- system.time(
cochrane_publications <- foreach(
batch = split(cochrane_chunks, split_factor),
.inorder = FALSE,
.combine = rbind
) %dopar%
map_over_chunks_in_batch(batch)
)
glue("Extraction took {round(duration[3] / 60, 1)} minutes.")
# clean up
rm(
cochrane_files,
cochrane_chunks,
cochrane_concatenated,
batch_size,
split_factor,
duration,
get_dataframe_from_chunk,
map_over_chunks_in_batch
)
# the resulting df has about 2.3 million rows and 3 columns
# extract publication identifiers for later joining
publication_info <- cochrane_publications %>%
filter(str_detect(key, "PT|DOI|PM|XR")) %>%
distinct(id, key, .keep_all = TRUE) %>%
spread(key, value) %>%
rename(export_id = id)
# make new column with all ids found in Abstracts (AB)
ids_in_cochrane <- cochrane_publications %>%
filter(key == "AB") %>%
mutate(
# look for all ids by concatenating regex patterns above
ids = str_extract_all(
value,
regex(
ids_regex_patterns %>%
paste(collapse = "|"),
ignore_case = TRUE
)
)
) %>%
select(-value) %>%
unnest(ids) %>%
rename(export_id = id) %>%
select(export_id, id = ids) %>%
# remove unnecessary stuff
mutate(
id = toupper(id),
id = str_replace_all(id, "NUMBER:? ?|NO:? ?", ""),
id = str_replace_all(id, "\\s", ""),
# get simple EUCTR ID out of the id if present
id = ifelse(
str_detect(id, ids_regex_patterns[["EUCTR"]]),
str_extract(id, ids_regex_patterns[["EUCTR"]]),
id
)
) %>%
left_join(
publication_info, by = "export_id"
) %>%
# convert pubmed/embase ids to ints for better compatibility later
mutate(
PM = as.numeric(
str_replace_all(PM, "PUBMED ", "")
),
XR = as.numeric(
str_replace_all(XR, "EMBASE ", "")
)
)
# clean up
rm(publication_info)
# save to rdata file
save(ids_in_cochrane, file = "rdata/ids_in_cochrane.rdata")
# clean up
rm(cochrane_publications)
}
With this method we find 27279 different trial IDs. This number is so high because these are not only trials conducted in Switzerland.
The pubmed trial download are the only files used in this report that can not be found in the git repository, because the xml file generated by the pubmed website is ober 1GB in size. Read the chapter data sources and download your own version of the file, if you wish to reproduce the report or work with the provided rdata
file.
In pubmed, we are interested in the fields Publisher Item Identifier PII
and Digital Object Identifier DOI
. We look in the xml node DataBank
for linked IDs and search the abstract
with a regex, looking for IDs. If we find an ID with a PII, DOI or in an abstract, we treat it as publication_found
.
# load data form rdata file if is present
if (file.exists("rdata/ids_in_pubmed.rdata")) {
load(file = "rdata/ids_in_pubmed.rdata")
} else {
pubmed_articles_linked <- read_xml(
"input/ignore/pubmed/pubmed_result_nct_linked.xml"
) %>%
xml_find_all("//PubmedArticle")
pubmed_articles_others <- read_xml(
"input/ignore/pubmed/pubmed_result_all_others.xml"
) %>%
xml_find_all("//PubmedArticle")
# read in parallel
batch_size <- 18000
split_factor <- as.factor(
rep_len(
1:ceiling(length(pubmed_articles_linked) / batch_size),
length.out = length(pubmed_articles_linked)
)
)
glue("Now extracting ids from Pubmed Linked Export")
duration <- system.time(
ids_in_pubmed_linked <- foreach(
node = split(pubmed_articles_linked, split_factor),
.inorder = FALSE,
.combine = rbind
) %dopar%
data.frame(
"PM" =
xml_integer(xml_find_first(node, ".//PMID")),
"PII" =
xml_text(xml_find_first(node, ".//ELocationID[@EIdType='pii']")),
"DOI" =
xml_text(xml_find_first(node, ".//ELocationID[@EIdType='doi']")),
"databank" =
xml_text(xml_find_first(node, ".//DataBank")),
"abstract" =
xml_text(xml_find_first( node, ".//Abstract")),
stringsAsFactors = FALSE
)
)
glue("Extraction took {round(duration[3] / 60, 1)} minutes.")
# repeat for others (unfortunately two xml files cannot be joined easily)
# so we copy the children of the root node from all files into one
batch_size <- 2500
split_factor <- as.factor(
rep_len(
1:ceiling(length(pubmed_articles_others) / batch_size),
length.out = length(pubmed_articles_others)
)
)
glue("Now extracting ids from Pubmed Searched Export")
duration <- system.time(
ids_in_pubmed_others <- foreach(
node = split(pubmed_articles_others, split_factor),
.inorder = FALSE,
.combine = rbind
) %dopar%
data.frame(
"PM" =
xml_integer(xml_find_first(node, ".//PMID")),
"PII" =
xml_text(xml_find_first(node, ".//ELocationID[@EIdType='pii']")),
"DOI" =
xml_text(xml_find_first(node, ".//ELocationID[@EIdType='doi']")),
"databank" =
xml_text(xml_find_first(node, ".//DataBank")),
"abstract" =
xml_text(xml_find_first( node, ".//Abstract")),
stringsAsFactors = FALSE
)
)
glue("Extraction took {duration[3]} minutes.")
# bind together
ids_in_pubmed <- bind_rows(
ids_in_pubmed_linked,
ids_in_pubmed_others
)
ids_in_pubmed %<>%
# look for all ids by concatenating regex patterns above
mutate(
abstract = str_extract_all(
abstract,
regex(
ids_regex_patterns %>%
paste(collapse = "|"),
ignore_case = TRUE
)
)
) %>%
# extract all major ids from databank node
mutate(nct_id = str_extract_all(
databank,
ids_regex_patterns[["NCT"]]
)) %>%
mutate(euctr_id = str_extract_all(
databank,
ids_regex_patterns[["EUCTR"]]
)) %>%
mutate(drks_id = str_extract_all(
databank,
ids_regex_patterns[["DRKS"]]
)) %>%
select(-databank) %>%
# split into rows
gather("register", "ids", abstract:drks_id) %>%
# unnest those with multiple ids in one row
unnest(ids) %>%
rename(id = ids) %>%
# filter out empty ones
filter(!is.na(id)) %>%
# filter out those without any publication id
filter(!is.na(PII) | !is.na(DOI))
# clean up
rm(
pubmed_articles_linked,
pubmed_articles_others,
ids_in_pubmed_linked,
ids_in_pubmed_others,
batch_size,
split_factor,
duration
)
# save to rdata file
save(ids_in_pubmed, file = "rdata/ids_in_pubmed.rdata")
}
With this method we find 49009 different trial IDs. This number is so high because these are not only trials conducted in Switzerland.
all_publications <- bind_rows(
ids_in_cochrane %>%
# keep only one row per trial id
distinct(id, .keep_all = TRUE) %>%
mutate(
source = "Cochrane"
) %>%
select(
trial_id = id,
DOI,
pubmed_id = PM,
embase_id = XR,
type = PT,
source
),
ids_in_pubmed %>%
# keep only one row per trial id
distinct(id, .keep_all = TRUE) %>%
mutate(
source = "Pubmed"
) %>%
select(
trial_id = id,
DOI,
pubmed_id = PM,
PII,
source
)
)
# clean up
rm(
ids_in_cochrane,
ids_in_pubmed
)
overlap <- all_publications %>%
group_by(trial_id) %>%
summarise(n = n()) %>%
filter(n > 1)
If we combine the two publication data sources we now have a total of 53303 primary IDs that we now know have results published. There’s a pretty big overlap: 46.9% of the trial ids we find in pubmed, are also in cochrane. Which makes sense as you can find abstracts from Pubmed in the Cochrane Library.
# clean up
rm(overlap)
Now we go through all trials from above and check whether the ID was found in the two publication sources.
trials_with_publications_found <- all_trials %>%
left_join(
all_publications %>%
select(trial_id, publication = source),
by = c("primary" = "trial_id")
) %>%
group_by(unique_id) %>%
# if per unique id only one of the rows has a source, the trial has a
# corresponding publication
summarise(publication_found = sum(!is.na(publication)) > 0) %>%
# now we join the linked publications from the registers
left_join(
all_trials %>%
select(unique_id, results_submitted),
by = "unique_id"
) %>%
# add booleans for those that already had a publication linked
# and one for those that we were able to find in the publications
# and one for both to see how big the overlap is
mutate(
both = (results_submitted & publication_found),
either = results_submitted | publication_found,
none = (!results_submitted & !publication_found)
)
knitr::kable(
results <- trials_with_publications_found %>%
select(-unique_id) %>%
summarize_all(funs(sum(.) / length(.))) %>%
mutate_all(percent)
)
publication_found | results_submitted | both | either | none |
---|---|---|---|---|
30.0% | 24.1% | 11.0% | 43.1% | 56.9% |
Looks like we found publications for another 30.0% of our trials! So far, that we can now say that 43.1% of the 7506 Swiss trials have lead to some form of publication. 56.9% haven’t for sure.
all_trials %<>%
# replace NAs with FALSE
mutate(publication_found = ifelse(
is.na(publication_found),
FALSE,
publication_found
)) %>%
left_join(
trials_with_publications_found %>%
select(unique_id, publication_found_2 = publication_found),
by = "unique_id"
) %>%
# join newly calculated field publication_found as _2 because in nct
# a publication_found was already present. So we combine the two with an
# OR condition and delete the second one
mutate(
publication_found = publication_found | publication_found_2
) %>%
select(-publication_found_2)
# clean up
rm(results, trials_with_publications_found)
Until now we did not distinguish between finished and unfinished trials. Also, we did not add any additional time period to the finish dates during which companies are able to write a publication. Let’s cope with this now.
We define a time span of ten years that we focus on. We look at the years 2007 to 2016. Now is Feb 2019, that mean’s the sponsors had over two years to upload the results of a trial after they labeled it as finished.
Provided end date of trial must be between 1st of January 2007 and 31st of December 2016
relevant_trials <- all_trials %>%
filter(
end_date >= as.Date("2007-01-01") &
end_date <= as.Date("2016-12-31")
) %>%
distinct(unique_id, .keep_all = TRUE)
Use the following chunk, if you wish to only analyze interventional studies.
# currently this chunk not! evaluated:
relevant_trials %<>% filter(study_type == "interventional")
From originally 7506 trials, we remain with 3612 trials that meet our criteria.
knitr::kable(
all_trials %>%
distinct(unique_id, .keep_all = TRUE) %>%
group_by(source) %>%
tally() %>%
rename(before = n) %>%
left_join(
relevant_trials %>%
group_by(source) %>%
tally() %>%
rename(after = n),
by = "source"
)
)
source | before | after |
---|---|---|
DRKS | 330 | 59 |
ICTRP | 602 | 237 |
NCT | 6574 | 3316 |
We realized that there are too many trial publications that do not contain the trial ID in its abstract and are thus not found via the methodology above. For that reason we additionally scrape Google Scholar Search Results to get a more complete picture.
Google Scholar lets us only scrape about 30 primary IDs. After that, the IP gets blocked. For this reason the next code chunk is not executed normally (if google_scholar.rdata
file is present). You need to be on a computer, that regularly gets a new IP to scrape the results again. We did this once on the 15th of January 2019 and saved the results into the google_scholar.rdata
file.
if (file.exists("rdata/google_scholar.rdata")) {
# of rdata file is present, do not scrape again!
load("rdata/google_scholar.rdata")
} else if (!exists("google_scholar")) {
# initialize new empty data frame to fill
google_scholar <- data.frame()
# create safe walk function that does not break on errors
safely_walk <- safely(walk)
# redirect prints to file
sink("scrape.log", append = TRUE)
# repeat as long as there are IDs not yet scraped
while (relevant_trials %>%
distinct(primary) %>%
filter(!primary %in% google_scholar$primary) %>%
nrow() > 0) {
# safely walk over relevant trials
relevant_trials %>%
distinct(primary) %>%
# but only iterate over those that haven't been scraped
filter(!primary %in% google_scholar$primary) %>%
t() %>%
as.list() %>%
safely_walk(function(primary_id) {
# logging
print(paste(Sys.time(), "try to scrape", primary_id))
error <- FALSE
# for euctr ids, leave away the country code and the prefix
search_query <- ifelse(
str_match(primary_id, "\\d{4}-\\d{6}-\\d{2}"),
str_extract(primary_id, "\\d{4}-\\d{6}-\\d{2}"),
primary_id # leave other as they are
)
# search for primary ID on Google Scholar and save server response
page <- tryCatch({
# somehow read_html does sometimes get stuck, it does not throw an
# error but neither does it continue, so we need to cancel it after
# 15 seconds and throw an error
withTimeout({
read_html(
paste0("https://scholar.google.ch/scholar?q=", search_query),
verbose = T)
},
timeout = 15,
onTimeout = "error"
)
},
warning = function(warning_message) {
print(paste(Sys.time(), warning_message))
},
error = function(error_message) {
print(paste(
Sys.time(),
"error scraping at",
primary_id,
", error:", error_message
))
# break outer for loop if try catch was not successful
error <<- TRUE
})
# if return value is valid, add it to the data frame
if (error == FALSE) {
print(paste(Sys.time(), "successfully scraped", primary_id))
# extract nodes for search results
search_results <- page %>%
html_nodes(".gs_r.gs_scl")
# extract relevant information of each node and return as data frame
result <- search_results %>%
map_df(function(node) {
data.frame(
"primary" = primary_id,
"scraped_at" = Sys.time(),
"title" =
node %>% html_node("h3.gs_rt") %>% html_text(),
"url" =
node %>% html_node("h3.gs_rt a") %>% html_attr("href"),
"source" =
node %>% html_node(".gs_ri .gs_a") %>% html_text(),
"text" =
node %>% html_node(".gs_ri .gs_rs") %>% html_text(),
"additional_info" =
node %>% html_node(".gs_ri .gs_fl") %>% html_text(),
stringsAsFactors = FALSE
)
})
# append number of pages found and return
result %<>%
mutate(
pages = page %>%
html_nodes(".gs_ico_nav_page") %>%
length() + 1
)
# for no results: append NA to be sure the ID is not scraped again
if (nrow(result) == 0) {
result <- data.frame(
"primary" = primary_id,
"scraped_at" = Sys.time(),
"pages" = 0,
stringsAsFactors = FALSE
)
}
# append to global variable defined above
google_scholar <<- google_scholar %>%
bind_rows(result)
# export as RData to make sure nothing gets lost
save(google_scholar, file = "rdata/google_scholar.RData")
}
# wait for a random number of seconds (between 5 and 10)
Sys.sleep(runif(1, 5.0, 10.0))
})
}
# finished scraping
print(paste(Sys.time(), "scraping finished"))
# end output redirection
sink(NULL)
}
Now we join the data scraped by google scholar. If any results were found, we label the trials with has_g_scholar_results
. When we do this we get the following eight groups:
# add column to indicate number of pages found on google scholar
relevant_trials %<>%
left_join(
google_scholar %>%
filter(!is.na(pages)) %>%
distinct(primary, .keep_all = TRUE) %>%
mutate(has_g_scholar_results = !(pages == 0 | is.na(pages))) %>%
select(primary, has_g_scholar_results),
by = "primary"
) %>%
# temporarily change NAs to to FALSE but scrape these IDs again later!
mutate(has_g_scholar_results = ifelse(
is.na(has_g_scholar_results),
FALSE,
has_g_scholar_results
))
# filter out some groups to get a better understanding of how well the google
# scholar search worked
knitr::kable(
relevant_trials %>%
group_by(has_g_scholar_results, publication_found, results_submitted) %>%
tally() %>%
arrange(has_g_scholar_results, publication_found, results_submitted) %>%
bind_cols(
data.frame(description = c(
"No results at all",
"No scholar search results, but results found in pubmed/cochrane",
"No scholar search results, but results uploaded to registry",
"No scholar search results, but results uploaded to registry \\
plus found in pubmed/cochrane",
"Has scholar search results, but neither results uploaded to \\
registry nor found in pubmed/cochrane",
"Has scholar search results and results uploaded to registry, \\
but was not found in pubmed/cochrane",
"Has scholar search results and was found in pubmed/cochrane, \\
but no results were uploaded ot registry",
"All conditions meet!"
))
) %>%
select(count = n, description)
)
## Adding missing grouping variables: `has_g_scholar_results`, `publication_found`
has_g_scholar_results | publication_found | count | description |
---|---|---|---|
FALSE | FALSE | 728 | No results at all |
FALSE | FALSE | 188 | No scholar search results, but results found in pubmed/cochrane |
FALSE | TRUE | 119 | No scholar search results, but results uploaded to registry |
FALSE | TRUE | 66 | No scholar search results, but results uploaded to registry |
plus found in pu | bmed/cochrane | ||
TRUE | FALSE | 668 | Has scholar search results, but neither results uploaded to |
registry nor fou | nd in pubmed/cochran | e | |
TRUE | FALSE | 456 | Has scholar search results and results uploaded to registry, |
but was not foun | d in pubmed/cochrane | ||
TRUE | TRUE | 857 | Has scholar search results and was found in pubmed/cochrane, |
but no results w | ere uploaded ot regi | stry | |
TRUE | TRUE | 530 | All conditions meet! |
We identify important institutions by searching for the company name or institution name in both the column of the leading institution as well as the column of the primary sponsor. Later we only work with the primary sponsor to calculate publication rates, but in this step we also want to have a unified version of the names of the leading institutions (at least those that are based in Switzerland).
relevant_trials %<>%
mutate(
# we have some ugly html line breaks in the affiliations, let's remove them
primary_sponsor =
str_replace_all(primary_sponsor, "<br ?\\/?>", " "),
leading_institution =
str_replace_all(leading_institution, "<br ?\\/?>", " ")
)
# combine the two columns containing affiliation information (without the NAs)
all_affiliations <- data.frame(affiliation = c(
relevant_trials$primary_sponsor,
relevant_trials$leading_institution
), stringsAsFactors = FALSE) %>%
# remove NAs
filter(!is.nan(affiliation))
# create short helper function to reduce repetition (e.g. ignore_case)
my_detect <- function(string, pattern) {
str_detect(string, regex(pattern, ignore_case = TRUE))
}
# nolint start
all_affiliations %<>%
mutate(
affiliation = as.character(affiliation),
simplified_name = case_when(
# private companies
my_detect(affiliation, "Abbott") ~ "Abbott",
my_detect(affiliation, "AbbVie") ~ "AbbVie",
my_detect(affiliation, "Ablynx") ~ "Ablynx",
my_detect(affiliation, "Actelion") ~ "Actelion",
my_detect(affiliation, "Allergan") ~ "Allergan",
my_detect(affiliation, "Amgen") ~ "Amgen",
my_detect(affiliation, "Ardea Biosciences, Inc.") ~ "Ardea Biosciences",
my_detect(affiliation, "Array BioPharma") ~ "Array BioPharma",
my_detect(affiliation, "Astellas") ~ "Astellas",
my_detect(affiliation, "Astra\\s*Zeneca") ~ "AstraZeneca",
my_detect(affiliation, "Bayer") ~ "Bayer",
my_detect(affiliation, "Biogen") ~ "Biogen",
my_detect(affiliation, "Biotronik") ~ "Biotronik",
my_detect(affiliation, "Bioverativ") ~ "Bioverativ",
my_detect(affiliation, "Boehringer[\\- ]*Ingelheim") ~ "Boehringer Ingelheim",
my_detect(affiliation, "Bristol[\\- ]*Myers") ~ "Bristol-Myers Squibb",
my_detect(affiliation, "Celgene") ~ "Celgene",
my_detect(affiliation, "CSL Behring") ~ "CSL Behring",
my_detect(affiliation, "Cytos Biotechnology") ~ "Cytos Biotechnology",
my_detect(affiliation, "Dr. Falk Pharma") ~ "Dr. Falk Pharma",
my_detect(affiliation, "Daiichi Sank[iy]o") ~ "Daiichi Sankyo",
my_detect(affiliation, "GlaxoSmithKline") ~ "GlaxoSmithKline",
my_detect(affiliation, "Geistlich Pharma") ~ "Geistlich Pharma",
my_detect(affiliation, "Genentech") ~ "Genentech",
my_detect(affiliation, "Gilead Sciences") ~ "Gilead Sciences",
my_detect(affiliation, "IBSA") ~ "IBSA",
my_detect(affiliation, "Incyte") ~ "Incyte",
my_detect(affiliation, "Janssen") ~ "Janssen",
my_detect(affiliation, "Johnson") ~ "Johnson & Johnson",
my_detect(affiliation, "Lilly") ~ "Lilly",
my_detect(affiliation, "Medtronic") ~ "Medtronic",
my_detect(affiliation, "Merck|MSD|Serono") ~ "Merck",
my_detect(affiliation, "Novartis") ~ "Novartis",
my_detect(affiliation, "Novo[\\- ]*Nordisk") ~ "Novo Nordisk",
my_detect(affiliation, "Nestl(é|e)") ~ "Nestlé",
my_detect(affiliation, "Pfizer") ~ "Pfizer",
my_detect(affiliation, "PIQUR Therapeutics") ~ "PIQUR Therapeutics",
my_detect(affiliation, "Roche(?!ster)") ~ "Hoffmann-La Roche",
my_detect(affiliation, "Sanofi") ~ "Sanofi",
my_detect(affiliation, "Laboratorios Servier") ~ "Servier",
my_detect(affiliation, "Institut de Recherches Internationales Servier") ~ "Servier",
my_detect(affiliation, "Shire") ~ "Shire",
my_detect(affiliation, "Sonova") ~ "Sonova",
my_detect(affiliation, "Spirig Pharma") ~ "Spirig Pharma",
my_detect(affiliation, "Takeda|Millennium") ~ "Takeda",
my_detect(affiliation, "\\+1 877 822 9493") ~ "UCB Pharma",
my_detect(affiliation, "UCB Pharma") ~ "UCB Pharma",
my_detect(affiliation, "Vertex Pharmaceuticals") ~ "Vertex Pharmaceuticals",
my_detect(affiliation, "ViiV Healthcare") ~ "ViiV Healthcare",
# university hospitals
my_detect(affiliation, "Universit.*spital.*Z(u|ue|ü)rich") ~ "Unispital Zürich",
my_detect(affiliation, "Z(u|ue|ü)rich.*Universit.*spital") ~ "Unispital Zürich",
my_detect(affiliation, "Univer?s?i?t?.*spital.*Basel") ~ "Unispital Basel",
my_detect(affiliation, "Basel.*Universit?.*spital") ~ "Unispital Basel",
my_detect(affiliation, "Basel Women.*spital|Frauenspital, Basel") ~ "Unispital Basel",
my_detect(affiliation, "Universit.*spital.*(Gen[eè]v|Genf)") ~ "Unispital Genf",
my_detect(affiliation, "[Hôos]*pit.*univers.*(Gen[eè]v|Genf)") ~ "Unispital Genf",
my_detect(affiliation, "(Gen[eè]v|Genf).*Universit.*spital") ~ "Unispital Genf",
my_detect(affiliation, "Lausanne University Hospital") ~ "Unispital Lausanne",
my_detect(affiliation, "(CTU|University Hospital).*Lausanne") ~ "Unispital Lausanne",
my_detect(affiliation, "CHUV|spital.*Universit.*Vaudois|Lausanne Hospitals?") ~ "Unispital Lausanne",
my_detect(affiliation, "Insel[ ho]*spital") ~ "Inselspital Bern",
my_detect(affiliation, "Bern.*Universiti?y Hosp") ~ "Inselspital Bern",
my_detect(affiliation, "University Hosp.*Bern") ~ "Inselspital Bern",
my_detect(affiliation, "Balgrist") ~ "Unispital Balgrist",
# universities
my_detect(affiliation, "Universit[äty of]*Z(u|ue|ü)rich") ~ "Universität Zürich",
my_detect(affiliation, "Z(u|ue|ü)ri?ch.*Universit[äty]") ~ "Universität Zürich",
my_detect(affiliation, "Universit[äty of]*(Geneva|Genf)") ~ "Universität Genf",
my_detect(affiliation, "Universit[äty of]*Berne?") ~ "Universität Bern",
my_detect(affiliation, "Universit?[äty of]*Basel") ~ "Universität Basel",
my_detect(affiliation, "Universit[äty of]*Lausanne") ~ "Universität Lausanne",
my_detect(affiliation, "Universit[äty of]*St\\. ?Gallen?") ~ "Universität St.Gallen",
my_detect(affiliation, "ETH|Eidgenössische Technische Hochschule") ~ "ETH Zürich",
my_detect(affiliation, "Swiss Federal Institute of Technology") ~ "ETH Zürich",
my_detect(affiliation, "EPFL|Ecole Polytechnique Fédérale") ~ "EPF Lausanne",
my_detect(affiliation, "ZHAW|Zurich.*Applied Sciences?") ~ "ZHAW",
my_detect(affiliation, "Zürcher Hochschule für Angewandte Wissenschaften") ~ "ZHAW",
my_detect(affiliation, "(FHS|Fachhochschule) S.*Gallen") ~ "FHS St.Gallen",
my_detect(affiliation, "Gallen University of Applied Sciences?") ~ "FHS St.Gallen",
my_detect(affiliation, "Berner Fachhochschule|Bern University of Applied Sciences?") ~ "BFH Bern",
my_detect(affiliation, "(HES|University of Applied Sciences).*(Valais|Wallis)") ~ "HEVS Hochschule Wallis",
my_detect(affiliation, "University of Applied Sciences of Western Switzerland") ~ "Fachhochschule Westschweiz",
# other hospitals
my_detect(affiliation, "(C|K)anton.*spital.*Frauenfeld") ~ "Kantonsspital Frauenfeld",
my_detect(affiliation, "(C|K)anton.*spital.*S.*Gallen") ~ "Kantonsspital St.Gallen",
my_detect(affiliation, "(Kantonsspital|LUKS) Luzern") ~ "Kantonsspital Luzern",
my_detect(affiliation, "Luzerner Kantonsspital") ~ "Kantonsspital Luzern",
my_detect(affiliation, "(C|K)anton.*spital.*Aarau") ~ "Kantonsspital Aarau",
my_detect(affiliation, "(C|K)anton.*spi?tal.*(Baselland|Liestal)") ~ "Kantonsspital Baselland",
my_detect(affiliation, "Kantonsspital Winterthur") ~ "Kantonsspital Winterthur",
my_detect(affiliation, "Kantonsspital Baden") ~ "Kantonsspital Baden",
my_detect(affiliation, "Kantonsspital Graub.*") ~ "Kantonsspital Graubünden",
my_detect(affiliation, "Osped.*Bellinzona") ~ "Kantonsspital Bellinzona",
my_detect(affiliation, "Osped.*Locarno") ~ "Regionalspital Locarno",
my_detect(affiliation, "[Hôos]*pital.*(Valais|Wallis)") ~ "Regionalspital Locarno",
my_detect(affiliation, "Paraplegi(c|ker) ?(C|Z)ent[erum].*") ~ "Paraplegikerzentrum Nottwil",
my_detect(affiliation, "Schulthess Klinik") ~ "Schulthess Klinik",
my_detect(affiliation, "Claraspital") ~ "St.Claraspital",
my_detect(affiliation, "Hirslanden") ~ "Hirslanden",
my_detect(affiliation, "St.? ?Anna.*Lu[cz]ern") ~ "Hirslanden",
my_detect(affiliation, "Triemli") ~ "Triemli Spital",
# others
my_detect(affiliation, "Swiss Group for Clinical Cancer Research") ~ "SAKK",
my_detect(affiliation, "SNSF|SNF|Swiss National Science Foundation") ~ "Schweizerischer Nationalfonds",
my_detect(affiliation, "Children's Oncology Group") ~ "Children's Oncology Group",
my_detect(affiliation, "European Organisation for Research and Treatment of Cancer") ~ "EORTC",
my_detect(affiliation, "Lymphoma Study Association") ~ "LYSA",
my_detect(affiliation, "International Extranodal Lymphoma Study Group") ~ "IELSG",
TRUE ~ ""
)
) %<>%
# change empty strings that were introduced above because of case_when
# constraints to NA
mutate(
simplified_name = ifelse(
nzchar(simplified_name),
simplified_name,
NA
)
)
# nolint end
# see how many are missing
missings <- all_affiliations %>%
filter(is.na(simplified_name) & nzchar(as.character(affiliation))) %>%
group_by(affiliation) %>%
tally() %>%
arrange(desc(n))
knitr::kable(
missings %>%
top_n(50)
)
# clean up
rm(my_detect, missings)
We were able to assign 65.8% of the affiliations. That does not sound like a whole lot, but when we inspect the biggest one missing, we can see that they are not based in Switzerland.
We differenciate between private and public institutions. We did the classification by hand:
# read in manual classification
public_or_private <- read_csv("input/public_private.csv", col_types = cols())
knitr::kable(
public_or_private
)
simple_name | group |
---|---|
Abbott | private |
AbbVie | private |
Ablynx | private |
Actelion | private |
Allergan | private |
Amgen | private |
Ardea Biosciences | private |
Array BioPharma | private |
Astellas | private |
AstraZeneca | private |
Bayer | private |
Biogen | private |
Biotronik | private |
Bioverativ | private |
Boehringer Ingelheim | private |
Bristol-Myers Squibb | private |
Celgene | private |
CSL Behring | private |
Cytos Biotechnology | private |
Daiichi Sankyo | private |
Dr. Falk Pharma | private |
Geistlich Pharma | private |
Genentech | private |
Gilead Sciences | private |
GlaxoSmithKline | private |
Hoffmann-La Roche | private |
IBSA | private |
Incyte | private |
Janssen | private |
Johnson & Johnson | private |
Lilly | private |
Medtronic | private |
Merck | private |
Nestlé | private |
Novartis | private |
Novo Nordisk | private |
Pfizer | private |
PIQUR Therapeutics | private |
Sanofi | private |
Servier | private |
Shire | private |
Sonova | private |
Spirig Pharma | private |
Takeda | private |
UCB Pharma | private |
Vertex Pharmaceuticals | private |
ViiV Healthcare | private |
Paraplegikerzentrum Nottwil | private |
Schulthess Klinik | private |
St.Claraspital | private |
Hirslanden | private |
Unispital Zürich | public |
Unispital Basel | public |
Unispital Genf | public |
Unispital Lausanne | public |
Inselspital Bern | public |
Unispital Balgrist | public |
Universität Zürich | public |
Universität Genf | public |
Universität Bern | public |
Universität Basel | public |
Universität Lausanne | public |
Universität St.Gallen | public |
ETH Zürich | public |
EPF Lausanne | public |
ZHAW | public |
FHS St.Gallen | public |
BFH Bern | public |
HEVS Hochschule Wallis | public |
Fachhochschule Westschweiz | public |
Kantonsspital Frauenfeld | public |
Kantonsspital St.Gallen | public |
Kantonsspital Luzern | public |
Kantonsspital Aarau | public |
Kantonsspital Baselland | public |
Kantonsspital Winterthur | public |
Kantonsspital Baden | public |
Kantonsspital Graubünden | public |
Kantonsspital Bellinzona | public |
Regionalspital Locarno | public |
Triemli Spital | public |
Schweizerischer Nationalfonds | public |
SAKK | public |
Children’s Oncology Group | consortium |
EORTC | consortium |
LYSA | consortium |
IELSG | consortium |
# deduplicate for joining
all_affiliations %<>%
distinct(affiliation, .keep_all = TRUE)
if (!"primary_sponsor_simple" %in% colnames(relevant_trials)) {
# take trials and join simplified names
relevant_trials %<>%
# for primary sponsors
left_join(
all_affiliations %>%
rename(primary_sponsor_simple = simplified_name),
by = c("primary_sponsor" = "affiliation")
) %>%
# and for leading institutions
left_join(
all_affiliations %>%
rename(leading_institution_simple = simplified_name),
by = c("leading_institution" = "affiliation")
)
}
Now we calculate publication rates for different groups and companies. We introduce the new field has_publication
which is true when any of the conditions (linked, found, g_scholar) meets. Then we visualize the different conditions. Once for all the known affiliations (a bit chaotic) and one for only the major institutions:
# create function that takes the trials and groups them by affiliation
# and returns a summarized and sorted data frame
get_publication_rates <- function(data) {
# as the primary sponsor is responsible for uploading / publishing the results
# of a trial, we only look at the sponsors, not the leading institutions
result <- bind_rows(
data %>%
select(
affiliation = primary_sponsor_simple,
results_submitted,
publication_found,
has_g_scholar_results,
unique_id
)
) %>%
filter(!is.na(affiliation)) %>%
mutate(
submitted_or_found = results_submitted |
publication_found,
# add column: set to TRUE when any of the three conditions meet
has_publication = submitted_or_found |
has_g_scholar_results,
found_or_g_scholar = publication_found |
has_g_scholar_results
) %>%
group_by(affiliation) %>%
summarise(
trials_in_total = n(),
results_submitted = sum(results_submitted, na.rm = TRUE) /
trials_in_total,
publication_found = sum(publication_found, na.rm = TRUE) /
trials_in_total,
submitted_or_found = sum(submitted_or_found, na.rm = TRUE) /
trials_in_total,
has_g_scholar_results = sum(has_g_scholar_results, na.rm = TRUE) /
trials_in_total,
found_or_g_scholar = sum(found_or_g_scholar, na.rm = TRUE) /
trials_in_total,
has_publication = sum(has_publication, na.rm = TRUE) /
trials_in_total
) %>%
ungroup() %>%
arrange(desc(trials_in_total))
# convert affiliations to factors and sort by totals
result %>%
mutate(
affiliation = factor(
affiliation,
levels = result$affiliation
)
)
}
# perform calculations
affiliation_publication_rates <- relevant_trials %>%
filter(study_type == "interventional") %>%
get_publication_rates() %>%
# join definition of public institution vs private organisation
left_join(
public_or_private,
by = c("affiliation" = "simple_name")
)
## Warning: Column `affiliation`/`simple_name` joining factor and character
## vector, coercing into character vector
# create lookup strings for better readability in plot
labels <- c(
"results_submitted" =
"Results were uploaded to the clinical trial registry",
"publication_found" =
"The Trial ID was found in pubmed or cochrane",
"submitted_or_found" =
"Either of the two conditions above did meet",
"has_g_scholar_results" =
"A Google Scholar Search for the Trial ID had 1+ results",
"found_or_g_scholar" =
"It was found in pubmed or cochrane or via Google Scholar",
"has_publication" =
"Any of the three conditions did meet"
)
# plot all, then plot only those with more than 50 trials
list("all", "top_10") %>%
walk(function(all_or_top_10) {
# save plot in variable to generate two different facets
g <- ggplot(
affiliation_publication_rates %>%
# filter to most important sponsors in second iteration
filter(all_or_top_10 == "all" | trials_in_total > 50) %>%
# convert columns to rows so we can display them in different facets
gather("key", "value", results_submitted:has_publication) %>%
mutate(
key = str_replace_all(key, labels),
# sort facets
key = factor(key, levels = labels)
),
aes(
x = trials_in_total,
y = value,
color = value,
label = affiliation
)
) +
geom_text_repel(force = 2) +
geom_point() +
scale_x_log10(
limits = c(5, 500),
breaks = c(5, 10, 20, 50, 100, 200, 500)
) +
scale_y_continuous(
limits = c(0, 1),
labels = percent
) +
scale_colour_gradient(low = "#B2182B", high = "#2166AC") +
theme_minimal() +
theme(legend.position = "none") +
labs(
title = "Relationship between number of publications and rates",
subtitle = "For all trials finished between 2007 and 2016",
x = "Total number of registered trials",
y = "Publication rate"
)
# output
print(g + facet_wrap(~ key, ncol = 1))
# export
ggsave(
paste0("output/publication_rates_", all_or_top_10, ".pdf"),
g + facet_wrap(~ key, ncol = 2),
device = "pdf",
units = "mm",
width = 297,
height = 420
)
rm(g)
})
## Warning: Removed 102 rows containing missing values (geom_text_repel).
## Warning: Removed 102 rows containing missing values (geom_point).
## Warning: Removed 102 rows containing missing values (geom_text_repel).
## Warning: Removed 102 rows containing missing values (geom_point).
calculateWeightedMeanAndTotal <- function(df, grouping_variable) {
df %>%
group_by_at(vars(grouping_variable)) %>%
summarise_at(
vars(results_submitted:has_publication),
funs(weighted.mean(., trials_in_total))
) %>%
# now we lost the trials in total as it's own column, join it in again
left_join(
df %>%
group_by_at(vars(grouping_variable)) %>%
summarise(trials_in_total = sum(trials_in_total)),
by = grouping_variable
)
}
# calculate share of each value by group: private or public
totals_by_group <- affiliation_publication_rates %>%
calculateWeightedMeanAndTotal(grouping_variable = "group")
# calculate totals overall
totals <- totals_by_group %>%
ungroup() %>%
summarise_at(
vars(results_submitted:has_publication),
funs(weighted.mean(., trials_in_total))
) %>%
cbind(
trials_in_total = sum(totals_by_group$trials_in_total), group = "Total"
)
knitr::kable(
totals_by_group %>%
bind_rows(totals) %>%
mutate_at(vars(results_submitted:has_publication), percent)
)
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
group | results_submitted | publication_found | submitted_or_found | has_g_scholar_results | found_or_g_scholar | has_publication | trials_in_total |
---|---|---|---|---|---|---|---|
consortium | 50.7% | 59.2% | 88.7% | 94.4% | 95.8% | 98.6% | 71 |
private | 68.8% | 54.9% | 83.8% | 85.0% | 90.8% | 94.5% | 1049 |
public | 14.1% | 46.7% | 55.3% | 64.5% | 70.4% | 75.5% | 891 |
Total | 44.0% | 51.4% | 71.4% | 76.3% | 81.9% | 86.2% | 2011 |
top_10_publication_rates <- affiliation_publication_rates %>%
calculateWeightedMeanAndTotal(grouping_variable = "affiliation") %>%
filter(trials_in_total > 50) %>%
arrange(desc(has_publication))
knitr::kable(
top_10_publication_rates %>%
mutate_at(vars(results_submitted:has_publication), percent)
)
affiliation | results_submitted | publication_found | submitted_or_found | has_g_scholar_results | found_or_g_scholar | has_publication | trials_in_total |
---|---|---|---|---|---|---|---|
Pfizer | 91.3% | 60.9% | 95.7% | 88.4% | 92.8% | 98.6% | 69 |
Bayer | 62.7% | 59.3% | 83.1% | 79.7% | 89.8% | 98.3% | 59 |
Hoffmann-La Roche | 82.1% | 64.3% | 90.5% | 89.3% | 95.2% | 97.6% | 84 |
Sanofi | 50.0% | 60.0% | 81.7% | 95.0% | 96.7% | 96.7% | 60 |
Novartis | 60.2% | 54.4% | 81.0% | 84.5% | 92.9% | 96.0% | 226 |
Unispital Lausanne | 24.6% | 53.8% | 67.7% | 72.3% | 76.9% | 83.1% | 65 |
Unispital Genf | 14.2% | 42.5% | 49.1% | 67.9% | 74.5% | 77.4% | 106 |
Universität Zürich | 9.5% | 40.7% | 48.9% | 60.6% | 65.6% | 72.9% | 221 |
Unispital Basel | 12.2% | 41.7% | 51.3% | 58.3% | 65.4% | 70.5% | 156 |
Inselspital Bern | 16.5% | 49.5% | 58.3% | 59.2% | 64.1% | 69.9% | 103 |
# export
write_csv(
relevant_trials,
"output/trials.csv"
)
The code in this RMarkdown is listed with the lintr package, which is based on the tidyverse style guide.
lintr::lint("main.Rmd")
## main.Rmd:599:1: style: lines should not be more than 80 characters.
## "https://raw.githubusercontent.com/ebmdatalab/euctr-tracker-data/master/trials.csv",
## ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## main.Rmd:604:1: style: lines should not be more than 80 characters.
## embd_euctr_data <- read_csv("input/ignore/embd/trials.csv", col_types = cols())
## ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~