Clinical Trials

Notes

This report was generated on 2019-03-05 14:22:20. R version: 3.5.0 on x86_64-apple-darwin15.6.0. For this report, CRAN packages as of 2018-09-01 were used.

R-Script & data

The preprocessing and analysis of the data was conducted in the R project for statistical computing. The RMarkdown script used to generate this document and all the resulting data can be downloaded under this link. Through executing main.Rmd, the herein described process can be reproduced and this document can be generated. In the course of this, data from the folder ìnput will be processed and results will be written to output. There are multiple RData files in the folder rdata. They are emitted in the chunks of main.Rmd and saved for a faster process after the first execution. The download stored to input/ignore/pubmed could not be saved in this git repository, because of it’s size (1.4 GB). Find out how to get your own version of the xml file below.

GitHub

The code for the herein described process can also be freely downloaded from https://github.com/srfdata/2019-02-clinical-trials.

License

2019-02-clinical-trials by SRF Data is licensed under a Creative Commons Namensnennung - Attribution ShareAlike 4.0 International License.

Data description of output files

`trials.csv`

Attribute	Type	Description
unique_id	String	Unique identifier assigned by SRF
primary	String	Trial identification number
size	Numeric	Number of people participating in study (total, world wide)
study_type	Factor	Type of study, either Interventional or Observational
status	Factor	Status of the trial, either Completed, Ongoing, Terminated or Unknown
primary_sponsor	String	The primary sponsor of the trial (responsible for updating register information)
leading_institution	String	The (scientifically) leading institution of the trial
registration_date	Date	Date the trial was registered in the registry
start_date	Date	Date on which the enrollment process started
end_date	Date	Date on which the enrollment process ended
results_submitted	Boolean	Equals true if a result was submitted to the trial registry
publication_found	Boolean	Equals true if the trial ID was found in cochrane / pubmed
source	Factor	Shows in which registry this trial was found, either NCT, EUCTR or DRKS
has_g_scholar_results	Boolean	Equals true if google scholar returned any results when queried with the trial id
primary_sponsor_simple	String	Simplified Name of primary sponsor (mainly done for swiss institutions)
leading_institution_simple	String	Simplified Name of leading institution (mainly done for swiss institutions)

Procedure

On the flow chart below, we describe how we tried to find all trials related to Switzerland and how we evaluated whether a publication for that trial exitsts.

Data sources

Trials

We searched the following registers by selecting Switzerland OR Schweiz as a country of recruitement:

WHO (ICTRP)

→ http://apps.who.int/trialsearch/AdvSearch.aspx

Downloaded 01/14/2019 at 12:35 CET by setting recruitment status to ALL and chosing Export all trials to XML.

ClinicalTrials.gov

→ https://clinicaltrials.gov/ct2/results?cond=&term=Switzerland&cntry=&state=&city=&dist=

Downloaded 01/14/2019 at 12:31 CET via Download > For Advanced Users: Full Study Record XML Download.

Deutsches Register Klinischer Studien (DRKS)

→ https://www.drks.de/drks_web/navigate.do?navigationId=search&reset=true

Downloaded 01/24/2019 at 13:20 CET as csv.

Publications

Pubmed

You can query PubMed for all publications with a linked NCT ClinicalTrials.gov ID:

→ https://www.ncbi.nlm.nih.gov/pubmed/?cmd=Search&term=clinicaltrials.gov%5bsi%5d

We also search the following terms and saved the result as xml by choosing Send to > File > XML:

EUDRACT*
EUCTR*
NCT0*
DRKS*

https://www.ncbi.nlm.nih.gov/pubmed?term=(((EUDRACT*)%20OR%20EUCTR*)%20OR%20NCT0*)%20OR%20DRKS*

Downloaded 01/14/2019 at 10:56 CET.

You can read more about finding results of studies on ClinicalTrials.gov.

Cochrane

→ http://cochranelibrary-wiley.com/cochranelibrary/search?searchRow.searchOptions.searchProducts=clinicalTrialsDoi

Downloaded 01/14/2019 at 10:25 CET by searching ID wildcards e.g. *NCT0* in ti, ab, kw (Title, Abstract, Keywords).

Limit the years to get the total number of trials below the export limit of 20k.

Search terms:

NCT0*
EUCTR*
EUDRACT*
Switzerland

Read more about the Cochrane CENTRAL database here.

Kofam/BASEC

The swiss registry for trials is only a secondary register. All the trials in it must also be registered in one of the primary registers (European, German, American registers). It does not contain any information about whether a trial was started or completed.

Preparations

## [1] "package package:rmarkdown detached"

Define packages

# from https://mran.revolutionanalytics.com/web/packages/checkpoint/vignettes/using-checkpoint-with-knitr.html
# if you don't need a package, remove it from here (commenting is probably not sufficient)
# tidyverse: see https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/
cat("
library(rstudioapi)
library(tidyverse) # ggplot2, dplyr, tidyr, readr, purrr, tibble
library(glue) # interpreted string literals
library(ggrepel) # repellingng text labels for ggplot
library(xml2) # read xml files
library(readxl) # read xls(x) files
library(rvest) # used for scraping
library(magrittr) # pipes
library(foreach) # parallelization
library(doParallel) # parallelization
library(stringr) # string manipulation
library(scales) # scales for ggplot2
library(jsonlite) # json
library(lubridate) # date handling
library(forcats) # easier factor handling,
library(lintr) # code linting, auf keinen Fall entfernen ;-)
library(styler) # code formatting
library(googlesheets) # googlesheets (replace with tidyverse/googlesheets4 asap)
library(rmarkdown) # needed for automated knitting
library(R.utils) # needed for withTimeout",
  file = "manifest.R"
)

Install packages

# if checkpoint is not yet installed, install it (for people using this
# system for the first time)
if (!require(checkpoint)) {
  if (!require(devtools)) {
    install.packages("devtools", repos = "http://cran.us.r-project.org")
    require(devtools)
  }
  devtools::install_github("RevolutionAnalytics/checkpoint",
                           ref = "v0.3.2", # could be adapted later,
                           # as of now (beginning of July 2017
                           # this is the current release on CRAN)
                           repos = "http://cran.us.r-project.org")
  require(checkpoint)
}
# nolint start
if (!dir.exists("~/.checkpoint")) {
  dir.create("~/.checkpoint")
}
# nolint end
# install packages for the specified CRAN snapshot date
checkpoint(snapshotDate = package_date,
           project = path_to_wd,
           verbose = T,
           scanForPackages = T,
           use.knitr = F,
           R.version = R_version)
rm(package_date)

Load packages

source("manifest.R")
unlink("manifest.R")
sessionInfo()

## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] de_CH.UTF-8/de_CH.UTF-8/de_CH.UTF-8/C/de_CH.UTF-8/de_CH.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] R.utils_2.7.0      R.oo_1.22.0        R.methodsS3_1.7.1 
##  [4] rmarkdown_1.10     googlesheets_0.3.0 styler_1.0.2      
##  [7] lintr_1.0.2        lubridate_1.7.4    jsonlite_1.5      
## [10] scales_1.0.0       doParallel_1.0.11  iterators_1.0.10  
## [13] foreach_1.4.4      magrittr_1.5       rvest_0.3.2       
## [16] readxl_1.1.0       xml2_1.2.0         ggrepel_0.8.0     
## [19] glue_1.3.0         forcats_0.3.0      stringr_1.3.1     
## [22] dplyr_0.7.6        purrr_0.2.5        readr_1.1.1       
## [25] tidyr_0.8.1        tibble_1.4.2       ggplot2_3.0.0     
## [28] tidyverse_1.2.1    rstudioapi_0.7     checkpoint_0.4.0  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.4 haven_1.1.2      lattice_0.20-35  colorspace_1.3-2
##  [5] htmltools_0.3.6  yaml_2.2.0       rlang_0.2.2      pillar_1.3.0    
##  [9] withr_2.1.2      modelr_0.1.2     bindrcpp_0.2.2   bindr_0.1.1     
## [13] plyr_1.8.4       munsell_0.5.0    gtable_0.2.0     cellranger_1.1.0
## [17] codetools_0.2-15 evaluate_0.11    knitr_1.20       rex_1.1.2       
## [21] broom_0.5.0      Rcpp_0.12.18     backports_1.1.2  hms_0.4.2       
## [25] digest_0.6.16    stringi_1.2.4    grid_3.5.0       rprojroot_1.3-2 
## [29] cli_1.0.0        tools_3.5.0      lazyeval_0.2.1   crayon_1.3.4    
## [33] pkgconfig_2.0.2  assertthat_0.2.0 httr_1.3.1       R6_2.2.2        
## [37] nlme_3.1-137     compiler_3.5.0

Read in data

Trials

ClinicalTrials.gov

We read the data from the files (one xml file per study) and collect it in multiple “lookup tables” (because one trial can have multiple locations and sponsors in an 1:n relationship). You can see what nodes of the xml file we use in the code below.

The xml files are not included in this git repo. Read the chapter data sources and download your own version of the file, if you wish to reproduce the report or work with the provided rdata file.or work with the provided rdata file.

# load data form rdata file if is present
if (
  file.exists("rdata/nct_trials.rdata") &
  file.exists("rdata/nct_locations.rdata") &
  file.exists("rdata/nct_sponsors.rdata") &
  file.exists("rdata/nct_citations.rdata") &
  file.exists("rdata/nct_officials.rdata")
) {
  load(file = "rdata/nct_trials.rdata")
  load(file = "rdata/nct_locations.rdata")
  load(file = "rdata/nct_sponsors.rdata")
  load(file = "rdata/nct_citations.rdata")
  load(file = "rdata/nct_officials.rdata")
} else {
  # create new data frame for xmls data
  # one main data frame will contain one row per trial
  nct_trials <- data.frame()
  # the following data frames can contain more than one row per trial
  nct_locations <- data.frame()
  nct_sponsors <- data.frame()
  nct_citations <- data.frame()
  nct_officials <- data.frame()

  # the following function has side effects! It looks up the XML data for a
  # given NCT ID by reading the file "input/...NCTID.xml" and adding it to the
  # data frames above so they can be joined later
  extract_info_from_file <- function(filename) {
    # get xml file
    xml <- read_xml(filename)
    # store id in variable as it's used multiple times
    nct_id <- xml_text(xml_find_first(xml, ".//id_info/nct_id"))
    # append to main data frame
    nct_trials <<- nct_trials %>% bind_rows(
      data.frame(
        "id" = nct_id,
        "secondary_id" = xml_text(
          xml_find_first(xml, ".//id_info/secondary_id")
        ),
        "source" = xml_text(
          xml_find_first(xml, ".//source")
        ),
        "size" = xml_text(
          xml_find_first(xml, ".//enrollment")
        ),
        "study_type" = xml_text(
          xml_find_first(xml, ".//study_type")
        ),
        "overall_status" = xml_text(
          xml_find_first(xml, ".//overall_status")
        ),
        "clinical_results_present" = length( # if node is present
          xml_find_all(xml, ".//clinical_results")
        ) > 0,
        "start_date" = xml_text(
          xml_find_first(xml, ".//start_date")
        ),
        "completion_date" = xml_text(
          xml_find_first(xml, ".//completion_date")
        ),
        "primary_completion_date" = xml_text(
          xml_find_first(xml, ".//primary_completion_date")
        ),
        "study_first_submitted" = xml_text(
          xml_find_first(xml, ".//study_first_submitted")
        ),
        "study_first_submitted_qc" = xml_text(
          xml_find_first(xml, ".//study_first_submitted_qc")
        ),
        "study_first_posted" = xml_text(
          xml_find_first(xml, ".//study_first_posted")
        ),
        "results_first_submitted" = xml_text(
          xml_find_first(xml, ".//results_first_submitted")
        ),
        "results_first_submitted_qc" = xml_text(
          xml_find_first(xml, ".//results_first_submitted_qc")
        ),
        "results_first_posted" = xml_text(
          xml_find_first(xml, ".//results_first_posted")
        ),
        "submitted_pending_results" = xml_text(
          xml_find_first(xml, ".//pending_results/submitted")
        ),
        "why_stopped" = xml_text(
          xml_find_first(xml, ".//why_stopped")
        ),
        # the following subchildren can only be accessed if the parent
        # node overall_official is present, otherwise they cause errors
        "overall_official" = ifelse(
          length(xml_find_all(xml, ".//overall_official")) > 0,
          xml_text(xml_find_first(xml, ".//overall_official/last_name")),
          NA # if not present
        ),
        "official_role" = ifelse(
          length(xml_find_all(xml, ".//overall_official")) > 0,
          xml_text(xml_find_first(xml, ".//overall_official/role")),
          NA # if not present
        ),
        "official_affiliation" = ifelse(
          length(xml_find_all(xml, ".//overall_official")) > 0,
          xml_text(xml_find_first(xml, ".//overall_official/affiliation")),
          NA # if not present
        ),
        stringsAsFactors = FALSE
      )
    )

    # get locations
    xml_find_all(xml, ".//location") %>%
      # keep only those with country Switzerland
      keep(~ xml_text(xml_find_first(., ".//country")) == "Switzerland") %>%
      walk(function(node) {
        # append to data frame
        nct_locations <<- nct_locations %>% bind_rows(
          data.frame(
            "id" = nct_id,
            "name" = xml_text(xml_find_first(node, ".//name")),
            "city" = xml_text(xml_find_first(node, ".//city")),
            stringsAsFactors = FALSE
          )
        )
      })

    # get sponsors
    xml_find_all(xml, ".//lead_sponsor|.//collaborator") %>%
      walk(function(node) {
        # append to data frame
        nct_sponsors <<- nct_sponsors %>% bind_rows(
          data.frame(
            "id" = nct_id,
            "sponsor" = xml_text(xml_find_first(node, ".//agency")),
            "type" = xml_name(node),
            "class" = xml_text(xml_find_first(node, ".//agency_class")),
            stringsAsFactors = FALSE
          )
        )
      })

    # get citations
    xml_find_all(xml, ".//results_reference") %>%
      walk(function(node) {
        # append to data frame
        nct_citations <<- nct_citations %>% bind_rows(
          data.frame(
            "id" = nct_id,
            "citation" = xml_text(xml_find_first(node, ".//citation")),
            "pm_id" = xml_text(xml_find_first(node, ".//PMID")),
            stringsAsFactors = FALSE
          )
        )
      })

    # get officials
    xml_find_all(xml, ".//overall_official") %>%
      walk(function(node) {
        # append to data frame
        nct_officials <<- nct_officials %>% bind_rows(
          data.frame(
            "id" = nct_id,
            "name" = xml_text(xml_find_first(node, ".//last_name")),
            "role" = xml_text(xml_find_first(node, ".//role")),
            "affiliation" = xml_text(xml_find_first(node, ".//affiliation")),
            stringsAsFactors = FALSE
          )
        )
      })
  }

  print("Now extracting data from XML files (ClinicalTrials.gov data)…")
  print("Takes about 2.5 minutes")

  # go over all entries
  system.time(
    list.files("input/ignore/clinicaltrials", full.names = TRUE) %>%
      walk(extract_info_from_file)
  )

  # save to rdata
  save(nct_trials, file = "rdata/nct_trials.rdata")
  save(nct_locations, file = "rdata/nct_locations.rdata")
  save(nct_sponsors, file = "rdata/nct_sponsors.rdata")
  save(nct_citations, file = "rdata/nct_citations.rdata")
  save(nct_officials, file = "rdata/nct_officials.rdata")

  # clean up
  rm(extract_info_from_file)

}

## [1] "Now extracting data from XML files (ClinicalTrials.gov data)…"
## [1] "Takes about 2.5 minutes"

Some specifications about dates:

study_first_submitted: The date on which the study sponsor or investigator first submitted a study record to ClinicalTrials.gov. There is typically a delay of a few days between the first submitted date and the record’s availability on ClinicalTrials.gov (the first posted date).
study_first_submitted_qc: The date on which the study sponsor or investigator first submits a study record that is consistent with National Library of Medicine (NLM) quality control (QC) review criteria. The sponsor or investigator may need to revise and submit a study record one or more times before NLM’s QC review criteria are met. It is the responsibility of the sponsor or investigator to ensure that the study record is consistent with the NLM QC review criteria.
primary_completion_date: The date on which the last participant in a clinical study was examined or received an intervention to collect final data for the primary outcome measure. Whether the clinical study ended according to the protocol or was terminated does not affect this date. For clinical studies with more than one primary outcome measure with different completion dates, this term refers to the date on which data collection is completed for all the primary outcome measures. The “estimated” primary completion date is the date that the researchers think will be the primary completion date for the study.

Conclusion: This report uses a trial’s primary completion point as the key criterion to determine whether or not a trial is due to post results.

We followed the argumentation of the Clinical Trial Reporting: “University Policies and Performance in the UKBristol and London, 06 November 2018”) This approach is likely to slightly overcount due trials. For example, if a trial’s expected primary completion date is extended during the trial due to slower than expected patient recruitment, and university staff fails to update the registry entry accordingly, the expected primary completion date listed in the registry will be further in the past than the actual or currently expected primary completion date. On balance, the approach used here has two significant advantages: In terms of accuracy, the number of trials falsely identified as overdue using this approach is assumed to be substantially lower than the number of trials falsely identified as not yet due when using the conventional approach. In terms of faithfully depicting a university’s registry management performance, this approach is preferable because it will never falsely identify trials as overdue if a university keeps its registry entries up to date. Thus, the approach used here incentivises universities to keep their registry entries up to date. In contrast, the conventional approach creates perverse incentives for trial sponsors to postpone or neglect updating a trial’s status to ‘completed’."

ICTRP

# load data form rdata file if is present
if (file.exists("rdata/ictrp_trials.rdata")) {
  load(file = "rdata/ictrp_trials.rdata")
} else {
  # Read file. Caution! This file is 55MB, it contains 7602 Trials
  ictrp_xml <- read_xml("input/ictrp/ICTRP-Results.xml")
  ictrp_nodes <- xml_find_all(ictrp_xml, "//Trial")

  print("Now extracting data from XML file (ICTRP data)")

  # In this file there are no n to 1 relations, so we can read it in with one
  # split into batches of size 1000 nodes and calculate in parallel
  batch_size <- 1000
  split_factor <- as.factor(
    rep_len(
      1:ceiling(length(ictrp_nodes) / batch_size),
      length.out = length(ictrp_nodes)
    )
  )
  system.time(
    ictrp_trials <- foreach(
      node = split(ictrp_nodes, split_factor),
      .inorder = FALSE,
      .combine = rbind
    ) %dopar%
      data.frame(
        "id" =
          xml_text(xml_find_first(node, ".//TrialID")),
        "secondary" =
          xml_text(xml_find_first(node, ".//Secondary_ID")),
        "size" =
          xml_text(xml_find_first(node, ".//Target_size")),
        "study_type" =
          xml_text(xml_find_first(node, ".//Study_type")),
        "registration_date" =
          xml_text(xml_find_first(node, ".//Date_registration")),
        "enrollment_date" =
          xml_text(xml_find_first(node, ".//Date_enrollement")),
        "primary_sponsor" =
          xml_text(xml_find_first(node, ".//Primary_sponsor")),
        "secondary_sponsor" =
          xml_text(xml_find_first(node, ".//Secondary_Sponsor")),
        "source" =
          xml_text(xml_find_first(node, ".//Source_Register")),
        "type" =
          xml_text(xml_find_first(node, ".//Study_type")),
        "countries" =
          xml_text(xml_find_first(node, ".//Countries")),
        "contact_name" = paste(
          xml_text(xml_find_first(node, ".//Contact_Firstname")),
          xml_text(xml_find_first(node, ".//Contact_Lastname"))
        ),
        "contact_address" =
          xml_text(xml_find_first(node, "./Contact_Address")),
        "affiliation" =
          xml_text(xml_find_first(node, ".//Contact_Affiliation")),
        "support" =
          xml_text(xml_find_first(node, ".//Source_Support")),
        stringsAsFactors = FALSE
      )
  )

  # save to rdata
  save(ictrp_trials, file = "rdata/ictrp_trials.rdata")
  # clean up
  rm(ictrp_xml, ictrp_nodes, batch_size, split_factor)
}

## [1] "Now extracting data from XML file (ICTRP data)"

Unfortunately the ICTRP registry is missing some key information. There is no structured information in the download about the end date of a trial. We have to add another source of data for that information.

EU Trials Tracker

The EMB Data Lab is hosting a website tracking the trials and publications in the EU Clinical Trials Register EUCTR. You can read more about their methodology here. The tracker also includes a flag for whether results are published in the registry and about the status of the trial. This data source is said to be updated once a month.

EMB Data Lab also collects the dates of the trials in the different countries (see example here). The earliest date is in the column min_end_date and the latest date is in max_end_date. We use max_end_date to evaluate whether the trial has been completed or not.

The EU does not collect information about observational studies. On the website you can read: “The European Union Clinical Trials Register allows you to search for protocol and results information on:

interventional clinical trials that are conducted in the European Union (EU) and the European Economic Area (EEA);
clinical trials conducted outside the EU / EEA that are linked to European paediatric-medicine development."

We do have information about observational trials from the other two registers though.

# this chunk is changing the original data frames, execute once only
if ("max_end_date" %in% colnames(ictrp_trials)) {
  print("EMBD data already present, this chunk will be skipped")
} else {
  # filter and keep only european registry
  ictrp_trials %<>%
    filter(str_detect(id, "\\d{4}-\\d{6}-\\d{2}")) %>%
    # filter(str_detect(countries, "Switzerland")) %>%
    mutate(simple_id = str_extract(id, "\\d{4}-\\d{6}-\\d{2}"))

  # download file from github
  download.file(
    "https://raw.githubusercontent.com/ebmdatalab/euctr-tracker-data/master/trials.csv",
    "input/ignore/embd/trials.csv"
  )

  # read in
  embd_euctr_data <- read_csv("input/ignore/embd/trials.csv", col_types = cols())

  # status codes
  embd_status_codes <- data.frame(
    "0" = "Ongoing",
    "1" = "Completed, but no date",
    "2" = "Ongoing",
    "3" = "Suspended, withdrawn, not authorised or prohibited",
    "4" = "No trial status on register",
    stringsAsFactors = FALSE
  ) %>%
    gather("code", "status_description") %>%
    mutate(code = as.numeric(str_sub(code, 2)))

  # join min/max end date and the indicator for results
  # this is a range of the end dates of all countries involved. A trial can be
  # completed in one country in 2015 but in another it dured until 2016
  ictrp_trials %<>%
    left_join(
      embd_euctr_data %>%
        select(
          simple_id = trial_id,
          min_end_date,
          max_end_date,
          trial_status,
          has_results
        ),
      by = "simple_id"
    ) %>%
    # in the embd data all ids occur twice and get joined twice, remove them
    distinct(simple_id, .keep_all = TRUE) %>%
    # replace status codes with words from data frame defined above
    left_join(
      embd_status_codes,
      by = c("trial_status" = "code")
    ) %>%
    select(-trial_status) %>%
    rename(trial_status = status_description)

  # clean up
  rm(embd_euctr_data, embd_status_codes)
}

100% of all EU Trials in ICTRP were also found in the EMBD data from github. For 61.1% of the trials there is now an end date available.

Deutsches Register Klinischer Studien (DRKS)

## Warning: Missing column names filled in: 'X322' [322]

## Warning: Duplicated column names deduplicated: 'url' => 'url_1' [320]

## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 2)

## Warning: 1 parsing failure.
## row # A tibble: 1 x 5 col     row col   expected    actual      file                    expected   <int> <chr> <chr>       <chr>       <chr>                   actual 1   209 <NA>  322 columns 336 columns 'input/drks/trials.csv' file # A tibble: 1 x 5

Check overlap

Let’s check which one of the trial registries is the one with the most trials in it and how big is the overlap between the different registries.

We count the following number of unique trial ids:

knitr::kable(data.frame(
  "NCT" = nrow(nct_trials %>% distinct(id)),
  "EUCTR" = nrow(ictrp_trials %>% distinct(id)),
  "DRKS" = nrow(drks_trials %>% distinct(id))
))

NCT	EUCTR	DRKS
6574	1101	520

The clinicaltrials.gov register has by far the most number of ids, we’ll start from there and try to identify duplicates in the other registers. To mark duplicates, we’ll introduce a new and custom unique id created by the digest package and the xxhash32 algorithm.

# load data form rdata file if is present
if (
  file.exists("rdata/ictrp_unique_ids.rdata") &
    file.exists("rdata/drks_unique_ids.rdata") &
    file.exists("rdata/nct_unique_ids.rdata")
) {
  load(file = "rdata/ictrp_unique_ids.rdata")
  load(file = "rdata/drks_unique_ids.rdata")
  load(file = "rdata/nct_unique_ids.rdata")
} else {
  # assign unique ids to nct_trials
  nct_unique_ids <- nct_trials %>%
    rowwise() %>%
    select(primary = id, secondary = secondary_id) %>%
    mutate(
      unique_id = digest::digest(primary, algo = "xxhash32")
    )

  join_nct_or_generate_new_id <- function(df) {
    # create regex search for primary OR secondary
    search <- glue("{df$id}|{df$secondary}", .na = "") %>%
      str_replace("\\|$", "") # remove ORs if no secondary present
    if (!is.na(search)) {
      nct_match <- nct_unique_ids %>%
        filter(
          str_detect(primary, search) |
            str_detect(secondary, search)
        )
      # if entry was found, return the corresponding unique id
      if (nrow(nct_match) == 1) {
        return(data.frame(
          "id" = df$id,
          "found" = TRUE,
          "unique_id" = nct_match$unique_id,
          stringsAsFactors = FALSE
        ))
        # if not found: generate a new one
      } else if (nrow(nct_match) == 0) {
        return(data.frame(
          "id" = df$id,
          "found" = FALSE,
          "unique_id" = digest::digest(search, algo = "xxhash32"),
          stringsAsFactors = FALSE
        ))
        # sadly, the ictrp rows are not necessarily unique
        # in that case return NA in the column found and all unique_ids
      } else {
        return(data.frame(
          "id" = df$id,
          "found" = NA,
          "unique_id" = paste(nct_match$unique_id, collapse = " OR "),
          stringsAsFactors = FALSE
        ))
      } # else (not 0 and not 1 entries found)
    } # if id is present
  } # function

  # convert to list of one-rowed data frames with selection of columns
  drks_ids <- split(
    drks_trials %>%
      # keep only secondary ids that are NCT ids
      mutate(
        secondary = str_extract(
          glue("{secondary} {tertiary} {quaternary} {quinary} {senary}"),
          "NCT0\\d{7}"
        )
      ) %>%
      select(id, secondary),
    # by sequencing over all row numbers we split the data frame into
    # a list of tibbles - one for each row, that we can then pass to foreach
    seq(nrow(drks_trials))
  )

  print(glue("Searching for DRKS IDs in the clinicaltrials data set"))
  duration <- system.time(
    drks_unique_ids <- foreach(
      ids = drks_ids,
      .inorder = FALSE,
      .combine = rbind
    ) %dopar%
      join_nct_or_generate_new_id(ids)
  )

  print(glue(
    "The matching took {round(duration[3])} seconds. \\
    Between DRKS in NCT IDs {duplicates_found} \\
    duplicates were found, that's {percent(share)}.",
    duplicates_found = drks_unique_ids %>%
      filter(found) %>%
      nrow(),
    share = drks_unique_ids %>%
      group_by(found) %>%
      summarise(n = n()) %>%
      mutate(share = n / sum(n)) %>%
      filter(found) %>%
      select(share) %>%
      simplify()
  ))

  # repeat for ictrp ids
  # convert to list of one-rowed data frames with selection of columns
  ictrp_ids <- split(
    ictrp_trials %>%
      # search for the id without EUCTR and country code to broaden the search
      select(id = simple_id, secondary) %>%
      distinct(id, .keep_all = TRUE) %>%
      # keep only secondary ids that are NCT ids
      mutate(secondary = str_extract(secondary, "NCT0\\d{7}")),
    seq(nrow(ictrp_trials))
  )

  print(glue("Searching for ICTRP IDs in the clinicaltrials data set"))
  duration <- system.time(
    ictrp_unique_ids <- foreach(
      ids = ictrp_ids,
      .inorder = FALSE,
      .combine = rbind
    ) %dopar%
      join_nct_or_generate_new_id(ids)
  )

  print(glue(
    "The matching took {round(duration[3])} seconds. \\
    Between ICTRP in NCT IDs {duplicates_found} \\
    duplicates were found, that's {percent(share)}",
    duplicates_found = ictrp_unique_ids %>%
      filter(found) %>%
      nrow(),
    share = ictrp_unique_ids %>%
      group_by(found) %>%
      summarise(n = n()) %>%
      mutate(share = n / sum(n)) %>%
      filter(found) %>%
      select(share) %>%
      simplify()
  ))

  not_unique <- ictrp_unique_ids %>%
    filter(is.na(found))

  # some ICTRP entries were found more than once, show which
  print(glue(
    "The following {nrow(not_unique)} EUCTR IDs were found in more than one \\
    clinicaltrials.gov entry:"
  ))

  print(knitr::kable(
    not_unique
  ))

  print(glue("We separate them and treat them as different trials"))

  ictrp_unique_ids %<>%
    filter(!is.na(found)) %>%
    bind_rows(
      # put each of the non unique trials into it's own row
      # and bind back to original data frame
      not_unique %>%
        mutate(unique_id = str_split(unique_id, " OR ")) %>%
        unnest(unique_id) %>%
        mutate(found = TRUE)
    )

  # save to rdata
  save(ictrp_unique_ids, file = "rdata/ictrp_unique_ids.rdata")
  save(drks_unique_ids, file = "rdata/drks_unique_ids.rdata")
  save(nct_unique_ids, file = "rdata/nct_unique_ids.rdata")

  # clean up
  rm(
    drks_ids,
    ictrp_ids,
    join_nct_or_generate_new_id,
    duration,
    not_unique
  )
}

## Searching for DRKS IDs in the clinicaltrials data set
## The matching took 39 seconds. Between DRKS in NCT IDs 190 duplicates were found, that's 36.5%.
## Searching for ICTRP IDs in the clinicaltrials data set
## The matching took 83 seconds. Between ICTRP in NCT IDs 495 duplicates were found, that's 45.0%
## The following 4 EUCTR IDs were found in more than one clinicaltrials.gov entry:
## 
## 
## id               found   unique_id                        
## ---------------  ------  ---------------------------------
## 2009-012476-28   NA      b489857b OR e09f835a OR f9792432 
## 2006-004495-13   NA      68f841e5 OR 94e9e897             
## 2011-001729-25   NA      4bc81970 OR 75628864             
## 2007-003013-14   NA      34923059 OR 02733b10             
## We separate them and treat them as different trials

# join NCT unique ids
if (!"unique_id" %in% colnames(nct_trials)) {
  nct_trials %<>%
    left_join(
      nct_unique_ids %>%
        select(-secondary),
      by = c("id" = "primary")
    )
}

# join DRKS unique ids
if (!"unique_id" %in% colnames(drks_trials)) {
  drks_trials %<>%
    left_join(
      drks_unique_ids %>%
        select(-found),
      by = "id"
    )
}

# join ICTRP unique ids
if (!"unique_id" %in% colnames(ictrp_trials)) {
  ictrp_trials %<>%
    left_join(
      ictrp_unique_ids %>%
        select(-found),
      by = c("simple_id" = "id")
    )
}

# clean up
rm(drks_unique_ids, nct_unique_ids, ictrp_unique_ids)

How complete are the data sources?

ICTRP

list_coverage <- function(df) {
  df %>%
    mutate_if(is.character, funs(factor(replace(., . == "[---]*", NA)))) %>%
    summarize_all(funs(sum(!is.na(.)) / length(.))) %>%
    gather("column", "coverage") %>%
    arrange(desc(coverage)) %>%
    mutate(coverage = percent(coverage))
}

knitr::kable(
  ictrp_trials %>%
    list_coverage()
)

column	coverage
id	100.0%
secondary	100.0%
size	100.0%
study_type	100.0%
registration_date	100.0%
enrollment_date	100.0%
primary_sponsor	100.0%
secondary_sponsor	100.0%
source	100.0%
type	100.0%
countries	100.0%
contact_name	100.0%
contact_address	100.0%
affiliation	100.0%
support	100.0%
simple_id	100.0%
has_results	100.0%
trial_status	100.0%
unique_id	100.0%
min_end_date	61.3%
max_end_date	61.3%

DRKS

knitr::kable(
  drks_trials %>%
    list_coverage()
)

column	coverage
id	100.0%
registration_date	100.0%
investorInitiated	100.0%
study_type	100.0%
national	100.0%
address.type0	100.0%
address.affiliation0	100.0%
unique_id	100.0%
last-update	99.8%
start_date	99.6%
target_size	99.2%
address.type1	98.8%
phase	98.7%
address.lastname1	98.3%
address.affiliation1	97.9%
plannedActual	67.5%
address.land0	67.1%
address.zip0	66.9%
address.city0	66.9%
address.streetAndNo1	66.9%
address.zip1	66.9%
address.city1	66.9%
address.land1	66.9%
address.firstname1	66.7%
address.streetAndNo0	63.8%
secondary	53.8%
address.firstname0	50.4%
address.lastname0	50.4%
firstPartnerPublishDate	41.3%
tertiary	39.4%
address.form1	34.0%
publication.category0	32.1%
address.title1	31.5%
publication.value0	31.2%
study_end	26.9%
address.form0	26.3%
publication.type0	25.8%
address.title0	24.6%
quaternary	22.7%
quinary	15.8%
publication.category1	15.4%
publication.value1	14.8%
publication.type1	11.0%
publication.key0	8.1%
publication.category2	7.7%
publication.value2	7.7%
publication.category3	5.2%
publication.value3	5.0%
publication.type2	4.0%
publication.key1	3.7%
senary	3.1%
publication.category4	2.7%
publication.value4	2.7%
publication.type3	2.5%
publication.key2	1.5%
publication.key3	0.8%
publication.type4	0.8%
publication.key4	0.6%

NCT

knitr::kable(
  nct_trials %>%
    list_coverage()
)

column	coverage
id	100.0%
source	100.0%
study_type	100.0%
overall_status	100.0%
clinical_results_present	100.0%
study_first_submitted	100.0%
study_first_submitted_qc	100.0%
study_first_posted	100.0%
unique_id	100.0%
start_date	99.4%
size	98.7%
primary_completion_date	95.1%
completion_date	93.9%
overall_official	87.6%
official_role	87.6%
official_affiliation	87.4%
secondary_id	28.5%
results_first_submitted	12.1%
results_first_submitted_qc	12.1%
results_first_posted	12.1%
why_stopped	6.4%
submitted_pending_results	1.8%

Combine all Trials

NCT

A trials is labeled as results_submitted if one of the following conditions meet:

The field results_first_posted has a value OR
The field submitted_pending_results has a value OR
The field why_stopped has a value

It is labeled as publication_found if:

The section “Clinical Results” on clinicaltrials.gov has at least one entry

As the end_date the field primary_completion_date is used.

The study type “expanded access” is treated as “interventional” so we only have two different types: interventional and observational.

DRKS

A trials is labeled as results_submitted if one of the following condition meets:

One of the publications columns in the downloaded xls contains the following words: Veröffentlichung OR Abstract zur Studie

As the end_date the field study_end is used.

ICTRP

For the labelling of the results_submitted and the end_date we use the date that the EU Trials Tracker lists as has_results and max_end_date.

# set to english month names etc.
locale_for_later_restore <- Sys.getlocale("LC_TIME")
Sys.setlocale("LC_TIME", "C")

## [1] "C"

english_months <- paste0(month.name, collapse = "|")

permissive_date_converter <- function(string) {
  case_when(
    # check for dates with a day init (e.g. "May 8, 2004")
    str_detect(string, glue("({english_months}) \\d{{1,2}}, \\d{{4}}")) ~
      as.Date(string, format = "%B %e, %Y"),
    # else check for dates without da day (e.g. "May 2004")
    str_detect(string, glue("({english_months}) \\d{{4}}")) ~
      as.Date(glue("01 {string}"), format = "%d %B %Y"),
    # else check for dates in standard format (e.g. 30/01/2018)
    str_detect(string, "\\d{1,2}/\\d{2}/\\d{4}") ~
      as.Date(string, format = "%d/%m/%Y"),
    # else check for dates in standard format (e.g. 30-01-2018)
    # by converting to date, if it fails, the month and the day are
    # in exchanged order (see next case)
    # of course we can never be really sure what an author meant, when he
    # entered e.g. 06-06-2018
    str_detect(string, "\\d{1,2}-\\d{2}-\\d{4}") &
      !is.na(as.Date(string, format = "%m-%d-%Y")) ~
      as.Date(string, format = "%m-%d-%Y"),
    # else check for dates in strange order (e.g. 01-30-2018)
    str_detect(string, "\\d{2}-\\d{2}-\\d{4}") ~
      as.Date(string, format = "%d-%m-%Y"),
    # else check for dates in strange order (e.g. 2014-03-17)
    str_detect(string, "\\d{4}-\\d{2}-\\d{2}") ~
      as.Date(string, format = "%Y-%m-%d"),
    # check if is month/year (e.g. 11/2015)
    str_detect(string, "\\d{2}/\\d{4}") ~
      as.Date(glue("01/{string}"), format = "%d/%m/%Y")
  )
}

all_trials <- bind_rows(
  # NCT
  nct_trials %>%
    mutate(
      # generate boolean based on multiple possible conditions
      results_submitted = (
        clinical_results_present |
          !is.na(results_first_posted) |
          !is.na(submitted_pending_results) |
          !is.na(why_stopped)
      ),
      publication_found = id %in% nct_citations$id,
      # convert to correct data type
      size = as.numeric(size),
      study_first_submitted_qc = as.Date(
        study_first_submitted_qc,
        format = "%B %e, %Y"
      ),
      study_type = ifelse(
        str_detect(
          study_type,
          regex("^(intervent|expanded)", ignore_case = TRUE)
        ),
        "interventional",
        "observational"
      ),
      source = "NCT"
    ) %>%
    rowwise() %>%
    # permissive date converter is not able to process whole vectors/columns
    mutate(
      start_date =
        permissive_date_converter(start_date),
      primary_completion_date =
        permissive_date_converter(primary_completion_date)
    ) %>%
    # join lead_sponsor as primary_sponsor, there's one for each trial
    left_join(
      nct_sponsors %>%
        filter(type == "lead_sponsor") %>%
        select(id, primary_sponsor = sponsor),
      by = "id"
    ) %>%
    select(
      unique_id,
      primary = id,
      size,
      study_type,
      status = overall_status,
      primary_sponsor,
      leading_institution = official_affiliation,
      registration_date = study_first_submitted_qc,
      start_date,
      end_date = primary_completion_date,
      results_submitted,
      publication_found,
      source
    ),
  # DRKS
  drks_trials %>%
    mutate(
      # add column for publication if one was uploaded
      has_publication =
        grepl("Veröffentlichung|Abstract zur Studie|", publication.category0) |
        grepl("Veröffentlichung|Abstract zur Studie|", publication.category1) |
        grepl("Veröffentlichung|Abstract zur Studie|", publication.category2) |
        grepl("Veröffentlichung|Abstract zur Studie|", publication.category3),
      # convert to correct data type, faulty end dates [---]* will be NA
      target_size = as.numeric(target_size),
      registration_date = as.Date(registration_date, format = "%d.%m.%Y"),
      start_date = as.Date(start_date, format = "%d.%m.%Y"),
      study_end = as.Date(study_end, format = "%d.%m.%Y"),
      study_type = ifelse(
        str_detect(
          study_type,
          regex("^intervent", ignore_case = TRUE)
        ),
        "interventional",
        "observational"
      ),
      source = "DRKS"
    ) %>%
    select(
      unique_id,
      primary = id,
      size = target_size,
      study_type,
      primary_sponsor = address.affiliation0,
      leading_institution = address.affiliation1,
      registration_date,
      start_date,
      end_date = study_end,
      results_submitted = has_publication,
      source
    ),
  # ICTRP
  ictrp_trials %>%
    mutate(
      # convert to correct data type
      size = as.numeric(size),
      registration_date = as.Date(registration_date, format = "%d/%m/%Y"),
      has_results = has_results == 1,
      # take only first entry of semicolon-separated list
      affiliation = sub("(.*?);.*", "\\1", affiliation),
      study_type = ifelse(
        str_detect(
          study_type,
          regex("^intervent", ignore_case = TRUE)
        ),
        "interventional",
        "observational"
      ),
      source = "ICTRP"
    ) %>%
    rowwise() %>%
    # permissive date converter is not able to process whole vectors/columns
    mutate(
      enrollment_date = permissive_date_converter(enrollment_date)
    ) %>%
    select(
      unique_id,
      primary = id,
      size,
      study_type,
      status = trial_status,
      primary_sponsor,
      leading_institution = affiliation,
      registration_date,
      start_date = enrollment_date,
      end_date = max_end_date,
      results_submitted = has_results,
      source
    )
) %>%
  ungroup() %>%
  # convert to factor
  mutate(study_type = factor(study_type))

## Warning in evalq(as.numeric(target_size), <environment>): NAs durch
## Umwandlung erzeugt

# restore locale
Sys.setlocale("LC_TIME", locale_for_later_restore)

## [1] "de_CH.UTF-8"

# clean up
rm(
  locale_for_later_restore,
  english_months,
  permissive_date_converter,
  nct_citations,
  nct_locations,
  nct_officials,
  nct_sponsors,
  nct_trials,
  drks_trials,
  ictrp_trials
)

Check integrity

knitr::kable(
  all_trials %>%
    list_coverage()
)

column	coverage
unique_id	100.0%
primary	100.0%
study_type	100.0%
primary_sponsor	100.0%
registration_date	100.0%
results_submitted	100.0%
source	100.0%
start_date	99.1%
size	98.8%
status	93.7%
leading_institution	89.7%
end_date	86.2%
publication_found	80.2%

# clean up
rm(list_coverage)

By now we still have some different wordings regarding the status of the trial:

knitr::kable(
  all_trials %>%
    group_by(source, status) %>%
    tally() %>%
    arrange(desc(n))
)

source	status	n
NCT	Completed	3576
NCT	Recruiting	1256
ICTRP	Ongoing	840
DRKS	NA	520
NCT	Active, not recruiting	497
NCT	Unknown status	483
NCT	Terminated	425
ICTRP	Completed, but no date	203
NCT	Not yet recruiting	153
NCT	Enrolling by invitation	77
NCT	Withdrawn	73
ICTRP	Suspended, withdrawn, not authorised or prohibited	60
NCT	Suspended	15
NCT	No longer available	7
NCT	Available	6
NCT	Approved for marketing	5
ICTRP	No trial status on register	3
NCT	Temporarily not available	1

We simplify these into the following groups:

simplified_status <- c(
  "ICTRP: Completed, but no date" = "Completed",
  "ICTRP: No trial status on register" = "Unknown",
  "ICTRP: Ongoing" = "Ongoing",
  "ICTRP: Suspended, withdrawn, not authorised or prohibited" = "Unknown",
  "NCT: Active, not recruiting" = "Ongoing",
  "NCT: Completed" = "Completed",
  "NCT: Enrolling by invitation" = "Ongoing",
  "NCT: Not yet recruiting" = "Ongoing",
  "NCT: Recruiting" = "Ongoing",
  "NCT: Suspended" = "Ongoing",
  "NCT: Terminated" = "Terminated",
  "NCT: Unknown status" = "Unknown",
  "NCT: Withdrawn" = "Terminated",
  "NCT: Available" = "Ongoing",
  "NCT: Temporarily not available" = "Ongoing",
  "NCT: No longer available" = "Completed",
  "NCT: Approved for marketing" = "Completed",
  "DRKS: NA" = "Unknown"
)

knitr::kable(
  simplified_status
)

	x
ICTRP: Completed, but no date	Completed
ICTRP: No trial status on register	Unknown
ICTRP: Ongoing	Ongoing
ICTRP: Suspended, withdrawn, not authorised or prohibited	Unknown
NCT: Active, not recruiting	Ongoing
NCT: Completed	Completed
NCT: Enrolling by invitation	Ongoing
NCT: Not yet recruiting	Ongoing
NCT: Recruiting	Ongoing
NCT: Suspended	Ongoing
NCT: Terminated	Terminated
NCT: Unknown status	Unknown
NCT: Withdrawn	Terminated
NCT: Available	Ongoing
NCT: Temporarily not available	Ongoing
NCT: No longer available	Completed
NCT: Approved for marketing	Completed
DRKS: NA	Unknown

# simplify status
if (nrow(all_trials %>% filter(status == "Recruiting")) > 0) {
  all_trials %<>% mutate(
    status = glue("{source}: {status}"),
    status = str_replace_all(status, simplified_status),
    status = factor(status)
  )
}

rm(simplified_status)

That we transform “suspended” to “ongoing” is not very precise, but we are mainly interested in trials that are completed.

Quick visual analysis

ggplot(
  data = all_trials %>%
    distinct(unique_id, .keep_all = TRUE),
  aes(
    x = start_date,
    color = source
  )
) +
  geom_freqpoly(bins = 50) +
  # we have so me weird dates lying back in 1900-01-01 (errors obviously)
  xlim(as.Date("1990-01-01"), as.Date("2020-01-01")) +
  theme_minimal() +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "When were the trials started?",
    x = "Start date",
    y = "Number of studies"
  )

## Warning: Removed 62 rows containing non-finite values (stat_bin).

## Warning: Removed 9 rows containing missing values (geom_path).

ggplot(
  all_trials %>%
    group_by(source, status) %>%
    tally(),
  aes(
    x = source,
    fill = status,
    y = n
  )
) +
  geom_bar(stat = "identity") +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  labs(
    x = "Registry",
    y = "Number of trials",
    fill = "Trial Status",
    title = "How many trials have we per status and registry?"
  )

Publications

Cochrane

Read in txt files exported by the Cochrane website and keep all entries with a value in the fields publication type PT, embase ID EM, pubmed ID PM or Digital Object Identifier DOI. Then apply a regex search to look for NCT, EUCTR or DRKS IDs. If an ID is found, the corresponding trial is treated as publication_found.

The txt files are not included in this git repo. Read the chapter data sources and download your own version of the file, if you wish to reproduce the report or work with the provided rdata file. Read the chapter data sources and download your own version of the file, if you wish to reproduce the report or work with the provided rdata file.

# define regex patterns for different register ids
ids_regex_patterns <- c(
  "NCT" = "NCT\\d{8}",
  "EUCTR" = "\\d{4}\\s?\\-\\s?\\d{6}\\s?\\-\\s?\\d{2}",
  "DRKS" = "DRKS\\d{8}"
)

# load data form rdata file if is present
if (file.exists("rdata/ids_in_cochrane.rdata")) {
  load(file = "rdata/ids_in_cochrane.rdata")
} else {
  # Recalculate.
  # Define files that need to be parsed
  cochrane_files <- list.files("input/ignore/cochrane", full.names = TRUE)

  # paste them together with three newlines so all citations get split correctly
  cochrane_concatenated <- cochrane_files %>%
    map(~ read_file(.)) %>%
    reduce(paste0, collapse = "\n\n\n")

  # then split again by three newlines
  cochrane_chunks <- str_split(cochrane_concatenated, "\n\n\n")

  # by now the variable is a list with one child, a char vector
  # with as.list/unlist we can untangle that and get a regular list
  cochrane_chunks <- as.list(unlist(cochrane_chunks))

  # convertLinesToDataframe gets called with a batch
  # but what we need to analyse is what I call one chunk,
  # meaning one citation consisting of a Record iterator and some lines
  map_over_chunks_in_batch <- function(batch) {
    # that's why we split it up a second time and parallelize it
    sub_batch_size <- 2000
    sub_split_factor <- as.factor(
      rep_len(
        1:ceiling(length(batch) / sub_batch_size),
        length.out = length(batch)
      )
    )
    foreach(
      chunk = split(batch, sub_split_factor),
      .inorder = FALSE,
      .combine = rbind
    ) %dopar%
      get_dataframe_from_chunk(chunk)
  }

  get_dataframe_from_chunk <- function(chunks) {
    # return data frame of id, key, value triples
    chunks %>%
      map_df(function(chunk) {
      # extract id from top of chunk
      id_total <- str_match(chunk, "^Record #(\\d*) of (\\d*)")
      id <- id_total[1, 2]
      total <- id_total[1, 3]
        unlist(str_split(chunk, "\n")) %>%
        map_df(function(line) {
          if (str_detect(line, "^(\\w{2,3}): ")) {
            key_value <- str_match(line, "^(\\w{2,3}): (.*)")
            key <- key_value[1, 2]
            value <- key_value[1, 3]
            # append to super tidy (narrow) data frame tidy way
            data.frame(
              # make new id by concatenating total rows in file and iterator
              "id" = paste0(id, "-", total),
              "key" = key,
              "value" = value,
              stringsAsFactors = FALSE
            )
          }
        })
      })
  }

  # map over all lines of the textfile and bind results as data frame
  batch_size <- 20000
  split_factor <- as.factor(
    rep_len(
      1:ceiling(length(cochrane_chunks) / batch_size),
      length.out = length(cochrane_chunks)
    )
  )

  glue("Now extracting ids from Cochrane Library Export")
  duration <- system.time(
    cochrane_publications <- foreach(
      batch = split(cochrane_chunks, split_factor),
      .inorder = FALSE,
      .combine = rbind
    ) %dopar%
      map_over_chunks_in_batch(batch)
  )
  glue("Extraction took {round(duration[3] / 60, 1)} minutes.")

  # clean up
  rm(
    cochrane_files,
    cochrane_chunks,
    cochrane_concatenated,
    batch_size,
    split_factor,
    duration,
    get_dataframe_from_chunk,
    map_over_chunks_in_batch
  )
  # the resulting df has about 2.3 million rows and 3 columns
  # extract publication identifiers for later joining
  publication_info <- cochrane_publications %>%
    filter(str_detect(key, "PT|DOI|PM|XR")) %>%
    distinct(id, key, .keep_all = TRUE) %>%
    spread(key, value) %>%
    rename(export_id = id)

  # make new column with all ids found in Abstracts (AB)
  ids_in_cochrane <- cochrane_publications %>%
    filter(key == "AB") %>%
    mutate(
      # look for all ids by concatenating regex patterns above
      ids = str_extract_all(
        value,
        regex(
          ids_regex_patterns %>%
            paste(collapse = "|"),
          ignore_case = TRUE
        )
      )
    ) %>%
    select(-value) %>%
    unnest(ids) %>%
    rename(export_id = id) %>%
    select(export_id, id = ids) %>%
    # remove unnecessary stuff
    mutate(
      id = toupper(id),
      id = str_replace_all(id, "NUMBER:? ?|NO:? ?", ""),
      id = str_replace_all(id, "\\s", ""),
      # get simple EUCTR ID out of the id if present
      id = ifelse(
        str_detect(id, ids_regex_patterns[["EUCTR"]]),
        str_extract(id, ids_regex_patterns[["EUCTR"]]),
        id
      )
    ) %>%
    left_join(
      publication_info, by = "export_id"
    ) %>%
    # convert pubmed/embase ids to ints for better compatibility later
    mutate(
      PM = as.numeric(
        str_replace_all(PM, "PUBMED ", "")
      ),
      XR = as.numeric(
        str_replace_all(XR, "EMBASE ", "")
      )
    )

  # clean up
  rm(publication_info)

  # save to rdata file
  save(ids_in_cochrane, file = "rdata/ids_in_cochrane.rdata")

  # clean up
  rm(cochrane_publications)
}

With this method we find 27279 different trial IDs. This number is so high because these are not only trials conducted in Switzerland.

Pubmed

The pubmed trial download are the only files used in this report that can not be found in the git repository, because the xml file generated by the pubmed website is ober 1GB in size. Read the chapter data sources and download your own version of the file, if you wish to reproduce the report or work with the provided rdata file.

In pubmed, we are interested in the fields Publisher Item Identifier PII and Digital Object Identifier DOI. We look in the xml node DataBank for linked IDs and search the abstract with a regex, looking for IDs. If we find an ID with a PII, DOI or in an abstract, we treat it as publication_found.

# load data form rdata file if is present
if (file.exists("rdata/ids_in_pubmed.rdata")) {
  load(file = "rdata/ids_in_pubmed.rdata")
} else {
  pubmed_articles_linked <- read_xml(
    "input/ignore/pubmed/pubmed_result_nct_linked.xml"
  ) %>%
    xml_find_all("//PubmedArticle")

  pubmed_articles_others <- read_xml(
    "input/ignore/pubmed/pubmed_result_all_others.xml"
  ) %>%
    xml_find_all("//PubmedArticle")

  # read in parallel
  batch_size <- 18000
  split_factor <- as.factor(
    rep_len(
      1:ceiling(length(pubmed_articles_linked) / batch_size),
      length.out = length(pubmed_articles_linked)
    )
  )
  glue("Now extracting ids from Pubmed Linked Export")
  duration <- system.time(
    ids_in_pubmed_linked <- foreach(
      node = split(pubmed_articles_linked, split_factor),
      .inorder = FALSE,
      .combine = rbind
    ) %dopar%
      data.frame(
        "PM" =
          xml_integer(xml_find_first(node, ".//PMID")),
        "PII" =
          xml_text(xml_find_first(node, ".//ELocationID[@EIdType='pii']")),
        "DOI" =
          xml_text(xml_find_first(node, ".//ELocationID[@EIdType='doi']")),
        "databank" =
          xml_text(xml_find_first(node, ".//DataBank")),
        "abstract" =
          xml_text(xml_find_first( node, ".//Abstract")),
        stringsAsFactors = FALSE
      )
  )
  glue("Extraction took {round(duration[3] / 60, 1)} minutes.")

  # repeat for others (unfortunately two xml files cannot be joined easily)
  # so we copy the children of the root node from all files into one
  batch_size <- 2500
  split_factor <- as.factor(
    rep_len(
      1:ceiling(length(pubmed_articles_others) / batch_size),
      length.out = length(pubmed_articles_others)
    )
  )
  glue("Now extracting ids from Pubmed Searched Export")
  duration <- system.time(
    ids_in_pubmed_others <- foreach(
      node = split(pubmed_articles_others, split_factor),
      .inorder = FALSE,
      .combine = rbind
    ) %dopar%
      data.frame(
        "PM" =
          xml_integer(xml_find_first(node, ".//PMID")),
        "PII" =
          xml_text(xml_find_first(node, ".//ELocationID[@EIdType='pii']")),
        "DOI" =
          xml_text(xml_find_first(node, ".//ELocationID[@EIdType='doi']")),
        "databank" =
          xml_text(xml_find_first(node, ".//DataBank")),
        "abstract" =
          xml_text(xml_find_first( node, ".//Abstract")),
        stringsAsFactors = FALSE
      )
  )
  glue("Extraction took {duration[3]} minutes.")

  # bind together
  ids_in_pubmed <- bind_rows(
    ids_in_pubmed_linked,
    ids_in_pubmed_others
  )

  ids_in_pubmed %<>%
    # look for all ids by concatenating regex patterns above
    mutate(
      abstract = str_extract_all(
        abstract,
        regex(
          ids_regex_patterns %>%
            paste(collapse = "|"),
          ignore_case = TRUE
        )
      )
    ) %>%
    # extract all major ids from databank node
    mutate(nct_id = str_extract_all(
      databank,
      ids_regex_patterns[["NCT"]]
    )) %>%
    mutate(euctr_id = str_extract_all(
      databank,
      ids_regex_patterns[["EUCTR"]]
    )) %>%
    mutate(drks_id = str_extract_all(
      databank,
      ids_regex_patterns[["DRKS"]]
    )) %>%
    select(-databank) %>%
    # split into rows
    gather("register", "ids", abstract:drks_id) %>%
    # unnest those with multiple ids in one row
    unnest(ids) %>%
    rename(id = ids) %>%
    # filter out empty ones
    filter(!is.na(id)) %>%
    # filter out those without any publication id
    filter(!is.na(PII) | !is.na(DOI))

  # clean up
  rm(
    pubmed_articles_linked,
    pubmed_articles_others,
    ids_in_pubmed_linked,
    ids_in_pubmed_others,
    batch_size,
    split_factor,
    duration
  )

  # save to rdata file
  save(ids_in_pubmed, file = "rdata/ids_in_pubmed.rdata")
}

With this method we find 49009 different trial IDs. This number is so high because these are not only trials conducted in Switzerland.

Combine

all_publications <- bind_rows(
  ids_in_cochrane %>%
    # keep only one row per trial id
    distinct(id, .keep_all = TRUE) %>%
    mutate(
      source = "Cochrane"
    ) %>%
    select(
      trial_id = id,
      DOI,
      pubmed_id = PM,
      embase_id = XR,
      type = PT,
      source
    ),
  ids_in_pubmed %>%
    # keep only one row per trial id
    distinct(id, .keep_all = TRUE) %>%
    mutate(
      source = "Pubmed"
    ) %>%
    select(
      trial_id = id,
      DOI,
      pubmed_id = PM,
      PII,
      source
    )
)

# clean up
rm(
  ids_in_cochrane,
  ids_in_pubmed
)

overlap <- all_publications %>%
  group_by(trial_id) %>%
  summarise(n = n()) %>%
  filter(n > 1)

If we combine the two publication data sources we now have a total of 53303 primary IDs that we now know have results published. There’s a pretty big overlap: 46.9% of the trial ids we find in pubmed, are also in cochrane. Which makes sense as you can find abstracts from Pubmed in the Cochrane Library.

# clean up
rm(overlap)

Matching

Match trial IDs in publications

Now we go through all trials from above and check whether the ID was found in the two publication sources.

trials_with_publications_found <- all_trials %>%
  left_join(
    all_publications %>%
      select(trial_id, publication = source),
    by = c("primary" = "trial_id")
  ) %>%
  group_by(unique_id) %>%
  # if per unique id only one of the rows has a source, the trial has a
  # corresponding publication
  summarise(publication_found = sum(!is.na(publication)) > 0) %>%
  # now we join the linked publications from the registers
  left_join(
    all_trials %>%
      select(unique_id, results_submitted),
    by = "unique_id"
  ) %>%
  # add booleans for those that already had a publication linked
  # and one for those that we were able to find in the publications
  # and one for both to see how big the overlap is
  mutate(
    both = (results_submitted & publication_found),
    either = results_submitted | publication_found,
    none = (!results_submitted & !publication_found)
  )

knitr::kable(
  results <- trials_with_publications_found %>%
    select(-unique_id) %>%
    summarize_all(funs(sum(.) / length(.))) %>%
    mutate_all(percent)
)

publication_found	results_submitted	both	either	none
30.0%	24.1%	11.0%	43.1%	56.9%

Looks like we found publications for another 30.0% of our trials! So far, that we can now say that 43.1% of the 7506 Swiss trials have lead to some form of publication. 56.9% haven’t for sure.

all_trials %<>%
  # replace NAs with FALSE
  mutate(publication_found = ifelse(
      is.na(publication_found),
      FALSE,
      publication_found
  )) %>%
  left_join(
    trials_with_publications_found %>%
      select(unique_id, publication_found_2 = publication_found),
    by = "unique_id"
  ) %>%
  # join newly calculated field publication_found as _2 because in nct
  # a publication_found was already present. So we combine the two with an
  # OR condition and delete the second one
  mutate(
    publication_found = publication_found | publication_found_2
  ) %>%
  select(-publication_found_2)

# clean up
rm(results, trials_with_publications_found)

Select relevant trials

Until now we did not distinguish between finished and unfinished trials. Also, we did not add any additional time period to the finish dates during which companies are able to write a publication. Let’s cope with this now.

Select Timespan

We define a time span of ten years that we focus on. We look at the years 2007 to 2016. Now is Feb 2019, that mean’s the sponsors had over two years to upload the results of a trial after they labeled it as finished.

Provided end date of trial must be between 1st of January 2007 and 31st of December 2016

relevant_trials <- all_trials %>%
  filter(
    end_date >= as.Date("2007-01-01") &
    end_date <= as.Date("2016-12-31")
  ) %>%
  distinct(unique_id, .keep_all = TRUE)

Select Type

Use the following chunk, if you wish to only analyze interventional studies.

# currently this chunk not! evaluated:
relevant_trials %<>% filter(study_type == "interventional")

From originally 7506 trials, we remain with 3612 trials that meet our criteria.

knitr::kable(
  all_trials %>%
    distinct(unique_id, .keep_all = TRUE) %>%
    group_by(source) %>%
    tally() %>%
    rename(before = n) %>%
    left_join(
      relevant_trials %>%
        group_by(source) %>%
        tally() %>%
        rename(after = n),
      by = "source"
    )
)

source	before	after
DRKS	330	59
ICTRP	602	237
NCT	6574	3316

Google Scholar

We realized that there are too many trial publications that do not contain the trial ID in its abstract and are thus not found via the methodology above. For that reason we additionally scrape Google Scholar Search Results to get a more complete picture.

Scraper

Google Scholar lets us only scrape about 30 primary IDs. After that, the IP gets blocked. For this reason the next code chunk is not executed normally (if google_scholar.rdata file is present). You need to be on a computer, that regularly gets a new IP to scrape the results again. We did this once on the 15th of January 2019 and saved the results into the google_scholar.rdata file.

if (file.exists("rdata/google_scholar.rdata")) {
  # of rdata file is present, do not scrape again!
  load("rdata/google_scholar.rdata")
} else if (!exists("google_scholar")) {
  # initialize new empty data frame to fill
  google_scholar <- data.frame()

  # create safe walk function that does not break on errors
  safely_walk <- safely(walk)

  # redirect prints to file
  sink("scrape.log", append = TRUE)

  # repeat as long as there are IDs not yet scraped
  while (relevant_trials %>%
      distinct(primary) %>%
      filter(!primary %in% google_scholar$primary) %>%
      nrow() > 0) {
    # safely walk over relevant trials
    relevant_trials %>%
      distinct(primary) %>%
      # but only iterate over those that haven't been scraped
      filter(!primary %in% google_scholar$primary) %>%
      t() %>%
      as.list() %>%
      safely_walk(function(primary_id) {
        # logging
        print(paste(Sys.time(), "try to scrape", primary_id))
        error <- FALSE

        # for euctr ids, leave away the country code and the prefix
        search_query <- ifelse(
          str_match(primary_id, "\\d{4}-\\d{6}-\\d{2}"),
          str_extract(primary_id, "\\d{4}-\\d{6}-\\d{2}"),
          primary_id # leave other as they are
        )

        # search for primary ID on Google Scholar and save server response
        page <- tryCatch({
          # somehow read_html does sometimes get stuck, it does not throw an
          # error but neither does it continue, so we need to cancel it after
          # 15 seconds and throw an error
          withTimeout({
            read_html(
              paste0("https://scholar.google.ch/scholar?q=", search_query),
              verbose = T)
          },
            timeout = 15,
            onTimeout = "error"
          )
        },
        warning = function(warning_message) {
          print(paste(Sys.time(), warning_message))
        },
        error = function(error_message) {
          print(paste(
            Sys.time(),
            "error scraping at",
            primary_id,
            ", error:", error_message
          ))
          # break outer for loop if try catch was not successful
          error <<- TRUE
        })

        # if return value is valid, add it to the data frame
        if (error == FALSE) {
          print(paste(Sys.time(), "successfully scraped", primary_id))
          # extract nodes for search results
          search_results <- page %>%
            html_nodes(".gs_r.gs_scl")

          # extract relevant information of each node and return as data frame
          result <- search_results %>%
            map_df(function(node) {
              data.frame(
                "primary" = primary_id,
                "scraped_at" = Sys.time(),
                "title" =
                  node %>% html_node("h3.gs_rt") %>% html_text(),
                "url" =
                  node %>% html_node("h3.gs_rt a") %>% html_attr("href"),
                "source" =
                  node %>% html_node(".gs_ri .gs_a") %>% html_text(),
                "text" =
                  node %>% html_node(".gs_ri .gs_rs") %>% html_text(),
                "additional_info" =
                  node %>% html_node(".gs_ri .gs_fl") %>% html_text(),
                stringsAsFactors = FALSE
              )
            })

          # append number of pages found and return
          result %<>%
            mutate(
              pages = page %>%
                html_nodes(".gs_ico_nav_page") %>%
                length() + 1
            )

          # for no results: append NA to be sure the ID is not scraped again
          if (nrow(result) == 0) {
            result <- data.frame(
              "primary" = primary_id,
              "scraped_at" = Sys.time(),
              "pages" = 0,
              stringsAsFactors = FALSE
            )
          }

          # append to global variable defined above
          google_scholar <<- google_scholar %>%
            bind_rows(result)

          # export as RData to make sure nothing gets lost
          save(google_scholar, file = "rdata/google_scholar.RData")
        }

        # wait for a random number of seconds (between 5 and 10)
        Sys.sleep(runif(1, 5.0, 10.0))
    })
  }

  # finished scraping
  print(paste(Sys.time(), "scraping finished"))

  # end output redirection
  sink(NULL)
}

Now we join the data scraped by google scholar. If any results were found, we label the trials with has_g_scholar_results. When we do this we get the following eight groups:

# add column to indicate number of pages found on google scholar
relevant_trials %<>%
  left_join(
    google_scholar %>%
      filter(!is.na(pages)) %>%
      distinct(primary, .keep_all = TRUE) %>%
      mutate(has_g_scholar_results = !(pages == 0 | is.na(pages))) %>%
      select(primary, has_g_scholar_results),
    by = "primary"
  ) %>%
  # temporarily change NAs to to FALSE but scrape these IDs again later!
  mutate(has_g_scholar_results = ifelse(
    is.na(has_g_scholar_results),
    FALSE,
    has_g_scholar_results
  ))

# filter out some groups to get a better understanding of how well the google
# scholar search worked
knitr::kable(
  relevant_trials %>%
    group_by(has_g_scholar_results, publication_found, results_submitted) %>%
    tally() %>%
    arrange(has_g_scholar_results, publication_found, results_submitted) %>%
    bind_cols(
      data.frame(description = c(
        "No results at all",
        "No scholar search results, but results found in pubmed/cochrane",
        "No scholar search results, but results uploaded to registry",
        "No scholar search results, but results uploaded to registry \\
        plus found in pubmed/cochrane",
        "Has scholar search results, but neither results uploaded to \\
        registry nor found in pubmed/cochrane",
        "Has scholar search results and results uploaded to registry, \\
        but was not found in pubmed/cochrane",
        "Has scholar search results and was found in pubmed/cochrane, \\
        but no results were uploaded ot registry",
        "All conditions meet!"
      ))
    ) %>%
    select(count = n, description)
)

## Adding missing grouping variables: `has_g_scholar_results`, `publication_found`

has_g_scholar_results	publication_found	count	description
FALSE	FALSE	728	No results at all
FALSE	FALSE	188	No scholar search results, but results found in pubmed/cochrane
FALSE	TRUE	119	No scholar search results, but results uploaded to registry
FALSE	TRUE	66	No scholar search results, but results uploaded to registry
plus found in pu	bmed/cochrane
TRUE	FALSE	668	Has scholar search results, but neither results uploaded to
registry nor fou	nd in pubmed/cochran	e
TRUE	FALSE	456	Has scholar search results and results uploaded to registry,
but was not foun	d in pubmed/cochrane
TRUE	TRUE	857	Has scholar search results and was found in pubmed/cochrane,
but no results w	ere uploaded ot regi	stry
TRUE	TRUE	530	All conditions meet!

Name recognition

We identify important institutions by searching for the company name or institution name in both the column of the leading institution as well as the column of the primary sponsor. Later we only work with the primary sponsor to calculate publication rates, but in this step we also want to have a unified version of the names of the leading institutions (at least those that are based in Switzerland).

relevant_trials %<>%
  mutate(
    # we have some ugly html line breaks in the affiliations, let's remove them
    primary_sponsor =
      str_replace_all(primary_sponsor, "<br ?\\/?>", " "),
    leading_institution =
      str_replace_all(leading_institution, "<br ?\\/?>", " ")
  )

# combine the two columns containing affiliation information (without the NAs)
all_affiliations <- data.frame(affiliation = c(
  relevant_trials$primary_sponsor,
  relevant_trials$leading_institution
), stringsAsFactors = FALSE) %>%
  # remove NAs
  filter(!is.nan(affiliation))

# create short helper function to reduce repetition (e.g. ignore_case)
my_detect <- function(string, pattern) {
  str_detect(string, regex(pattern, ignore_case = TRUE))
}

# nolint start
all_affiliations %<>%
  mutate(
    affiliation = as.character(affiliation),
    simplified_name = case_when(
      # private companies
      my_detect(affiliation, "Abbott") ~ "Abbott",
      my_detect(affiliation, "AbbVie") ~ "AbbVie",
      my_detect(affiliation, "Ablynx") ~ "Ablynx",
      my_detect(affiliation, "Actelion") ~ "Actelion",
      my_detect(affiliation, "Allergan") ~ "Allergan",
      my_detect(affiliation, "Amgen") ~ "Amgen",
      my_detect(affiliation, "Ardea Biosciences, Inc.") ~ "Ardea Biosciences",
      my_detect(affiliation, "Array BioPharma") ~ "Array BioPharma",
      my_detect(affiliation, "Astellas") ~ "Astellas",
      my_detect(affiliation, "Astra\\s*Zeneca") ~ "AstraZeneca",
      my_detect(affiliation, "Bayer") ~ "Bayer",
      my_detect(affiliation, "Biogen") ~ "Biogen",
      my_detect(affiliation, "Biotronik") ~ "Biotronik",
      my_detect(affiliation, "Bioverativ") ~ "Bioverativ",
      my_detect(affiliation, "Boehringer[\\- ]*Ingelheim") ~ "Boehringer Ingelheim",
      my_detect(affiliation, "Bristol[\\- ]*Myers") ~ "Bristol-Myers Squibb",
      my_detect(affiliation, "Celgene") ~ "Celgene",
      my_detect(affiliation, "CSL Behring") ~ "CSL Behring",
      my_detect(affiliation, "Cytos Biotechnology") ~ "Cytos Biotechnology",
      my_detect(affiliation, "Dr. Falk Pharma") ~ "Dr. Falk Pharma",
      my_detect(affiliation, "Daiichi Sank[iy]o") ~ "Daiichi Sankyo",
      my_detect(affiliation, "GlaxoSmithKline") ~ "GlaxoSmithKline",
      my_detect(affiliation, "Geistlich Pharma") ~ "Geistlich Pharma",
      my_detect(affiliation, "Genentech") ~ "Genentech",
      my_detect(affiliation, "Gilead Sciences") ~ "Gilead Sciences",
      my_detect(affiliation, "IBSA") ~ "IBSA",
      my_detect(affiliation, "Incyte") ~ "Incyte",
      my_detect(affiliation, "Janssen") ~ "Janssen",
      my_detect(affiliation, "Johnson") ~ "Johnson & Johnson",
      my_detect(affiliation, "Lilly") ~ "Lilly",
      my_detect(affiliation, "Medtronic") ~ "Medtronic",
      my_detect(affiliation, "Merck|MSD|Serono") ~ "Merck",
      my_detect(affiliation, "Novartis") ~ "Novartis",
      my_detect(affiliation, "Novo[\\- ]*Nordisk") ~ "Novo Nordisk",
      my_detect(affiliation, "Nestl(é|e)") ~ "Nestlé",
      my_detect(affiliation, "Pfizer") ~ "Pfizer",
      my_detect(affiliation, "PIQUR Therapeutics") ~ "PIQUR Therapeutics",
      my_detect(affiliation, "Roche(?!ster)") ~ "Hoffmann-La Roche",
      my_detect(affiliation, "Sanofi") ~ "Sanofi",
      my_detect(affiliation, "Laboratorios Servier") ~ "Servier",
      my_detect(affiliation, "Institut de Recherches Internationales Servier") ~ "Servier",
      my_detect(affiliation, "Shire") ~ "Shire",
      my_detect(affiliation, "Sonova") ~ "Sonova",
      my_detect(affiliation, "Spirig Pharma") ~ "Spirig Pharma",
      my_detect(affiliation, "Takeda|Millennium") ~ "Takeda",
      my_detect(affiliation, "\\+1 877 822 9493") ~ "UCB Pharma",
      my_detect(affiliation, "UCB Pharma") ~ "UCB Pharma",
      my_detect(affiliation, "Vertex Pharmaceuticals") ~ "Vertex Pharmaceuticals",
      my_detect(affiliation, "ViiV Healthcare") ~ "ViiV Healthcare",
      # university hospitals
      my_detect(affiliation, "Universit.*spital.*Z(u|ue|ü)rich") ~ "Unispital Zürich",
      my_detect(affiliation, "Z(u|ue|ü)rich.*Universit.*spital") ~ "Unispital Zürich",
      my_detect(affiliation, "Univer?s?i?t?.*spital.*Basel") ~ "Unispital Basel",
      my_detect(affiliation, "Basel.*Universit?.*spital") ~ "Unispital Basel",
      my_detect(affiliation, "Basel Women.*spital|Frauenspital, Basel") ~ "Unispital Basel",
      my_detect(affiliation, "Universit.*spital.*(Gen[eè]v|Genf)") ~ "Unispital Genf",
      my_detect(affiliation, "[Hôos]*pit.*univers.*(Gen[eè]v|Genf)") ~ "Unispital Genf",
      my_detect(affiliation, "(Gen[eè]v|Genf).*Universit.*spital") ~ "Unispital Genf",
      my_detect(affiliation, "Lausanne University Hospital") ~ "Unispital Lausanne",
      my_detect(affiliation, "(CTU|University Hospital).*Lausanne") ~ "Unispital Lausanne",
      my_detect(affiliation, "CHUV|spital.*Universit.*Vaudois|Lausanne Hospitals?") ~ "Unispital Lausanne",
      my_detect(affiliation, "Insel[ ho]*spital") ~ "Inselspital Bern",
      my_detect(affiliation, "Bern.*Universiti?y Hosp") ~ "Inselspital Bern",
      my_detect(affiliation, "University Hosp.*Bern") ~ "Inselspital Bern",
      my_detect(affiliation, "Balgrist") ~ "Unispital Balgrist",
      # universities
      my_detect(affiliation, "Universit[äty of]*Z(u|ue|ü)rich") ~ "Universität Zürich",
      my_detect(affiliation, "Z(u|ue|ü)ri?ch.*Universit[äty]") ~ "Universität Zürich",
      my_detect(affiliation, "Universit[äty of]*(Geneva|Genf)") ~ "Universität Genf",
      my_detect(affiliation, "Universit[äty of]*Berne?") ~ "Universität Bern",
      my_detect(affiliation, "Universit?[äty of]*Basel") ~ "Universität Basel",
      my_detect(affiliation, "Universit[äty of]*Lausanne") ~ "Universität Lausanne",
      my_detect(affiliation, "Universit[äty of]*St\\. ?Gallen?") ~ "Universität St.Gallen",
      my_detect(affiliation, "ETH|Eidgenössische Technische Hochschule") ~ "ETH Zürich",
      my_detect(affiliation, "Swiss Federal Institute of Technology") ~ "ETH Zürich",
      my_detect(affiliation, "EPFL|Ecole Polytechnique Fédérale") ~ "EPF Lausanne",
      my_detect(affiliation, "ZHAW|Zurich.*Applied Sciences?") ~ "ZHAW",
      my_detect(affiliation, "Zürcher Hochschule für Angewandte Wissenschaften") ~ "ZHAW",
      my_detect(affiliation, "(FHS|Fachhochschule) S.*Gallen") ~ "FHS St.Gallen",
      my_detect(affiliation, "Gallen University of Applied Sciences?") ~ "FHS St.Gallen",
      my_detect(affiliation, "Berner Fachhochschule|Bern University of Applied Sciences?") ~ "BFH Bern",
      my_detect(affiliation, "(HES|University of Applied Sciences).*(Valais|Wallis)") ~ "HEVS Hochschule Wallis",
      my_detect(affiliation, "University of Applied Sciences of Western Switzerland") ~ "Fachhochschule Westschweiz",
      # other hospitals
      my_detect(affiliation, "(C|K)anton.*spital.*Frauenfeld") ~ "Kantonsspital Frauenfeld",
      my_detect(affiliation, "(C|K)anton.*spital.*S.*Gallen") ~ "Kantonsspital St.Gallen",
      my_detect(affiliation, "(Kantonsspital|LUKS) Luzern") ~ "Kantonsspital Luzern",
      my_detect(affiliation, "Luzerner Kantonsspital") ~ "Kantonsspital Luzern",
      my_detect(affiliation, "(C|K)anton.*spital.*Aarau") ~ "Kantonsspital Aarau",
      my_detect(affiliation, "(C|K)anton.*spi?tal.*(Baselland|Liestal)") ~ "Kantonsspital Baselland",
      my_detect(affiliation, "Kantonsspital Winterthur") ~ "Kantonsspital Winterthur",
      my_detect(affiliation, "Kantonsspital Baden") ~ "Kantonsspital Baden",
      my_detect(affiliation, "Kantonsspital Graub.*") ~ "Kantonsspital Graubünden",
      my_detect(affiliation, "Osped.*Bellinzona") ~ "Kantonsspital Bellinzona",
      my_detect(affiliation, "Osped.*Locarno") ~ "Regionalspital Locarno",
      my_detect(affiliation, "[Hôos]*pital.*(Valais|Wallis)") ~ "Regionalspital Locarno",
      my_detect(affiliation, "Paraplegi(c|ker) ?(C|Z)ent[erum].*") ~ "Paraplegikerzentrum Nottwil",
      my_detect(affiliation, "Schulthess Klinik") ~ "Schulthess Klinik",
      my_detect(affiliation, "Claraspital") ~ "St.Claraspital",
      my_detect(affiliation, "Hirslanden") ~ "Hirslanden",
      my_detect(affiliation, "St.? ?Anna.*Lu[cz]ern") ~ "Hirslanden",
      my_detect(affiliation, "Triemli") ~ "Triemli Spital",
      # others
      my_detect(affiliation, "Swiss Group for Clinical Cancer Research") ~ "SAKK",
      my_detect(affiliation, "SNSF|SNF|Swiss National Science Foundation") ~ "Schweizerischer Nationalfonds",
      my_detect(affiliation, "Children's Oncology Group") ~ "Children's Oncology Group",
      my_detect(affiliation, "European Organisation for Research and Treatment of Cancer") ~ "EORTC",
      my_detect(affiliation, "Lymphoma Study Association") ~ "LYSA",
      my_detect(affiliation, "International Extranodal Lymphoma Study Group") ~ "IELSG",
      TRUE ~ ""
    )
  ) %<>%
  # change empty strings that were introduced above because of case_when
  # constraints to NA
  mutate(
    simplified_name = ifelse(
      nzchar(simplified_name),
      simplified_name,
      NA
    )
  )
# nolint end

# see how many are missing
missings <- all_affiliations %>%
  filter(is.na(simplified_name) & nzchar(as.character(affiliation))) %>%
  group_by(affiliation) %>%
  tally() %>%
  arrange(desc(n))

knitr::kable(
  missings %>%
    top_n(50)
)

# clean up
rm(my_detect, missings)

We were able to assign 65.8% of the affiliations. That does not sound like a whole lot, but when we inspect the biggest one missing, we can see that they are not based in Switzerland.

Visual Analysis

Publication Rates

We differenciate between private and public institutions. We did the classification by hand:

# read in manual classification
public_or_private <- read_csv("input/public_private.csv", col_types = cols())

knitr::kable(
  public_or_private
)

simple_name	group
Abbott	private
AbbVie	private
Ablynx	private
Actelion	private
Allergan	private
Amgen	private
Ardea Biosciences	private
Array BioPharma	private
Astellas	private
AstraZeneca	private
Bayer	private
Biogen	private
Biotronik	private
Bioverativ	private
Boehringer Ingelheim	private
Bristol-Myers Squibb	private
Celgene	private
CSL Behring	private
Cytos Biotechnology	private
Daiichi Sankyo	private
Dr. Falk Pharma	private
Geistlich Pharma	private
Genentech	private
Gilead Sciences	private
GlaxoSmithKline	private
Hoffmann-La Roche	private
IBSA	private
Incyte	private
Janssen	private
Johnson & Johnson	private
Lilly	private
Medtronic	private
Merck	private
Nestlé	private
Novartis	private
Novo Nordisk	private
Pfizer	private
PIQUR Therapeutics	private
Sanofi	private
Servier	private
Shire	private
Sonova	private
Spirig Pharma	private
Takeda	private
UCB Pharma	private
Vertex Pharmaceuticals	private
ViiV Healthcare	private
Paraplegikerzentrum Nottwil	private
Schulthess Klinik	private
St.Claraspital	private
Hirslanden	private
Unispital Zürich	public
Unispital Basel	public
Unispital Genf	public
Unispital Lausanne	public
Inselspital Bern	public
Unispital Balgrist	public
Universität Zürich	public
Universität Genf	public
Universität Bern	public
Universität Basel	public
Universität Lausanne	public
Universität St.Gallen	public
ETH Zürich	public
EPF Lausanne	public
ZHAW	public
FHS St.Gallen	public
BFH Bern	public
HEVS Hochschule Wallis	public
Fachhochschule Westschweiz	public
Kantonsspital Frauenfeld	public
Kantonsspital St.Gallen	public
Kantonsspital Luzern	public
Kantonsspital Aarau	public
Kantonsspital Baselland	public
Kantonsspital Winterthur	public
Kantonsspital Baden	public
Kantonsspital Graubünden	public
Kantonsspital Bellinzona	public
Regionalspital Locarno	public
Triemli Spital	public
Schweizerischer Nationalfonds	public
SAKK	public
Children’s Oncology Group	consortium
EORTC	consortium
LYSA	consortium
IELSG	consortium

# deduplicate for joining
all_affiliations %<>%
  distinct(affiliation, .keep_all = TRUE)

if (!"primary_sponsor_simple" %in% colnames(relevant_trials)) {
  # take trials and join simplified names
  relevant_trials %<>%
    # for primary sponsors
    left_join(
      all_affiliations %>%
        rename(primary_sponsor_simple = simplified_name),
      by = c("primary_sponsor" = "affiliation")
    ) %>%
    # and for leading institutions
    left_join(
      all_affiliations %>%
        rename(leading_institution_simple = simplified_name),
      by = c("leading_institution" = "affiliation")
    )
}

Now we calculate publication rates for different groups and companies. We introduce the new field has_publication which is true when any of the conditions (linked, found, g_scholar) meets. Then we visualize the different conditions. Once for all the known affiliations (a bit chaotic) and one for only the major institutions:

# create function that takes the trials and groups them by affiliation
# and returns a summarized and sorted data frame
get_publication_rates <- function(data) {
  # as the primary sponsor is responsible for uploading / publishing the results
  # of a trial, we only look at the sponsors, not the leading institutions
  result <- bind_rows(
    data %>%
      select(
        affiliation = primary_sponsor_simple,
        results_submitted,
        publication_found,
        has_g_scholar_results,
        unique_id
      )
  ) %>%
    filter(!is.na(affiliation)) %>%
    mutate(
      submitted_or_found = results_submitted |
        publication_found,
      # add column: set to TRUE when any of the three conditions meet
      has_publication = submitted_or_found |
        has_g_scholar_results,
      found_or_g_scholar = publication_found |
        has_g_scholar_results
    ) %>%
    group_by(affiliation) %>%
    summarise(
      trials_in_total = n(),
      results_submitted = sum(results_submitted, na.rm = TRUE) /
        trials_in_total,
      publication_found = sum(publication_found, na.rm = TRUE) /
        trials_in_total,
      submitted_or_found = sum(submitted_or_found, na.rm = TRUE) /
        trials_in_total,
      has_g_scholar_results = sum(has_g_scholar_results, na.rm = TRUE) /
        trials_in_total,
      found_or_g_scholar = sum(found_or_g_scholar, na.rm = TRUE) /
        trials_in_total,
      has_publication = sum(has_publication, na.rm = TRUE) /
        trials_in_total
    ) %>%
    ungroup() %>%
    arrange(desc(trials_in_total))

  # convert affiliations to factors and sort by totals
  result %>%
    mutate(
      affiliation = factor(
        affiliation,
        levels = result$affiliation
      )
    )
}

# perform calculations
affiliation_publication_rates <- relevant_trials %>%
  filter(study_type == "interventional") %>%
  get_publication_rates() %>%
  # join definition of public institution vs private organisation
  left_join(
    public_or_private,
    by = c("affiliation" = "simple_name")
  )

## Warning: Column `affiliation`/`simple_name` joining factor and character
## vector, coercing into character vector

# create lookup strings for better readability in plot
labels <- c(
  "results_submitted" =
    "Results were uploaded to the clinical trial registry",
  "publication_found" =
    "The Trial ID was found in pubmed or cochrane",
  "submitted_or_found" =
    "Either of the two conditions above did meet",
  "has_g_scholar_results" =
    "A Google Scholar Search for the Trial ID had 1+ results",
  "found_or_g_scholar" =
    "It was found in pubmed or cochrane or via Google Scholar",
  "has_publication" =
    "Any of the three conditions did meet"
)

# plot all, then plot only those with more than 50 trials
list("all", "top_10") %>%
  walk(function(all_or_top_10) {
    # save plot in variable to generate two different facets
    g <- ggplot(
      affiliation_publication_rates %>%
        # filter to most important sponsors in second iteration
        filter(all_or_top_10 == "all" | trials_in_total > 50) %>%
        # convert columns to rows so we can display them in different facets
        gather("key", "value", results_submitted:has_publication) %>%
        mutate(
          key = str_replace_all(key, labels),
          # sort facets
          key = factor(key, levels = labels)
        ),
      aes(
        x = trials_in_total,
        y = value,
        color = value,
        label = affiliation
      )
    ) +
      geom_text_repel(force = 2) +
      geom_point() +
      scale_x_log10(
        limits = c(5, 500),
        breaks = c(5, 10, 20, 50, 100, 200, 500)
      ) +
      scale_y_continuous(
        limits = c(0, 1),
        labels = percent
      ) +
      scale_colour_gradient(low = "#B2182B", high = "#2166AC") +
      theme_minimal() +
      theme(legend.position = "none") +
      labs(
        title = "Relationship between number of publications and rates",
        subtitle = "For all trials finished between 2007 and 2016",
        x = "Total number of registered trials",
        y = "Publication rate"
      )

    # output
    print(g + facet_wrap(~ key, ncol = 1))

    # export
    ggsave(
      paste0("output/publication_rates_", all_or_top_10, ".pdf"),
      g + facet_wrap(~ key, ncol = 2),
      device = "pdf",
      units = "mm",
      width = 297,
      height = 420
    )

    rm(g)
  })

## Warning: Removed 102 rows containing missing values (geom_text_repel).

## Warning: Removed 102 rows containing missing values (geom_point).

## Warning: Removed 102 rows containing missing values (geom_text_repel).

## Warning: Removed 102 rows containing missing values (geom_point).

Numbers by group public / private

calculateWeightedMeanAndTotal <- function(df, grouping_variable) {
  df %>%
    group_by_at(vars(grouping_variable)) %>%
    summarise_at(
      vars(results_submitted:has_publication),
      funs(weighted.mean(., trials_in_total))
    ) %>%
    # now we lost the trials in total as it's own column, join it in again
    left_join(
      df %>%
        group_by_at(vars(grouping_variable)) %>%
        summarise(trials_in_total = sum(trials_in_total)),
      by = grouping_variable
    )
}

# calculate share of each value by group: private or public
totals_by_group <- affiliation_publication_rates %>%
  calculateWeightedMeanAndTotal(grouping_variable = "group")

# calculate totals overall
totals <- totals_by_group %>%
  ungroup() %>%
  summarise_at(
    vars(results_submitted:has_publication),
    funs(weighted.mean(., trials_in_total))
  ) %>%
  cbind(
    trials_in_total = sum(totals_by_group$trials_in_total), group = "Total"
  )

knitr::kable(
  totals_by_group %>%
    bind_rows(totals) %>%
    mutate_at(vars(results_submitted:has_publication), percent)
)

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

group	results_submitted	publication_found	submitted_or_found	has_g_scholar_results	found_or_g_scholar	has_publication	trials_in_total
consortium	50.7%	59.2%	88.7%	94.4%	95.8%	98.6%	71
private	68.8%	54.9%	83.8%	85.0%	90.8%	94.5%	1049
public	14.1%	46.7%	55.3%	64.5%	70.4%	75.5%	891
Total	44.0%	51.4%	71.4%	76.3%	81.9%	86.2%	2011

Biggest institutions

top_10_publication_rates <- affiliation_publication_rates %>%
  calculateWeightedMeanAndTotal(grouping_variable = "affiliation") %>%
  filter(trials_in_total > 50) %>%
  arrange(desc(has_publication))

knitr::kable(
  top_10_publication_rates %>%
    mutate_at(vars(results_submitted:has_publication), percent)
)

affiliation	results_submitted	publication_found	submitted_or_found	has_g_scholar_results	found_or_g_scholar	has_publication	trials_in_total
Pfizer	91.3%	60.9%	95.7%	88.4%	92.8%	98.6%	69
Bayer	62.7%	59.3%	83.1%	79.7%	89.8%	98.3%	59
Hoffmann-La Roche	82.1%	64.3%	90.5%	89.3%	95.2%	97.6%	84
Sanofi	50.0%	60.0%	81.7%	95.0%	96.7%	96.7%	60
Novartis	60.2%	54.4%	81.0%	84.5%	92.9%	96.0%	226
Unispital Lausanne	24.6%	53.8%	67.7%	72.3%	76.9%	83.1%	65
Unispital Genf	14.2%	42.5%	49.1%	67.9%	74.5%	77.4%	106
Universität Zürich	9.5%	40.7%	48.9%	60.6%	65.6%	72.9%	221
Unispital Basel	12.2%	41.7%	51.3%	58.3%	65.4%	70.5%	156
Inselspital Bern	16.5%	49.5%	58.3%	59.2%	64.1%	69.9%	103

# export
write_csv(
  relevant_trials,
  "output/trials.csv"
)

Linting

The code in this RMarkdown is listed with the lintr package, which is based on the tidyverse style guide.

lintr::lint("main.Rmd")

## main.Rmd:599:1: style: lines should not be more than 80 characters.
##     "https://raw.githubusercontent.com/ebmdatalab/euctr-tracker-data/master/trials.csv",
## ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
## main.Rmd:604:1: style: lines should not be more than 80 characters.
##   embd_euctr_data <- read_csv("input/ignore/embd/trials.csv", col_types = cols())
## ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Clinical Trials

Preprocessing and analysis

SRF Data, Angelo Zehr (angelo.zehr@srf.ch)

January 2019

Notes

R-Script & data

GitHub

License

Data description of output files

trials.csv

Procedure

Data sources

Trials

WHO (ICTRP)

ClinicalTrials.gov

Deutsches Register Klinischer Studien (DRKS)

Publications

Pubmed

Cochrane

Kofam/BASEC

Preparations

Define packages

Install packages

Load packages

Read in data

Trials

ClinicalTrials.gov

ICTRP

EU Trials Tracker

Deutsches Register Klinischer Studien (DRKS)

Check overlap

How complete are the data sources?

ICTRP

DRKS

NCT

Combine all Trials

NCT

DRKS

ICTRP

Check integrity

Quick visual analysis

Publications

Cochrane

Pubmed

Combine

Matching

Match trial IDs in publications

Select relevant trials

Select Timespan

Select Type

Google Scholar

Scraper

Name recognition

Visual Analysis

Publication Rates

Numbers by group public / private

Biggest institutions

Linting

`trials.csv`