This document illustrates the analysis of the article 20 years, 20 titles on srf.ch.
SRF Data attaches great importance to transparent and reproducible data preprocessing and -analysis. SRF Data believes in the principles of open data but also open and reproducible methods. Third parties should be empowered to build on the work of SRF Data and to generate new analyses and applications.
The analysis of the data was conducted in the R project for statistical computing. The RMarkdown script used to generate this document and all the resulting data can be downloaded under this link. Through executing analysis.Rmd
, the herein described process can be reproduced and this document can be generated. In the course of this, data from the folder ìnput
will be processed and results will be written into output
.
R version 3.4.4 is used. If the code does not work, it is very likely that an older R version is installed. If an error occures it sometimes helps to execute the script several times. Particularly in the case of package installation problems it could be helpful to restart the R session and execute the code over again. If necessary, third party libraries like libgdal-dev
have to be installed.
This report was generated on 2018-12-07 10:16:20. R version: 3.4.4 on x86_64-pc-linux-gnu. For this report, CRAN packages as of 2017-09-01 were used.
The code for the herein described process can also be freely downloaded from https://github.com/srfdata/2018-01-roger-federer. Criticism in the form of GitHub issues and pull requests are very welcome!
2018-01-roger-federer by SRF Data is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All code & data from SRF Data are available under https://srfdata.github.io.
The published information has been collated carefully, but no guarantee is offered of its completeness, correctness or up-to-date nature. No liability is accepted for damage or loss incurred from the use of this script or the information drawn from it. This exclusion of liability also applies to third-party content that is accessible via this offer.
player_match_stats
Column | Data Type | Description |
---|---|---|
match_id | num | unique id of the match |
tournament_event_id | int | unique id of the tournament (e.g. Roland Garros 2011) |
tournament_id | int | unique id of the sort of tournament (e.g. Roland Garros) |
season | int | year in 4 digit form (e.g. 2011) |
date | Date | date of the start of the tournament (e.g. 2011-10-20) |
level | chr | one of: ATP 500 (A), ATP 250 (B), Davis Cup (D), Tour Finals (F), Grand Slam (G), Masters (M), Alternative Finals (L), Olympia (O), World Team Cup (T) |
surface | chr | one of: Clay (C), Grass (G), Hard (H), Carpet (P) |
round | chr | one of: R128, R64, R32, R16, Quarter Finals (QF), Semi Finals (SF), Final (F), Robin Round (RR), BR |
best_of | int | play mode. One of: 3 or 5 sets |
player_id | int | unique id of player (e.g. 1234) |
player_rank | int | rank of player before match (e.g. 10) |
player_elo_rating | int | elo rating of player before match (e.g. 1000) |
player_age | num | age of player in years (e.g 21.2) |
player_height | int | height of player in centimeters (e.g. 173) |
opponent_id | int | unique id of opponent (e.g. 1234) |
opponent_rank | int | rank of opponent before match (e.g. 10) |
opponent_elo_rating | int | elo rating of opponent before match (e.g. 1000) |
opponent_seed | int | not used, many NAs – seed to start in a tournament |
opponent_entry | chr | not used, many NAs - one of LL, Q, WC, PR, SE |
opponent_country_id | chr | 3 character country origin of opponent (e.g. AUS) |
opponent_age | num | age of opponent in years (e.g 21.2) |
opponent_height | int | height of opponent in centimeters (e.g. 173) |
outcome | chr | “RET” if one of the players retired from the game early (else: NA) |
p_matches | int | 1 if player won the match, 0 if the opponent won |
o_matches | int | 0 if player won the match, 1 if the opponent won |
p_sets | int | number of sets the player won (e.g. 2) |
o_sets | int | number of sets the opponent won (e.g. 1) |
p_games | int | number of games the player won (e.g. 30) |
o_games | int | number of games the opponent won (e.g. 26) |
p_ace | int | number of aces the player made (e.g. 10) |
p_df | int | number of double faults the player made (e.g. 5) |
p_sv_pt | int | number of serves the player played (e.g. 120) |
p_1st_in | int | number of the players first serves that hit the field (e.g. 80) |
p_1st_won | int | number of first serves that the player won (e.g. 60) |
p_2nd_won | int | number of second serves that the player won (e.g. 40) |
p_sv_gms | int | number of games the player won with his serve (e.g. 10) |
p_bp_sv | int | number of break points the player saved (e.g. 3) |
p_bp_fc | int | number of break points the player faced (e.g. 5) |
o_ace | int | number of aces the opponent made (e.g. 10) |
o_df | int | number of double faults the opponent made (e.g. 5) |
o_sv_pt | int | number of serves the opponent played (e.g. 120) |
o_1st_in | int | number of the opponents first serves that hit the field (e.g. 80) |
o_1st_won | int | number of first serves that the opponent won (e.g. 60) |
o_2nd_won | int | number of second serves that the opponent won (e.g. 40) |
o_sv_gms | int | number of games the opponent won with his serve (e.g. 10) |
o_bp_sv | int | number of break points the opponent saved (e.g. 2) |
o_bp_fc | int | number of break points the opponent faced (e.g. 4) |
minutes | int | number the duration of a game in minutes (e.g 120) |
p_points | int | number of points the player made in the match (e.g. 90) |
o_points | int | number of points the opponent made in the match (e.g. 94) |
ace_rate | num | percentage of aces the player made as decimal (e.g. 0.15) |
double_faults | num | percentage of double faults the player made as decimal (e.g. 0.15) |
first_serve_in | num | percentage of the players first serve that hit the field as decimal (e.g. 0.8) |
first_serve_won | num | percentage of points the player made with his first serve as decimal (e.g. 0.75) |
second_serve_needed | num | percentage of the players first serve that did not hit the field as decimal (e.g. 0.25) |
second_serve_won | num | percentage of points the player made with his second serve as decimal (e.g. 0.4) |
serves_won | num | percentage of points the player made with his serve total as decimal (e.g. 0.9) |
breakpoints_saved | num | percentage of break points the player saved as decimal (e.g. 0.9) |
o_ace_rate | num | percentage of aces the opponent made as decimal (e.g. 0.1) |
first_return_won | num | percentage of points the opponent made during the first serve as decimal (e.g. 0.3) |
second_return_won | num | percentage of points the opponent made during the second serve as decimal (e.g. 0.4) |
returns_won | num | percentage of points the opponent made during the serve total as decimal (e.g. 0.5) |
breakpoints_won | num | percentage of break points the player won as decimal (e.g. 0.9) |
Set1 | chr | result of the 1st set, player left, opponent right (e.g. 7-6(3)) |
Set2 | chr | result of the 2nd set, player left, opponent right (e.g. 2-6) |
Set3 | chr | result of the 3rd set, player left, opponent right (e.g. 6-4 or NA) |
Set4 | chr | result of the 4th set, player left, opponent right (e.g. 6-7(2) or NA) |
Set5 | chr | result of the 5th set, player left, opponent right (e.g. 3-6 or NA) |
RET | chr | “RET” if one of the players retired from the game early (else: NA) |
score | chr | score of the whole match, player left, opponent right (e.g. 7-6(3) 2-6 6-4 6-7(2) 3-6) |
p_tie_breaks_won | num | percentage of tie breaks the player won as decimal (e.g. 0.5) |
o_tie_breaks_won | num | percentage of tie breaks the opponent won as decimal (e.g. 0.5) |
p_deciding_sets_won | num | percentage of deciding sets the player won as decimal (e.g. 0) |
o_deciding_sets_won | num | percentage of deciding sets the opponent won as decimal (e.g. 1) |
The database was set up and initialised by Mileta Cekovic which runs Ultimate Tennis Statistics. This project’s data is based on information found on the ATP Website, which does not offer an official data API.
The point by point data used in chapter 9 was collected by Jeff Sackmann and a team of volunteers. Their project is called the Map Charting Project. The data is not imported directly but via the deuce package by Stephanie Kovalchik who offers a lot of Jeff Sackmann’s data in a handy R package.
## [1] "package package:rmarkdown detached"
# von https://mran.revolutionanalytics.com/web/packages/checkpoint/vignettes/using-checkpoint-with-knitr.html
# alle Packages, die nicht gebraucht werden, können hier entfernt werden (auskommentieren reicht nicht!)
# tidyverse: see https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/
cat("
library(lubridate) # date handling
library(tidyverse) # ggplot2, dplyr, tidyr, readr, purrr, tibble
library(ggrepel) # repelling text labels for ggplot
library(RColorBrewer) # better colors for ggplot
library(magrittr) # pipes
library(stringr) # string manipulation
library(scales) # scales for ggplot2
library(jsonlite) # json
library(lintr) # code linting, auf keinen Fall entfernen ;-)
library(rmarkdown) # muss für automatisches knitting
# in deploy.sh eingebunden werden",
file = "manifest.R")
# if checkpoint is not yet installed, install it (for people using this
# system for the first time)
if (!require(checkpoint)) {
if (!require(devtools)) {
install.packages("devtools", repos = "http://cran.us.r-project.org")
require(devtools)
}
devtools::install_github("RevolutionAnalytics/checkpoint",
ref = "v0.3.2", # could be adapted later,
# as of now (beginning of July 2017
# this is the current release on CRAN)
repos = "http://cran.us.r-project.org")
require(checkpoint)
}
## Loading required package: checkpoint
##
## checkpoint: Part of the Reproducible R Toolkit from Microsoft
## https://mran.microsoft.com/documents/rro/reproducibility/
# nolint start
if (!dir.exists("~/.checkpoint")) {
dir.create("~/.checkpoint")
}
# nolint end
# install packages for the specified CRAN snapshot date
checkpoint(snapshotDate = package_date,
project = path_to_wd,
verbose = T,
scanForPackages = T,
use.knitr = F,
R.version = R_version)
## Scanning for packages used in this project
## rmarkdown files found and will not be parsed. Set use.knitr = TRUE
## - Discovered 12 packages
## All detected packages already installed
## checkpoint process complete
## ---
rm(package_date)
# install deuce package for point-by-point data
# https://github.com/skoval/deuce
if (!require(deuce)) {
if (!require(devtools)) {
install.packages("devtools", repos = "http://cran.us.r-project.org")
require(devtools)
}
devtools::install_github("skoval/deuce")
library(deuce)
}
## Loading required package: deuce
## Loading required package: jsonlite
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: stringr
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## Loading required package: rvest
## Loading required package: xml2
source("manifest.R")
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Conflicts with tidy packages ----------------------------------------------
## as.difftime(): lubridate, base
## date(): lubridate, base
## filter(): dplyr, stats
## intersect(): lubridate, base
## lag(): dplyr, stats
## setdiff(): lubridate, base
## union(): lubridate, base
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
unlink("manifest.R")
sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.1 LTS
##
## Matrix products: default
## BLAS: /opt/R/R-3.4.4/lib64/R/lib/libRblas.so
## LAPACK: /opt/R/R-3.4.4/lib64/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rmarkdown_1.6 lintr_1.0.1 scales_0.5.0
## [4] magrittr_1.5 RColorBrewer_1.1-2 ggrepel_0.6.5
## [7] purrr_0.2.3 readr_1.1.1 tidyr_0.7.0
## [10] tibble_1.3.4 ggplot2_2.2.1 tidyverse_1.1.1
## [13] deuce_1.0 rvest_0.3.2 xml2_1.1.1
## [16] lubridate_1.6.0 stringr_1.2.0 dplyr_0.7.2
## [19] jsonlite_1.5 checkpoint_0.4.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.12 cellranger_1.1.0 compiler_3.4.4 plyr_1.8.4
## [5] bindr_0.1 forcats_0.2.0 tools_3.4.4 digest_0.6.12
## [9] gtable_0.2.0 evaluate_0.10.1 nlme_3.1-131.1 lattice_0.20-35
## [13] pkgconfig_2.0.1 rlang_0.1.2 rex_1.1.1 psych_1.7.5
## [17] yaml_2.1.14 parallel_3.4.4 haven_1.1.0 bindrcpp_0.2
## [21] httr_1.3.1 knitr_1.17 hms_0.3 rprojroot_1.2
## [25] grid_3.4.4 glue_1.1.1 R6_2.2.2 readxl_1.0.0
## [29] foreign_0.8-69 modelr_0.1.1 reshape2_1.4.2 backports_1.1.0
## [33] htmltools_0.3.6 assertthat_0.2.0 mnormt_1.5-5 colorspace_1.3-2
## [37] stringi_1.1.5 lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2
# manual
# create lookup tables for level names
levels <- data.frame(t(c(
"B" = "ATP 250",
"A" = "ATP 500",
"M" = "Masters",
"F" = "Tour Finals",
"G" = "Grand Slam",
"D" = "Davis Cup",
"L" = "Alternative Finals",
"O" = "Olympia",
"T" = "World Team Cup"
)), stringsAsFactors = FALSE) %>%
gather(level, name)
# a (subjective) player set of tennis champions to compare Federer against
champions <- data.frame(t(c(
"3819" = "Roger Federer",
"284" = "Jimmy Connors",
"437" = "Bjorn Borg",
"581" = "John McEnroe",
"656" = "Ivan Lendl",
"1222" = "Stefan Edberg",
"1948" = "Pete Sampras",
"1736" = "Andre Agassi"
)), stringsAsFactors = FALSE) %>%
gather(id, name) %>%
mutate(id = as.numeric(gsub("X(\\d*)(\\.\\d)?", "\\1", id)))
# The big 4: Murray, Djoko, Nadal, Federer
big_4 <- data.frame(t(c(
"3819" = "Roger Federer",
"4742" = "Rafael Nadal",
"4920" = "Novak Djokovic",
"4913" = "Andy Murray"
)), stringsAsFactors = FALSE) %>%
gather(id, name) %>%
mutate(id = as.numeric(gsub("X(\\d*)(\\.\\d)?", "\\1", id)))
# RData
load(file = "input/rdata/tournaments.RData")
load(file = "input/rdata/player.RData")
load(file = "input/rdata/player_names.RData")
load(file = "input/rdata/player_ranking.RData")
load(file = "input/rdata/player_match_stats.RData")
load(file = "input/rdata/player_stats.RData")
# select all of Federers matches
federer_match_stats <- player_match_stats %>%
filter(player_id == 3819)
# read in ranking point breakdown provided on ATP website
# src: www.atpworldtour.com/en/rankings/rankings-faq
ranking_point_breakdown <- read.csv(
"input/ranking_points.csv",
stringsAsFactors = FALSE
)
# generate empty frame for each day of the time frame
interpolate_dates <- data.frame(
rank_date = seq(
from = as.Date("1998-07-06"),
to = Sys.Date(),
by = "day"
)
)
# as a navigation I want to use Federers ATP ranks
federer_ranking <- player_ranking %>%
filter(player_id == 3819) %>%
collect() %>%
filter(rank_date >= as.Date("1998-07-06")) %>%
arrange(desc(rank_date)) %>%
select(-player_id) %>%
right_join(
interpolate_dates,
by = "rank_date"
) %>%
fill(rank, rank_points) %>%
rename(date = rank_date)
# test plot
ggplot(
data = federer_ranking,
aes(
x = date,
y = rank,
group = 1
)
) +
coord_flip() +
geom_line() +
theme_minimal() +
scale_x_date() +
scale_y_continuous(trans = trans_new(
"nlog10_trans",
transform = function(y) {
-log10(y)
},
inverse = function(y) {
10 ** y
})
)
# get order of rounds from ranking point breakdown table
ranking_point_levels <- names(ranking_point_breakdown)
# select only Grand Slam matches and keeop only latest match per tournament
federer_grand_slams <- federer_match_stats %>%
filter(level == "G") %>%
group_by(tournament_id, season) %>%
mutate(round = factor(round, levels = ranking_point_levels)) %>%
arrange(date, round) %>%
filter(as.integer(round) == min(as.integer(round))) %>%
mutate(round = as.character(round)) %>%
mutate(round = ifelse(round == "F" & p_matches == 1, "W", round)) %>%
ungroup()
# add slugs of tournament name
grand_slam_slugs <- data.frame(t(c(
"1" = "ao", # Australian Open
"19" = "fo", # French Open
"17" = "w", # Wimbledon
"6" = "uso" # US Open
))) %>%
gather(tournament_id, slug) %>%
mutate(
tournament_id = as.numeric(gsub("X(\\d*)(\\.\\d)?", "\\1", tournament_id))
)
## Warning: attributes are not identical across measure variables;
## they will be dropped
export_federer_grand_slams <- federer_grand_slams %>%
left_join(
grand_slam_slugs,
by = "tournament_id"
) %>%
select(slug, season, round) %>%
mutate(slug = factor(slug, levels = grand_slam_slugs$slug)) %>%
arrange(season, slug)
# cut off all the data points that occur after player reached his highest rank
champions_rise <- player_ranking %>%
filter(rank_date >= as.Date("1973-01-01")) %>%
filter(
player_id %in% champions$id |
player_id %in% big_4$id
) %>%
left_join(
player_names,
by = "player_id"
) %>%
left_join(
player %>%
select(player_id, dob),
by = "player_id"
) %>%
mutate(
age = lubridate::as.duration(rank_date - dob) %>%
as.numeric("years")
) %>%
group_by(player_id) %>%
mutate(
highest_rank = min(rank, na.rm = TRUE),
age_rank_reached = min(age[rank == highest_rank], na.rm = TRUE)
) %>%
filter(age <= age_rank_reached)
# rise
ggplot(
data = champions_rise,
aes(x = age, y = rank)
) +
geom_step(
data = champions_rise,
aes(colour = name),
size = 0.4,
alpha = 0.9
) +
geom_step(
data = champions_rise %>%
filter(player_id == 3819),
size = 1,
alpha = 1
) +
xlim(c(15, 30)) +
scale_y_continuous(trans = trans_new(
"nlog10_trans",
transform = function(y) {
-log10(y)
},
inverse = function(y) {
10 ** y
}),
name = "ATP Rank (logarithmic)",
limits = c(1000, 0.5)
) +
geom_text_repel(
data = champions_rise %>%
filter(
player_id %in% champions$id |
player_id %in% big_4$id
) %>%
group_by(name) %>%
filter(rank == min(rank, na.rm = TRUE)),
aes(label = name),
size = 3,
segment.color = NA,
angle = 90
) +
scale_colour_manual(values = colorRampPalette(brewer.pal(7, "Set1"))(11)) +
theme_minimal() +
labs(
title = "ATP Ranking vs. Age",
subtitle = "All players having reached the top 20 in grey.",
x = "Age"
)
# get match stats for big 4
big_4_match_stats <- player_match_stats %>%
filter(player_id %in% big_4$id) %>%
left_join(player_names, by = "player_id")
# calculate means per season
big_4_serve_stat_means <- big_4_match_stats %>%
select(season, name, ace_rate:breakpoints_saved, player_id) %>%
group_by(season, name, player_id) %>%
summarise_all(mean, na.rm = TRUE) %>%
gather(category, value, ace_rate:breakpoints_saved)
# extract serves won
big_4_serves_won <- big_4_serve_stat_means %>%
filter(category == "serves_won")
ggplot(
data = big_4_serves_won,
aes(
x = season,
y = value,
color = name,
size = player_id == 3819
)
) +
geom_line() +
theme_minimal() +
scale_y_continuous(labels = scales::percent) +
labs(
title = "Serves won in percent",
x = "Season",
y = "Share"
) +
scale_size_manual(values = c(0.5, 1.5)) +
scale_colour_manual(values = colorRampPalette(brewer.pal(7, "Set1"))(10)) +
xlim(min(big_4_serves_won$season), max(big_4_serves_won$season) + 5) +
geom_text_repel(
data = big_4_serves_won %>%
group_by(player_id) %>%
filter(season == max(season)),
aes(label = name),
size = 3,
nudge_x = 1,
segment.color = NA
) +
theme(legend.position = "none")
# generate empty frame for each day of the time frame
interpolate_dates <- data.frame(
rank_date = seq(
from = as.Date("1974-07-29"),
to = Sys.Date(),
by = "week"
)
)
# create data frame with only number ones for simplified plot
no_1s <- player_ranking %>%
filter(rank_date > as.Date("1973-01-01")) %>%
filter(rank == 1) %>%
# reduce ranking data to mondays
mutate(
rank_date = rank_date - as.numeric(format(rank_date, "%u")) + 1
) %>%
left_join(
player_names %>%
select(player_id, name),
by = "player_id"
) %>%
right_join(
interpolate_dates,
by = "rank_date"
) %>%
select(-rank_points, -rank) %>%
fill(player_id, name)
# calculate records for most consecutive weeks
longest_periods <- no_1s %>%
mutate(
consecutive = 0, # needed as starting point
consecutive = ifelse(
player_id != lag(player_id, default = 0), # restart if player changed
row_number(),
NA
)
) %>%
tidyr::fill(consecutive) %>%
group_by(name, consecutive) %>%
tally() %>%
ungroup() %>%
top_n(10, wt = n) %>%
arrange(desc(n)) %>%
select(name, consecutive_weeks = n)
knitr::kable(longest_periods)
name | consecutive_weeks |
---|---|
Roger Federer | 237 |
Jimmy Connors | 160 |
Ivan Lendl | 157 |
Novak Djokovic | 122 |
Pete Sampras | 102 |
Jimmy Connors | 84 |
Pete Sampras | 82 |
Ivan Lendl | 80 |
Lleyton Hewitt | 75 |
Rafael Nadal | 68 |
# define order of players manually
n1_levels <- c(
"Jimmy Connors",
"Bjorn Borg",
"John Mcenroe",
"Ivan Lendl",
"Stefan Edberg",
"Pete Sampras",
"Andre Agassi",
"Roger Federer",
"Rafael Nadal",
"Novak Djokovic",
"Andy Murray"
)
ggplot(
data = no_1s %>%
filter(name %in% n1_levels),
aes(
x = rank_date,
y = factor(name, levels = rev(n1_levels))
)
) +
geom_point(size = 0.01) +
theme_minimal() +
theme(legend.position = "none") +
labs(
x = "Year",
y = "Player",
title = "When where the players in the N°1 of the ATP Rankings"
)
# select only relevant matches and mangle so federer has -1 for losses
# and +1 for wins
federer_vs_big_4_matches <- player_match_stats %>%
filter(player_id == 3819 & opponent_id %in% big_4$id) %>%
left_join(big_4, by = c("opponent_id" = "id")) %>%
arrange(date) %>%
mutate(o_matches = o_matches * -1) %>%
mutate(o_sets = o_sets * -1) %>%
mutate(o_games = o_games * -1) %>%
mutate(o_points = o_points * -1) %>%
select(
date,
match_id,
opponent_id,
name,
surface,
p_matches,
o_matches,
p_sets,
o_sets,
p_games,
o_games,
p_points,
o_points
) %>%
mutate(fed_won = p_matches == 1) %>%
gather(key, value, p_matches:o_points)
ggplot(
data = federer_vs_big_4_matches,
aes(
x = factor(date),
y = value,
fill = surface
)
) +
geom_bar(
data = federer_vs_big_4_matches %>%
filter(key == "p_matches" | key == "o_matches"),
stat = "identity"
) +
facet_wrap(
~ name,
nrow = 3,
scales = "free_x"
) +
theme_minimal() +
theme(axis.text.x = element_blank()) +
geom_hline(yintercept = 0) +
labs(
x = "All matches (chronological)",
y = "Opponent | Fed",
title = "Federer vs. Big 4: Matches won over time",
fill = "Surface"
)
under_pressure_years <- big_4_match_stats %>%
group_by(season, name, player_id) %>%
summarise(
tie_breaks_won = sum(p_tie_breaks_won, na.rm = TRUE) /
(sum(p_tie_breaks_won, na.rm = TRUE) +
sum(o_tie_breaks_won, na.rm = TRUE)),
deciding_sets_won = sum(p_deciding_sets_won, na.rm = TRUE) /
(sum(p_deciding_sets_won, na.rm = TRUE) +
sum(o_deciding_sets_won, na.rm = TRUE)),
breakpoints_saved = sum(p_bp_sv, na.rm = TRUE) /
sum(p_bp_fc, na.rm = TRUE),
breakpoints_converted = (sum(o_bp_fc, na.rm = TRUE) -
sum(o_bp_sv, na.rm = TRUE)) /
sum(o_bp_fc, na.rm = TRUE)
) %>%
ungroup %>%
mutate(
under_pressure_index =
breakpoints_converted +
breakpoints_saved +
tie_breaks_won +
deciding_sets_won
) %>%
gather(index, value, tie_breaks_won:under_pressure_index)
ggplot(
data = under_pressure_years %>%
filter(
name %in% big_4$name,
index == "under_pressure_index"
),
aes(
x = season,
y = value,
color = name,
size = name == "Roger Federer"
)
) +
geom_line() +
theme_minimal() +
scale_size_manual(values = c(0.5, 1.5)) +
scale_colour_manual(values = colorRampPalette(brewer.pal(7, "Set1"))(10)) +
theme(legend.position = "none") +
geom_text_repel(
data = under_pressure_years %>%
filter(
name %in% big_4$name,
index == "under_pressure_index"
) %>%
group_by(name) %>%
filter(season == max(season)),
aes(label = name),
size = 3,
nudge_x = 1,
segment.color = NA
) +
labs(
title = "Under Pressure Index (ATP)",
x = "Years",
y = "Under Pressure Performance"
)
## Warning: Removed 2 rows containing missing values (geom_path).
# read from materialized veiw player_stats
wins_defeats <- player_stats %>%
select(player_id, wins = p_matches, defeats = o_matches) %>%
collect()
# calculate total and percentage values
wins_defeats %<>%
mutate(total = wins + defeats) %>%
mutate(win_percentage = wins / total)
# filter out irrelevant players with less than 100 matches played
wins_defeats %<>%
filter(total > 100, win_percentage > 0.6)
# join player names
wins_defeats %<>%
left_join(
player_names,
by = "player_id"
)
# define players of interest (top 10 plus historic champions)
players_of_interest <- wins_defeats %>%
filter(
player_id %in% big_4$id |
player_id %in% champions$id
)
ggplot(
data = wins_defeats,
aes(
x = win_percentage,
y = total,
alpha = win_percentage,
color = active
)
) +
guides(alpha = FALSE) +
geom_point() +
geom_text_repel(data = players_of_interest, aes(label = name)) +
theme_minimal() +
scale_color_brewer(palette = "Set1") +
scale_x_continuous(labels = scales::percent) +
labs(
title = "Numbers of Games played vs. Games won in percent",
x = "Share in percent",
y = "Total matches played",
color = "Active players"
)
# read data frame from deuce package
data(mcp_points)
# filter out women's matches and separate match_id into columns
mcp_points %<>%
filter(grepl("\\d{8}-M-", match_id)) %>%
separate(match_id,
sep = "-",
into = c(
"date",
"sex",
"place",
"level",
"player_1",
"player_2"
)
) %>%
select(-sex) %>%
mutate(
date = as.POSIXct(date, format = "%Y%m%d"),
player_1 = trimws(gsub("_", " ", player_1)),
player_2 = trimws(gsub("_", " ", player_2))
) %>%
mutate_at(vars(isAce:isDouble), as.logical) %>%
left_join(
player_names %>%
select(player_1 = name, player_1_id = player_id),
by = "player_1"
) %>%
left_join(
player_names %>%
select(player_2 = name, player_2_id = player_id),
by = "player_2"
)
## Warning in grepl("\\d{8}-M-", match_id): input string 97663 is invalid in
## this locale
## Warning in grepl("\\d{8}-M-", match_id): input string 97664 is invalid in
## this locale
## Warning in grepl("\\d{8}-M-", match_id): input string 97665 is invalid in
## this locale
## Warning in grepl("\\d{8}-M-", match_id): input string 97666 is invalid in
## this locale
## Warning in grepl("\\d{8}-M-", match_id): input string 97667 is invalid in
## this locale
# summarise to matches
mcp_matches <- mcp_points %>%
# filter out one match that has NAs in important columns
filter(place != "Aptos_CH" | level != "SF") %>%
gather(indicator, name, player_1:player_2) %>%
mutate(indicator = ifelse(indicator == "player_1", 1, 2)) %>%
mutate(
player_id = ifelse(indicator == 1, player_1_id, player_2_id),
opponent_id = ifelse(indicator == 1, player_2_id, player_1_id)
) %>%
group_by(date, place, level, player_id, opponent_id) %>%
summarise(
rallyLength = mean(as.numeric(rallyCount), na.rm = TRUE),
number_of_rallies = n()
)
## Warning in mean(as.numeric(rallyCount), na.rm = TRUE): NAs introduced by
## coercion
## Warning in mean(as.numeric(rallyCount), na.rm = TRUE): NAs introduced by
## coercion
# get values for big 4
big_4_mcp_values <- mcp_matches %>%
filter(player_id %in% big_4$id) %>%
# join player names
left_join(player_names, by = "player_id")
# calculate means for each season
big_4_rally_lengths <- big_4_mcp_values %>%
group_by(season = format(date, "%Y"), player_id, name) %>%
# calculate a weighted average by multiplying the average rally length of
# each match with the number of rallies in the match
summarise(
rallyLength = sum(rallyLength * number_of_rallies) / sum(number_of_rallies)
) %>%
ungroup()
ggplot(
data = big_4_rally_lengths,
aes(
x = season,
y = rallyLength,
color = name,
group = name
)
) +
geom_line() +
theme_minimal() +
ylim(0, NA) +
labs(
title = "Average length of a rally by the big 4 players",
y = "Number of times the ball was hit",
x = "Time"
)
average_match_lengths <- big_4_match_stats %>%
group_by(name, player_id, best_of, p_matches) %>%
summarise(
total_matches = n(),
coverage = sum(ifelse(is.na(minutes), 0, 1)) / n(),
minutes = mean(minutes, na.rm = TRUE)
) %>%
ungroup() %>%
mutate(
best_of = factor(best_of),
p_matches = ifelse(p_matches == 1, "Wins", "Defeats")
)
# calculate averages
average_of_all_players <- player_match_stats %>%
filter(!is.na(minutes)) %>%
group_by(p_matches, best_of, player_id) %>%
# calculate average per player
summarise(minutes = mean(minutes, na.rm = TRUE)) %>%
# calculate everage of all players
summarise(minutes = mean(minutes, na.rm = TRUE)) %>%
ungroup() %>%
mutate(
best_of = factor(best_of),
p_matches = ifelse(p_matches == 1, "Wins", "Defeats")
)
ggplot(
data = average_match_lengths,
aes(
x = best_of,
y = minutes,
label = name,
color = name == "Roger Federer"
)
) +
theme_minimal() +
theme(legend.position = "none") +
facet_wrap(~ p_matches) +
geom_text(
data = average_match_lengths %>%
filter(player_id %in% big_4$id),
size = 3
) +
geom_point(
data = average_of_all_players,
inherit.aes = FALSE,
aes(
x = best_of,
y = minutes
)
) +
scale_color_manual(values = c("black", "red")) +
ylim(50, 200) +
labs(
title = "Average match length of Top 10 players",
subtitle = "The black dot is the average over all players",
x = "Number of Sets (Best of …)",
y = "Average Minutes"
)
# gather for better joinability
ranking_point_breakdown %<>%
gather(round, points, W:RR)
# generate new table with one row per tournament and player
# with cumulative ranking points added per tournament
tournament_ranking_points <- player_match_stats %>%
filter(
player_id %in% champions$id |
player_id %in% big_4$id
) %>%
select(
player_id,
tournament_event_id,
date,
age = player_age,
level,
round,
season,
p_matches
) %>%
group_by(player_id, tournament_event_id) %>%
mutate(round = factor(round, levels = ranking_point_levels)) %>%
arrange(player_id, tournament_event_id, round) %>%
# keep only most important match (multiple if RR) of tournament
filter(as.integer(round) == min(as.integer(round))) %>%
mutate(round = as.character(round)) %>%
mutate(round = ifelse(round == "F" & p_matches == 1, "W", round)) %>%
left_join(
tournaments %>%
select(tournament_event_id, draw_size) %>%
# replace two faulty values in the tournaments table
mutate(draw_size = ifelse(
tournament_event_id == 3843 | tournament_event_id == 3913,
56,
draw_size
)),
by = "tournament_event_id"
) %>%
# calculate min_draw_size to match data from atp website
mutate(
min_draw_size = ifelse(
draw_size > 56 & level == "M",
56,
0
)
) %>%
mutate(
min_draw_size = ifelse(
draw_size > 32 & level %in% c("A", "B"),
32,
min_draw_size
)
) %>%
left_join(
ranking_point_breakdown,
by = c("level", "round", "min_draw_size")
) %>%
mutate(points = ifelse(is.na(points), 0, points)) %>%
mutate(points = ifelse(round == "RR" & p_matches == 0, 0, points)) %>%
# it's really strange, but there are some negative values in the age column,
# but only for old players
filter(age > 14) %>%
group_by(player_id) %>%
arrange(age) %>%
mutate(cumsum = cumsum(points))
# plot
ggplot(
data = tournament_ranking_points,
aes(
x = age,
y = cumsum,
group = player_id
)
) +
theme_minimal() +
theme(legend.position = "none") +
geom_line(
colour = "darkgrey",
alpha = 0.4,
size = 0.2
) +
geom_line(
data = tournament_ranking_points %>%
filter(
player_id %in% champions$id |
player_id %in% big_4$id
),
aes(color = factor(player_id)),
alpha = 1,
size = 0.2
) +
geom_text_repel(
data = tournament_ranking_points %>%
filter(
player_id %in% champions$id |
player_id %in% big_4$id
) %>%
filter(age == max(age)) %>%
left_join(player_names, by = "player_id"),
aes(label = name),
size = 3,
nudge_x = 1,
segment.color = NA
) +
geom_line(
data = tournament_ranking_points %>%
filter(player_id == 3819),
aes(group = 3819),
alpha = 1,
size = 1
) +
labs(
x = "Age",
y = "Cumulative ranking points",
title = "Cumulative Ranking Points won over career"
)
lintr::lint("main.Rmd", linters =
lintr::with_defaults(
commented_code_linter = NULL,
trailing_whitespace_linter = NULL
)
)
## main.Rmd:889:18: style: Variable and function names should be all lowercase.
## mutate_at(vars(isAce:isDouble), as.logical) %>%
## ^~~~~
## main.Rmd:889:24: style: Variable and function names should be all lowercase.
## mutate_at(vars(isAce:isDouble), as.logical) %>%
## ^~~~~~~~
## main.Rmd:913:35: style: Variable and function names should be all lowercase.
## rallyLength = mean(as.numeric(rallyCount), na.rm = TRUE),
## ^~~~~~~~~~
## main.Rmd:929:23: style: Variable and function names should be all lowercase.
## rallyLength = sum(rallyLength * number_of_rallies) / sum(number_of_rallies)
## ^~~~~~~~~~~
## main.Rmd:937:9: style: Variable and function names should be all lowercase.
## y = rallyLength,
## ^~~~~~~~~~~