This document describes the pre-processing and exploratory analysis of the data set that is the basis of the article Die grössten Hürden für Frauen in der Schweizer Politik published on srf.ch.
SRF Data attaches importance to the fact that the data pre-processing and analysis can be reproduced and checked. SRF Data believes in the principle of open data, but also open and comprehensible methods. On the other hand, it should be possible for third parties to build on this preparatory work and thus generate further evaluations or applications.
The preprocessing and analysis of the data was conducted in the R project for statistical computing. The RMarkdown script used to generate this document and all the resulting data can be downloaded under this link. Through executing main.Rmd
, the herein described process can be reproduced and this document can be generated. In the course of this, data from the folder input
will be processed and results will be written to output
.
SRF Data uses Timo Grossenbacher’s rddj-template as the basis for its R scripts. If you have problems executing this script, it may help to study the instructions from the rddj-template.
This report was generated on 2019-09-12 17:07:41. R version: 3.5.3 on x86_64-apple-darwin15.6.0. For this report, CRAN packages as of 2019-03-01 were used.
The code for the herein described process can also be freely downloaded from https://github.com/srfdata/2019-09-women-in-politics.
2019-09-women-in-politics by SRF Data is licensed under a Creative Commons Namensnennung - Attribution ShareAlike 4.0 International License.
Code and data by SRF Data are available on https://srfdata.github.io.
The published information has been carefully compiled, but does not claim to be up-to-date, complete or correct. No liability is assumed for damages arising from the use of this script or the information drawn from it. This also applies to contents of third parties which are accessible via this offer.
candidates_per_gender_per_year.csv
Attribute | Type | Description |
---|---|---|
year | Number | Year of measurement |
category | Enum | Either “running” or “elected” |
share | Number | Share of women as decimal |
candidates_per_gender_per_party_and_year.csv
Attribute | Type | Description |
---|---|---|
year | Number | Year of measurement |
category | Enum | Either “running” or “elected” |
party | Number | Party in question |
n | Number | Number of candidates in that canton, party, year and category |
share | Number | Share of women as decimal |
candidates_detailed.csv
Attribute | Type | Description |
---|---|---|
party | Number | Party in question |
canton | Number | Canton of measurement |
year | Number | Year of measurement |
n | Number | Number of candidates in that canton, party and year |
share | Number | Share of women as decimal |
-> input/candidates_2019.csv
Information about the candidates running this year was acquired by smartvote. As we cannot distribute any proprietary material from them, we have reduced the columns to those necessary for this analysis: ID, name, gender, district (canton) and party. The export was done via their API on the 6th of September 2019 with 4596 unique candidates.
Partiese were simplified as follows:
-> input/su-d-17.02.02.05.02.06.xlsx
Candidates running provided by the Federal Statistical Office.
Footnotes were removed: ¹ Stille Wahl ² Inkl. Vigilance/Genf
-> input/je-d-17.02.02.02.01.02.xlsx
Candidates elected provided by the Federal Statistical Office.
Footnotes were removed: 1) 2009: Fusion von FDP und LPS auf nationaler Ebene unter der Bezeichnung “FDP.Die Liberalen”. Fusion von FDP und LP im Kanton Waadt im Jahr 2012. Im Kanton Basel-Stadt haben FDP und LP nicht fusioniert. Da die LP-BS Mitglied der „FDP.Die Liberalen Schweiz“ ist, wird das Mandat der LP-BS auf gesamtschweizerischer Ebene der FDP zugeteilt. 2) CSP-OW: Der Gewählte bleibt in der CVP-Fraktion.
as well as
## [1] "package package:rmarkdown detached"
## Loading required package: knitr
## Loading required package: rstudioapi
# from https://mran.revolutionanalytics.com/web/packages/checkpoint/vignettes/using-checkpoint-with-knitr.html
# if you don't need a package, remove it from here (commenting is probably not sufficient)
# tidyverse: see https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/
cat("
library(rstudioapi)
library(tidyverse) # ggplot2, dplyr, tidyr, readr, purrr, tibble
library(glue) # use glue directly without namespacing
library(magrittr) # pipes
library(readxl) # excel
library(scales) # scales for ggplot2
library(lintr) # code linting
library(rmarkdown)",
file = "manifest.R")
# if checkpoint is not yet installed, install it (for people using this
# system for the first time)
if (!require(checkpoint)) {
if (!require(devtools)) {
install.packages("devtools", repos = "http://cran.us.r-project.org")
require(devtools)
}
devtools::install_github("RevolutionAnalytics/checkpoint",
ref = "v0.3.2", # could be adapted later,
# as of now (beginning of July 2017
# this is the current release on CRAN)
repos = "http://cran.us.r-project.org")
require(checkpoint)
}
# nolint start
if (!dir.exists("~/.checkpoint")) {
dir.create("~/.checkpoint")
}
# nolint end
# install packages for the specified CRAN snapshot date
checkpoint(snapshotDate = package_date,
project = path_to_wd,
verbose = T,
scanForPackages = T,
use.knitr = F,
R.version = R_version)
rm(package_date)
source("manifest.R")
unlink("manifest.R")
sessionInfo()
## R version 3.5.3 Patched (2019-03-11 r76221)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] de_CH.UTF-8/de_CH.UTF-8/de_CH.UTF-8/C/de_CH.UTF-8/de_CH.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rmarkdown_1.11 lintr_1.0.3 scales_1.0.0 readxl_1.3.0
## [5] magrittr_1.5 glue_1.3.0 forcats_0.4.0 stringr_1.4.0
## [9] dplyr_0.8.0.1 purrr_0.3.0 readr_1.3.1 tidyr_0.8.2
## [13] tibble_2.0.1 ggplot2_3.1.0 tidyverse_1.2.1 checkpoint_0.4.0
## [17] rstudioapi_0.9.0 knitr_1.21
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.0 cellranger_1.1.0 plyr_1.8.4 pillar_1.3.1
## [5] compiler_3.5.3 tools_3.5.3 digest_0.6.18 lubridate_1.7.4
## [9] jsonlite_1.6 evaluate_0.13 nlme_3.1-137 gtable_0.2.0
## [13] lattice_0.20-38 pkgconfig_2.0.2 rlang_0.3.1 rex_1.1.2
## [17] cli_1.0.1 yaml_2.2.0 haven_2.1.0 xfun_0.5
## [21] withr_2.1.2 xml2_1.2.0 httr_1.4.0 hms_0.4.2
## [25] generics_0.0.2 grid_3.5.3 tidyselect_0.2.5 R6_2.4.0
## [29] modelr_0.1.4 backports_1.1.3 htmltools_0.3.6 rvest_0.3.2
## [33] assertthat_0.2.0 colorspace_1.4-0 stringi_1.3.1 lazyeval_0.2.1
## [37] munsell_0.5.0 broom_0.5.1 crayon_1.3.4
# map over election years and read data
# do each year twice, once for the candidates
# and once for the elected parliamentarians
suppressMessages(
candidates_per_gender <- bind_rows(
as.tibble(seq(1971, 2015, 4)) %>%
mutate(
category = "elected",
path = "input/je-d-17.02.02.02.01.02.xlsx",
skip = 3,
n_max = 33
),
as.tibble(seq(1971, 2015, 4)) %>%
mutate(
category = "running",
path = "input/su-d-17.02.02.05.02.06.xlsx",
skip = 4,
n_max = 27
)
) %>%
# name first column
rename(year = 1) %>%
pmap_dfr(function(...) {
current <- tibble(...)
# the worksheets have the election year as its name
data <- read_excel(
path = current$path,
sheet = as.character(current$year),
skip = current$skip,
n_max = current$n_max
)
# as each party has two columns, but only the first is correctly named,
# we'll need to fill up the second one with the value of the first
filled_colnames <- colnames(data) %>%
as.tibble() %>%
mutate(value = if_else(
str_detect(value, "^\\.\\."),
glue("{lag(value)}_M"), # add _M to second
glue("{value}_F") # add _F to first
)) %>%
pull(value)
# assign filled colnames
colnames(data) <- filled_colnames
# filter out empty rows
data %>%
rename(canton = 1) %>%
filter(!is.na(canton)) %>%
# remove columns without party data
select(-starts_with("..")) %>%
# make tidy
gather(party, n, -canton) %>%
# split party and gender into two
separate(party, into = c("party", "gender"), sep = "_") %>%
# convert to numeric (stars will be converted to NA)
mutate(n = as.numeric(n)) %>%
mutate(
# add flag for running / elected
category = current$category,
# add year
year = current$year
)
}) %>%
# remove footnotes
mutate(
canton = str_replace(canton, " [¹²]$", ""),
canton = str_replace(canton, " \\d\\)$", ""),
party = str_replace(party, " [¹²]$", ""),
party = str_replace(party, " \\d\\)$", "")
) %>%
# fix typo (space missing between A. and Rh.)
mutate(canton = str_replace(canton, "A.Rh.", "A. Rh.")) %>%
# convert to factors
mutate_at(vars(canton, party, gender, category), factor)
)
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
## Warning: NAs durch Umwandlung erzeugt
## Warning: NAs durch Umwandlung erzeugt
## Warning: NAs durch Umwandlung erzeugt
## Warning: NAs durch Umwandlung erzeugt
## Warning: NAs durch Umwandlung erzeugt
## Warning: NAs durch Umwandlung erzeugt
## Warning: NAs durch Umwandlung erzeugt
## Warning: NAs durch Umwandlung erzeugt
## Warning: NAs durch Umwandlung erzeugt
## Warning: NAs durch Umwandlung erzeugt
## Warning: NAs durch Umwandlung erzeugt
## Warning: NAs durch Umwandlung erzeugt
## Warning: NAs durch Umwandlung erzeugt
## Warning: NAs durch Umwandlung erzeugt
# integrity check: remove total from data and check whether values match
# save into separate variable
totals_per_canton <- candidates_per_gender %>%
filter(canton == "Total") %>%
select(-canton)
totals_per_party <- candidates_per_gender %>%
filter(party == "Total") %>%
select(-party)
# remove from original data
candidates_per_gender %<>%
filter(canton != "Total" & party != "Total")
Integrity check: The following table should be empty as it compares the totals calculated by summing all values and the totals that the BFS provides in its tables.
# compare
knitr::kable(
candidates_per_gender %>%
group_by(gender, party, year, category) %>%
summarise(n_calc = sum(n, na.rm = TRUE)) %>%
left_join(
totals_per_canton,
by = c("gender", "party", "year", "category")
) %>%
# show only entries where the calculated total and the other total differ
filter(n != n_calc)
)
gender | party | year | category | n_calc | n |
---|---|---|---|---|---|
F | FDP | 2011 | running | 101 | 109 |
F | FDP | 2015 | running | 138 | 142 |
M | FDP | 2011 | elected | 22 | 23 |
M | FDP | 2011 | running | 316 | 336 |
M | FDP | 2015 | elected | 25 | 26 |
M | FDP | 2015 | running | 310 | 321 |
# clean up
rm(totals_per_canton)
It is currently not empty because in the years 2011 and 2015 the LPS is listed separately. The BFS makes the mistake to nevertheless add the LPS candidates to the total of the FDP. This means that our values are correct and the totals of the BFS are not. We throw away those totals and continue with our analysis. (Later we might group the LPS and the FDP.)
# compare
knitr::kable(
candidates_per_gender %>%
group_by(gender, canton, year, category) %>%
summarise(n_calc = sum(n, na.rm = TRUE)) %>%
left_join(
totals_per_party,
by = c("gender", "canton", "year", "category")
) %>%
# show only entries where the calculated total and the other total differ
filter(n != n_calc)
)
gender canton year category n_calc n ——- ——- —– ——— ——- —
# clean up
rm(totals_per_party)
The second table is empty, everything was read in correctly.
Let’s now add the LPS to the FDP.
candidates_per_gender %<>%
mutate(
party = as.character(party),
party = replace(party, party == "LPS", "FDP"),
party = factor(party)
) %>%
group_by(canton, party, gender, category, year) %>%
summarise_at(
vars(n),
# we do not want everything to be zero
function(n) {
# if all entries are NA we want to leave it at that
if (length(n) == length(n[is.na(n)])) {
return(NA)
}
return(sum(n, na.rm = TRUE))
}
)
Read latest smartvote data to calculate numbers for 2019.
candidates_2019 <- read_csv("input/candidates_2019.csv")
## Parsed with column specification:
## cols(
## ID_Candidate = col_double(),
## firstname = col_character(),
## lastname = col_character(),
## gender = col_character(),
## city = col_character(),
## district = col_character(),
## group = col_character()
## )
candidates_per_gender_2019 <- candidates_2019 %>%
# reduce to important columns and unify with other data frame
select(gender, party = group, canton = district) %>%
# make cantons identical as well
mutate(
canton = case_when(
canton == "Appenzell Ausserrhoden" ~ "Appenzell A. Rh.",
canton == "Appenzell Innerrhoden" ~ "Appenzell I. Rh.",
TRUE ~ canton
),
# unify party names
party = case_when(
party == "BDP/PBD" ~ "BDP",
party == "CVP/PDC" ~ "CVP",
party == "FDP/PLR" ~ "FDP",
party == "GLP/PVL" ~ "GLP",
party == "GPS/PES" ~ "GPS",
party == "SP/PS" ~ "SP",
party == "SVP/UDC" ~ "SVP",
TRUE ~ "Übrige"
)
) %>%
group_by(canton, party, gender) %>%
summarise(n = n()) %>%
ungroup() %>%
# add missing columns to match data before 2019
mutate(
gender = str_to_upper(gender),
category = "running",
year = 2019
)
candidates_per_gender_2019 %>%
distinct(canton) %>%
filter(!canton %in% levels(candidates_per_gender$canton))
## # A tibble: 0 x 1
## # … with 1 variable: canton <chr>
candidates_per_gender_2019 %<>%
filter(party %in% levels(candidates_per_gender$party)) %>%
# make factors joinable by using same levels
mutate(
canton = factor(canton, levels = levels(candidates_per_gender$canton)),
party = factor(party, levels = levels(candidates_per_gender$party)),
gender = factor(gender, levels = levels(candidates_per_gender$gender)),
category = factor(category, levels = levels(candidates_per_gender$category))
)
# join
candidates_per_gender %<>%
bind_rows(candidates_per_gender_2019)
write_csv(
candidates_per_gender_per_year,
"output/candidates_per_gender_per_year.csv"
)
candidates_per_party_export <- candidates_per_gender_per_party_and_year %>%
filter(party %in% party_order) %>%
mutate(n = men + women) %>%
select(year, category, party, n, share)
write_csv(
candidates_per_party_export,
"output/candidates_per_gender_per_party_and_year.csv"
)
# clean up
rm(candidates_per_party_export)
candidates_detailed_export <- candidates_per_gender %>%
filter(party %in% party_order & !is.na(share_f) & category == "running") %>%
mutate(n = men + women) %>%
# replace canton names with abbrs
left_join(
read_csv("input/canton_names.csv") %>%
select(canton_abbr = abbr, name),
by = c("canton" = "name")
) %>%
ungroup() %>%
select(-canton, canton = canton_abbr) %>%
select(party, canton, year, n, share = share_f)
## Parsed with column specification:
## cols(
## id = col_double(),
## abbr = col_character(),
## name = col_character(),
## regi = col_character()
## )
## Warning: Column `canton`/`name` joining factor and character vector,
## coercing into character vector
write_csv(
candidates_detailed_export,
"output/candidates_detailed.csv"
)
# clean up
rm(candidates_detailed_export)
Der Code in diesem RMarkdown wird mit lintr automatisch auf den Wickham’schen tidyverse style guide überprüft.
lintr::lint(
"main.Rmd",
linters = lintr::with_defaults(commented_code_linter = NULL)
)