Elections 19: Project 6

Preliminary Remarks

This document describes the pre-processing and exploratory analysis of the data set that is the basis of the article Die grössten Hürden für Frauen in der Schweizer Politik published on srf.ch.

SRF Data attaches importance to the fact that the data pre-processing and analysis can be reproduced and checked. SRF Data believes in the principle of open data, but also open and comprehensible methods. On the other hand, it should be possible for third parties to build on this preparatory work and thus generate further evaluations or applications.

R-Script & Data

The preprocessing and analysis of the data was conducted in the R project for statistical computing. The RMarkdown script used to generate this document and all the resulting data can be downloaded under this link. Through executing main.Rmd, the herein described process can be reproduced and this document can be generated. In the course of this, data from the folder input will be processed and results will be written to output.

SRF Data uses Timo Grossenbacher’s rddj-template as the basis for its R scripts. If you have problems executing this script, it may help to study the instructions from the rddj-template.

This report was generated on 2019-09-12 17:07:41. R version: 3.5.3 on x86_64-apple-darwin15.6.0. For this report, CRAN packages as of 2019-03-01 were used.

GitHub

The code for the herein described process can also be freely downloaded from https://github.com/srfdata/2019-09-women-in-politics.

License

2019-09-women-in-politics by SRF Data is licensed under a Creative Commons Namensnennung - Attribution ShareAlike 4.0 International License.

Other projects

Code and data by SRF Data are available on https://srfdata.github.io.

Disclaimer

The published information has been carefully compiled, but does not claim to be up-to-date, complete or correct. No liability is assumed for damages arising from the use of this script or the information drawn from it. This also applies to contents of third parties which are accessible via this offer.

Data description of output files

`candidates_per_gender_per_year.csv`

Attribute	Type	Description
year	Number	Year of measurement
category	Enum	Either “running” or “elected”
share	Number	Share of women as decimal

`candidates_per_gender_per_party_and_year.csv`

Attribute	Type	Description
year	Number	Year of measurement
category	Enum	Either “running” or “elected”
party	Number	Party in question
n	Number	Number of candidates in that canton, party, year and category
share	Number	Share of women as decimal

`candidates_detailed.csv`

Attribute	Type	Description
party	Number	Party in question
canton	Number	Canton of measurement
year	Number	Year of measurement
n	Number	Number of candidates in that canton, party and year
share	Number	Share of women as decimal

Original source

Data for 2019

-> input/candidates_2019.csv

Information about the candidates running this year was acquired by smartvote. As we cannot distribute any proprietary material from them, we have reduced the columns to those necessary for this analysis: ID, name, gender, district (canton) and party. The export was done via their API on the 6th of September 2019 with 4596 unique candidates.

Partiese were simplified as follows:

SP = SP + JUSO + JSP
GPS = GPS + JG + JGBNW + ALG + BastA
GLP = GLP + JGLP
BDP = BDP + JBDP
CVP = CVP + JCVP + CSV
FDP = FDP + JFS + LDP + JLDP
SVP = SVP + JSVP

Gender Ratios by Party and Canton and Year

-> input/su-d-17.02.02.05.02.06.xlsx

Candidates running provided by the Federal Statistical Office.

Footnotes were removed: ¹ Stille Wahl ² Inkl. Vigilance/Genf

-> input/je-d-17.02.02.02.01.02.xlsx

Candidates elected provided by the Federal Statistical Office.

Footnotes were removed: 1) 2009: Fusion von FDP und LPS auf nationaler Ebene unter der Bezeichnung “FDP.Die Liberalen”. Fusion von FDP und LP im Kanton Waadt im Jahr 2012. Im Kanton Basel-Stadt haben FDP und LP nicht fusioniert. Da die LP-BS Mitglied der „FDP.Die Liberalen Schweiz“ ist, wird das Mandat der LP-BS auf gesamtschweizerischer Ebene der FDP zugeteilt. 2) CSP-OW: Der Gewählte bleibt in der CVP-Fraktion.

as well as

Stille Wahl
BE: Die Vertreterin des Grünen Bündnis (FGA) wurde auf einer gemeinsamen Liste GPS/FGA gewählt.

Vorbereitungen

## [1] "package package:rmarkdown detached"

## Loading required package: knitr

## Loading required package: rstudioapi

Packages definieren

# from https://mran.revolutionanalytics.com/web/packages/checkpoint/vignettes/using-checkpoint-with-knitr.html
# if you don't need a package, remove it from here (commenting is probably not sufficient)
# tidyverse: see https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/
cat("
library(rstudioapi)
library(tidyverse) # ggplot2, dplyr, tidyr, readr, purrr, tibble
library(glue) # use glue directly without namespacing
library(magrittr) # pipes
library(readxl) # excel
library(scales) # scales for ggplot2
library(lintr) # code linting
library(rmarkdown)",
file = "manifest.R")

Packages installieren

# if checkpoint is not yet installed, install it (for people using this
# system for the first time)
if (!require(checkpoint)) {
  if (!require(devtools)) {
    install.packages("devtools", repos = "http://cran.us.r-project.org")
    require(devtools)
  }
  devtools::install_github("RevolutionAnalytics/checkpoint",
                           ref = "v0.3.2", # could be adapted later,
                           # as of now (beginning of July 2017
                           # this is the current release on CRAN)
                           repos = "http://cran.us.r-project.org")
  require(checkpoint)
}
# nolint start
if (!dir.exists("~/.checkpoint")) {
  dir.create("~/.checkpoint")
}
# nolint end
# install packages for the specified CRAN snapshot date
checkpoint(snapshotDate = package_date,
           project = path_to_wd,
           verbose = T,
           scanForPackages = T,
           use.knitr = F,
           R.version = R_version)
rm(package_date)

Packages laden

source("manifest.R")
unlink("manifest.R")
sessionInfo()

## R version 3.5.3 Patched (2019-03-11 r76221)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] de_CH.UTF-8/de_CH.UTF-8/de_CH.UTF-8/C/de_CH.UTF-8/de_CH.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rmarkdown_1.11   lintr_1.0.3      scales_1.0.0     readxl_1.3.0    
##  [5] magrittr_1.5     glue_1.3.0       forcats_0.4.0    stringr_1.4.0   
##  [9] dplyr_0.8.0.1    purrr_0.3.0      readr_1.3.1      tidyr_0.8.2     
## [13] tibble_2.0.1     ggplot2_3.1.0    tidyverse_1.2.1  checkpoint_0.4.0
## [17] rstudioapi_0.9.0 knitr_1.21      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0       cellranger_1.1.0 plyr_1.8.4       pillar_1.3.1    
##  [5] compiler_3.5.3   tools_3.5.3      digest_0.6.18    lubridate_1.7.4 
##  [9] jsonlite_1.6     evaluate_0.13    nlme_3.1-137     gtable_0.2.0    
## [13] lattice_0.20-38  pkgconfig_2.0.2  rlang_0.3.1      rex_1.1.2       
## [17] cli_1.0.1        yaml_2.2.0       haven_2.1.0      xfun_0.5        
## [21] withr_2.1.2      xml2_1.2.0       httr_1.4.0       hms_0.4.2       
## [25] generics_0.0.2   grid_3.5.3       tidyselect_0.2.5 R6_2.4.0        
## [29] modelr_0.1.4     backports_1.1.3  htmltools_0.3.6  rvest_0.3.2     
## [33] assertthat_0.2.0 colorspace_1.4-0 stringi_1.3.1    lazyeval_0.2.1  
## [37] munsell_0.5.0    broom_0.5.1      crayon_1.3.4

Read data

# map over election years and read data
# do each year twice, once for the candidates
# and once for the elected parliamentarians
suppressMessages(
  candidates_per_gender <- bind_rows(
    as.tibble(seq(1971, 2015, 4)) %>%
      mutate(
        category = "elected",
        path = "input/je-d-17.02.02.02.01.02.xlsx",
        skip = 3,
        n_max = 33
      ),
    as.tibble(seq(1971, 2015, 4)) %>%
      mutate(
        category = "running",
        path = "input/su-d-17.02.02.05.02.06.xlsx",
        skip = 4,
        n_max = 27
      )
  ) %>%
    # name first column
    rename(year = 1) %>%
    pmap_dfr(function(...) {
      current <- tibble(...)
      # the worksheets have the election year as its name
      data <- read_excel(
        path = current$path,
        sheet = as.character(current$year),
        skip = current$skip,
        n_max = current$n_max
      )

      # as each party has two columns, but only the first is correctly named,
      # we'll need to fill up the second one with the value of the first
      filled_colnames <- colnames(data) %>%
        as.tibble() %>%
        mutate(value = if_else(
          str_detect(value, "^\\.\\."),
          glue("{lag(value)}_M"), # add _M to second
          glue("{value}_F") # add _F to first
        )) %>%
        pull(value)

      # assign filled colnames
      colnames(data) <- filled_colnames

      # filter out empty rows
      data %>%
        rename(canton = 1) %>%
        filter(!is.na(canton)) %>%
        # remove columns without party data
        select(-starts_with("..")) %>%
        # make tidy
        gather(party, n, -canton) %>%
        # split party and gender into two
        separate(party, into = c("party", "gender"), sep = "_") %>%
        # convert to numeric (stars will be converted to NA)
        mutate(n = as.numeric(n)) %>%
        mutate(
          # add flag for running / elected
          category = current$category,
          # add year
          year = current$year
        )
    }) %>%
    # remove footnotes
    mutate(
      canton = str_replace(canton, " [¹²]$", ""),
      canton = str_replace(canton, " \\d\\)$", ""),
      party = str_replace(party, " [¹²]$", ""),
      party = str_replace(party, " \\d\\)$", "")
    ) %>%
    # fix typo (space missing between A. and Rh.)
    mutate(canton = str_replace(canton, "A.Rh.", "A. Rh.")) %>%
    # convert to factors
    mutate_at(vars(canton, party, gender, category), factor)
)

## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.

## Warning: NAs durch Umwandlung erzeugt

## Warning: NAs durch Umwandlung erzeugt

## Warning: NAs durch Umwandlung erzeugt

## Warning: NAs durch Umwandlung erzeugt

## Warning: NAs durch Umwandlung erzeugt

## Warning: NAs durch Umwandlung erzeugt

## Warning: NAs durch Umwandlung erzeugt

## Warning: NAs durch Umwandlung erzeugt

## Warning: NAs durch Umwandlung erzeugt

## Warning: NAs durch Umwandlung erzeugt

## Warning: NAs durch Umwandlung erzeugt

## Warning: NAs durch Umwandlung erzeugt

## Warning: NAs durch Umwandlung erzeugt

## Warning: NAs durch Umwandlung erzeugt

# integrity check: remove total from data and check whether values match
# save into separate variable
totals_per_canton <- candidates_per_gender %>%
  filter(canton == "Total") %>%
  select(-canton)

totals_per_party <- candidates_per_gender %>%
  filter(party == "Total") %>%
  select(-party)

# remove from original data
candidates_per_gender %<>%
  filter(canton != "Total" & party != "Total")

Integrity check: The following table should be empty as it compares the totals calculated by summing all values and the totals that the BFS provides in its tables.

# compare
knitr::kable(
  candidates_per_gender %>%
    group_by(gender, party, year, category) %>%
    summarise(n_calc = sum(n, na.rm = TRUE)) %>%
    left_join(
      totals_per_canton,
      by = c("gender", "party", "year", "category")
    ) %>%
    # show only entries where the calculated total and the other total differ
    filter(n != n_calc)
)

gender	party	year	category	n_calc	n
F	FDP	2011	running	101	109
F	FDP	2015	running	138	142
M	FDP	2011	elected	22	23
M	FDP	2011	running	316	336
M	FDP	2015	elected	25	26
M	FDP	2015	running	310	321

# clean up
rm(totals_per_canton)

It is currently not empty because in the years 2011 and 2015 the LPS is listed separately. The BFS makes the mistake to nevertheless add the LPS candidates to the total of the FDP. This means that our values are correct and the totals of the BFS are not. We throw away those totals and continue with our analysis. (Later we might group the LPS and the FDP.)

# compare
knitr::kable(
  candidates_per_gender %>%
    group_by(gender, canton, year, category) %>%
    summarise(n_calc = sum(n, na.rm = TRUE)) %>%
    left_join(
      totals_per_party,
      by = c("gender", "canton", "year", "category")
    ) %>%
    # show only entries where the calculated total and the other total differ
    filter(n != n_calc)
)

gender canton year category n_calc n ——- ——- —– ——— ——- —

# clean up
rm(totals_per_party)

The second table is empty, everything was read in correctly.

Let’s now add the LPS to the FDP.

candidates_per_gender %<>%
  mutate(
    party = as.character(party),
    party = replace(party, party == "LPS", "FDP"),
    party = factor(party)
  ) %>%
  group_by(canton, party, gender, category, year) %>%
  summarise_at(
    vars(n),
    # we do not want everything to be zero
    function(n) {
      # if all entries are NA we want to leave it at that
      if (length(n) == length(n[is.na(n)])) {
        return(NA)
      }
      return(sum(n, na.rm = TRUE))
    }
  )

2019

Read latest smartvote data to calculate numbers for 2019.

candidates_2019 <- read_csv("input/candidates_2019.csv")

## Parsed with column specification:
## cols(
##   ID_Candidate = col_double(),
##   firstname = col_character(),
##   lastname = col_character(),
##   gender = col_character(),
##   city = col_character(),
##   district = col_character(),
##   group = col_character()
## )

Add 2019 to data

candidates_per_gender_2019 <- candidates_2019 %>%
  # reduce to important columns and unify with other data frame
  select(gender, party = group, canton = district) %>%
  # make cantons identical as well
  mutate(
    canton = case_when(
      canton == "Appenzell Ausserrhoden" ~ "Appenzell A. Rh.",
      canton == "Appenzell Innerrhoden" ~ "Appenzell I. Rh.",
      TRUE ~ canton
    ),
    # unify party names
    party = case_when(
      party == "BDP/PBD" ~ "BDP",
      party == "CVP/PDC" ~ "CVP",
      party == "FDP/PLR" ~ "FDP",
      party == "GLP/PVL" ~ "GLP",
      party == "GPS/PES" ~ "GPS",
      party == "SP/PS" ~ "SP",
      party == "SVP/UDC" ~ "SVP",
      TRUE ~ "Übrige"
    )
  ) %>%
  group_by(canton, party, gender) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  # add missing columns to match data before 2019
  mutate(
    gender = str_to_upper(gender),
    category = "running",
    year = 2019
  )

candidates_per_gender_2019 %>%
  distinct(canton) %>%
  filter(!canton %in% levels(candidates_per_gender$canton))

## # A tibble: 0 x 1
## # … with 1 variable: canton <chr>

candidates_per_gender_2019 %<>%
  filter(party %in% levels(candidates_per_gender$party)) %>%
  # make factors joinable by using same levels
  mutate(
    canton = factor(canton, levels = levels(candidates_per_gender$canton)),
    party = factor(party, levels = levels(candidates_per_gender$party)),
    gender = factor(gender, levels = levels(candidates_per_gender$gender)),
    category = factor(category, levels = levels(candidates_per_gender$category))
  )

# join
candidates_per_gender %<>%
  bind_rows(candidates_per_gender_2019)

Female Shares: Candidates running / elected

# make less tidy by putting number of men and women into columns
candidates_per_gender %<>%
  spread(gender, n) %>%
  rename(women = `F`, men = `M`) %>%
  # replace NAs with zero where one of the values is present (men or women)
  mutate(
    women = ifelse(is.na(women) & !is.na(men), 0, women),
    men = ifelse(is.na(men) & !is.na(women), 0, men)
  ) %>%
  # calculate shares
  mutate(
    share_f = women / (women + men),
    share_m = men / (women + men)
  )

# calculate values over all parties
candidates_per_gender_per_year <- candidates_per_gender %>%
  group_by(year, category) %>%
  summarise(
    share = sum(women, na.rm = TRUE) /
      (sum(women, na.rm = TRUE) + sum(men, na.rm = TRUE))
  ) %>%
  ungroup() %>%
  # remove elected 2019
  filter(year != max(year) | category != "elected")

ggplot(
  candidates_per_gender_per_year,
  aes(
    x = year,
    y = share,
    group = category,
    color = category
  )
) +
  geom_step(direction = "vh") +
  scale_color_manual(values = c("#000000", "#999999")) +
  scale_y_continuous(labels = percent) +
  theme_minimal() +
  labs(
    title = "Over all parties: Share of female candidates running vs. elected",
    x = NULL,
    y = "Share of women in candidates",
    color = NULL
  )

By party

party_colors <- c(
  "SP" = "#F03E35",
  "GPS" = "#7BB535",
  "GLP" = "#C4C429",
  "BDP" = "#E6C409",
  "CVP" = "#D67C15",
  "FDP" = "#2669B5",
  "SVP" = "#3F8A30"
)

# make copy with different names (add running)
party_colors_alt <- c(party_colors)
names(party_colors_alt) <- c(
  "SP_running",
  "GPS_running",
  "GLP_running",
  "BDP_running",
  "CVP_running",
  "FDP_running",
  "SVP_running"
)
# plus add black as a value for all parties for the elected candidates
party_colors_alt <- c(party_colors_alt, c(
  "SP_elected" = "#000000",
  "GPS_elected" = "#000000",
  "GLP_elected" = "#000000",
  "BDP_elected" = "#000000",
  "CVP_elected" = "#000000",
  "FDP_elected" = "#000000",
  "SVP_elected" = "#000000"
))

# save order of parties for factoring / sorting facets
party_order <- str_replace(names(party_colors_alt), "_.*$", "")[1:7]

# calculate values over all parties
candidates_per_gender_per_party_and_year <- candidates_per_gender %>%
  group_by(year, category, party) %>%
  summarise(
    women = sum(women, na.rm = TRUE),
    men = sum(men, na.rm = TRUE),
    share = women / (women + men)
  ) %>%
  ungroup() %>%
  # filter out elected 2019
  filter(year != max(year) | category != "elected")

# plot big 7
ggplot(
  candidates_per_gender_per_party_and_year %>%
    filter(party %in% party_order),
  aes(
    x = year,
    y = share,
    group = category,
    color = glue("{party}_{category}")
  )
) +
  geom_step(direction = "vh") +
  facet_wrap(~ factor(party, levels = party_order), ncol = 2) +
  scale_y_continuous(labels = percent) +
  scale_color_manual(values = party_colors_alt) +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(
    title = "Share of female candidates running vs. elected",
    x = NULL,
    y = "Share of women in candidates"
  )

## Warning: Removed 2 rows containing missing values (geom_path).

Export

Candidates per gender/year

write_csv(
  candidates_per_gender_per_year,
  "output/candidates_per_gender_per_year.csv"
)

Candidates per party/gender/year

candidates_per_party_export <- candidates_per_gender_per_party_and_year %>%
  filter(party %in% party_order) %>%
  mutate(n = men + women) %>%
  select(year, category, party, n, share)

write_csv(
  candidates_per_party_export,
  "output/candidates_per_gender_per_party_and_year.csv"
)

# clean up
rm(candidates_per_party_export)

Candidates full details

candidates_detailed_export <- candidates_per_gender %>%
  filter(party %in% party_order & !is.na(share_f) & category == "running") %>%
  mutate(n = men + women) %>%
  # replace canton names with abbrs
  left_join(
    read_csv("input/canton_names.csv") %>%
      select(canton_abbr = abbr, name),
    by = c("canton" = "name")
  ) %>%
  ungroup() %>%
  select(-canton, canton = canton_abbr) %>%
  select(party, canton, year, n, share = share_f)

## Parsed with column specification:
## cols(
##   id = col_double(),
##   abbr = col_character(),
##   name = col_character(),
##   regi = col_character()
## )

## Warning: Column `canton`/`name` joining factor and character vector,
## coercing into character vector

write_csv(
  candidates_detailed_export,
  "output/candidates_detailed.csv"
)

# clean up
rm(candidates_detailed_export)

Linting

Der Code in diesem RMarkdown wird mit lintr automatisch auf den Wickham’schen tidyverse style guide überprüft.

lintr::lint(
  "main.Rmd",
  linters = lintr::with_defaults(commented_code_linter = NULL)
)