2019-08-representation

Vorbemerkungen

Dieses Dokument beschreibt die Vorprozessierung und explorative Analyse des Datensatzes, der Grundlage des auf srf.ch veröffentlichten Artikels Wie das Parlament die Wähler abbildet – und wie nicht ist.

SRF Data legt Wert darauf, dass die Datenvorprozessierung und -Analyse nachvollzogen und überprüft werden kann. SRF Data glaubt an das Prinzip offener Daten, aber auch offener und nachvollziehbarer Methoden. Zum anderen soll es Dritten ermöglicht werden, auf dieser Vorarbeit aufzubauen und damit weitere Auswertungen oder Applikationen zu generieren.

Das Endprodukt des vorliegenden Scripts, neben der explorativen Analyse, sind (Datenbeschreibung siehe unten):

councillors.json: Datensatz aller Parlamentarier mit Geschlecht, Altersgruppe, Stadt-Land-Kategorisierung, Zivilstand, Bildung und Religionszugehörigkeit als JSON
representation.json: Datensatz mit Ist- und Soll-Werten für die untersuchten Kategorien (Geschlecht, Altersgruppe etc.) als JSON

R-Script & Daten

Die Vorprozessierung und Analyse wurde im Statistikprogramm R vorgenommen. Das zugrunde liegende Script sowie die prozessierten Daten können unter diesem Link heruntergeladen werden. Durch Ausführen von main.Rmd kann der hier beschriebene Prozess nachvollzogen und der für den Artikel verwendete Datensatz generiert werden. Dabei werden Daten aus dem Ordner input eingelesen und Ergebnisse in den Ordner output geschrieben.

SRF Data verwendet das rddj-template von Timo Grossenbacher als Grundlage für seine R-Scripts. Entstehen bei der Ausführung dieses Scripts Probleme, kann es helfen, die Anleitung von rddj-template zu studieren.

Debug-Informationen: This report was generated on 2019-08-27 08:44:02. R version: 3.5.3 on x86_64-apple-darwin15.6.0. For this report, CRAN packages as of 2018-09-01 were used.

GitHub

Der Code für die vorliegende Datenprozessierung ist auf https://github.com/srfdata/2019-08-representation zur freien Verwendung verfügbar.

Lizenz

2019-08-representation von SRF Data ist lizenziert unter einer Creative Commons Namensnennung - Weitergabe unter gleichen Bedingungen 4.0 International Lizenz.

Weitere Projekte

Code & Daten von SRF Data sind unter http://srfdata.github.io verfügbar.

Haftungsausschluss

Die veröffentlichten Informationen sind sorgfältig zusammengestellt, erheben aber keinen Anspruch auf Aktualität, Vollständigkeit oder Richtigkeit. Es wird keine Haftung übernommen für Schäden, die durch die Verwendung dieses Scripts oder der daraus gezogenen Informationen entstehen. Dies gilt ebenfalls für Inhalte Dritter, die über dieses Angebot zugänglich sind.

Datenbeschreibung

`councillors_supplemented.csv`

Attribut	Typ	Beschreibung
id	Numeric	Offizielle Id der Parlamentarischen Dienste
fistName	String	Vorname der Parlamentarier
lastName	String	Nachname der Parlamentarier
ageGroup	String	Altersgruppe der Parlamentarier (sieben Kategorien)
gender	String	Geschlecht der Parlamentarier (m, f)
urbanRural	String	Stadt-Land-Einteilung der Parlamentarier gemäss Raumgliederung des BfS, vier Kategorien (big_city, urban, periurban, rural)
maritalStatus	String	Zivilstand der Parlamentarier
education	String	Höchste Ausbildung der Parlamentarier
religion	String	Religionszugehörigkeit der Parlamentarier
party	String	Partei der Parlamentarier

`representation.csv`

Attribut	Typ	Beschreibung
category	String	Kategoriebezeichnung in Englisch für die jeweilige demografische Angabe
group	String	Unterschiedliche Einteilungen der jeweiligen Kategorie (Geschlecht, Altersgruppen etc.)
is	Numeric	Ist-Wert für eine Gruppe: So viele Parlamentarier unter 30 Jahren sitzen im Parlament
should	Numeric	Soll-Wert für eine Gruppe: So viel Parlamentarier unter 30 Jahren müssten im Vergleich zur ständigen Wohnbevölkerung über 15 Jahren im Parlament sitzen

Originalquelle

Schweizer Parlamentarier (National- und Ständerat), Stand 28. Mai 2019

→ input/councillorsCouncillor-export_de_ergaenzt_mit_religion.csv

Die Daten über die Parlamentarier stammen von den Parlamentarischen Diensten und wurden von Smartvote durch die eigene Datenbank und weitere Recherchen ergänzt. Im Datensatz befinden sich neben den verwendeten Kategorien auch noch weitere Angaben (Beruf, Sprache etc.) enthalten.

Ergänzung zu den Daten der Parlamentarier

→ input/Missing Data Parlamentarians - Sheet1.csv

Ergänzung des obigen Datensatzes durch eigene Recherche. 99 Parlamentarier wurden per Mail (Code für den automatisierten Mailversand weiter unten) angefragt, fehlende Daten zu ergänzen. Davon haben 64 Parlamentarier geantwortet.

Ständige Wohnbevölkerung nach Alter

→ input/je-d-01.02.03.02.xlsx

Die Daten zur ständigen Wohnbevölkerung nach Alter haben wir von der Webseite des Bundesamts für Statistik (BfS) heruntergeladen.

Ständige Wohnbevölkerung nach Religionszugehörigkeit

→ input/je-d-01.08.02.01.xlsx

Die Daten zur ständigen Wohnbevölkerung ab 15 Jahren nach Religionszugehörigkeit haben wir von der Webseite des BfS heruntergeladen.

Ständige Wohnbevölkerung nach Zivilstand

→ input/su-d-01.02.03.03.xlsx

Die Daten zur ständigen Wohnbevölkerung nach Zivilstand haben wir von der Webseite des BfS heruntergeladen.

Ständige Wohnbevölkerung nach Bildung

→ input/su-d-40.02.15.08.01-2017.xlsx

Die Daten zur ständigen Wohnbevölkerung ab 15 Jahren nach Bildung haben wir von der Webseite des BfS heruntergeladen.

Ständige Wohnbevölkerung nach Alter und Gemeinde (Stand 31.12.2017)

→ input/su-d-01.02.03.06.xlsx

Die Daten zur ständigen Wohnbevölkerung nach Alter und Gemeinde haben wir von der Webseite des BfS heruntergeladen.

Raumgliederung

→ input/Raumgliederungen.xlsx

Die Einteilung der Wohnorte der Parlamentarier in urban, periurban und rural haben wir der Raumgliederung des BfS entnommen (Gemeindetypologie 2012, 25 Typen, 9 Kategorien und Stadt/Land-Typologie). Die Tabelle haben wir aus der Applikation der Schweizer Gemeinden des BfS exportiert. Wir haben den offiziellen Gemeindestand vom 02.04.2017 verwendet mit 2240 Einträgen.

Vorbereitungen

## [1] "package package:rmarkdown detached"

## Loading required package: knitr

## Loading required package: rstudioapi

Packages definieren

# from https://mran.revolutionanalytics.com/web/packages/checkpoint/vignettes/using-checkpoint-with-knitr.html
# if you don't need a package, remove it from here (commenting is probably not sufficient)
# tidyverse: see https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/
cat("
library(rstudioapi)
library(tidyverse) # ggplot2, dplyr, tidyr, readr, purrr, tibble
library(lubridate) # date calculation
library(glue) # cooler string templating
library(magrittr) # pipes
library(readxl) # excel
library(scales) # scales for ggplot2
library(jsonlite) # json
library(lintr) # code linting
library(sf) # spatial data handling
library(rmarkdown)
# library(mailR)",
file = "manifest.R")

Packages installieren

# if checkpoint is not yet installed, install it (for people using this
# system for the first time)
if (!require(checkpoint)) {
  if (!require(devtools)) {
    install.packages("devtools", repos = "http://cran.us.r-project.org")
    require(devtools)
  }
  devtools::install_github("RevolutionAnalytics/checkpoint",
                           ref = "v0.3.2", # could be adapted later,
                           # as of now (beginning of July 2017
                           # this is the current release on CRAN)
                           repos = "http://cran.us.r-project.org")
  require(checkpoint)
}

## 
##   There is a binary version available but the source version is
##   later:
##          binary source needs_compilation
## pkgbuild  1.0.4  1.0.5             FALSE
## 
## 
## The downloaded binary packages are in
##  /var/folders/z1/kysjxzjd13ngsnb450pxrpdjw6bv67/T//RtmpORUN2J/downloaded_packages
##   
   checking for file ‘/private/var/folders/z1/kysjxzjd13ngsnb450pxrpdjw6bv67/T/RtmpORUN2J/remotesf16c3120f0d4/RevolutionAnalytics-checkpoint-024b91d/DESCRIPTION’ ...
  
✔  checking for file ‘/private/var/folders/z1/kysjxzjd13ngsnb450pxrpdjw6bv67/T/RtmpORUN2J/remotesf16c3120f0d4/RevolutionAnalytics-checkpoint-024b91d/DESCRIPTION’ (344ms)
## 
  
─  preparing ‘checkpoint’:
##    checking DESCRIPTION meta-information ...
  
✔  checking DESCRIPTION meta-information
## 
  
─  checking for LF line-endings in source and make files and shell scripts
## 
  
─  checking for empty or unneeded directories
## 
  
─  building ‘checkpoint_0.4.0.tar.gz’
## 
  
   Warnung: invalid uid value replaced by that for user 'nobody'
## 
  
   Warnung: invalid gid value replaced by that for user 'nobody'
## 
  
   
##

# nolint start
if (!dir.exists("~/.checkpoint")) {
  dir.create("~/.checkpoint")
}
# nolint end
# install packages for the specified CRAN snapshot date
checkpoint(snapshotDate = package_date,
           project = path_to_wd,
           verbose = T,
           scanForPackages = T,
           use.knitr = F,
           R.version = R_version)
rm(package_date)

Packages laden

source("manifest.R")
unlink("manifest.R")
sessionInfo()

## R version 3.5.3 Patched (2019-03-11 r76221)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] de_CH.UTF-8/de_CH.UTF-8/de_CH.UTF-8/C/de_CH.UTF-8/de_CH.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] rmarkdown_1.10   sf_0.6-3         lintr_1.0.2      jsonlite_1.5    
##  [5] scales_1.0.0     readxl_1.1.0     magrittr_1.5     glue_1.3.0      
##  [9] lubridate_1.7.4  forcats_0.3.0    stringr_1.3.1    dplyr_0.7.6     
## [13] purrr_0.2.5      readr_1.1.1      tidyr_0.8.1      tibble_1.4.2    
## [17] ggplot2_3.0.0    tidyverse_1.2.1  checkpoint_0.4.0 devtools_2.1.0  
## [21] usethis_1.5.1    rstudioapi_0.7   knitr_1.20      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.18      lattice_0.20-38   class_7.3-15     
##  [4] prettyunits_1.0.2 ps_1.3.0          assertthat_0.2.0 
##  [7] rprojroot_1.3-2   digest_0.6.16     R6_2.2.2         
## [10] cellranger_1.1.0  plyr_1.8.4        backports_1.1.2  
## [13] e1071_1.7-0       evaluate_0.11     httr_1.3.1       
## [16] pillar_1.3.0      rlang_0.4.0       lazyeval_0.2.1   
## [19] curl_3.2          callr_3.3.1       desc_1.2.0       
## [22] munsell_0.5.0     broom_0.5.0       compiler_3.5.3   
## [25] modelr_0.1.2      pkgconfig_2.0.2   pkgbuild_1.0.5   
## [28] htmltools_0.3.6   tidyselect_0.2.4  crayon_1.3.4     
## [31] withr_2.1.2       grid_3.5.3        spData_0.2.9.3   
## [34] DBI_1.0.0         nlme_3.1-137      gtable_0.2.0     
## [37] units_0.6-0       cli_1.1.0         stringi_1.2.4    
## [40] fs_1.3.1          remotes_2.1.0     rex_1.1.2        
## [43] bindrcpp_0.2.2    testthat_2.2.1    xml2_1.2.0       
## [46] tools_3.5.3       hms_0.4.2         processx_3.4.1   
## [49] pkgload_1.0.2     yaml_2.2.0        colorspace_1.3-2 
## [52] sessioninfo_1.1.1 classInt_0.2-3    rvest_0.3.2      
## [55] memoise_1.1.0     bindr_0.1.1       haven_1.1.2

Datenvorbereitung

# list of municipality names to replace councillors cities in order to match BfS data
municpality_names <- c(
  "^Coira$" = "Chur",
  "^Birmenstorf$" = "Birmenstorf (AG)",
  "^Emmenbrücke$" = "Emmen",
  "^Geneve$" = "Genève",
  "^Azmoos$" = "Wartau",
  "^Avry-sur-Matran$" = "Avry",
  "^Rüegsauschachen$" = "Rüegsau",
  "^Carouge$" = "Carouge (GE)",
  "^Wil$" = "Wil (SG)",
  "^Gstaad$" = "Saanen",
  "^Nàfels$" = "Glarus Nord",
  "^Oberriet$" = "Oberriet (SG)",
  "^Kaltacker$" = "Heimiswil",
  "^Illnau$" = "Illnau-Effretikon",
  "^Edlibach$" = "Menzingen",
  "^Bürglen$" = "Lungern",
  "^Hergiswil$" = "Hergiswil (NW)",
  "^St-George$" = "Saint-George",
  "^Cheyres$" = "Cheyres-Châbles",
  "^Aesch$" = "Aesch (ZH)",
  "^Küssnacht$" = "Küssnacht (SZ)",
  "^Grand-Lancy$" = "Lancy",
  "^Rüi b\\. Büren$" = "Rüti bei Büren",
  "^Ruvigliana$" = "Lugano",
  "^Küsnacht$" = "Küsnacht (ZH)",
  "^Mälchi$" = "Fraubrunnen",
  "^Watt-Regensdorf$" = "Regensdorf",
  "^Gossau$" = "Gossau (ZH)",
  "^Rüthi \\(Rheintal\\)$" = "Rüthi (SG)",
  "^Auvernier$" = "Milvignes",
  "^Rapperswil$" = "Rapperswil-Jona",
  "^Morgins$" = "Troistorrents",
  "^Glion$" = "Montreux",
  "^Cinuos-chel$" = "S-chanf",
  "^Wohlen$" = "Wohlen (AG)",
  "^Affoltern am Ablis$" = "Affoltern am Albis",
  "^Wallenwil$" = "Eschlikon",
  "^Bironico$" = "Monteceneri",
  "^Oberflachs$" = "Schinznach",
  "^Nussbaumen$" = "Obersiggenthal",
  "^Opfertshofen$" = "Thayngen",
  "^Biel$" = "Biel/Bienne",
  "^Schwanden$" = "Glarus Süd",
  "^Haslen$" = "Glarus Süd",
  "^Le Sentier$" = "Le Chenit"
)

# read councillor data (246 councillors)
councillors_initial_data <- read_delim(
  "input/councillorsCouncillor-export_de_ergaenzt_mit_religion.csv",
  delim = ";"
  ) %>%
  # remove all c_ from column names to make further processing easier
  rename_at(
    vars(
      starts_with("c_")),
      ~ str_replace(., "^c_", "")
    ) %>%
  # convert birth date to date format
  mutate(birth_date = as.Date(birth_date, format = "%d.%m.%Y")) %>%
  # replace domicile names in order to match BfS names
  mutate(domicile_city_final = str_replace_all(
    domicile_city_final,
    municpality_names
  ))

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   c_id = col_integer(),
##   c_number = col_integer(),
##   c_domicile_zip = col_integer(),
##   c_postal_zip = col_integer(),
##   c_domicile_zip_final = col_integer()
## )

## See spec(...) for full column specifications.

## Warning: `lang()` is deprecated as of rlang 0.2.0.
## Please use `call2()` instead.
## This warning is displayed once per session.

## Warning: `new_overscope()` is deprecated as of rlang 0.2.0.
## Please use `new_data_mask()` instead.
## This warning is displayed once per session.

## Warning: `is_lang()` is deprecated as of rlang 0.2.0.
## Please use `is_call()` instead.
## This warning is displayed once per session.

## Warning: The `printer` argument is deprecated as of rlang 0.3.0.
## This warning is displayed once per session.

# read data that was completed by email request
missing_data_completion <- read_csv(
  "input/Missing Data Parlamentarians - Sheet1.csv"
)

## Parsed with column specification:
## cols(
##   id = col_integer(),
##   last_name = col_character(),
##   birth_date = col_date(format = ""),
##   gender = col_character(),
##   religion = col_character(),
##   education = col_character(),
##   marital_status = col_character(),
##   domicile_city_final = col_character(),
##   email = col_character(),
##   answer = col_character()
## )

# exclude missing data from initial data set
councillors_no_missing <- councillors_initial_data %>%
  anti_join(missing_data_completion, by = "id")

# join completed data to original data set
councillors_missing_data_joined <- councillors_initial_data %>%
  select(
    -last_name,
    -birth_date,
    -gender,
    -religion,
    -education,
    -marital_status,
    -domicile_city_final
  ) %>%
  inner_join(missing_data_completion, by = "id")

## Warning: `overscope_eval_next()` is deprecated as of rlang 0.2.0.
## Please use `eval_tidy()` with a data mask instead.
## This warning is displayed once per session.

## Warning: `chr_along()` is deprecated as of rlang 0.2.0.
## This warning is displayed once per session.

# concat missing and non missing data
councillors <- councillors_no_missing %>%
  bind_rows(councillors_missing_data_joined)

Funktionen für Barcharts

create_barchart <- function(data, title) {
  ggplot(
    data = data %>%
      mutate(missing = is - should),
    aes(
      x = group,
      y = missing,
      label = missing,
      fill = category
    )
  ) +
    geom_bar(stat = "identity") +
    geom_text(
      position = position_stack(vjust = 0.5),
      color = "white",
      size = 3
    ) +
    scale_fill_brewer(palette = "Set1") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(
      title = title,
      y = "Abweichung",
      x = ""
    )
}

create_barchart_with_factors <- function(data, title, factors) {
  ggplot(
    data = data %>%
      mutate(missing = is - should) %>%
      mutate(group = factor(group, levels = factors)),
    aes(
      x = group,
      y = missing,
      label = missing,
      fill = category
    )
  ) +
    geom_bar(stat = "identity") +
    geom_text(
      position = position_stack(vjust = 0.5),
      color = "white",
      size = 3
    ) +
    scale_fill_brewer(palette = "Set1") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(
      title = title,
      y = "Abweichung",
      x = ""
    )
}

Alter

# function to calculate age based on birth date
calc_age <- function(birth_date, ref_date = Sys.Date()) {
  period <- as.period(
    interval(birth_date, ref_date),
    unit = "year"
  )
  # return
  period$year
}

# function that puts people into 7 groups based on their age
get_group_for_age <- function(age) {
  case_when(
    age < 30 ~ "below 29",
    age < 40 ~ "30-39",
    age < 50 ~ "40-49",
    age < 60 ~ "50-59",
    age < 70 ~ "60-69",
    TRUE ~ "above 70"
  )
}

# create age groups for councillors
councillors_age <- councillors %>%
  # calculate age
  mutate(age = calc_age(birth_date)) %>%
  # use age_group for json export
  mutate(age_group = get_group_for_age(age)) %>%
  # reduce data
  select(age_group, id)

# age groups and percentage for councillors
age_groups_percentage_councillors <- councillors_age %>%
  group_by(age_group) %>%
  # count councillors, use id
  summarise(is = n_distinct(id)) %>%
  mutate(share = is / sum(is))

# read data for age of general population, data from end of 2017
general_population_age <- read_excel(
    "input/je-d-01.02.03.02.xlsx",
    sheet = "2017",
    skip = 5
  ) %>%
  rename(
    age = 1,
    total = 2,
    m = 3,
    f = 4,
    swiss = 5,
    foreigners = 6
  ) %>%
  # exclude footer rows
  filter(!is.na(total)) %>%
  # replace 105 und mehr  to 105 to only have numbers
  mutate(age = replace(age, age == "105 und mehr", 105)) %>%
  mutate(age = as.numeric(age)) %>%
  # select population older than 15 years
  filter(age >= 15) %>%
  # here use group instead of age_group for json export
  mutate(group = get_group_for_age(age))
  
# representation of age groups
age_groups_representation <- general_population_age %>%
  select(group, total) %>%
  group_by(group) %>%
  # sum up all values
  summarise(total = sum(total)) %>%
  # calculate share based on should values
  mutate(share = total / sum(total)) %>%
  mutate(should = round(share * 246, 0)) %>%
  # join councillors data (is values)
  left_join(
    age_groups_percentage_councillors %>%
      select(age_group, is),
    by = c("group" = "age_group")
  ) %>%
  mutate(category = "age_group") %>%
  select(category, group, is, should)

create_barchart_with_factors(
  age_groups_representation,
  "Repräsentation Alter",
  c(
    "below 29",
    "30-39",
    "40-49",
    "50-59",
    "60-69",
    "above 70"
  )
)

Geschlecht

# percentage gender data councillors
gender_percentage_councillors <- councillors %>%
  group_by(gender) %>%
  # count councillors, use id
  summarise(is = n_distinct(id)) %>%
  mutate(share = is / sum(is))

# read gender data for general population
general_population_gender <- general_population_age %>%
  select(m, f) %>%
  summarise_all(sum) %>%
  gather(group, total)

## Warning: `mut_node_car()` is deprecated as of rlang 0.2.0.
## This warning is displayed once per session.

# representation of gender
gender_representation <- general_population_gender %>%
  select(group, total) %>%
  group_by(group) %>%
  # sum up all values
  summarise(total = sum(total)) %>%
  # calculate share based on should values
  mutate(share = total / sum(total)) %>%
  mutate(should = round(share * 246, 0)) %>%
  # join councillors data (is values)
  left_join(
  gender_percentage_councillors %>%
    select(gender, is),
    by = c("group" = "gender")
  ) %>%
  mutate(category = "gender") %>%
  select(category, group, is, should)

create_barchart(gender_representation, "Repräsentation Geschlecht")

Stadt / Land

# top 10 cities in Switzerland
big_10_names <- c(
  "Zürich",
  "Genève",
  "Bern",
  "Basel",
  "Lausanne",
  "Winterthur",
  "Luzern",
  "St. Gallen",
  "Lugano",
  "Biel/Bienne"
)

# read data for urban rural distinction
urban_rural <- read_excel(
  "input/Raumgliederungen.xlsx",
  sheet = "Daten",
  range = cell_cols("A:F")
) %>%
  filter(row_number() > 2) %>%
  rename(
    bfs_id = 1,
    municipality = 2,
    canton = 3,
    district_number = 4,
    district_name = 5,
    urban_rural = 6
  ) %>%
  mutate(
    urban_rural = case_when(
      urban_rural == 1 ~ "urban",
      urban_rural == 2 ~ "periurban",
      urban_rural == 3 ~ "rural",
      TRUE ~ urban_rural
    )
  ) %>%
  mutate(
    urban_rural = ifelse(
      municipality %in% big_10_names,
      "big_city",
      urban_rural
    )
  )

# read population data for every municipality, municpalities (2240 entries)
municipality_population <- read_excel(
  "input/su-d-01.02.03.06.xlsx",
  sheet = "2017",
  skip = 2
) %>%
  filter(row_number() > 3) %>%
  rename(
    geography = 1,
    total = 2
  )  %>%
  mutate(total = as.numeric(total)) %>%
  # exclude footer rows
  filter(!is.na(total)) %>%
  # exclude canton and district data
  filter(!str_detect(geography, "^-") & !str_detect(geography, "^>>")) %>%
  separate(geography, into = c("bfs_id", "municipality"), sep = 11) %>%
  mutate(bfs_id = str_extract(bfs_id, "\\d\\d\\d\\d")) %>%
  mutate(bfs_id = as.numeric(bfs_id)) %>%
  select(-total) %>%
  gather(age, population, -municipality, -bfs_id) %>%
  # replace 105 und mehr to 105 to only have numbers
  mutate(age = replace(age, age == "105 und mehr", 105)) %>%
  mutate(age = as.numeric(age)) %>%
  # select population older than 15 years
  filter(age >= 15) %>%
  group_by(municipality) %>%
  # sum up all ages to calcuate total population
  summarise(total = sum(population))

## Warning in evalq(as.numeric(age), <environment>): NAs durch Umwandlung
## erzeugt

# join population data to urban rural distinction
general_population_urban_rural <- municipality_population %>%
  left_join(
    urban_rural %>%
      select(municipality, urban_rural),
    by = "municipality"
  ) %>%
  # here use group instead of urban_rural for json export
  rename(group = urban_rural)

# join councillors to urban rural data set
councillors_urban_rural <- councillors %>%
  mutate(
    domicile_city_final = str_replace(domicile_city_final, "\\s\\d*$", "")
  ) %>%
  left_join(
    urban_rural %>%
      select(municipality, urban_rural),
    by = c("domicile_city_final" = "municipality")
  ) %>%
  # reduce data
  select(urban_rural, id)

# calculate percentage of urban rural councillors
councillors_urban_rural_percentage <- councillors_urban_rural %>%
  group_by(urban_rural) %>%
  # count councillors, use id
  summarise(is = n_distinct(id)) %>%
  mutate(share = is / sum(is))

# representation of urban rural population
urban_rural_representation <- general_population_urban_rural %>%
  select(group, total) %>%
  group_by(group) %>%
  # sum up all values
  summarise(total = sum(total)) %>%
  # calculate share based on should values
  mutate(share = total / sum(total)) %>%
  mutate(should = round(share * 246, 0)) %>%
  # join councillors data (is values)
  left_join(
    councillors_urban_rural_percentage %>%
      select(urban_rural, is),
    by = c("group" = "urban_rural")
  ) %>%
  mutate(category = "urban_rural") %>%
  select(category, group, is, should)

create_barchart(urban_rural_representation, "Repräsentation Stadt Land")

Bildung

#education renaming to match BfS categories
# nolint start
educatoion_renaming <- c(
  "^Handelsschule oder Handelsdiplom$" = "secondary_profession",
  "^Berufslehre oder Berufsschule$" = "secondary_profession",
  "^Diplommittellschule oder allgemeinbildende Schuke$" = "secondary_general_education",
  "^Diplommittellschule oder allgemeinbildende Schule$" = "secondary_general_education",
  "^Maturitätsschule, Gymnasium oder Seminar$" = "secondary_general_education",
  "^Höhere Berufsausbildung \\(Meisterdiplom, höhere Fachausweise etc\\.\\)$" = "tertiary_higher_profession",
  "^Höhere Berufsausbildung*$" = "tertiary_higher_profession",
  "^Höhere Fachschule$" = "tertiary_higher_profession",
  "^Höhere Berufsausbildung$" = "tertiary_higher_profession",
  "^Höhere Fachschule \\(Krankenpflegeschule, Schule für Sozialarbeit etc\\.\\)$" = "tertiary_higher_profession",
  "^Fachhochsule oder Technikum$" = "tertiary_university",
  "^Fachhochschule oder Technikum$" = "tertiary_university",
  "^Universität oder ETH$" = "tertiary_university",
  "^Anderer Bildungsabschluss$" = "unknown",
  "\\?" = "unknown"
)

# education renaming for short English version
educatoion_renaming_short <- c(
  "^Handelsschule oder Handelsdiplom$" = "secondary",
  "^Berufslehre oder Berufsschule$" = "secondary",
  "^Diplommittellschule oder allgemeinbildende Schuke$" = "secondary",
  "^Diplommittellschule oder allgemeinbildende Schule$" = "secondary",
  "^Maturitätsschule, Gymnasium oder Seminar$" = "secondary",
  "^Höhere Berufsausbildung \\(Meisterdiplom, höhere Fachausweise etc\\.\\)$" = "tertiary",
  "^Höhere Berufsausbildung*$" = "tertiary",
  "^Höhere Fachschule$" = "tertiary",
  "^Höhere Berufsausbildung$" = "tertiary",
  "^Höhere Fachschule \\(Krankenpflegeschule, Schule für Sozialarbeit etc\\.\\)$" = "tertiary",
  "^Fachhochsule oder Technikum$" = "tertiary",
  "^Fachhochschule oder Technikum$" = "tertiary",
  "^Universität oder ETH$" = "tertiary",
  "^Anderer Bildungsabschluss$" = "unknown",
  "\\?" = "unknown",
  "keine Angabe" = "unknown"
)

# nolint end

councillors_education <- councillors %>%
  mutate(
    education = str_replace_all(
      education,
      educatoion_renaming_short
      )
    ) %>%
  # reduce data
  select(education, id)

# calculate count of education type by councillors
councillors_education_percentage <- councillors_education %>%
  group_by(education) %>%
  # count councillors, use id
  summarise(is = n_distinct(id)) %>%
  # add share
  mutate(share = is / sum(is))

#read data for 2017, general population from age 15 on
education_general_population <- read_excel(
  "input/su-d-40.02.15.08.01-2017.xlsx",
  sheet = "Schweiz",
  skip = 3
  ) %>%
  filter(row_number() < 2) %>%
  rename(
    obligatory = 5,
    secondary_profession = 7,
    secondary_general_education = 9,
    tertiary_higher_profession = 11,
    tertiary_university = 13
  ) %>%
  mutate(
    secondary =
      as.numeric(secondary_profession) +
      as.numeric(secondary_general_education)
    ) %>%
  mutate(
    tertiary =
      as.numeric(tertiary_higher_profession) +
      as.numeric(tertiary_higher_profession)
    ) %>%
  # make data tidy to have key and total value on one line
  select(
    obligatory,
    secondary,
    tertiary
  ) %>%
  gather(
    group,
    total
  ) %>%
  mutate(total = as.numeric(total))

# representation of education
education_representation <- education_general_population %>%
  select(group, total) %>%
  group_by(group) %>%
  # calculate total, needed for share calculation
  summarise(total = sum(total)) %>%
  mutate(share = total / sum(total)) %>%
  # calculate councillors that should be in parliament
  mutate(should = round(share * 246, 0)) %>%
  # join councillors data (is values)
  full_join(
    councillors_education_percentage %>%
      select(education, is),
    by = c("group" = "education")
  ) %>%
  # add category for export
  mutate(category = "education") %>%
  select(category, group, is, should) %>%
  mutate(is = ifelse(is.na(is), 0, is))

create_barchart(education_representation, "Repräsentation Bildung")

## Warning: Removed 1 rows containing missing values (position_stack).

## Warning: Removed 1 rows containing missing values (position_stack).

Familienstatus

# define groups to extract data from BfS file
marital_status_filter <- c(
  "Ledig",
  "Verheiratet",
  "Geschieden",
  "In eingetragener Partnerschaft",
  "Verwitwet",
  "Unverheiratet",
  "Aufgelöste Partnerschaft"
)

general_population_marital_status <- read_excel(
  "input/su-d-01.02.03.03.xlsx",
  sheet = "2017",
  skip = 3
) %>%
  rename(
    group = 1,
    total = 2
  ) %>%
  select(group, total) %>%
  filter(group %in% marital_status_filter) %>%
  # subtract amout of people younger than 15 years (1269033)
  mutate(total = ifelse(group == "Ledig", total - 1269033, total))
  
# rename councillors data to match BfS categories
marital_status_renaming <- c(
  "^geschieden$" = "Geschieden",
  "^getrennt$" = "Geschieden",
  "^in eingetragener Partnerschaft$" = "In eingetragener Partnerschaft",
  "^ledig$" = "Ledig",
  "^leidig$" = "Ledig",
  "^verheiratet$" = "Verheiratet",
  "^verwitwet$" = "Verwitwet"
)

councillors_marital_status <- councillors %>%
  mutate(
    marital_status = str_replace_all(
      marital_status,
      marital_status_renaming
      )
    ) %>%
  # reduce data
  select(marital_status, id)

# calculate percentage of marital status councillors
councillors_marital_status_percentage <- councillors_marital_status %>%
  group_by(marital_status) %>%
  # count councillors, use id
  summarise(is = n_distinct(id)) %>%
  mutate(share = is / sum(is))

# rename variables for export
marital_status_export_renaming <- c(
  "Verheiratet" = "married",
  "In eingetragener Partnerschaft" = "partnership",
  "Ledig" = "single",
  "\\?" = "unknown",
  "Geschieden" = "divorced",
  "Verwitwet" = "widowed",
  "keine Angabe" = "unknown"
)

# representation of marital status
marital_status_representation <- general_population_marital_status %>%
  select(group, total) %>%
  group_by(group) %>%
  # calculate total, needed for share calculation
  summarise(total = sum(total)) %>%
  mutate(share = total / sum(total)) %>%
  # calculate councillors that should be in parliament
  mutate(should = round(share * 246, 0)) %>%
  # join councillors data (is values)
  full_join(
    councillors_marital_status_percentage %>%
      select(marital_status, is),
    by = c("group" = "marital_status")
  ) %>%
  # add category for export
  mutate(category = "marital_status") %>%
  select(category, group, is, should) %>%
  filter(!is.na(is)) %>%
  # replace by English variable names
  mutate(group = str_replace_all(
    group,
    marital_status_export_renaming
  ))

create_barchart(marital_status_representation, "Repräsentation Zivilstand")

## Warning: Removed 1 rows containing missing values (position_stack).

## Warning: Removed 1 rows containing missing values (position_stack).

Religion

religion_renaming <- c(
  "^Andere christliche Gemeinschaften$" = "other_religion",
  "^andere christliche Gemeinschaften$" = "other_religion",
  "^Christ-katholisch$" = "catholic",
  "^keine Angabe$" = "unknown",
  "\\?" = "unknown",
  "^Islamische Gemeinschaften$" = "muslim",
  "^Protestantisch$" = "protestant",
  "^Römisch-katholisch$" = "catholic",
  "^Konfessionslos$" = "undenominational"
)

religion_export_renaming <- c(
  "^jewish$" = "other_religion",
  "^other_christian$" = "other_religion"
)

general_population_religion <- read_excel(
  "input/je-d-01.08.02.01.xlsx",
  sheet = "1910-2017",
  range = "A22:R22",
  col_names = FALSE
) %>%
  select(
    protestant = 3,
    catholic = 5,
    other_christian = 7,
    jewish = 9,
    muslim = 11,
    other_religion = 13,
    undenominational = 15
  ) %>%
  gather(group, share) %>%
  # mark jewish and other_christian as other religion
  mutate(group = str_replace_all(
    group,
    religion_export_renaming
  )) %>%
  # group by religion type
  group_by(group) %>%
  # sum up all religions
  summarise(share = sum(share)) %>%
  # make share a decimal value
  mutate(share = share / 100)

councillors_religion <- councillors %>%
  mutate(
    religion = str_replace_all(
      religion,
      religion_renaming
    )
  ) %>%
  # reduce data
  select(religion, id)

# calculate percentage of religion, councillors
councillors_religion_percentage <- councillors_religion %>%
  group_by(religion) %>%
  # count councillors, use id
  summarise(is = n_distinct(id)) %>%
  mutate(share = is / sum(is))

# representation of religion
religion_representation <- general_population_religion %>%
  select(group, share) %>%
  # calculate councillors that should be in parliament
  mutate(should = round(share * 246, 0)) %>%
  # join councillors data (is values)
  full_join(
    councillors_religion_percentage %>%
      select(religion, is),
    by = c("group" = "religion")
  ) %>%
  # add category for export
  mutate(category = "religion") %>%
  select(category, group, is, should) %>%
  # set na values of is to 0
  mutate(is = ifelse(is.na(is), 0, is))

create_barchart(religion_representation, "Repräsentation Religion")

## Warning: Removed 1 rows containing missing values (position_stack).

## Warning: Removed 1 rows containing missing values (position_stack).

Export

councillors_export <- councillors %>%
   select (
    id,
    first_name,
    last_name,
    gender,
    party,
    canton
  ) %>%
  left_join(councillors_age, by = "id") %>%
  left_join(councillors_urban_rural, by = "id") %>%
  left_join(councillors_marital_status, by = "id") %>%
  left_join(councillors_education, by = "id") %>%
  left_join(councillors_religion, by = "id") %>%
  select (
    id,
    first_name,
    last_name,
    age_group,
    gender,
    urban_rural,
    marital_status,
    education,
    religion,
    party,
    canton
  ) %>%
  mutate(marital_status = str_replace_all(
    marital_status,
    marital_status_export_renaming
  ))

# unfortunately R and Javascript work with different variable naming systems
# in R snake_case is predominant, in Javascript it's camelCase, so we'll have
# to convert the column names in the export to match the javascript world
convert_column_names_to_camel_case <- function(df) {
  df %>%
    rename_at(
      vars(matches("_[a-z]")),
      # unfortunately the stringr package does not offer a possibility
      # to convert to uppercase in replace, so we use sub
      ~ sub("_([a-z])", "\\U\\1", ., perl = TRUE)
    )
}

write_csv(councillors_export, "output/councillors_supplemented.csv")

representation_export <- bind_rows(
  age_groups_representation,
  gender_representation,
  urban_rural_representation,
  marital_status_representation,
  education_representation,
  religion_representation
)

write_csv(representation_export, "output/representation.csv")

Linting

Der Code in diesem RMarkdown wird mit lintr automatisch auf den Wickham’schen tidyverse style guide überprüft.

lintr::lint(
  "main.Rmd",
  linters = lintr::with_defaults(
    object_length_linter(45)
  )
)