Dieses Dokument beschreibt die Vorprozessierung und explorative Analyse des Datensatzes, der Grundlage des auf srf.ch veröffentlichten Artikels Wie das Parlament die Wähler abbildet – und wie nicht ist.
SRF Data legt Wert darauf, dass die Datenvorprozessierung und -Analyse nachvollzogen und überprüft werden kann. SRF Data glaubt an das Prinzip offener Daten, aber auch offener und nachvollziehbarer Methoden. Zum anderen soll es Dritten ermöglicht werden, auf dieser Vorarbeit aufzubauen und damit weitere Auswertungen oder Applikationen zu generieren.
Das Endprodukt des vorliegenden Scripts, neben der explorativen Analyse, sind (Datenbeschreibung siehe unten):
councillors.json
: Datensatz aller Parlamentarier mit Geschlecht, Altersgruppe, Stadt-Land-Kategorisierung, Zivilstand, Bildung und Religionszugehörigkeit als JSONrepresentation.json
: Datensatz mit Ist- und Soll-Werten für die untersuchten Kategorien (Geschlecht, Altersgruppe etc.) als JSONDie Vorprozessierung und Analyse wurde im Statistikprogramm R vorgenommen. Das zugrunde liegende Script sowie die prozessierten Daten können unter diesem Link heruntergeladen werden. Durch Ausführen von main.Rmd
kann der hier beschriebene Prozess nachvollzogen und der für den Artikel verwendete Datensatz generiert werden. Dabei werden Daten aus dem Ordner input
eingelesen und Ergebnisse in den Ordner output
geschrieben.
SRF Data verwendet das rddj-template von Timo Grossenbacher als Grundlage für seine R-Scripts. Entstehen bei der Ausführung dieses Scripts Probleme, kann es helfen, die Anleitung von rddj-template zu studieren.
Debug-Informationen: This report was generated on 2019-08-27 08:44:02. R version: 3.5.3 on x86_64-apple-darwin15.6.0. For this report, CRAN packages as of 2018-09-01 were used.
Der Code für die vorliegende Datenprozessierung ist auf https://github.com/srfdata/2019-08-representation zur freien Verwendung verfügbar.
2019-08-representation von SRF Data ist lizenziert unter einer Creative Commons Namensnennung - Weitergabe unter gleichen Bedingungen 4.0 International Lizenz.
Code & Daten von SRF Data sind unter http://srfdata.github.io verfügbar.
Die veröffentlichten Informationen sind sorgfältig zusammengestellt, erheben aber keinen Anspruch auf Aktualität, Vollständigkeit oder Richtigkeit. Es wird keine Haftung übernommen für Schäden, die durch die Verwendung dieses Scripts oder der daraus gezogenen Informationen entstehen. Dies gilt ebenfalls für Inhalte Dritter, die über dieses Angebot zugänglich sind.
councillors_supplemented.csv
Attribut | Typ | Beschreibung |
---|---|---|
id | Numeric | Offizielle Id der Parlamentarischen Dienste |
fistName | String | Vorname der Parlamentarier |
lastName | String | Nachname der Parlamentarier |
ageGroup | String | Altersgruppe der Parlamentarier (sieben Kategorien) |
gender | String | Geschlecht der Parlamentarier (m, f) |
urbanRural | String | Stadt-Land-Einteilung der Parlamentarier gemäss Raumgliederung des BfS, vier Kategorien (big_city, urban, periurban, rural) |
maritalStatus | String | Zivilstand der Parlamentarier |
education | String | Höchste Ausbildung der Parlamentarier |
religion | String | Religionszugehörigkeit der Parlamentarier |
party | String | Partei der Parlamentarier |
representation.csv
Attribut | Typ | Beschreibung |
---|---|---|
category | String | Kategoriebezeichnung in Englisch für die jeweilige demografische Angabe |
group | String | Unterschiedliche Einteilungen der jeweiligen Kategorie (Geschlecht, Altersgruppen etc.) |
is | Numeric | Ist-Wert für eine Gruppe: So viele Parlamentarier unter 30 Jahren sitzen im Parlament |
should | Numeric | Soll-Wert für eine Gruppe: So viel Parlamentarier unter 30 Jahren müssten im Vergleich zur ständigen Wohnbevölkerung über 15 Jahren im Parlament sitzen |
→ input/councillorsCouncillor-export_de_ergaenzt_mit_religion.csv
Die Daten über die Parlamentarier stammen von den Parlamentarischen Diensten und wurden von Smartvote durch die eigene Datenbank und weitere Recherchen ergänzt. Im Datensatz befinden sich neben den verwendeten Kategorien auch noch weitere Angaben (Beruf, Sprache etc.) enthalten.
→ input/Missing Data Parlamentarians - Sheet1.csv
Ergänzung des obigen Datensatzes durch eigene Recherche. 99 Parlamentarier wurden per Mail (Code für den automatisierten Mailversand weiter unten) angefragt, fehlende Daten zu ergänzen. Davon haben 64 Parlamentarier geantwortet.
→ input/je-d-01.02.03.02.xlsx
Die Daten zur ständigen Wohnbevölkerung nach Alter haben wir von der Webseite des Bundesamts für Statistik (BfS) heruntergeladen.
→ input/je-d-01.08.02.01.xlsx
Die Daten zur ständigen Wohnbevölkerung ab 15 Jahren nach Religionszugehörigkeit haben wir von der Webseite des BfS heruntergeladen.
→ input/su-d-01.02.03.03.xlsx
Die Daten zur ständigen Wohnbevölkerung nach Zivilstand haben wir von der Webseite des BfS heruntergeladen.
→ input/su-d-40.02.15.08.01-2017.xlsx
Die Daten zur ständigen Wohnbevölkerung ab 15 Jahren nach Bildung haben wir von der Webseite des BfS heruntergeladen.
→ input/su-d-01.02.03.06.xlsx
Die Daten zur ständigen Wohnbevölkerung nach Alter und Gemeinde haben wir von der Webseite des BfS heruntergeladen.
→ input/Raumgliederungen.xlsx
Die Einteilung der Wohnorte der Parlamentarier in urban, periurban und rural haben wir der Raumgliederung des BfS entnommen (Gemeindetypologie 2012, 25 Typen, 9 Kategorien und Stadt/Land-Typologie). Die Tabelle haben wir aus der Applikation der Schweizer Gemeinden des BfS exportiert. Wir haben den offiziellen Gemeindestand vom 02.04.2017 verwendet mit 2240 Einträgen.
## [1] "package package:rmarkdown detached"
## Loading required package: knitr
## Loading required package: rstudioapi
# from https://mran.revolutionanalytics.com/web/packages/checkpoint/vignettes/using-checkpoint-with-knitr.html
# if you don't need a package, remove it from here (commenting is probably not sufficient)
# tidyverse: see https://blog.rstudio.org/2016/09/15/tidyverse-1-0-0/
cat("
library(rstudioapi)
library(tidyverse) # ggplot2, dplyr, tidyr, readr, purrr, tibble
library(lubridate) # date calculation
library(glue) # cooler string templating
library(magrittr) # pipes
library(readxl) # excel
library(scales) # scales for ggplot2
library(jsonlite) # json
library(lintr) # code linting
library(sf) # spatial data handling
library(rmarkdown)
# library(mailR)",
file = "manifest.R")
# if checkpoint is not yet installed, install it (for people using this
# system for the first time)
if (!require(checkpoint)) {
if (!require(devtools)) {
install.packages("devtools", repos = "http://cran.us.r-project.org")
require(devtools)
}
devtools::install_github("RevolutionAnalytics/checkpoint",
ref = "v0.3.2", # could be adapted later,
# as of now (beginning of July 2017
# this is the current release on CRAN)
repos = "http://cran.us.r-project.org")
require(checkpoint)
}
##
## There is a binary version available but the source version is
## later:
## binary source needs_compilation
## pkgbuild 1.0.4 1.0.5 FALSE
##
##
## The downloaded binary packages are in
## /var/folders/z1/kysjxzjd13ngsnb450pxrpdjw6bv67/T//RtmpORUN2J/downloaded_packages
##
checking for file ‘/private/var/folders/z1/kysjxzjd13ngsnb450pxrpdjw6bv67/T/RtmpORUN2J/remotesf16c3120f0d4/RevolutionAnalytics-checkpoint-024b91d/DESCRIPTION’ ...
✔ checking for file ‘/private/var/folders/z1/kysjxzjd13ngsnb450pxrpdjw6bv67/T/RtmpORUN2J/remotesf16c3120f0d4/RevolutionAnalytics-checkpoint-024b91d/DESCRIPTION’ (344ms)
##
─ preparing ‘checkpoint’:
## checking DESCRIPTION meta-information ...
✔ checking DESCRIPTION meta-information
##
─ checking for LF line-endings in source and make files and shell scripts
##
─ checking for empty or unneeded directories
##
─ building ‘checkpoint_0.4.0.tar.gz’
##
Warnung: invalid uid value replaced by that for user 'nobody'
##
Warnung: invalid gid value replaced by that for user 'nobody'
##
##
# nolint start
if (!dir.exists("~/.checkpoint")) {
dir.create("~/.checkpoint")
}
# nolint end
# install packages for the specified CRAN snapshot date
checkpoint(snapshotDate = package_date,
project = path_to_wd,
verbose = T,
scanForPackages = T,
use.knitr = F,
R.version = R_version)
rm(package_date)
source("manifest.R")
unlink("manifest.R")
sessionInfo()
## R version 3.5.3 Patched (2019-03-11 r76221)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] de_CH.UTF-8/de_CH.UTF-8/de_CH.UTF-8/C/de_CH.UTF-8/de_CH.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rmarkdown_1.10 sf_0.6-3 lintr_1.0.2 jsonlite_1.5
## [5] scales_1.0.0 readxl_1.1.0 magrittr_1.5 glue_1.3.0
## [9] lubridate_1.7.4 forcats_0.3.0 stringr_1.3.1 dplyr_0.7.6
## [13] purrr_0.2.5 readr_1.1.1 tidyr_0.8.1 tibble_1.4.2
## [17] ggplot2_3.0.0 tidyverse_1.2.1 checkpoint_0.4.0 devtools_2.1.0
## [21] usethis_1.5.1 rstudioapi_0.7 knitr_1.20
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.18 lattice_0.20-38 class_7.3-15
## [4] prettyunits_1.0.2 ps_1.3.0 assertthat_0.2.0
## [7] rprojroot_1.3-2 digest_0.6.16 R6_2.2.2
## [10] cellranger_1.1.0 plyr_1.8.4 backports_1.1.2
## [13] e1071_1.7-0 evaluate_0.11 httr_1.3.1
## [16] pillar_1.3.0 rlang_0.4.0 lazyeval_0.2.1
## [19] curl_3.2 callr_3.3.1 desc_1.2.0
## [22] munsell_0.5.0 broom_0.5.0 compiler_3.5.3
## [25] modelr_0.1.2 pkgconfig_2.0.2 pkgbuild_1.0.5
## [28] htmltools_0.3.6 tidyselect_0.2.4 crayon_1.3.4
## [31] withr_2.1.2 grid_3.5.3 spData_0.2.9.3
## [34] DBI_1.0.0 nlme_3.1-137 gtable_0.2.0
## [37] units_0.6-0 cli_1.1.0 stringi_1.2.4
## [40] fs_1.3.1 remotes_2.1.0 rex_1.1.2
## [43] bindrcpp_0.2.2 testthat_2.2.1 xml2_1.2.0
## [46] tools_3.5.3 hms_0.4.2 processx_3.4.1
## [49] pkgload_1.0.2 yaml_2.2.0 colorspace_1.3-2
## [52] sessioninfo_1.1.1 classInt_0.2-3 rvest_0.3.2
## [55] memoise_1.1.0 bindr_0.1.1 haven_1.1.2
# list of municipality names to replace councillors cities in order to match BfS data
municpality_names <- c(
"^Coira$" = "Chur",
"^Birmenstorf$" = "Birmenstorf (AG)",
"^Emmenbrücke$" = "Emmen",
"^Geneve$" = "Genève",
"^Azmoos$" = "Wartau",
"^Avry-sur-Matran$" = "Avry",
"^Rüegsauschachen$" = "Rüegsau",
"^Carouge$" = "Carouge (GE)",
"^Wil$" = "Wil (SG)",
"^Gstaad$" = "Saanen",
"^Nàfels$" = "Glarus Nord",
"^Oberriet$" = "Oberriet (SG)",
"^Kaltacker$" = "Heimiswil",
"^Illnau$" = "Illnau-Effretikon",
"^Edlibach$" = "Menzingen",
"^Bürglen$" = "Lungern",
"^Hergiswil$" = "Hergiswil (NW)",
"^St-George$" = "Saint-George",
"^Cheyres$" = "Cheyres-Châbles",
"^Aesch$" = "Aesch (ZH)",
"^Küssnacht$" = "Küssnacht (SZ)",
"^Grand-Lancy$" = "Lancy",
"^Rüi b\\. Büren$" = "Rüti bei Büren",
"^Ruvigliana$" = "Lugano",
"^Küsnacht$" = "Küsnacht (ZH)",
"^Mälchi$" = "Fraubrunnen",
"^Watt-Regensdorf$" = "Regensdorf",
"^Gossau$" = "Gossau (ZH)",
"^Rüthi \\(Rheintal\\)$" = "Rüthi (SG)",
"^Auvernier$" = "Milvignes",
"^Rapperswil$" = "Rapperswil-Jona",
"^Morgins$" = "Troistorrents",
"^Glion$" = "Montreux",
"^Cinuos-chel$" = "S-chanf",
"^Wohlen$" = "Wohlen (AG)",
"^Affoltern am Ablis$" = "Affoltern am Albis",
"^Wallenwil$" = "Eschlikon",
"^Bironico$" = "Monteceneri",
"^Oberflachs$" = "Schinznach",
"^Nussbaumen$" = "Obersiggenthal",
"^Opfertshofen$" = "Thayngen",
"^Biel$" = "Biel/Bienne",
"^Schwanden$" = "Glarus Süd",
"^Haslen$" = "Glarus Süd",
"^Le Sentier$" = "Le Chenit"
)
# read councillor data (246 councillors)
councillors_initial_data <- read_delim(
"input/councillorsCouncillor-export_de_ergaenzt_mit_religion.csv",
delim = ";"
) %>%
# remove all c_ from column names to make further processing easier
rename_at(
vars(
starts_with("c_")),
~ str_replace(., "^c_", "")
) %>%
# convert birth date to date format
mutate(birth_date = as.Date(birth_date, format = "%d.%m.%Y")) %>%
# replace domicile names in order to match BfS names
mutate(domicile_city_final = str_replace_all(
domicile_city_final,
municpality_names
))
## Parsed with column specification:
## cols(
## .default = col_character(),
## c_id = col_integer(),
## c_number = col_integer(),
## c_domicile_zip = col_integer(),
## c_postal_zip = col_integer(),
## c_domicile_zip_final = col_integer()
## )
## See spec(...) for full column specifications.
## Warning: `lang()` is deprecated as of rlang 0.2.0.
## Please use `call2()` instead.
## This warning is displayed once per session.
## Warning: `new_overscope()` is deprecated as of rlang 0.2.0.
## Please use `new_data_mask()` instead.
## This warning is displayed once per session.
## Warning: `is_lang()` is deprecated as of rlang 0.2.0.
## Please use `is_call()` instead.
## This warning is displayed once per session.
## Warning: The `printer` argument is deprecated as of rlang 0.3.0.
## This warning is displayed once per session.
# read data that was completed by email request
missing_data_completion <- read_csv(
"input/Missing Data Parlamentarians - Sheet1.csv"
)
## Parsed with column specification:
## cols(
## id = col_integer(),
## last_name = col_character(),
## birth_date = col_date(format = ""),
## gender = col_character(),
## religion = col_character(),
## education = col_character(),
## marital_status = col_character(),
## domicile_city_final = col_character(),
## email = col_character(),
## answer = col_character()
## )
# exclude missing data from initial data set
councillors_no_missing <- councillors_initial_data %>%
anti_join(missing_data_completion, by = "id")
# join completed data to original data set
councillors_missing_data_joined <- councillors_initial_data %>%
select(
-last_name,
-birth_date,
-gender,
-religion,
-education,
-marital_status,
-domicile_city_final
) %>%
inner_join(missing_data_completion, by = "id")
## Warning: `overscope_eval_next()` is deprecated as of rlang 0.2.0.
## Please use `eval_tidy()` with a data mask instead.
## This warning is displayed once per session.
## Warning: `chr_along()` is deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
# concat missing and non missing data
councillors <- councillors_no_missing %>%
bind_rows(councillors_missing_data_joined)
create_barchart <- function(data, title) {
ggplot(
data = data %>%
mutate(missing = is - should),
aes(
x = group,
y = missing,
label = missing,
fill = category
)
) +
geom_bar(stat = "identity") +
geom_text(
position = position_stack(vjust = 0.5),
color = "white",
size = 3
) +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(
title = title,
y = "Abweichung",
x = ""
)
}
create_barchart_with_factors <- function(data, title, factors) {
ggplot(
data = data %>%
mutate(missing = is - should) %>%
mutate(group = factor(group, levels = factors)),
aes(
x = group,
y = missing,
label = missing,
fill = category
)
) +
geom_bar(stat = "identity") +
geom_text(
position = position_stack(vjust = 0.5),
color = "white",
size = 3
) +
scale_fill_brewer(palette = "Set1") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(
title = title,
y = "Abweichung",
x = ""
)
}
# function to calculate age based on birth date
calc_age <- function(birth_date, ref_date = Sys.Date()) {
period <- as.period(
interval(birth_date, ref_date),
unit = "year"
)
# return
period$year
}
# function that puts people into 7 groups based on their age
get_group_for_age <- function(age) {
case_when(
age < 30 ~ "below 29",
age < 40 ~ "30-39",
age < 50 ~ "40-49",
age < 60 ~ "50-59",
age < 70 ~ "60-69",
TRUE ~ "above 70"
)
}
# create age groups for councillors
councillors_age <- councillors %>%
# calculate age
mutate(age = calc_age(birth_date)) %>%
# use age_group for json export
mutate(age_group = get_group_for_age(age)) %>%
# reduce data
select(age_group, id)
# age groups and percentage for councillors
age_groups_percentage_councillors <- councillors_age %>%
group_by(age_group) %>%
# count councillors, use id
summarise(is = n_distinct(id)) %>%
mutate(share = is / sum(is))
# read data for age of general population, data from end of 2017
general_population_age <- read_excel(
"input/je-d-01.02.03.02.xlsx",
sheet = "2017",
skip = 5
) %>%
rename(
age = 1,
total = 2,
m = 3,
f = 4,
swiss = 5,
foreigners = 6
) %>%
# exclude footer rows
filter(!is.na(total)) %>%
# replace 105 und mehr to 105 to only have numbers
mutate(age = replace(age, age == "105 und mehr", 105)) %>%
mutate(age = as.numeric(age)) %>%
# select population older than 15 years
filter(age >= 15) %>%
# here use group instead of age_group for json export
mutate(group = get_group_for_age(age))
# representation of age groups
age_groups_representation <- general_population_age %>%
select(group, total) %>%
group_by(group) %>%
# sum up all values
summarise(total = sum(total)) %>%
# calculate share based on should values
mutate(share = total / sum(total)) %>%
mutate(should = round(share * 246, 0)) %>%
# join councillors data (is values)
left_join(
age_groups_percentage_councillors %>%
select(age_group, is),
by = c("group" = "age_group")
) %>%
mutate(category = "age_group") %>%
select(category, group, is, should)
create_barchart_with_factors(
age_groups_representation,
"Repräsentation Alter",
c(
"below 29",
"30-39",
"40-49",
"50-59",
"60-69",
"above 70"
)
)
# percentage gender data councillors
gender_percentage_councillors <- councillors %>%
group_by(gender) %>%
# count councillors, use id
summarise(is = n_distinct(id)) %>%
mutate(share = is / sum(is))
# read gender data for general population
general_population_gender <- general_population_age %>%
select(m, f) %>%
summarise_all(sum) %>%
gather(group, total)
## Warning: `mut_node_car()` is deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
# representation of gender
gender_representation <- general_population_gender %>%
select(group, total) %>%
group_by(group) %>%
# sum up all values
summarise(total = sum(total)) %>%
# calculate share based on should values
mutate(share = total / sum(total)) %>%
mutate(should = round(share * 246, 0)) %>%
# join councillors data (is values)
left_join(
gender_percentage_councillors %>%
select(gender, is),
by = c("group" = "gender")
) %>%
mutate(category = "gender") %>%
select(category, group, is, should)
create_barchart(gender_representation, "Repräsentation Geschlecht")
# top 10 cities in Switzerland
big_10_names <- c(
"Zürich",
"Genève",
"Bern",
"Basel",
"Lausanne",
"Winterthur",
"Luzern",
"St. Gallen",
"Lugano",
"Biel/Bienne"
)
# read data for urban rural distinction
urban_rural <- read_excel(
"input/Raumgliederungen.xlsx",
sheet = "Daten",
range = cell_cols("A:F")
) %>%
filter(row_number() > 2) %>%
rename(
bfs_id = 1,
municipality = 2,
canton = 3,
district_number = 4,
district_name = 5,
urban_rural = 6
) %>%
mutate(
urban_rural = case_when(
urban_rural == 1 ~ "urban",
urban_rural == 2 ~ "periurban",
urban_rural == 3 ~ "rural",
TRUE ~ urban_rural
)
) %>%
mutate(
urban_rural = ifelse(
municipality %in% big_10_names,
"big_city",
urban_rural
)
)
# read population data for every municipality, municpalities (2240 entries)
municipality_population <- read_excel(
"input/su-d-01.02.03.06.xlsx",
sheet = "2017",
skip = 2
) %>%
filter(row_number() > 3) %>%
rename(
geography = 1,
total = 2
) %>%
mutate(total = as.numeric(total)) %>%
# exclude footer rows
filter(!is.na(total)) %>%
# exclude canton and district data
filter(!str_detect(geography, "^-") & !str_detect(geography, "^>>")) %>%
separate(geography, into = c("bfs_id", "municipality"), sep = 11) %>%
mutate(bfs_id = str_extract(bfs_id, "\\d\\d\\d\\d")) %>%
mutate(bfs_id = as.numeric(bfs_id)) %>%
select(-total) %>%
gather(age, population, -municipality, -bfs_id) %>%
# replace 105 und mehr to 105 to only have numbers
mutate(age = replace(age, age == "105 und mehr", 105)) %>%
mutate(age = as.numeric(age)) %>%
# select population older than 15 years
filter(age >= 15) %>%
group_by(municipality) %>%
# sum up all ages to calcuate total population
summarise(total = sum(population))
## Warning in evalq(as.numeric(age), <environment>): NAs durch Umwandlung
## erzeugt
# join population data to urban rural distinction
general_population_urban_rural <- municipality_population %>%
left_join(
urban_rural %>%
select(municipality, urban_rural),
by = "municipality"
) %>%
# here use group instead of urban_rural for json export
rename(group = urban_rural)
# join councillors to urban rural data set
councillors_urban_rural <- councillors %>%
mutate(
domicile_city_final = str_replace(domicile_city_final, "\\s\\d*$", "")
) %>%
left_join(
urban_rural %>%
select(municipality, urban_rural),
by = c("domicile_city_final" = "municipality")
) %>%
# reduce data
select(urban_rural, id)
# calculate percentage of urban rural councillors
councillors_urban_rural_percentage <- councillors_urban_rural %>%
group_by(urban_rural) %>%
# count councillors, use id
summarise(is = n_distinct(id)) %>%
mutate(share = is / sum(is))
# representation of urban rural population
urban_rural_representation <- general_population_urban_rural %>%
select(group, total) %>%
group_by(group) %>%
# sum up all values
summarise(total = sum(total)) %>%
# calculate share based on should values
mutate(share = total / sum(total)) %>%
mutate(should = round(share * 246, 0)) %>%
# join councillors data (is values)
left_join(
councillors_urban_rural_percentage %>%
select(urban_rural, is),
by = c("group" = "urban_rural")
) %>%
mutate(category = "urban_rural") %>%
select(category, group, is, should)
create_barchart(urban_rural_representation, "Repräsentation Stadt Land")
#education renaming to match BfS categories
# nolint start
educatoion_renaming <- c(
"^Handelsschule oder Handelsdiplom$" = "secondary_profession",
"^Berufslehre oder Berufsschule$" = "secondary_profession",
"^Diplommittellschule oder allgemeinbildende Schuke$" = "secondary_general_education",
"^Diplommittellschule oder allgemeinbildende Schule$" = "secondary_general_education",
"^Maturitätsschule, Gymnasium oder Seminar$" = "secondary_general_education",
"^Höhere Berufsausbildung \\(Meisterdiplom, höhere Fachausweise etc\\.\\)$" = "tertiary_higher_profession",
"^Höhere Berufsausbildung*$" = "tertiary_higher_profession",
"^Höhere Fachschule$" = "tertiary_higher_profession",
"^Höhere Berufsausbildung$" = "tertiary_higher_profession",
"^Höhere Fachschule \\(Krankenpflegeschule, Schule für Sozialarbeit etc\\.\\)$" = "tertiary_higher_profession",
"^Fachhochsule oder Technikum$" = "tertiary_university",
"^Fachhochschule oder Technikum$" = "tertiary_university",
"^Universität oder ETH$" = "tertiary_university",
"^Anderer Bildungsabschluss$" = "unknown",
"\\?" = "unknown"
)
# education renaming for short English version
educatoion_renaming_short <- c(
"^Handelsschule oder Handelsdiplom$" = "secondary",
"^Berufslehre oder Berufsschule$" = "secondary",
"^Diplommittellschule oder allgemeinbildende Schuke$" = "secondary",
"^Diplommittellschule oder allgemeinbildende Schule$" = "secondary",
"^Maturitätsschule, Gymnasium oder Seminar$" = "secondary",
"^Höhere Berufsausbildung \\(Meisterdiplom, höhere Fachausweise etc\\.\\)$" = "tertiary",
"^Höhere Berufsausbildung*$" = "tertiary",
"^Höhere Fachschule$" = "tertiary",
"^Höhere Berufsausbildung$" = "tertiary",
"^Höhere Fachschule \\(Krankenpflegeschule, Schule für Sozialarbeit etc\\.\\)$" = "tertiary",
"^Fachhochsule oder Technikum$" = "tertiary",
"^Fachhochschule oder Technikum$" = "tertiary",
"^Universität oder ETH$" = "tertiary",
"^Anderer Bildungsabschluss$" = "unknown",
"\\?" = "unknown",
"keine Angabe" = "unknown"
)
# nolint end
councillors_education <- councillors %>%
mutate(
education = str_replace_all(
education,
educatoion_renaming_short
)
) %>%
# reduce data
select(education, id)
# calculate count of education type by councillors
councillors_education_percentage <- councillors_education %>%
group_by(education) %>%
# count councillors, use id
summarise(is = n_distinct(id)) %>%
# add share
mutate(share = is / sum(is))
#read data for 2017, general population from age 15 on
education_general_population <- read_excel(
"input/su-d-40.02.15.08.01-2017.xlsx",
sheet = "Schweiz",
skip = 3
) %>%
filter(row_number() < 2) %>%
rename(
obligatory = 5,
secondary_profession = 7,
secondary_general_education = 9,
tertiary_higher_profession = 11,
tertiary_university = 13
) %>%
mutate(
secondary =
as.numeric(secondary_profession) +
as.numeric(secondary_general_education)
) %>%
mutate(
tertiary =
as.numeric(tertiary_higher_profession) +
as.numeric(tertiary_higher_profession)
) %>%
# make data tidy to have key and total value on one line
select(
obligatory,
secondary,
tertiary
) %>%
gather(
group,
total
) %>%
mutate(total = as.numeric(total))
# representation of education
education_representation <- education_general_population %>%
select(group, total) %>%
group_by(group) %>%
# calculate total, needed for share calculation
summarise(total = sum(total)) %>%
mutate(share = total / sum(total)) %>%
# calculate councillors that should be in parliament
mutate(should = round(share * 246, 0)) %>%
# join councillors data (is values)
full_join(
councillors_education_percentage %>%
select(education, is),
by = c("group" = "education")
) %>%
# add category for export
mutate(category = "education") %>%
select(category, group, is, should) %>%
mutate(is = ifelse(is.na(is), 0, is))
create_barchart(education_representation, "Repräsentation Bildung")
## Warning: Removed 1 rows containing missing values (position_stack).
## Warning: Removed 1 rows containing missing values (position_stack).
# define groups to extract data from BfS file
marital_status_filter <- c(
"Ledig",
"Verheiratet",
"Geschieden",
"In eingetragener Partnerschaft",
"Verwitwet",
"Unverheiratet",
"Aufgelöste Partnerschaft"
)
general_population_marital_status <- read_excel(
"input/su-d-01.02.03.03.xlsx",
sheet = "2017",
skip = 3
) %>%
rename(
group = 1,
total = 2
) %>%
select(group, total) %>%
filter(group %in% marital_status_filter) %>%
# subtract amout of people younger than 15 years (1269033)
mutate(total = ifelse(group == "Ledig", total - 1269033, total))
# rename councillors data to match BfS categories
marital_status_renaming <- c(
"^geschieden$" = "Geschieden",
"^getrennt$" = "Geschieden",
"^in eingetragener Partnerschaft$" = "In eingetragener Partnerschaft",
"^ledig$" = "Ledig",
"^leidig$" = "Ledig",
"^verheiratet$" = "Verheiratet",
"^verwitwet$" = "Verwitwet"
)
councillors_marital_status <- councillors %>%
mutate(
marital_status = str_replace_all(
marital_status,
marital_status_renaming
)
) %>%
# reduce data
select(marital_status, id)
# calculate percentage of marital status councillors
councillors_marital_status_percentage <- councillors_marital_status %>%
group_by(marital_status) %>%
# count councillors, use id
summarise(is = n_distinct(id)) %>%
mutate(share = is / sum(is))
# rename variables for export
marital_status_export_renaming <- c(
"Verheiratet" = "married",
"In eingetragener Partnerschaft" = "partnership",
"Ledig" = "single",
"\\?" = "unknown",
"Geschieden" = "divorced",
"Verwitwet" = "widowed",
"keine Angabe" = "unknown"
)
# representation of marital status
marital_status_representation <- general_population_marital_status %>%
select(group, total) %>%
group_by(group) %>%
# calculate total, needed for share calculation
summarise(total = sum(total)) %>%
mutate(share = total / sum(total)) %>%
# calculate councillors that should be in parliament
mutate(should = round(share * 246, 0)) %>%
# join councillors data (is values)
full_join(
councillors_marital_status_percentage %>%
select(marital_status, is),
by = c("group" = "marital_status")
) %>%
# add category for export
mutate(category = "marital_status") %>%
select(category, group, is, should) %>%
filter(!is.na(is)) %>%
# replace by English variable names
mutate(group = str_replace_all(
group,
marital_status_export_renaming
))
create_barchart(marital_status_representation, "Repräsentation Zivilstand")
## Warning: Removed 1 rows containing missing values (position_stack).
## Warning: Removed 1 rows containing missing values (position_stack).
religion_renaming <- c(
"^Andere christliche Gemeinschaften$" = "other_religion",
"^andere christliche Gemeinschaften$" = "other_religion",
"^Christ-katholisch$" = "catholic",
"^keine Angabe$" = "unknown",
"\\?" = "unknown",
"^Islamische Gemeinschaften$" = "muslim",
"^Protestantisch$" = "protestant",
"^Römisch-katholisch$" = "catholic",
"^Konfessionslos$" = "undenominational"
)
religion_export_renaming <- c(
"^jewish$" = "other_religion",
"^other_christian$" = "other_religion"
)
general_population_religion <- read_excel(
"input/je-d-01.08.02.01.xlsx",
sheet = "1910-2017",
range = "A22:R22",
col_names = FALSE
) %>%
select(
protestant = 3,
catholic = 5,
other_christian = 7,
jewish = 9,
muslim = 11,
other_religion = 13,
undenominational = 15
) %>%
gather(group, share) %>%
# mark jewish and other_christian as other religion
mutate(group = str_replace_all(
group,
religion_export_renaming
)) %>%
# group by religion type
group_by(group) %>%
# sum up all religions
summarise(share = sum(share)) %>%
# make share a decimal value
mutate(share = share / 100)
councillors_religion <- councillors %>%
mutate(
religion = str_replace_all(
religion,
religion_renaming
)
) %>%
# reduce data
select(religion, id)
# calculate percentage of religion, councillors
councillors_religion_percentage <- councillors_religion %>%
group_by(religion) %>%
# count councillors, use id
summarise(is = n_distinct(id)) %>%
mutate(share = is / sum(is))
# representation of religion
religion_representation <- general_population_religion %>%
select(group, share) %>%
# calculate councillors that should be in parliament
mutate(should = round(share * 246, 0)) %>%
# join councillors data (is values)
full_join(
councillors_religion_percentage %>%
select(religion, is),
by = c("group" = "religion")
) %>%
# add category for export
mutate(category = "religion") %>%
select(category, group, is, should) %>%
# set na values of is to 0
mutate(is = ifelse(is.na(is), 0, is))
create_barchart(religion_representation, "Repräsentation Religion")
## Warning: Removed 1 rows containing missing values (position_stack).
## Warning: Removed 1 rows containing missing values (position_stack).
councillors_export <- councillors %>%
select (
id,
first_name,
last_name,
gender,
party,
canton
) %>%
left_join(councillors_age, by = "id") %>%
left_join(councillors_urban_rural, by = "id") %>%
left_join(councillors_marital_status, by = "id") %>%
left_join(councillors_education, by = "id") %>%
left_join(councillors_religion, by = "id") %>%
select (
id,
first_name,
last_name,
age_group,
gender,
urban_rural,
marital_status,
education,
religion,
party,
canton
) %>%
mutate(marital_status = str_replace_all(
marital_status,
marital_status_export_renaming
))
# unfortunately R and Javascript work with different variable naming systems
# in R snake_case is predominant, in Javascript it's camelCase, so we'll have
# to convert the column names in the export to match the javascript world
convert_column_names_to_camel_case <- function(df) {
df %>%
rename_at(
vars(matches("_[a-z]")),
# unfortunately the stringr package does not offer a possibility
# to convert to uppercase in replace, so we use sub
~ sub("_([a-z])", "\\U\\1", ., perl = TRUE)
)
}
write_csv(councillors_export, "output/councillors_supplemented.csv")
representation_export <- bind_rows(
age_groups_representation,
gender_representation,
urban_rural_representation,
marital_status_representation,
education_representation,
religion_representation
)
write_csv(representation_export, "output/representation.csv")
Der Code in diesem RMarkdown wird mit lintr automatisch auf den Wickham’schen tidyverse style guide überprüft.
lintr::lint(
"main.Rmd",
linters = lintr::with_defaults(
object_length_linter(45)
)
)