This document illustrates the preprocessing of the dataset visualized in this article on srf.ch.
SRF Data attaches great importance to transparent and reproducible data preprocessing and -analysis. SRF Data believes in the principles of open data but also open and reproducible methods. Third parties should be empowered to build on the work of SRF Data and to generate new analyses and applications.
The preprocessing and analysis of the data was conducted in the R project for statistical computing. The RMarkdown script used to generate this document and all the resulting data can be downloaded under this link. Through executing main.Rmd
, the herein described process can be reproduced and this document can be generated. In the course of this, data from the folder ìnput
will be processed and results will be written to output
.
Attention: Please set your working directory in the first code chunk!
The code for the herein described process can also be freely downloaded from https://github.com/srfdata/2015-06-elections-partystrengths. Criticism in the form of GitHub issues and pull requests are very welcome!
2015-06-elections-partystrengths by SRF Data is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The published information has been collated carefully, but no guarantee is offered of its completeness, correctness or up-to-date nature. No liability is accepted for damage or loss incurred from the use of this script or the information drawn from it. This exclusion of liability also applies to third-party content that is accessible via this offer.
All code & data from SRF Data is available under http://srfdata.github.io.
input/parteienstaerke.xlsx
- Communal party strengths, received by the Federal Statistical Office (BFS) (not publicly available as far as we know)input/parteienstaerke_mod_2.xlsx
- Slightly and manually structurally adjusted parteienstaerke.xlsx
in order to make reading into R easier - note: file seems to have errors when opened via Excel/OpenOffice/etc. but still contains valid, R-readable datainput/parteienstaerke_national.xlsx
- National party strengths, without formatting, as publicly available from the BFSinput/gdenamen_translation.xlsx
- Common translations for Swiss municipalities, without formatting, as publicly available from the BFS (sheet “Geb. Übers. - Trad. usuelles”)input/geodata/g2g14.*
- Official, unaltered, generalized municipality borders (2014) as available from the BFSThe following sections describe the results of the data preprocessing as stored in the output
folder.
output/municipalities.csv
Contains metadata for all 2’345 municipalities as of May 1st, 2014, derived from input/parteienstaerke_mod_2.xlsx
and input/gdenamen_translation.xlsx
.
Attribute | Type | Description |
---|---|---|
ID | Integer | Official BFS-number |
Name_Official | String | Official name of the municipality |
Name_D | String | Contains the “common” German translation if it exists |
Name_F | String | Contains the “common” French translation if it exists |
Name_I | String | Contains the “common” Italian translation if it exists |
output/parties.csv
Contains party classifications made by SRF Data with the help of political scientists, used throughout all projects related to elections.
Attribute | Type | Description |
---|---|---|
ID | Integer | Unique identifier |
Abbr_* | String | Abbreviation in German (D), French (F), English (E), Romansh (R), Italian (I), respectively |
Legend_* | String | Abbreviation, but with slightly more information, used for frontend purposes |
Name_* | String | Full name |
Sortorder | Integer | Used for frontend purposes solely |
OLD_ID | String | “Official” ID as given in input/parteienstaerke_mod_2.xlsx (sheet “Parteien”), used for combining party strengths for party groupings |
output/partystrengths.csv
Party strengths 1971-2011 in all 2’345 municipalities, as derived from input/parteienstaerke_mod_2.xlsx
. Note: The municipalities have already been harmonized to May 1st 2014 by the BFS.
Note: Only those municipality-year-party combinations are contained where the party actually had candidates.
Attribute | Type | Description |
---|---|---|
municipality | Integer | Unique identifier, referencing ID in output/municipalities.csv |
party | Integer | Party or party grouping, referencing ID in output/parties.csv |
year | Integer | Election year |
support | Double | Party strength in fraction of 1 (e.g. 0.2 = 20%) |
output/by_municipality/municipality_{id}.csv
Party strengths 1971-2011 in municipality with BFS-number id.
Note: Only those year-party combinations are contained where the party actually had candidates.
Attribute | Type | Description |
---|---|---|
party | Integer | Party or party grouping, referencing ID in output/parties.csv |
year | Integer | Election year |
support | Double | Party strength in fraction of 1 (e.g. 0.2 = 20%) |
output/by_municipality/national.csv
Party strengths 1971-2011 on the national level.
Note: Only those year-party combinations are contained where the party actually had candidates.
Attribute | Type | Description |
---|---|---|
party | Integer | Party or party grouping, referencing ID in output/parties.csv |
year | Integer | Election year |
support | Double | Party strength in fraction of 1 (e.g. 0.2 = 20%) |
output/by_party/party_{id}.csv
Party strengths 1971-2011 in all municipalities for party with id as defined in output/parties.csv
.
Note: Only those year-municipality combinations are contained where the party actually had candidates.
Attribute | Type | Description |
---|---|---|
municipality | Integer | Unique identifier, referencing ID in output/municipalities.csv |
year | Integer | Election year |
support | Double | Party strength in fraction of 1 (e.g. 0.2 = 20%) |
output/by_party/dominant_party.csv
Party strengths 1971-2011 with only the “dominant” party per year-municipality combination (i.e. the one with the most support)
Attribute | Type | Description |
---|---|---|
municipality | Integer | Unique identifier, referencing ID in output/municipalities.csv |
party | Integer | Party or party grouping, referencing ID in output/parties.csv |
year | Integer | Election year |
support | Double | Party strength in fraction of 1 (e.g. 0.2 = 20%) |
# von https://mran.revolutionanalytics.com/web/packages/checkpoint/vignettes/using-checkpoint-with-knitr.html
cat("library(magrittr)
library(tidyr)
library(dplyr)
library(readxl)
library(ggplot2)",
file = "manifest.R")
package_date <- "2015-08-27"
if(!require(checkpoint)) {
if(!require(devtools)){
install.packages("devtools", repos = "http://cran.us.r-project.org")
require(devtools)
}
devtools::install_github("checkpoint", username = "RevolutionAnalytics", ref = "v0.3.2", repos = "http://cran.us.r-project.org")
require(checkpoint)
}
if(!dir.exists("~/.checkpoint")){
dir.create("~/.checkpoint")
}
checkpoint(snapshotDate = package_date, project = path_to_wd, verbose = T, scanForPackages = T, use.knitr = F)
rm(package_date)
source("manifest.R")
unlink("manifest.R")
knitr::read_chunk('scripts/preprocessing.R')
source('scripts//preprocessing.R')
# load data
# read in input file (xlsx) - need to load a slightly modified input file as the original xlsx makes it really complicated
raw_data <- read_excel("input/parteienstaerke_mod_2.xlsx", sheet = 1, skip = 12, col_names = T)[1:2345,] # replace this subsetting with bugfix in readxl
parties <- read_excel("input/parteienstaerke_mod_2.xlsx", sheet = 4)[1:24,]
# all years
years <- seq(1971, 2011, 4)
length(years) # we have 11 different time stamps
## [1] 11
# first problem: not all years have the same amount of parties, thus we need to fill up by year
# first split the columns into sets for each year.
# doublecheck that "1" is contained as many times as we have years
table(names(raw_data) == "1")[2] == length(years)
## TRUE
## TRUE
# this is true, so we can split the columns by this column name
indices <- which(names(raw_data) == "1")
yearlyDataframes <- list()
for(i in 1:length(indices)){
start <- indices[i]
stop <- if(is.na(indices[i + 1])) ncol(raw_data) else indices[i + 1] - 1
yearlyDataframes[[i]] <- prepareYearlyDataForVis(raw_data[,start:stop], years[i])
}
# combine
filled_data <- cbind(raw_data[, 1:indices[1] - 1], yearlyDataframes)
# doublecheck: column numbers should be 2 + 11 * 24
dim(filled_data)[2] == 2 + 11 * 24
## [1] TRUE
# aggregate parties (cf. to column OLD_ID in parties.csv)
for(year in years){
# FDP
filled_data <- combineAndDelete(1, c(1, 5), year)
# Kleine Rechtsparteien
filled_data <- combineAndDelete(16, c(16, 15, 14, 17), year)
# Kleine Linksparteien
filled_data <- combineAndDelete(9, c(9, 27, 10, 11, 12), year)
# Andere
filled_data <- combineAndDelete(8, c(8, 6, 26, 35, 33), year)
}
# gather into long form
long_data <- filled_data %>%
gather(ParteiJahr, Staerke,-c(BFSNr, GdeName))
# add new columns Jahr and Partei from splitted ParteiJahr
long_data %<>% separate(col = ParteiJahr, into = c("Partei","Jahr")) %>%
mutate(Jahr = as.Date(Jahr, "%Y")) %>%
tbl_df()
# divide strength values through 100
long_data %<>% mutate(Staerke = Staerke / 100)
# only retain BFSNr
long_data %<>% select(BFSNr, Partei, Jahr, Staerke)
# country
# read in input file (xlsx) - need to load a slightly modified input file as the original xlsx makes it really complicated
raw_data_country <- read_excel("input/parteienstaerke_national.xlsx", sheet = 1, skip = 0, col_names = T)[1:24,] # replace this subsetting with bugfix in readxl
long_data_country <- raw_data_country %>%
gather(Jahr, Staerke, -Partei) %>%
mutate(Jahr = as.Date(Jahr, "%Y")) %>%
select(Partei, Jahr, Staerke) %>%
rename(party = Partei, year = Jahr, support = Staerke) %>%
mutate(support = support / 100, year = as.numeric(format.Date(year, "%Y")))
# aggregate parties
for(year in years){
# FDP
long_data_country <- combineAndDeleteCountry(1, c(1, 5), year)
# Kleine Rechtsparteien
long_data_country <- combineAndDeleteCountry(16, c(16, 15, 14, 17), year)
# Kleine Linksparteien
long_data_country <- combineAndDeleteCountry(9, c(9, 27, 10, 11, 12), year)
# Andere
long_data_country <- combineAndDeleteCountry(8, c(8, 6, 26, 35, 33), year)
}
dim(long_data_country)[1] == 11 * 24
## [1] TRUE
# delete incomplete cases
# long_data_country[complete.cases(long_data_country),]
parties <- read.csv("input/parties.csv") # same here
write.csv(parties, file = "output/parties.csv", row.names = F, na = "")
# change variable names
long_data %<>% rename(year = Jahr, municipality = BFSNr, party = Partei, support = Staerke) %>%
mutate(year = as.numeric(format.Date(year, "%Y")), party = as.numeric(party))
totalSupport <- long_data %>%
group_by(year, municipality) %>%
summarize(total = sum(support, na.rm = T)) %>%
filter(!(total > 0.999999999 & total < 1.000000001))
totalSupport %>%
filter(total > 0.0)
## Source: local data frame [0 x 3]
## Groups: year
##
## Variables not shown: year (dbl), municipality (dbl), total (dbl)
mainDir <- getwd()
subDir <- "output/by_party"
dir.create(file.path(mainDir, subDir), showWarnings = FALSE)
# For the map, we need 13 files (one for each of the 12 parties or groups of parties, plus one for 'all parties'). Each file represents support for a single party, and contains data for all municipalities in all years. In the case of 'all parties', it's the support for the strongest party in that municipality in that year.
# split by party and save
for(i in parties$ID){
i <- as.character(i)
subset_data <- subset(long_data, party == i)
subset_data <- subset_data[complete.cases(subset_data),]
write.csv(subset_data, file = paste("output/by_party/party_", i, ".csv", sep = ""), row.names = F)
}
# save value for dominant party in each year/municipality
dominance <- long_data %>%
group_by(municipality, year) %>%
arrange(desc(support)) %>%
# only take the most dominant party in each year
slice(1) %>%
ungroup()
# only retain those where support is not NA
subset_data <- dominance[complete.cases(dominance),]
write.csv(subset_data, file = "output/by_party/dominant_party.csv", row.names = F)
# For the area chart, we need many more files, one for each municipality plus one for all of Switzerland (we estimate that's about 2600 files). Each of these files should contain data for all parties in all years.
mainDir <- getwd()
subDir <- "output/by_municipality"
dir.create(file.path(mainDir, subDir), showWarnings = FALSE)
# split by municipality and save
for(i in unique(long_data$municipality)){
subset_data <- subset(long_data, municipality == i) %>%
select(party, year, support)
subset_data <- subset_data[complete.cases(subset_data),]
# check if support adds up to 100
totalSupport <- subset_data %>%
group_by(year) %>%
summarize(total = sum(support)) %>%
filter(!(total > 0.999999999 & total < 1.000000001))
# if it doesn't only retain those cases where it is not 0
totalSupport %<>%
filter(!(total < 0.000000001))
if(dim(totalSupport)[1] != 0){
warning("Party strengths do not sum up to 1 (or are not 0) in municipality ", i, " and year ", totalSupport$year)
}
write.csv(subset_data, file = paste("output/by_municipality/municipality_", i, ".csv", sep = ""), row.names = F)
}
# save a file for whole switzerland
subset_data <- long_data_country[complete.cases(long_data_country),]
# check if support adds up to 100
totalSupport <- subset_data %>%
group_by(year) %>%
summarize(total = sum(support)) %>%
filter(!(total > 0.999999999 & total < 1.000000001))
if(dim(totalSupport)[1] != 0){
warning("Party strengths do not sum up to 1 in Switzerland and year ", totalSupport$year)
}
write.csv(subset_data, file = "output/by_municipality/national.csv", row.names = F)
alldata <- long_data %>% select(municipality, party, year, support)
# only keep complete cases (e.g. remove NA)
write.csv(alldata[complete.cases(alldata),], "output/partystrengths.csv", row.names = F)
# save municipality names
# first, load municipality name translations (src http://www.bfs.admin.ch/bfs/portal/de/index/infothek/nomenklaturen/blank/blank/gem_liste/03.Document.90142.xls, slightly edited to make it easier to read in)
translated_names <- read_excel("input/gdenamen_translation.xlsx", sheet = 1)[1:75,]
municipalities <- filled_data %>%
select(BFSNr, GdeName) %>%
rename(ID = BFSNr, Name_Official = GdeName)
municipalities %<>%
left_join(translated_names, by = "Name_Official")
write.csv(municipalities, file = "output/municipalities.csv", row.names = F, na = "")