Elections in Vienna are today and while glancing through the electoral lists I couldn’t help but paying attention to candidates’ birth years. Maybe that’s an age thing… This got me thinking that I haven’t seen any more systematic analysis of parties’/candidates’ age profile. So as a modest contribution to this end, here are my two cents. Again, I’ll focus mainly on the pertaining steps in R and related number crunching. Due to a lack of time and not being an expert on Vienna’s electoral system, I’ll be brief when it comes to substantive matters. But the presented results hopefully provide sufficient material to dig into.
As always, if you see any glaring error or have any constructive comment, feel free to let me now (best via twitter DM).
3 Data
Again, as so often, the trickiest part is to get the data ‘liberated’ from the format it is provided in. The entire list of candidates is published in this pdf. Note that there are three lists: One for the city council (‘Gemeinderat’; composed on the basis of the results in 18 multi-member electoral districts), one for the 23 district councils (‘Bezirksrat’; one in each district), and the ‘city election proposal’ (‘Stadtwahlvorschlag’; admittedly a somewhat clumsy translation). The latter doesn’t constitute a body in itself, but serves to allocate mandates which remained unassigned after counting the votes for the city council (‘zweites Ermittlungsverfahren’/similar to the d’Hondt procedure).
When it comes to extracting the data from the linked pdf, a difficulty may arise due to the two-column format of the document. Hence, simple row-wise extraction doesn’t help much since it would put candidates together which could be from different parties. Similarly, simply isolating the two columns and extracting candidates would also not do the trick since breaks betwee parties, districts etc run over two, and not one column. To illustrate this, I resorted to cutting edge technology and drew the two arrows below:
Luckily, the tabulizer package is not only very powerful when it comes to extracting text/data from a pdf, it is also sophisticated enough to take into consideration the text flow highlighted above. I am not familiar with the underlying heuristic, but I assume it is contingent on consistently formatted section headings. Hence, empowered with this tool, retrieving the text becomes rather effortless. The subsequent steps are a battery of regular expressions to extract the specific data we are interested in. To see the code, unfold the snippets below.
Code: Extract text from pdf
Code
df_raw <- tabulizer::extract_text(file=here::here("_blog_data", "vienna_elections_2020","amtsblatt2020.pdf"),pages=c(1:115),encoding="UTF-8") %>%#capital letters of UTF!enframe(name=NULL,value="text_raw")
After these few steps we have a searchable/sortable table of all candidates (or better candidatures since one person can be candidate on multiple districts/lists). There have been 8,983 candidatures by 5,038 individuals.
The table essentially provides all necessary data for the subsequent analysis. While the pdf does not include the exact birth date of each candidate, it provides us with their birth years which we can take as a proxy for age. Note that I also extracted candidates’ residence zip code to see how often place of residence and candidature actually overlap (see below).
Source:
data: https://www.wien.gv.at/politik/wahlen/grbv/2020/ analysis: Roland Schmidt | @zoowalk | https://werk.statt.codes
4 Analysis
4.1 Oldest and youngest candidates
Let’s now look at the overall youngest and oldest candidates. We could retrieve this information already from the main table provided above (sort column birth year). Here, however, let’s nest candidates’ different candidatures for the sake of clarity.
The oldest candidate is Waschiczek Wolfgang, who was born in 1928. Not bad.
4.2 Avgerage birth year per election and party
Let’s now look at the average year of birth of parties’ candidates on each of the different electoral levels. The table below provides the median, mean and standard deviation for each party. The thin white line in the density plots on the right indicates the median.