In mid-June, the so-called ‘Ibiza Commission of Inquiry’ heard its last witness after almost one year at work. The commission had been established by the Austrian Parliament and set up to investigate the prevalence of corruption during the coalition government of the ÖVP and FPÖ (Dec, 2017 - May, 2019). This post diggs into its published transcripts. The emphasis of the post is, first, on demonstrating how to extract the text from the hearings’ transcripts in R and, second, on crunching a few numbers to get some substantive insights. The compiled dataset, covering all statements (incl. questions, answers etc.), is available for download.
In mid-June, the Austrian Parliament’s ‘Commission of Inquiry concerning the alleged corruptibility of the turquois-blue Federal Government’1 heard its last respondent (‘Auskunftsperson’). More informally, the commission is simply called ‘Ibiza inquiry’, named after the location of a secretly taped video which showed high ranking members of the extreme right FPÖ party speaking freely - liberated by alcohol - with a fake niece of a Russian oligarch about actual or intended corruption in Austria’s political system. As a consequence of the video’s release, the then ruling coalition government of ÖVP (‘turquoise’) and FPÖ (‘blue’) collapsed and the commission was set up. In short, a pretty wild and bewildering story.
The commission, empowered with a fairly broad mandate and armed with access to plenty of WhatsApp etc messages, offered an unprecedented view into the inner-working of Chancellor Kurz’s first government and the mindset of some of its protagonists. While the initial impetus for setting up the commission was first and foremost the Ibiza video featuring the FPÖ leadership, the inquiry’s focus gradually shifted (not least due to the opposition’s efforts) to the wheeling and dealing of Chancellor Sebastian Kurz’s ÖVP and its affiliates.
Having said this, the purpose of this post is not to give a recap of the inquiry or its results, but first and foremost procedural in the sense that it intends to detail the necessary steps to extract statements from the inquiry’s transcripts and subsequently obtain some exemplary insights with R. As always, if you spot any error, feel free to contact me, best via twitter DM. And if you use any of the work provided here, grateful if you acknowledge my input.
If you are not interested in any of the coding steps generating the data, jump directly to the Analysis section which still contains plenty of code, but you’re not burdened with how to obtain the data in the first place.
2 Getting the data
In this section I will layout the necessary steps to obtain the relevant data. I’ll first detail each step with one single, sample session of the inquiry commission. Subsequently, I’ll apply the demonstrated steps to all sessions by means of one general function.
But before, let’s load the packages we’ll need along the way and define some templates.
Code
# libraries ---------------------------------------------------------------#load the required librarieslibrary(tidyverse)library(rvest)library(xml2)library(fuzzyjoin)library(reactable)library(reactablefmtr)library(hrbrthemes)#define party colorsvec_party_col <-c("ÖVP"="#5DC2CC","SPÖ"="#FC0204", "Grüne"="#A3C630","FPÖ"="#005DA8", "NEOS"="#EA5290" )
2.1 Getting links to transcripts
In a first step, let’s get the links leading to the transcripts of the sessions. This link leads us to the commission’s overview page including all published reports, including the sessions’ transcripts (Protokolle).
The code below extracts links related to the latter. Explanatory comments are inserted directly into the code chunk.
Code
#link to overview pagesite_link <-"https://www.parlament.gv.at/PAKT/VHG/XXVII/A-USA/A-USA_00002_00906/index.shtml#tab-VeroeffentlichungenBerichte"# get links to pages where links to protocols are located# link has to include words 'Protokolls' in text;df_links_to_subpages <- site_link %>% rvest::read_html() %>%#define a filter to get only the links related to transcripts (protocols)# filters links based on text/name of links rvest::html_elements(xpath ="//a[contains(text(), 'Protokolls')]") %>%html_attr("href") %>%# extracts linksenframe(name =NULL, value ="link") %>%#links of interest inlude "KOMM"filter(str_detect(link, regex("KOMM"))) %>%#complete the linkmutate(link_to_subpages =paste0("https://www.parlament.gv.at/", link)) %>%select(link_to_subpages)
Here the first ten links:
Each of these links leads to a subpage which provides details on the record in question and the link to the actual file containing the transcribed text. Below one such subpage.
Note the link leading to the HTML version of the transcript. To access the transcript we need the link’s target address. The function below extracts the link leading to actual text. Subsequently, the function is applied to all subpage links which we obtained in the previous step.
Code
# function to extract link to protocol from details pagefn_get_link_to_record <-function(link_to_subpage) { link_to_subpage %>% rvest::read_html() %>% rvest::html_elements("a") %>%html_attr("href") %>%enframe(name =NULL,value ="link" ) %>%#link to transcript contains "fnameorig"filter(str_detect(link, regex("fnameorig"))) %>%#complete linkmutate(link_to_record =paste0("https://www.parlament.gv.at/", link)) %>%select(link_to_record)}library(furrr)plan(multisession, workers =2)# apply function to all linksdf_links_to_records <- df_links_to_subpages %>%pull(link_to_subpages) %>% purrr::set_names() %>%future_map_dfr(., fn_get_link_to_record, .id ="link_to_subpages")
What we obtain is a dataframe including the links which lead to all transcripts of the inquiry (only first 5 are shown).
2.2 Extracting text
Now with the links to the transcripts available, let’s have a look at one such text, e.g. here.
Importantly, notice that statements given before the inquiry commission are always introduced with the speaker’s name (and position) in bold and underlined letters. This (almost) consistently applied formatting will eventually allow us to distinguish between the actual statement and its speaker, and the start/end of different statements. I’ll first extract these names and subsequently assign these names to their respective statements.
2.2.1 Extract speakers
To extract the speakers from the text I’ll use once again the powerfull rvest package. To identify those text parts which are in bold and underlined, its html_elements function is used. As for the resort to xml_contents I am grateful for the answer to this Stackoverflow question.
Code
# get those elements which are bold and underlineddf_speakers <- link_to_record %>% rvest::read_html() %>%#extract elements which are bold and underlined; note that the sequence has to be that of the html tags rvest::html_elements("b u") %>%map(., xml_contents) %>%map(., html_text) %>%enframe(value ="speaker") %>%mutate(speaker =str_trim(speaker, side =c("both"))) %>%mutate(speaker =str_squish(speaker)) %>%# keep only those elements which end with colon;# filter(str_detect(speaker, regex("\\:$"))) %>%#remove colon at end; needed to unify names of speakers where some instances end/do not end with colonmutate(speaker=str_remove(speaker, regex("\\:$"))) %>%filter(str_count(speaker, regex("\\S+"))>1) %>%#removes heading of transcript which is also bold and underlinedfilter(!str_detect(speaker, regex("^Befragung der"))) %>%distinct(speaker)
Here’s the result for our one our sample session.
# A tibble: 12 × 1
speaker
<chr>
1 Verfahrensrichter Dr. Wolfgang Pöschl
2 Vorsitzender Mag. Wolfgang Sobotka
3 Sebastian Kurz
4 Abgeordneter Mag. Klaus Fürlinger (ÖVP)
5 Mag. Klaus Fürlinger (ÖVP)
6 Abgeordneter Kai Jan Krainer (SPÖ)
7 Abgeordneter Dr. Christian Stocker (ÖVP)
8 Abgeordneter Mag. Andreas Hanger (ÖVP)
9 Abgeordnete Mag. Nina Tomaselli (Grüne)
10 Abgeordneter Christian Hafenecker, MA (FPÖ)
11 Vorsitzender Mag. Wolfgang Sobotk
12 Abgeordneter David Stögmüller (Grüne)
As you can see, the approach worked rather well, but not 100 % perfect. There are a few rows which contain incomplete names of speakers or some related fragments (Sobotka without a). These unwanted results are - as far as I can tell - due to some inconsistent formatting of speakers’ names (even if their appearance in the document is identical) or editorial errors (e.g. the position of a speaker is mentioned in one instance, but not in another, e.g. Abgeordneter mag. Klaus Führlinger). These glitches have to be corrected ‘manually’. The code chunk below does this for our specific sample case.
After these modifications we get a clean dataframe of those who made statements during the session in question (our sample link).
# A tibble: 10 × 1
speaker
<chr>
1 Verfahrensrichter Dr. Wolfgang Pöschl
2 Vorsitzender Mag. Wolfgang Sobotka
3 Sebastian Kurz
4 Abgeordneter Mag. Klaus Fürlinger (ÖVP)
5 Abgeordneter Kai Jan Krainer (SPÖ)
6 Abgeordneter Dr. Christian Stocker (ÖVP)
7 Abgeordneter Mag. Andreas Hanger (ÖVP)
8 Abgeordnete Mag. Nina Tomaselli (Grüne)
9 Abgeordneter Christian Hafenecker, MA (FPÖ)
10 Abgeordneter David Stögmüller (Grüne)
In a later step, we will search the entire transcript of the session for the presence of these speakers’ names (incl. position and title) to identify the start of a statement. Since this pattern matching will require regular expressions (regex), the names have to be modified accordingly (e.g. a dot . has to be escaped and becomes \\., for further info see here).
Code
# create regex df_speakers <- df_speakers %>%mutate(speaker_pattern =str_replace_all(speaker, "\\.", "\\\\.") %>%str_replace_all(., "\\:", "\\\\:") %>%# str_replace_all(., "\\,", "\\\\,") %>%str_replace_all(., "\\)", "\\\\)") %>%str_replace_all(., "\\(", "\\\\(") %>%#match has to be at start of string; avoids mismatches where name appears in middle of textpaste0("^",.))
2.2.2 Identify and extract statements
Now, let’s get the entire text of a transcript. Again, I use the rvest package, this time, however, the hmtl-tag <p> is targeted.
Code
## Extract entire text from transcriptdf_text <- link_to_record %>% rvest::read_html() %>% rvest::html_elements("p") %>% rvest::html_text() %>%enframe(., value ="text",name="row") %>%mutate(text =str_squish(text) %>%str_trim(., side =c("both")))
The result is a dataframe with one row for each line in the transcript. The challenge is now to identify those rows where a speaker’s statement starts and ends/the next speaker’s statement begins. I’ll do this by - as already mentioned above - pattern matching the names of the retrieved speakers with the content of each row. In simpler terms, does the row start with the name of a speaker which we previously identified by their bold and underlined format? To do this I’ll make use of fuzzyjoins::regex_left_join.
Code
# get speaker name from text df_text_2 <- df_text %>%#search for match only in opening section of line# mutate(text_start=stringr::str_sub(text, start=1, end=40)) %>% fuzzyjoin::regex_left_join(., df_speakers,by =c("text"="speaker_pattern") ) # speaker_pattern
Have a look at the result below. There’s now a new column indicating the start of a statement with the speaker’s name (since transcripts are starting with some introductory text you’ll see speakers’ names appear only on later pages in the table).
In the next step, I 1) dplyr::fill all empty speaker rows after a statement’s start with the speaker’s name (details here), and 2) create a numerical grouping variable which increases each time the speaker is changing (= start of new statement). Remaining rows without a speaker are removed since they contain only additional text and not statements.
Code
df_text_2 <- df_text_2 %>%# fill rows with speaker and pattern as basis for groupingfill(speaker, .direction ="down") %>%fill(speaker_pattern, .direction ="down") %>%filter(!is.na(speaker)) %>%# create grouping id; later needed to collapse rowsmutate(grouping_id =ifelse(speaker == dplyr::lag(speaker, default ="start"),0,1 )) %>%mutate(grouping_id_cum =cumsum(grouping_id)) %>%# remove speaker info from actual spoken textmutate(text =str_remove(text, speaker_pattern)) %>%# remove colon; this approach keeps annotations which are included between speaker name and colon; e.g "(zur Geschäftsbehandlung)";mutate(text =str_remove(text, regex("\\:"))) %>%mutate(text =str_trim(text, side =c("both")))
With each row now attributed to a speaker, and each statement assigned with a distinct indicator, we can collapse a statement’s multiple lines into one single row with its single speaker.
The resulting table already represents most of the required data for the later analysis (for one specific sample respondent). I complement it with additional data on an MP’s party affiliation (if applicable); a speaker’s position (if stated); and the session’s date, duration and number.
Code
df_text_3 <- df_text_3 %>%# extract party of MP; pattern matches only if speaker starts with Abgeordnetermutate(speaker_party=str_extract(speaker, regex("(?<=^Abgeordnete[^\\(]{1,40})\\(.*\\)$"))) %>%mutate(speaker_party=str_extract(speaker_party, regex("[:alpha:]+"))) %>%# extract position of speakermutate(speaker_position =case_when(str_detect(speaker, regex("^Abgeord")) ~"Abgeordneter",str_detect(speaker, regex("^Verfahrensanwalt")) ~"Verfahrensanwalt",str_detect(speaker, regex("^Vorsitzender-Stellvertreter")) ~"Vorsitzender-Stellvertreter",str_detect(speaker, regex("^Vorsitzende")) ~"Vorsitzende/r",str_detect(speaker, regex("^Verfahrensrichter-Stellvertreter")) ~"Verfahrensrichter-Stellvertreter",str_detect(speaker, regex("^Verfahrensricht")) ~"Verfahrensrichter",str_detect(speaker, regex("^Vertrauensperson")) ~"Vertrauensperson",TRUE~as.character("Auskunftsperson") ))
Code
# add session details# extract date of session vec_respondent <- df_text %>%filter(str_detect(text, regex("^Befragung der Auskunftsperson"))) %>%pull(text) %>%str_remove(., "^Befragung der Auskunftsperson ")# extract duration of ENTIRE session (not only respondent) vec_duration <- df_text %>%filter(str_detect(text, regex("^Gesamtdauer der"))) %>%pull(text) %>%str_extract(., regex("(?<=Sitzung).*"))# extract session number vec_session_no <- df_text %>%filter(str_detect(text, regex("^Gesamtdauer der"))) %>%pull(text) %>%str_extract(., regex("\\d*(?=\\. Sitzung)"))# extract date of session vec_date <- df_text %>%filter(str_detect(text, regex("^Montag|^Dienstag|^Mittwoch|^Donnerstag|^Freitag|^Samstag|^Sonntag"))) %>%pull(text)df_text_4 <- df_text_3 %>%mutate(session_date = vec_date,session_no = vec_session_no,session_duration = vec_duration,respondent = vec_respondent ) %>%select(session_no, session_date, session_duration, respondent, speaker, speaker_position, speaker_party, text, -grouping_id_cum)
2.2.3 Party affiliation of MP asking question(s)
A further detail which we can extract is the party affiliation of an MP who asks a question of the respondent. In other words, each row from a respondent obtains an column indicating the party of the MP who asked the question. To do so, however, implies the assumption that any row/statement by an MP which precedes a row/statement by the respondent is actually a question by the MP. This assumption may not always be warranted. In some cases statements/rows by both an MP and a respondent may e.g. actually refer to an earlier question by the commission’s chair and hence not be an interaction between the latter two.
Bearing this qualification in mind, the approach seems promising enough to identify interactions between an MP and a respondent and consider the former’s party affiliation.
Code
# identify party df_text_4 <- df_text_4 %>%mutate(respondent_questioner_party =case_when(str_detect(lag(speaker_position), regex("^Abge")) & speaker_position=="Auskunftsperson"~lag(speaker_party),TRUE~NA_character_)) %>%# calculate answer length without annotations included in transcriptsmutate(text_length_old=stringi::stri_count_words(text, locale="de")) %>%mutate(text_length =str_remove_all(text, regex("\\([^\\(\\)]*\\)")) %>% stringi::stri_count_words(., locale="de"))