A look at the transcripts of the ‘Ibizia Inquiry’

In mid-June, the so-called ‘Ibiza Commission of Inquiry’ heard its last witness after almost one year at work. The commission had been established by the Austrian Parliament and set up to investigate the prevalence of corruption during the coalition government of the ÖVP and FPÖ (Dec, 2017 - May, 2019). This post diggs into its published transcripts. The emphasis of the post is, first, on demonstrating how to extract the text from the hearings’ transcripts in R and, second, on crunching a few numbers to get some substantive insights. The compiled dataset, covering all statements (incl. questions, answers etc.), is available for download.
web scraping

7 Oct 2021

1 Context

In mid-June, the Austrian Parliament’s ‘Commission of Inquiry concerning the alleged corruptibility of the turquois-blue Federal Government’1 heard its last respondent (‘Auskunftsperson’). More informally, the commission is simply called ‘Ibiza inquiry’, named after the location of a secretly taped video which showed high ranking members of the extreme right FPÖ party speaking freely - liberated by alcohol - with a fake niece of a Russian oligarch about actual or intended corruption in Austria’s political system. As a consequence of the video’s release, the then ruling coalition government of ÖVP (‘turquoise’) and FPÖ (‘blue’) collapsed and the commission was set up. In short, a pretty wild and bewildering story.

The commission, empowered with a fairly broad mandate and armed with access to plenty of WhatsApp etc messages, offered an unprecedented view into the inner-working of Chancellor Kurz’s first government and the mindset of some of its protagonists. While the initial impetus for setting up the commission was first and foremost the Ibiza video featuring the FPÖ leadership, the inquiry’s focus gradually shifted (not least due to the opposition’s efforts) to the wheeling and dealing of Chancellor Sebastian Kurz’s ÖVP and its affiliates.

Having said this, the purpose of this post is not to give a recap of the inquiry or its results, but first and foremost procedural in the sense that it intends to detail the necessary steps to extract statements from the inquiry’s transcripts and subsequently obtain some exemplary insights with R. As always, if you spot any error, feel free to contact me, best via twitter DM. And if you use any of the work provided here, grateful if you acknowledge my input.

If you are not interested in any of the coding steps generating the data, jump directly to the Analysis section which still contains plenty of code, but you’re not burdened with how to obtain the data in the first place.

2 Getting the data

In this section I will layout the necessary steps to obtain the relevant data. I’ll first detail each step with one single, sample session of the inquiry commission. Subsequently, I’ll apply the demonstrated steps to all sessions by means of one general function.

But before, let’s load the packages we’ll need along the way and define some templates.

# libraries ---------------------------------------------------------------
#load the required libraries

#define party colors
vec_party_col <- c(

2.2 Extracting text

Now with the links to the transcripts available, let’s have a look at one such text, e.g. here.

Importantly, notice that statements given before the inquiry commission are always introduced with the speaker’s name (and position) in bold and underlined letters. This (almost) consistently applied formatting will eventually allow us to distinguish between the actual statement and its speaker, and the start/end of different statements. I’ll first extract these names and subsequently assign these names to their respective statements.

2.2.1 Extract speakers

To extract the speakers from the text I’ll use once again the powerfull rvest package. To identify those text parts which are in bold and underlined, its html_elements function is used. As for the resort to xml_contents I am grateful for the answer to this Stackoverflow question.

# get those elements which are bold and underlined
df_speakers <- link_to_record %>%
    rvest::read_html() %>%
    #extract elements which are bold and underlined; note that the sequence has to be that of the html tags 
    rvest::html_elements("b u") %>%
    map(., xml_contents) %>%
    map(., html_text) %>%
    enframe(value = "speaker") %>%
    mutate(speaker = str_trim(speaker, side = c("both"))) %>%
    mutate(speaker = str_squish(speaker)) %>%
    # keep only those elements which end with colon;
    # filter(str_detect(speaker, regex("\\:$"))) %>%
    #remove colon at end; needed to unify names of speakers where some instances end/do not end with colon
    mutate(speaker=str_remove(speaker, regex("\\:$"))) %>% 
    filter(str_count(speaker, regex("\\S+"))>1) %>% 
    #removes heading of transcript which is also bold and underlined
    filter(!str_detect(speaker, regex("^Befragung der"))) %>% 

Here’s the result for our one our sample session.

# A tibble: 12 × 1
 1 Verfahrensrichter Dr. Wolfgang Pöschl      
 2 Vorsitzender Mag. Wolfgang Sobotka         
 3 Sebastian Kurz                             
 4 Abgeordneter Mag. Klaus Fürlinger (ÖVP)    
 5 Mag. Klaus Fürlinger (ÖVP)                 
 6 Abgeordneter Kai Jan Krainer (SPÖ)         
 7 Abgeordneter Dr. Christian Stocker (ÖVP)   
 8 Abgeordneter Mag. Andreas Hanger (ÖVP)     
 9 Abgeordnete Mag. Nina Tomaselli (Grüne)    
10 Abgeordneter Christian Hafenecker, MA (FPÖ)
11 Vorsitzender Mag. Wolfgang Sobotk          
12 Abgeordneter David Stögmüller (Grüne)      

As you can see, the approach worked rather well, but not 100 % perfect. There are a few rows which contain incomplete names of speakers or some related fragments (Sobotka without a). These unwanted results are - as far as I can tell - due to some inconsistent formatting of speakers’ names (even if their appearance in the document is identical) or editorial errors (e.g. the position of a speaker is mentioned in one instance, but not in another, e.g. Abgeordneter mag. Klaus Führlinger). These glitches have to be corrected ‘manually’. The code chunk below does this for our specific sample case.

After these modifications we get a clean dataframe of those who made statements during the session in question (our sample link).

# A tibble: 10 × 1
 1 Verfahrensrichter Dr. Wolfgang Pöschl      
 2 Vorsitzender Mag. Wolfgang Sobotka         
 3 Sebastian Kurz                             
 4 Abgeordneter Mag. Klaus Fürlinger (ÖVP)    
 5 Abgeordneter Kai Jan Krainer (SPÖ)         
 6 Abgeordneter Dr. Christian Stocker (ÖVP)   
 7 Abgeordneter Mag. Andreas Hanger (ÖVP)     
 8 Abgeordnete Mag. Nina Tomaselli (Grüne)    
 9 Abgeordneter Christian Hafenecker, MA (FPÖ)
10 Abgeordneter David Stögmüller (Grüne)      

In a later step, we will search the entire transcript of the session for the presence of these speakers’ names (incl. position and title) to identify the start of a statement. Since this pattern matching will require regular expressions (regex), the names have to be modified accordingly (e.g. a dot . has to be escaped and becomes \\., for further info see here).

# create regex
  df_speakers <- df_speakers %>%
      speaker_pattern =
        str_replace_all(speaker, "\\.", "\\\\.") %>%
          str_replace_all(., "\\:", "\\\\:") %>%
          # str_replace_all(., "\\,", "\\\\,") %>%
          str_replace_all(., "\\)", "\\\\)") %>%
          str_replace_all(., "\\(", "\\\\(") %>% 
        #match has to be at start of string; avoids mismatches where name appears in middle of text

2.2.2 Identify and extract statements

Now, let’s get the entire text of a transcript. Again, I use the rvest package, this time, however, the hmtl-tag <p> is targeted.

## Extract entire text from transcript

df_text <- link_to_record %>%
    rvest::read_html() %>%
    rvest::html_elements("p") %>%
    rvest::html_text() %>%
            value = "text",
            name="row") %>%
    mutate(text = str_squish(text) %>%
      str_trim(., side = c("both")))

The result is a dataframe with one row for each line in the transcript. The challenge is now to identify those rows where a speaker’s statement starts and ends/the next speaker’s statement begins. I’ll do this by - as already mentioned above - pattern matching the names of the retrieved speakers with the content of each row. In simpler terms, does the row start with the name of a speaker which we previously identified by their bold and underlined format? To do this I’ll make use of fuzzyjoins::regex_left_join.

# get speaker name from text
  df_text_2 <- df_text %>%
    #search for match only in opening section of line
  #  mutate(text_start=stringr::str_sub(text, start=1, end=40)) %>% 
      by = c("text" = "speaker_pattern")
    ) # speaker_pattern

Have a look at the result below. There’s now a new column indicating the start of a statement with the speaker’s name (since transcripts are starting with some introductory text you’ll see speakers’ names appear only on later pages in the table).

In the next step, I 1) dplyr::fill all empty speaker rows after a statement’s start with the speaker’s name (details here), and 2) create a numerical grouping variable which increases each time the speaker is changing (= start of new statement). Remaining rows without a speaker are removed since they contain only additional text and not statements.

df_text_2 <- df_text_2 %>%
    # fill rows with speaker and pattern as basis for grouping
    fill(speaker, .direction = "down") %>%
    fill(speaker_pattern, .direction = "down") %>%
    filter(!is.na(speaker)) %>%
    # create grouping id; later needed to collapse rows
    mutate(grouping_id = ifelse(speaker == dplyr::lag(speaker, default = "start"),
    )) %>%
    mutate(grouping_id_cum = cumsum(grouping_id)) %>%
    # remove speaker info from actual spoken text
    mutate(text = str_remove(text, speaker_pattern)) %>%
    # remove colon; this approach keeps annotations which are included between speaker name and colon; e.g "(zur Geschäftsbehandlung)";
    mutate(text = str_remove(text, regex("\\:"))) %>%
    mutate(text = str_trim(text, side = c("both")))

With each row now attributed to a speaker, and each statement assigned with a distinct indicator, we can collapse a statement’s multiple lines into one single row with its single speaker.

df_text_3 <- df_text_2 %>%
    # collapse rows
    group_by(grouping_id_cum) %>%
      text = paste(text, collapse = " "),
      speaker = unique(speaker)
    ) %>%
    relocate(speaker, .before = "text") %>%
    mutate(text = str_trim(text, side = c("both")))

The resulting table already represents most of the required data for the later analysis (for one specific sample respondent). I complement it with additional data on an MP’s party affiliation (if applicable); a speaker’s position (if stated); and the session’s date, duration and number.

df_text_3 <- df_text_3 %>%
    # extract party of MP; pattern matches only if speaker starts with Abgeordneter
    mutate(speaker_party=str_extract(speaker, regex("(?<=^Abgeordnete[^\\(]{1,40})\\(.*\\)$"))) %>% 
    mutate(speaker_party=str_extract(speaker_party, regex("[:alpha:]+"))) %>% 
  # extract position of speaker
    mutate(speaker_position = case_when(
      str_detect(speaker, regex("^Abgeord")) ~ "Abgeordneter",
      str_detect(speaker, regex("^Verfahrensanwalt")) ~ "Verfahrensanwalt",
      str_detect(speaker, regex("^Vorsitzender-Stellvertreter")) ~ "Vorsitzender-Stellvertreter",
      str_detect(speaker, regex("^Vorsitzende")) ~ "Vorsitzende/r",
      str_detect(speaker, regex("^Verfahrensrichter-Stellvertreter")) ~ "Verfahrensrichter-Stellvertreter",
      str_detect(speaker, regex("^Verfahrensricht")) ~ "Verfahrensrichter",
      str_detect(speaker, regex("^Vertrauensperson")) ~ "Vertrauensperson",
      TRUE ~ as.character("Auskunftsperson")
# add session details
# extract date of session
  vec_respondent <- df_text %>%
    filter(str_detect(text, regex("^Befragung der Auskunftsperson"))) %>%
    pull(text) %>%
    str_remove(., "^Befragung der Auskunftsperson ")

  # extract duration of ENTIRE session (not only respondent)
  vec_duration <- df_text %>%
    filter(str_detect(text, regex("^Gesamtdauer der"))) %>%
    pull(text) %>%
    str_extract(., regex("(?<=Sitzung).*"))

  # extract session number
  vec_session_no <- df_text %>%
    filter(str_detect(text, regex("^Gesamtdauer der"))) %>%
    pull(text) %>%
    str_extract(., regex("\\d*(?=\\. Sitzung)"))

  # extract date of session
  vec_date <- df_text %>%
    filter(str_detect(text, regex("^Montag|^Dienstag|^Mittwoch|^Donnerstag|^Freitag|^Samstag|^Sonntag"))) %>%

df_text_4 <- df_text_3 %>%
      session_date = vec_date,
      session_no = vec_session_no,
      session_duration = vec_duration,
      respondent = vec_respondent
    ) %>%
    select(session_no, session_date, session_duration, respondent, speaker, speaker_position, speaker_party, text, -grouping_id_cum)

2.2.3 Party affiliation of MP asking question(s)

A further detail which we can extract is the party affiliation of an MP who asks a question of the respondent. In other words, each row from a respondent obtains an column indicating the party of the MP who asked the question. To do so, however, implies the assumption that any row/statement by an MP which precedes a row/statement by the respondent is actually a question by the MP. This assumption may not always be warranted. In some cases statements/rows by both an MP and a respondent may e.g. actually refer to an earlier question by the commission’s chair and hence not be an interaction between the latter two.

Bearing this qualification in mind, the approach seems promising enough to identify interactions between an MP and a respondent and consider the former’s party affiliation.

# identify party
  df_text_4 <- df_text_4 %>%
    mutate(respondent_questioner_party = case_when(
      str_detect(lag(speaker_position), regex("^Abge")) &
        speaker_position=="Auskunftsperson" ~ lag(speaker_party),
      TRUE ~ NA_character_)) %>%
  # calculate answer length without annotations included in transcripts
    mutate(text_length_old=stringi::stri_count_words(text, locale="de")) %>% 
    mutate(text_length = str_remove_all(text, regex("\\([^\\(\\)]*\\)")) %>% 
             stringi::stri_count_words(., locale="de"))