Extracting the transcripts of the ‘Ibizia Inquiry’

Austria Parliament web scraping

In mid-June, the so-called ‘Ibiza Commission of Inquiry’ heard its last witness after almost one year at work. The commission had been established by the Austrian Parliament and set up to investigate the prevalence of corruption during the coalition government of the ÖVP and FPÖ (Dec, 2017 - May, 2019). This post diggs into its published transcripts. The emphasis of the post is, first, on demonstrating how to extract the text from the hearings’ transcripts in R and, second, on crunching a few numbers to get some substantive insights. The compiled dataset, covering all statements (incl. questions, answers etc.), is available for download.

true
10-07-2021

Context

In mid-June, the Austrian Parliament’s ‘Commission of Inquiry concerning the alleged corruptibility of the turquois-blue Federal Government’1 heard its last respondent (‘Auskunftsperson’). More informally, the commission is simply called ‘Ibiza inquiry’, named after the location of a secretly taped video which showed high ranking members of the extreme right FPÖ party speaking freely - liberated by alcohol - with a fake niece of a Russian oligarch about actual or intended corruption in Austria’s political system. As a consequence of the video’s release, the then ruling coalition government of ÖVP (‘turquoise’) and FPÖ (‘blue’) collapsed and the commission was set up. In short, a pretty wild and bewildering story.

The commission, empowered with a fairly broad mandate and armed with access to plenty of WhatsApp etc messages, offered an unprecedented view into the inner-working of Chancellor Kurz’s first government and the mindset of some of its protagonists. While the initial impetus for setting up the commission was first and foremost the Ibiza video featuring the FPÖ leadership, the inquiry’s focus gradually shifted (not least due to the opposition’s efforts) to the wheeling and dealing of Chancellor Sebastian Kurz’s ÖVP and its affiliates.

Having said this, the purpose of this post is not to give a recap of the inquiry or its results, but first and foremost procedural in the sense that it intends to detail the necessary steps to extract statements from the inquiry’s transcripts and subsequently obtain some exemplary insights with R. As always, if you spot any error, feel free to contact me, best via twitter DM. And if you use any of the work provided here, grateful if you acknowledge my input.

If you are not interested in any of the coding steps generating the data, jump directly to the Analysis section which still contains plenty of code, but you’re not burdened with how to obtain the data in the first place.

Getting the data

In this section I will layout the necessary steps to obtain the relevant data. I’ll first detail each step with one single, sample session of the inquiry commission. Subsequently, I’ll apply the demonstrated steps to all sessions by means of one general function.

But before, let’s load the packages we’ll need along the way and define some templates.

# libraries ---------------------------------------------------------------
#load the required libraries
library(tidyverse)
library(rvest)
library(xml2)
library(fuzzyjoin)
library(reactable)
library(reactablefmtr)
library(hrbrthemes)


#define party colors
vec_party_col <- c(
  "ÖVP"="#5DC2CC",
  "SPÖ"="#FC0204", 
  "Grüne"="#A3C630",
  "FPÖ"="#005DA8", 
  "NEOS"="#EA5290"
  ) 

In a first step, let’s get the links leading to the transcripts of the sessions. This link leads us to the commission’s overview page including all published reports, including the sessions’ transcripts (Protokolle).

The code below extracts links related to the latter. Explanatory comments are inserted directly into the code chunk.

#link to overview page
site_link <- "https://www.parlament.gv.at/PAKT/VHG/XXVII/A-USA/A-USA_00002_00906/index.shtml#tab-VeroeffentlichungenBerichte"

# get links to pages where links to protocols are located
# link has to include words 'Protokolls' in text;
df_links_to_subpages <- site_link %>%
  rvest::read_html() %>%
  #define a filter to get only the links related to transcripts (protocols)
  # filters links based on text/name of links
  rvest::html_elements(xpath = "//a[contains(text(), 'Protokolls')]") %>%
  html_attr("href") %>%
  # extracts links
  enframe(name = NULL, value = "link") %>%
  #links of interest inlude "KOMM"
  filter(str_detect(link, regex("KOMM"))) %>%
  #complete the link
  mutate(link_to_subpages = paste0("https://www.parlament.gv.at/", link)) %>%
  select(link_to_subpages)

Here the first ten links:

Each of these links leads to a subpage which provides details on the record in question and the link to the actual file containing the transcribed text. Below one such subpage.

Note the link leading to the HTML version of the transcript. To access the transcript we need the link’s target address. The function below extracts the link leading to actual text. Subsequently, the function is applied to all subpage links which we obtained in the previous step.

# function to extract link to protocol from details page

fn_get_link_to_record <- function(link_to_subpage) {
  link_to_subpage %>%
    rvest::read_html() %>%
    rvest::html_elements("a") %>%
    html_attr("href") %>%
    enframe(
      name = NULL,
      value = "link"
    ) %>%
    #link to transcript contains "fnameorig"
    filter(str_detect(link, regex("fnameorig"))) %>%
    #complete link
    mutate(link_to_record = paste0("https://www.parlament.gv.at/", link)) %>%
    select(link_to_record)
}

library(furrr)
plan(multisession, workers = 2)


# apply function to all links
df_links_to_records <- df_links_to_subpages %>%
  pull(link_to_subpages) %>%
  purrr::set_names() %>%
  future_map_dfr(., fn_get_link_to_record, .id = "link_to_subpages")

What we obtain is a dataframe including the links which lead to all transcripts of the inquiry (only first 5 are shown).

Extracting text

Now with the links to the transcripts available, let’s have a look at one such text, e.g. here.

Importantly, notice that statements given before the inquiry commission are always introduced with the speaker’s name (and position) in bold and underlined letters. This (almost) consistently applied formatting will eventually allow us to distinguish between the actual statement and its speaker, and the start/end of different statements. I’ll first extract these names and subsequently assign these names to their respective statements.

Extract speakers

To extract the speakers from the text I’ll use once again the powerfull rvest package. To identify those text parts which are in bold and underlined, its html_elements function is used. As for the resort to xml_contents I am grateful for the answer to this Stackoverflow question.

# get those elements which are bold and underlined
df_speakers <- link_to_record %>%
    rvest::read_html() %>%
    #extract elements which are bold and underlined; note that the sequence has to be that of the html tags 
    rvest::html_elements("b u") %>%
    map(., xml_contents) %>%
    map(., html_text) %>%
    enframe(value = "speaker") %>%
    mutate(speaker = str_trim(speaker, side = c("both"))) %>%
    mutate(speaker = str_squish(speaker)) %>%
    # keep only those elements which end with colon;
    # filter(str_detect(speaker, regex("\\:$"))) %>%
    #remove colon at end; needed to unify names of speakers where some instances end/do not end with colon
    mutate(speaker=str_remove(speaker, regex("\\:$"))) %>% 
    filter(str_count(speaker, regex("\\S+"))>1) %>% 
    #removes heading of transcript which is also bold and underlined
    filter(!str_detect(speaker, regex("^Befragung der"))) %>% 
    distinct(speaker)

Here’s the result for our one our sample session.

# A tibble: 12 x 1
   speaker                                    
   <chr>                                      
 1 Verfahrensrichter Dr. Wolfgang Pöschl      
 2 Vorsitzender Mag. Wolfgang Sobotka         
 3 Sebastian Kurz                             
 4 Abgeordneter Mag. Klaus Fürlinger (ÖVP)    
 5 Mag. Klaus Fürlinger (ÖVP)                 
 6 Abgeordneter Kai Jan Krainer (SPÖ)         
 7 Abgeordneter Dr. Christian Stocker (ÖVP)   
 8 Abgeordneter Mag. Andreas Hanger (ÖVP)     
 9 Abgeordnete Mag. Nina Tomaselli (Grüne)    
10 Abgeordneter Christian Hafenecker, MA (FPÖ)
11 Vorsitzender Mag. Wolfgang Sobotk          
12 Abgeordneter David Stögmüller (Grüne)      

As you can see, the approach worked rather well, but not 100 % perfect. There are a few rows which contain incomplete names of speakers or some related fragments (Sobotka without a). These unwanted results are - as far as I can tell - due to some inconsistent formatting of speakers’ names (even if their appearance in the document is identical) or editorial errors (e.g. the position of a speaker is mentioned in one instance, but not in another, e.g. Abgeordneter mag. Klaus Führlinger). These glitches have to be corrected ‘manually’. The code chunk below does this for our specific sample case.

After these modifications we get a clean dataframe of those who made statements during the session in question (our sample link).

# A tibble: 10 x 1
   speaker                                    
   <chr>                                      
 1 Verfahrensrichter Dr. Wolfgang Pöschl      
 2 Vorsitzender Mag. Wolfgang Sobotka         
 3 Sebastian Kurz                             
 4 Abgeordneter Mag. Klaus Fürlinger (ÖVP)    
 5 Abgeordneter Kai Jan Krainer (SPÖ)         
 6 Abgeordneter Dr. Christian Stocker (ÖVP)   
 7 Abgeordneter Mag. Andreas Hanger (ÖVP)     
 8 Abgeordnete Mag. Nina Tomaselli (Grüne)    
 9 Abgeordneter Christian Hafenecker, MA (FPÖ)
10 Abgeordneter David Stögmüller (Grüne)      

In a later step, we will search the entire transcript of the session for the presence of these speakers’ names (incl. position and title) to identify the start of a statement. Since this pattern matching will require regular expressions (regex), the names have to be modified accordingly (e.g. a dot . has to be escaped and becomes \\., for further info see here).

# create regex
  df_speakers <- df_speakers %>%
    mutate(
      speaker_pattern =
        str_replace_all(speaker, "\\.", "\\\\.") %>%
          str_replace_all(., "\\:", "\\\\:") %>%
          # str_replace_all(., "\\,", "\\\\,") %>%
          str_replace_all(., "\\)", "\\\\)") %>%
          str_replace_all(., "\\(", "\\\\(") %>% 
        #match has to be at start of string; avoids mismatches where name appears in middle of text
        paste0("^",.))

Identify and extract statements

Now, let’s get the entire text of a transcript. Again, I use the rvest package, this time, however, the hmtl-tag <p> is targeted.

## Extract entire text from transcript

df_text <- link_to_record %>%
    rvest::read_html() %>%
    rvest::html_elements("p") %>%
    rvest::html_text() %>%
    enframe(., 
            value = "text",
            name="row") %>%
    mutate(text = str_squish(text) %>%
      str_trim(., side = c("both")))

The result is a dataframe with one row for each line in the transcript. The challenge is now to identify those rows where a speaker’s statement starts and ends/the next speaker’s statement begins. I’ll do this by - as already mentioned above - pattern matching the names of the retrieved speakers with the content of each row. In simpler terms, does the row start with the name of a speaker which we previously identified by their bold and underlined format? To do this I’ll make use of fuzzyjoins::regex_left_join.

# get speaker name from text
  df_text_2 <- df_text %>%
    #search for match only in opening section of line
  #  mutate(text_start=stringr::str_sub(text, start=1, end=40)) %>% 
    fuzzyjoin::regex_left_join(.,
      df_speakers,
      by = c("text" = "speaker_pattern")
    ) # speaker_pattern

Have a look at the result below. There’s now a new column indicating the start of a statement with the speaker’s name (since transcripts are starting with some introductory text you’ll see speakers’ names appear only on later pages in the table).

In the next step, I 1) dplyr::fill all empty speaker rows after a statement’s start with the speaker’s name (details here), and 2) create a numerical grouping variable which increases each time the speaker is changing (= start of new statement). Remaining rows without a speaker are removed since they contain only additional text and not statements.

df_text_2 <- df_text_2 %>%
    # fill rows with speaker and pattern as basis for grouping
    fill(speaker, .direction = "down") %>%
    fill(speaker_pattern, .direction = "down") %>%
    filter(!is.na(speaker)) %>%
    # create grouping id; later needed to collapse rows
    mutate(grouping_id = ifelse(speaker == dplyr::lag(speaker, default = "start"),
      0,
      1
    )) %>%
    mutate(grouping_id_cum = cumsum(grouping_id)) %>%
    # remove speaker info from actual spoken text
    mutate(text = str_remove(text, speaker_pattern)) %>%
    # remove colon; this approach keeps annotations which are included between speaker name and colon; e.g "(zur Geschäftsbehandlung)";
    mutate(text = str_remove(text, regex("\\:"))) %>%
    mutate(text = str_trim(text, side = c("both")))

With each row now attributed to a speaker, and each statement assigned with a distinct indicator, we can collapse a statement’s multiple lines into one single row with its single speaker.

df_text_3 <- df_text_2 %>%
    # collapse rows
    group_by(grouping_id_cum) %>%
    summarise(
      text = paste(text, collapse = " "),
      speaker = unique(speaker)
    ) %>%
    relocate(speaker, .before = "text") %>%
    mutate(text = str_trim(text, side = c("both")))