How to extract speeches held at Austria’s parliament

Austria text analysis web scraping regex

The website of the Austrian parliament provides transcripts of its sessions. This post details how to extract the statements given by MPs, members of government and other speakers.

true
11-22-2021

Context

This post is actually a spin-off of a another post, which got too long and convoluted (see here). The context is that I was recently interested in transcripts of sessions of Austria’s parliament and noticed that those of more recent legislative periods are not included in an already compiled dataset.1 Hence, the interest and need to dig into transcripts provided on the parliament’s website.

This post will lay out the necessary steps in R to get transcripts of multiple sessions from multiple legislative periods, and subsequently retrieve statements by individual speakers. The result, a file comprising all statements for the 16th and 17th legislative period (as of 3 Nov’21), is available for download here. If you use it, I would be grateful if you acknowledge this blog post. If you have any question or spot an error, feel free to contact me via twitter DM.

Get the links of all sessions of multiple legislative periods

The parliament’s website provides an overview of all sessions held during a specific legislative period here. Below a screenshot of the site for the current legislative period:

We can use this overview page to extract the links leading to each session’s details page which includes links to the transcripts. However, instead of scraping the links to the details page from the table, I used the data provided via the site’s RSS-feed. The provided XML-format is IMHO considerably more convenient to work with than fiddling with the table itself.

To get the link leading to the XML file, click on the RSS symbol. In the above example the address is

[1] "https://www.parlament.gv.at/PAKT/PLENAR/filter.psp?view=RSS&jsMode=&xdocumentUri=&filterJq=&view=&MODUS=PLENAR&NRBRBV=NR&GP=XXVII&R_SISTEI=SI&listeId=1070&FBEZ=FP_00"

Since we might be also interested in sessions from other legislative periods, let’s have a look at the above link. As you can see, the query in the link contains the argument ‘GP=XXVII,’ i.e. the XXVII legislative period. If we are interested in sessions of e.g. the XXVI legislative period as well, we will need to modify the link accordingly. This can be done relatively conveniently with the glue function:

legis_period <- c("XXVI","XXVII")
links_rss_sessions<- glue::glue("https://www.parlament.gv.at/PAKT/PLENAR/filter.psp?view=RSS&jsMode=&xdocumentUri=&filterJq=&view=&MODUS=PLENAR&NRBRBV=NR&GP={legis_period}&R_SISTEI=SI&listeId=1070&FBEZ=FP_007")
links_rss_sessions
https://www.parlament.gv.at/PAKT/PLENAR/filter.psp?view=RSS&jsMode=&xdocumentUri=&filterJq=&view=&MODUS=PLENAR&NRBRBV=NR&GP=XXVI&R_SISTEI=SI&listeId=1070&FBEZ=FP_007
https://www.parlament.gv.at/PAKT/PLENAR/filter.psp?view=RSS&jsMode=&xdocumentUri=&filterJq=&view=&MODUS=PLENAR&NRBRBV=NR&GP=XXVII&R_SISTEI=SI&listeId=1070&FBEZ=FP_007

This vector, containing the links to both XML files which in turn contain the links leading to our session pages, has now to be fed into a function that actually extracts the links which we are interested in. The function below does this. Comments are inserted in the chunk.

fn_get_session_links <- function(rss_session) {

#extract the legislative period from the RSS-feed address  
legis_period <- str_extract(rss_session, regex("(?<=GP\\=)[^\\&]*(?=\\&)"))

#read the xml file;
df_rss_session <- xml2::read_xml(rss_session)
rss_data <- xml_child(df_rss_session, 1)

#create df with session name, id, and link to session's details page
df_rss_session_name <- rss_data %>%
  xml2::xml_find_all("//title") %>%
  html_text() %>%
  #create a dataframe
  enframe(.,
          name = "id",
          value = "session_name"
  ) %>% 
  #keep only those results which contain the value "Sitzung" (session)
  filter(str_detect(session_name, "Sitzung")) %>% 
  #add a session id but ensure that id has same length
  mutate(session_id=str_extract(session_name, regex("[:digit:]+"))) %>% 
  #str_pad! adds leading zeros; takes length of string into account
  mutate(session_id_pad=stringr::str_pad(session_id, width = 5, pad = 0)) %>%
  #compose the link leading to the session's details page by inserting the legislative period and the session number (padded) into the link, andadd tab destination
  mutate(link_records=glue::glue("https://www.parlament.gv.at/PAKT/VHG/{legis_period}/NRSITZ/NRSITZ_{session_id_pad}/index.shtml#tab-Sten.Protokoll")) %>% 
  mutate(session_name=str_trim(session_name))

#create df with date of session
df_rss_session_date <- rss_data %>%
  xml2::xml_find_all("//pubDate") %>%
  html_text() %>%
  enframe(.,
          name = "id",
          value = "date_session"
  ) %>%
  #important to adjust for time zone
  mutate(date_session = lubridate::dmy_hms(date_session, tz="Europe/Vienna"))

#combine both dataframes
df_sessions <- bind_cols(
  df_rss_session_date,
  df_rss_session_name,
  ) %>%
  select(-contains("id")) %>% 
  mutate(legis_period=legis_period)

df_sessions
}

Now let’s apply this function to the vector.

library(xml2)
df_sessions <- links_rss_sessions %>% 
  map_dfr(., possibly(fn_get_session_links, 
                       otherwise="missing")) 

As a result we obtain a dataframe with 230 rows (links to sessions’ details pages) in total.

If you have a look at the screenshot from above, you’ll see that we got indeed all 139 session of the current legislative period as of the time of writing.

Extract links leading to transcripts

As you could already see in the function fn_get_session_links above, the link_records not only comprises the link to the session’s details page, but was complemented by the expression #tab-Sten.Protokoll at the end. The reason for this addition is that the actual link leading to the session’s transcript is located at a distinct tab on the session’s details page. Below a screen shot for an example:

In the next step we have to retrieve the link finally leading us to the transcript. If we hover over the link leading to the HTML version of the ‘Stenographisches Protokoll’ (stenographic transcript), we can see that the address e.g. for the transcript of the 74th session is

[1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00074/fnameorig_946652.htm"

However, since we are not only interested in this particular case, but also in the links pertaining to other sessions we need to find a way to retrieve all the links in question by means of a general query. The code below does this.

We first extract all (!) links contained on the transcript tab with the rvest package, and then filter out the relevant link with the regular expression "\\/NRSITZ_\\d+\\/fnameorig_\\d+\\.html$".

fn_get_link_to_records <- function(link_to_transcript_tab) {

  res <- link_to_transcript_tab %>% 
    rvest::read_html() %>% 
    rvest::html_elements("a") %>% 
    rvest::html_attr("href") %>% 
    enframe(name = NULL,
            value = "link_to_text") %>% 
    filter(str_detect(link_to_text, regex("\\/NRSITZ_\\d+\\/fnameorig_\\d+\\.html$"))) %>% 
    mutate(link_to_text=glue::glue("https://www.parlament.gv.at/{link_to_text}")) %>% 
    pull()
  
  #if no link is identified, return NA_character
  ifelse(
    length(res)==1,
    res,
    NA_character_
  )
  
}

In the next step let’s apply this function to all links leading to submissions’ details page/the tab for transcripts. Note that I used the furrr package enabling us to apply the function in parallel rather than sequentially and hence accelerate things a bit.

library(furrr)
plan(multisession, workers=3)

tbl_missing <- tibble(link_to_text=NA_character_)

df_link_text <- df_sessions %>% 
  mutate(link_to_text=future_map_chr(link_records, 
                              possibly(fn_get_link_to_records,
                                       otherwise=NA_character_),
                              .progress = T))

What we obtain is a dataframe with the links to all transcripts.

Note that there are some sessions where no link to a transcript could be retrieved. A look at these sessions’ dates reveals that the missing links pertain to the most recent sessions. The finalized transcripts are only available after some delay. We remove these missing observations.

Account for multi-day sessions

There is one further thing which we have to control for: Some sessions last for several days. While we have a single observation (row) for each day, the transcript for each day covers the entire session and not only the statement from the day in question. If we do not account for this, statements of e.g. a three days spanning session would be included three times into the dataset. Below those sessions which lasted multiple days.

Show code
df_multi_day_sessions <- df_link_text %>% 
  group_by(link_to_text) %>% 
  arrange(date_session, .by_group = T) %>% 
  summarise(date_collapse=paste(date_session, collapse=", "),
            session_name=paste(unique(session_name), collapse=", "),
            date_n=n()) %>% 
  filter(date_n>1)

To control for this, I collapse duplicate links.

Show code
df_link_text<- df_link_text %>% 
  group_by(legis_period, link_to_text, link_records) %>% 
  arrange(date_session, .by_group = T) %>% 
  summarise(date_session=paste(date_session, collapse=", "),
            session_name=paste(unique(session_name), collapse=", "),
            date_n=n()) %>% 
  ungroup() %>% 
  #takes first date if session span over multiple days; later needed for sorting etc
  mutate(date_first=str_extract(date_session, regex("^[^,]*"))) %>% 
  mutate(date_first=lubridate::ymd(date_first))

Extract text from transcripts

Now, with the links to the actual texts available, we have to feed them into a function which actually retrieves the latter. The function below does this. Again, the rvest package is our tool of choice to extract the content of the html file.

The somewhat tricky part here is to identify the relevant css-selector enabling us to retrieve the parts we are interested in. Navigate to one sample page, open the inspect tools (F12), and select the item of interest.

In the screen recording above we see that the statement by MP Drozda can be selected with the css-selector WordSection27. Other statements have e.g. WordSection28, WordSection60 etc. In other words, every statement has its own distinct selector/css class. At first glance, this looks like troubles ahead. ‘Luckily’ though, the html_nodes syntax allows us to specify something like a regex pattern: [class^=WordSection], i.e. take only those classes which start with WordSection. With this approach, we are able to select all our statements even if each of their css-selector is unique (ends with a distinct number). Sweet, no?2

Let’s define the function accordingly:

fn_get_record_text <- function(link_to_text) {
  
link_to_text %>% 
    read_html(., encoding = "latin1") %>%
    html_nodes('[class^=WordSection]') %>%
    html_text2() %>% 
    enframe(name = NULL,
            value="text_raw") %>% 
    mutate(text_raw=text_raw %>% str_squish %>% str_trim(., "both")) 
}

tbl_missing_wo_id <- tibble(text_raw=NA_character_)

And then apply it:

#using the furrr package to speed things up a bit
  df_data <- df_link_text %>% 
  mutate(text=future_map(link_to_text, 
                      possibly(fn_get_record_text,
                               otherwise=tbl_missing_wo_id),
                      .progress = T))

The first five rows of the resulting dataframe are below: