How to extract speeches held at Austria’s parliament

The website of the Austrian parliament provides transcripts of its sessions. This post details how to extract the statements given by MPs, members of government and other speakers.
Austria
text analysis
web scraping
regex
Author

Roland Schmidt

Published

22 Nov 2021

1 Context

This post is actually a spin-off of a another post, which got too long and convoluted (see here). The context is that I was recently interested in transcripts of sessions of Austria’s parliament and noticed that those of more recent legislative periods are not included in an already compiled dataset.1 Hence, the interest and need to dig into transcripts provided on the parliament’s website.

This post will lay out the necessary steps in R to get transcripts of multiple sessions from multiple legislative periods, and subsequently retrieve statements by individual speakers. The result, a file comprising all statements for the 16th and 17th legislative period (as of 3 Nov’21), is available for download here. If you use it, I would be grateful if you acknowledge this blog post. If you have any question or spot an error, feel free to contact me via twitter DM.

4 Extract text from transcripts

Now, with the links to the actual texts available, we have to feed them into a function which actually retrieves the latter. The function below does this. Again, the rvest package is our tool of choice to extract the content of the html file.

The somewhat tricky part here is to identify the relevant css-selector enabling us to retrieve the parts we are interested in. Navigate to one sample page, open the inspect tools (F12), and select the item of interest.

In the screen recording above we see that the statement by MP Drozda can be selected with the css-selector WordSection27. Other statements have e.g. WordSection28, WordSection60 etc. In other words, every statement has its own distinct selector/css class. At first glance, this looks like troubles ahead. ‘Luckily’ though, the html_nodes syntax allows us to specify something like a regex pattern: [class^=WordSection], i.e. take only those classes which start with WordSection. With this approach, we are able to select all our statements even if each of their css-selector is unique (ends with a distinct number). Sweet, no?2

Let’s define the function accordingly:

And then apply it:

The first five rows of the resulting dataframe are below: