[1] "https://www.parlament.gv.at/PAKT/PLENAR/filter.psp?view=RSS&jsMode=&xdocumentUri=&filterJq=&view=&MODUS=PLENAR&NRBRBV=NR&GP=XXVII&R_SISTEI=SI&listeId=1070&FBEZ=FP_00"
1 Context
This post is actually a spin-off of a another post, which got too long and convoluted (see here). The context is that I was recently interested in transcripts of sessions of Austria’s parliament and noticed that those of more recent legislative periods are not included in an already compiled dataset.1 Hence, the interest and need to dig into transcripts provided on the parliament’s website.
This post will lay out the necessary steps in R to get transcripts of multiple sessions from multiple legislative periods, and subsequently retrieve statements by individual speakers. The result, a file comprising all statements for the 16th and 17th legislative period (as of 3 Nov’21), is available for download here. If you use it, I would be grateful if you acknowledge this blog post. If you have any question or spot an error, feel free to contact me via twitter DM.
2 Get the links of all sessions of multiple legislative periods
The parliament’s website provides an overview of all sessions held during a specific legislative period here. Below a screenshot of the site for the current legislative period:
We can use this overview page to extract the links leading to each session’s details page which includes links to the transcripts. However, instead of scraping the links to the details page from the table, I used the data provided via the site’s RSS-feed. The provided XML-format is IMHO considerably more convenient to work with than fiddling with the table itself.
To get the link leading to the XML file, click on the RSS symbol. In the above example the address is
Since we might be also interested in sessions from other legislative periods, let’s have a look at the above link. As you can see, the query in the link contains the argument ‘GP=XXVII’, i.e. the XXVII legislative period. If we are interested in sessions of e.g. the XXVI legislative period as well, we will need to modify the link accordingly. This can be done relatively conveniently with the glue
function:
https://www.parlament.gv.at/PAKT/PLENAR/filter.psp?view=RSS&jsMode=&xdocumentUri=&filterJq=&view=&MODUS=PLENAR&NRBRBV=NR&GP=XXVI&R_SISTEI=SI&listeId=1070&FBEZ=FP_007
https://www.parlament.gv.at/PAKT/PLENAR/filter.psp?view=RSS&jsMode=&xdocumentUri=&filterJq=&view=&MODUS=PLENAR&NRBRBV=NR&GP=XXVII&R_SISTEI=SI&listeId=1070&FBEZ=FP_007
This vector, containing the links to both XML files which in turn contain the links leading to our session pages, has now to be fed into a function that actually extracts the links which we are interested in. The function below does this. Comments are inserted in the chunk.
Now let’s apply this function to the vector.
As a result we obtain a dataframe with 92 rows (links to sessions’ details pages) in total.
If you have a look at the screenshot from above, you’ll see that we got indeed all session of the current legislative period as of the time of writing.
3 Extract links leading to transcripts
As you could already see in the function fn_get_session_links
above, the link_records
not only comprises the link to the session’s details page, but was complemented by the expression #tab-Sten.Protokoll
at the end. The reason for this addition is that the actual link leading to the session’s transcript is located at a distinct tab on the session’s details page. Below a screen shot for an example:
In the next step we have to retrieve the link finally leading us to the transcript. If we hover over the link leading to the HTML version of the ‘Stenographisches Protokoll’ (stenographic transcript), we can see that the address e.g. for the transcript of the 74th session is
[1] "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00074/fnameorig_946652.htm"
However, since we are not only interested in this particular case, but also in the links pertaining to other sessions we need to find a way to retrieve all the links in question by means of a general query. The code below does this.
We first extract all (!) links contained on the transcript tab with the rvest
package, and then filter out the relevant link with the regular expression "\\/NRSITZ_\\d+\\/fnameorig_\\d+\\.html$"
.
In the next step let’s apply this function to all links leading to submissions’ details page/the tab for transcripts. Note that I used the furrr
package enabling us to apply the function in parallel rather than sequentially and hence accelerate things a bit.
What we obtain is a dataframe with the links to all transcripts.
Note that there are some sessions where no link to a transcript could be retrieved. A look at these sessions’ dates reveals that the missing links pertain to the most recent sessions. The finalized transcripts are only available after some delay. We remove these missing observations.
3.1 Account for multi-day sessions
There is one further thing which we have to control for: Some sessions last for several days. While we have a single observation (row) for each day, the transcript for each day covers the entire session and not only the statement from the day in question. If we do not account for this, statements of e.g. a three days spanning session would be included three times into the dataset. Below those sessions which lasted multiple days.
To control for this, I collapse duplicate links.
Code
<- df_link_text %>%
df_link_textgroup_by(legis_period, link_to_text, link_records) %>%
arrange(date_session, .by_group = T) %>%
summarise(date_session=paste(date_session, collapse=", "),
session_name=paste(unique(session_name), collapse=", "),
date_n=n()) %>%
ungroup() %>%
#takes first date if session span over multiple days; later needed for sorting etc
mutate(date_first=str_extract(date_session, regex("^[^,]*"))) %>%
mutate(date_first=lubridate::ymd(date_first))
4 Extract text from transcripts
Now, with the links to the actual texts available, we have to feed them into a function which actually retrieves the latter. The function below does this. Again, the rvest
package is our tool of choice to extract the content of the html file.
The somewhat tricky part here is to identify the relevant css-selector enabling us to retrieve the parts we are interested in. Navigate to one sample page, open the inspect tools (F12), and select the item of interest.
In the screen recording above we see that the statement by MP Drozda can be selected with the css-selector WordSection27
. Other statements have e.g. WordSection28
, WordSection60
etc. In other words, every statement has its own distinct selector/css class. At first glance, this looks like troubles ahead. ‘Luckily’ though, the html_nodes
syntax allows us to specify something like a regex pattern: [class^=WordSection]
, i.e. take only those classes which start with WordSection
. With this approach, we are able to select all our statements even if each of their css-selector is unique (ends with a distinct number). Sweet, no?2
Let’s define the function accordingly:
And then apply it:
The first five rows of the resulting dataframe are below: