Text Analysis of 2020 Presidential Debates

dominik gulácsy
7 min readDec 9, 2020

Table of Contents

  1. Introduction
  2. Get debate recordings and use Amazon Transcribe
  3. Amazon Comprehend and Sentiment Analysis
  4. Conclusions
1st presidential debate of 2020 between Donald Trump and Joe Biden. (Photo by Doug Mills/The New York Times)

Introduction

This is my first article on Medium, so I welcome any kind of feedback and constructive critique. Although the analysis was carried out as part of a university assignment my topic of choice also reflects my personal interest. I was always fascinated by the tradition of having presidential debates in US politics. This series of debates represents an integral part of the election process and actually dates back to 1960.

This was the year when the Kennedy–Nixon debates took place. The debate on September 26 1960, was fundamentally different from any other debates before because due to the advent of telecommunication the audience got significantly larger. Throughout the years, its importance has been increasing as it has become the last major series of events to offer candidates a chance to win voters’ trust.

I was excited to watch the face-off between Donald Trump and Joe Biden this year and I thought it would be quite interesting to put the content of these debates into numbers and compare them to each other by also adding the ones from previous election periods.

Let’s look at how I managed to do that using R and AWS and what kind of conclusions I could make based on it. If you want to learn more about how this analysis was carried out, you can find the code on my GitHub account.

Get debate recordings and use Amazon Transcribe

So to get started with my analysis I needed to get the recordings of presidential debates. Eventually, I decided to include debates from 2012 (Romney-Obama), 2016 (Trump-Clinton) and 2020 (Trump-Biden) in my work. I started the data collection process on Youtube by looking for a full-length version of the debates. Usually, I was able to find some that only included the debate itself without any commentary but it took some time (cc. 30 mins).

The next step was to convert these video files on Youtube to .mp3. I used MP3FY’s online service to convert the collected video files. Unfortunately, I ran into a minor difficulty when trying to convert the files. Although in most cases Youtube videos contain a .mp3 sound channel, these debates are recorded by the country’s biggest news agencies and therefore the recordings have a higher audio quality which means that the audio channel is likely to be in the .m4a format. So after converting the videos I ended up with .m4a files and I needed to carry out another file format conversion using this online audio converter. Finally, I had all the recordings of debates in a sufficient format to proceed and was happy to leave behind the manual part of all of this. Let’s see what I can squeeze out of these audio files!

Audio is great. It may convey a large amount of information on the speakers however, it isn’t really the easiest thing to analyze. I needed something much higher-level to work with. For example, the plain text equivalent of the recording. Yet I faced another conversion task, but now I decided to make use of Amazon Transcribe’s capabilities.

IAM Console of AWS

The first thing you need to access this Machine Learning service of Amazon is to create an AWS account. It takes about 10 minutes and once you have your account set up you can generate an access key under IAM (Identity and Access Management) console to establish a connection through API to such services as Amazon Transcribe and many more.

To set up my environment I loaded the AWS credentials the following way:

I used the Amazon Transcribe service through R’s paws package which provides extensive support for all kind of AWS related functionality for R. First I configured the service object I wanted to use and then I carried out actions with that object. To transcribe the audio files I needed to upload them to S3 which is Amazon’s storage solution. First I created a bucket and then uploaded the files.

After this, I also configured the transcription service and started transcription jobs to process each audio file. As the transcription jobs finished I accessed their output and saved them as JSON documents. Now, I have the debates in a text format and I can perform a sentiment analysis on them. One thing that I need to keep in mind is that the text is not separated by speakers. So the text is a mash-up of conversations between the moderator and the two candidates.

Amazon Comprehend and Sentiment Analysis

To carry out the sentiment analysis I used Amazon Comprehend which is another AWS offering. Its sentiment detection functionality returns 4 types of sentiment scores: positive, negative, neutral and mixed. My setup for the service was the same as for Amazon Transcribe since I accessed it also via paws. One notable difficulty was that there is a size limit on the text you can send to the service. Effectively your text input could not be larger than 5000 bytes. In order to comply with this limitation, I wrote a function to chop up the text body of debates and send the data in 5000 bytes chunks.

I iterated through all the chunks of each debate texts and got the sentiment scores for every chunk. At the end of this process, I took the weighted average of the sentiment scores based on their length so that the shorter last chunk won’t distort the scores. I also added some general text statistics like the number of words and the speaking tempo that I calculated by dividing the number of words by the length of the debate.

Finally, I ended up with the following results:

Now, let’s look at what this tells us. Firstly, I checked out how the number of words changed over time. Obviously, taking only 3 election years into account makes it hard to make any substantiated statements, but still, we can see that number of words has increased in the debates. This can maybe signal that candidates try to squeeze more and more points into their speeches. Or maybe this is just because the debates are getting longer and they have more time to speak.

To better understand this issue I also took a look at the speaking tempo. I presented this data on a debate level so that maybe I can get some insight on differences between debates. Usually, 3 presidential debates take place in an election year. There are two conventional debates (1st and 3rd debate) that include the two candidates and a moderator and there is a town hall debate (2nd debate) which includes multiple participants. As can be seen on the charts in 2020 the second debate was cancelled due to COVID-19. Knowing these the graph makes much more sense. Speak tempo is lower for the 2nd debates since there is more transition time and candidates speak for less time than during the other 2 debates. One interesting thing is that while in 2012 and 2016 the speaking tempo was relatively the same in the 1st and 3rd debate until in 2020 the quasi 3rd debate tempo was considerably higher. Plus, it was highest in the data. This backs up that the second debate in 2020 was more hectic or heated than 1st one or any other debates in 2012 and 2016.

So 2020 debates were probably more hectic and heated but what about sentiment? Do debates have generally positive, negative or maybe neutral tone? Based on the sentiment data from Amazon Comprehend it looks like debates have mainly a negative tone. It can be also seen that it’s getting more and more negative. While in 2012 debates had almost as negative as a neutral “tone of voice” until in 2020 the sentiment is clearly negative.

Fine. It looks like debates generally have a negative sentiment. But do some debates tend to be more negative than others? Well, based on this data we cannot really say so. It is not surprising as there only 3 election years included in the data and the second debate could not be observed in 2020.

Conclusions

In this article, I went through a workflow example of how to extract textual data from video files and how to conduct a basic text analysis on the data retrieved from AWS services like Amazon Transcribe and Amazon Comprehend using R. Considering the limited number of input observations this analysis provided pretty interesting insights regarding the characteristics of presidential debates how these characteristics might have change over time. I would also like to highlight that the presented flow is not niche it can be used as a framework for similar projects. For example, analyzing public announcements made by influential governmental personnel, analyzing news interviews with residents and so on.

--

--

dominik gulácsy
0 Followers

Business Analytics Professional | MSc in Business Analytics student @CEU