Analyzing freeCodeCamp.org Video Transcripts and Metadata

dominik gulácsy

Published in

Nerd For Tech

16 min readMay 29, 2021

Visual representation of the text analysis pipeline (data source -> unstructured text data -> analysis)

Intro
Download Video Transcriptions and Metadata from Youtube
Overview of Video Courses and Trends (EDA)
Preprocessing for Text Analysis
Looking at Term Frequencies
Finding Distinctive Terms (TF-IDF)
Co-occurring Terms
Sentiment Analysis (AFINN)
Topic Modeling (LDA)
Summary

GitHub Repo of Analysis

#1 Intro

In this article, I’m going to use data from freeCodeCamp.org, one of the most popular free education resource platforms for programming and software development. By data, I refer to the 1,000+ videos that are available on the organization’s Youtube channel. It is an amazing selection of video courses that covers almost every aspect of coding from machine learning to front-end development.

Just think about the vast amount of information that is lying in this massive 1,151 hours of educational material full of hands-on practical tutorials and best practice gems. Going through only half of the videos would take about 1 year and a month at a watch rate of 20h/week. I suppose most people would be happy to get hold of that kind of knowledge however, I have to heartbrokenly admit that reading this article “surprisingly” will not make you programming genius in 10 minutes. Nevertheless, I hope you may bump into some new terms or relationships that you were not really aware of before. For example, I learned about some authentication concepts along the way.

Before going into the details I would like to mention that this analysis was inspired by Eduardo Ariño de la Rubia and his book recommendation Text Mining with R by Julia Silge and David Robinson

#2 Download Video Transcriptions and Metadata from Youtube

Getting the raw data was easier than I initially thought. This pleasant breeze of ease is all thanks to a very neat and well-documented command-line tool called youtube-dl. It is mainly used for Youtube but also works with videos on udemy or Vimeo. Basically, you only need to provide the URL to the channel or playlist and it automatically starts downloading assets associated with the videos (video file, info.json, caption). For most videos on Youtube verified subtitles is not available. In contrast, auto-generated captions are available for the majority of the videos. With the following one-liner I could download both the captions in VTT (Video Text Tracks) format and the metadata JSON files for every video uploaded to the freeCodeCamp channel:

Note that I used a URL pointing to a playlist rather than to the channel itself. You can find a playlist containing all uploaded videos by clicking “Video” on the channel’s page, choosing “Uploads“ in the drop-down menu, and clicking “Play All” next to it.

After the downloads finished I cross-checked the JSON files with the captions files and removed those files that did not have a corresponding captions file. Since I could not find an appropriate package for basic VTT-to-dataframe parsing functionality, I wrote my own parser function that could extract text segments together with their start and end timestamps. If you are interested in it you can find it here. Actually, I don’t really make use of this lower level of data in this analysis, even so it can be used in further developments. For example to create a keyword-based search functionality that recommends certain parts of videos to watch. Anyway, with all of this, I eventually land at a one-row-per-video dataframe that contains the whole caption for each video.

#3 Overview of Video Courses and Trends (EDA)

The main feature that forms the central element of my analysis is video transcripts. Nevertheless, I also briefly cover some general video characteristics for example looking at trends to give context to the text analysis. Firstly, I look at the most important variables’ distributions. Based on this we can see that the dislike count, like count and view count variables are extremely skewed to the right, the average rating is for almost all of the videos are between 4 and 5 and the average video is about 1 hour long but there are also some which are longer than 3 hours.

The average rating comes from the rescaled ratio of like counts and dislike counts. As we might see the average rating does not really show us a clear picture about which videos were more popular as it is jammed into a very tiny interquantile range. Actually, the average video rating is about 4.84 which raises concerns about whether this variable can be used as a reliable popularity metric as we cannot really distinguish between the videos. Based on this we might pick an absolute variable instead of the relative one such as the like count for example. The potential problem with this is tied to user behavior. By this, I mean that most users don’t click the like button even if they actually like the content so the like count is not really measuring what we might suggest it measures. In light of this uncertainty, I decided to go with the view count as a proxy variable for popularity and I assigned videos into 4 categories using the view count quartiles as cut-points. (Learn more about how Youtube counts views here)

Trends of Video Characteristics Note: 2021 only contains videos published till 2021/05/21.

Based on the charts we can see that most of the videos were published in 2017 and 2018 and it looks like from 2019 the number of videos per year is expected to stagnate at 100. The length of an average video increased significantly from 2015 to 2021. We can see a shift in the base length from 2019 going forward. I suppose it’s because the channel started focusing more on in-depth tutorials. Besides this, we can also see that the speech tempo is also showing an increasing trend till 2019 which supports this idea of format switch. As far as popularity is concerned we can see a peak around 2019 where some of the videos hit extremely high over 1 million views. If you’re interested you look at the top 5 most viewed videos:

1. Learn JavaScript - Full Course for Beginners | 6.1 M
2. C Programming Tutorial for Beginners | 4.2 M
3. HTML Full Course - Build a Website Tutorial | 3.5 M
4. Learn HTML5 and CSS3 From Scratch - Full Course | 2.4 M
5. Python Django Web Framework - Full Course for Beginners | 2.3 M

#4 Preprocessing for Text Analysis

In order to go on with the text analysis, I had to preprocess the textual data in my dataframe. I applied the “one-token-per-row” principle of Tidy Text introduced in the Text Mining with R book which is analogous to Hadley Wickham’s Tidy Data concept. Basically, the idea is that the structure of the data should look like the following where each variable is a column and each video-token pair is a row:

We can arrive at this format by taking advantage of the unnest_tokens function from the tidytext library. Unfortunately, in my case, I encountered some problems with the default word tokenization. This was due to the fact that I also wanted to tokenize names of programming languages (ie. Python, Javascript). The problem occurs when symbols like +#$! etc. are removed during tokenization and thus expressions like C# and C++ are reduced to the plain c letter. The other issue is that usually after tokenization words are removed that are also present in the stopwords table. As this table also includes letters some of the programming languages get completely removed. To overcome this I modified the stopwords accordingly, plus I also added some words that I came across in the analysis but I would consider them uninformative in this context.

#5 Looking at Term Frequencies

As a first step, I look at simple word frequencies. I start off with unnesting the video titles and looking at words that appear the most often.

Most Frequently Used Words in the Titles

The top three words are actually closely related as there are more than 230 “Live Coding with Jesse” sessions on the channel. Jesse and Beau are both names of presenters/tutors. Besides this, we can understand that the main profile of the channel is producing tutorials on topics like Javascript and React. Looking similarly at the tags associated with videos we get more or less the same picture. We see a lot of tutorial videos on CSS and JavaScript and generally on web development.

Eventually, I applied the same method to the video captions as well to find out which words were used the most heavily. However, this time after tidying up my dataframe I also added an extra column which contained the stemmed version of the tokenized words. Stemming will get rid of the different forms of particular words. For instance, converting both “creating”, “create”, “creates” and “created” into the stem “creat” thus we can detect better the linguistic macro patterns in the text.

My results show the presenters/tutors use words like time, data, set, add and create the most. This set of words tells us a bit about what kind of language coding tutors usually use during their sessions. We can also realize that they talk a lot about adding and creating elements like files or functions supposably.

Most Frequently Used Words and Wordstems

After playing a bit with this tokenized caption dataframe I realized I could also quickly have an idea about which programming languages are the most covered on the channel and how this has changed over the years. So I grouped my dataframe by year and calculated term frequencies for the whole years. The results show that JavaScript was always in the top 2 while CSS was referred to relatively very often until a significant decline in 2020. We can also see the tutors started talking about Python more frequently since 2018 and it was the most covered language in 2020.

Most Frequently Mentioned Programming Languages by Year

#6 Finding Distinctive Terms (TF-IDF)

Now that we looked at frequent terms let’s try to find those expressions that can be considered distinctive with regard to a particular set of videos. We are going to do this by calculating the TF-IDF ratio which combines two important measurements:

TF = Term Frequency; just like previously it tells us how many times a given expression appears in the document.
IDF = Inverse Document Frequency; this shows us how unique a particular term is to the corresponding document.

To calculate the TF-IDF of each term in a document we need to define the document instance. Naturally, we might say that one document ought to be one video. Then we would get a list of words that can be considered unique to that particular video. We could maybe compare these words to the tags that are associated with the video to evaluate the uniqueness of tagging. This application is great but we can also try to group videos by some sort of variable(s) that we might be interested in and check out what are the keywords that really differentiate the groups from each other.

For this analysis, I decided to use the view count categories -created previously using quartiles as cutting points- and years to group the videos. To get the TF-IDF for each term in the “document” I count the word stems by the grouping variable and calculate the term and inverse document frequencies with the help of the bind_tf_idf function. I will also do this for tags to see how similar the results are.

On the chart below we can see the top 5 words that are unique to videos that have fewer views, a moderate number of views, or a high number of views. Words from the captions data do not really paint us a clear picture of what kind of video content is popular or unpopular. We get words where the association is hard, such as with “sebastian” “hanson” and “cuisin”. Furthermore, we can also see some tech-related terms such as “changelog”, “quasar”, “stencil” or “fft” and terms like “asteroid, “tetromino” or “pacman” that are pretty likely to be associated with game development. However, we could not identify the general aspects that popularity is associated with.

Top 5 Wordstems & Tags with highest TF-IDF by View Count Categories

One potential cause for these unclear sets of words may be factors that affect the generation of captions. For instance, it may be the case that the tutor has a heavy accent or speaking in a noisy environment thus transcription accuracy reduces and misidentified terms will end up having a high TF-IDF as they are occurring in a pattern. Fortunately, we also have tag data which is much cleaner and filtered down. Of course, we should keep in mind that we are not truly aware of how these tags were generated and subsequently what kinds of biases it includes. Looking at the top words we can see that popular videos are more likely to be focused on Computer Science than others. We can also see that David Malan’s Harvard CS50 videos are very popular. In the upper-middle view count category, we can see phrases associated with Open Source, Git, and GitHub. The lower-middle category shows us that videos in general that are about visuals are not quite popular. Finally, the lowest view count category is a bit unclear again but we can see that for example vlog and streaming related videos are rather unpopular.

Although the interpretation of the results was not really straightforward and undeniable debatable yet it was very interesting to me so I was excited to repeat the process and look at years as well. This way we can have a sense of what kind of content was special for a given year. We can see for example that early tutorials are associated with “bonfire”, some kind of learning platform. We can also see that in 2019 the Harvard CS50 video course and videos covering docker and neural networks were the specials.

#7 Co-occurring Terms

Another thing that we might be interested in besides frequent terms and distinctive keywords is the concept of co-occurrent terms. We could define co-occurrence in different ways but here I cover bigrams (neighboring word pairs) and word pairs co-occurring in the videos. To look at neighboring words I visualize the co-occurrence counts as a network after creating a new text dataframe with bigram tokenization.

Plotting the network tells us some related terms that appear together commonly. An interesting observation is that we can see some clusters of nodes in the data and some connections between these clusters. For example, we can see that there are a number of clusters concerning web development and connections between words used in back-end development, creating files, and writing functions. We can also see a language-related cluster connecting words like “pretty”, “easy”, “simple” and “cool”.

Network Representation of Pairwise Correlations between Bigram Words

I get to the broader definition of co-occurrence by calculating the pairwise correlation between each word that appears in a given video. In contrast to co-occurrence counts, this is a relative measurement so it indicates how often worn pairs appear together relative to how often they appear separately.

To make it more interesting I also filtered down the pairwise correlations to word pairs where one of the terms is one of the most covered languages on the channel (Python, JavaScript, C, CSS). This gives us an idea of what kind of things are associated with the different languages. We can see for example that CSS is highly correlated with a bunch of words like “Bootstrap”, ”template”, “margin”, “div”, “scroll”, “height” and “flex” which makes total sense if you know a little bit of CSS. An interesting thing is that we can also see the nodes that connect JavaScript with CSS such as “jquery”, “html”, “page” and “dev”. Python seems to most related to words like “language”, “pip”, “tupl” (for tuple), and “numpi” (for numpy).

Network Representation of Pairwise Correlations between Well-covered Programming Languages and Other Wordstems

#8 Sentiment Analysis (Lexicon-based)

In this section, I am doing sentiment analysis on the caption of the videos. My aspiration with this sentiment analysis is to learn more about which videos have the most positive and negative sentiment and demonstrate the importance of domain-specific lexicons and disambiguation. I use the AFINN sentiment lexicon to calculate the sentiment scores of videos. Similarly to before, I tokenize the video caption to have a one-token-per-row dataframe, I do an inner join with the AFINN sentiment lexicon and I sum the tokens’ sentiment value by video. Now I can list the 4 videos with the most positive and negative aggregate sentiment.

It somehow looks like game development is associated with a rather negative sentiment, while 2 of the top 4 most positive videos were addressing some kind of social event. Although I can see why the latter may be the case I was not really convinced that game development must be something terrible so I looked into what kind of words are driving the most the aggregated sentiment scores for the particular videos. I measure sentiment contribution by taking the sentiment score of a word and multiplying it by the number of occurrences.

Top 3 Words with Highest Positive and Negative Sentiment Contribution per Video

Obviously, there are some serious red flags here. We can see for example that for game development videos the negative contributor words are all part of the domain language used in game development. So words like “attack”, “enemy” and “kill” should be assigned a neutral sentiment score as they are just integral elements of the game development process. This also the case with the error handling video where we can see that sentiment is mostly determined by the term error which, since it is the central element of the video’s content should be neutral considered neutral. We can also see that we are assigning large value to the word like which is again a very problematic issue as the word “like” can have dozens of meanings many of them which are not positive at all. In cases like this, we need to remove the disambiguity and that’s why context matters so much. This is also why text analysis in general and here especially sentiment analysis is much more complex than joining tables and counting words because we may get fooled pretty easily.

Words with the Highest Sentiment Contributions

To further elaborate on the usability of the AFINN lexicon on this video caption data. I also look at words that have a high sentiment contribution but are preceded by negation. As we can see, most of the words that have a high sentiment contribution imply an opposite sentiment when negated. See “no good”, “not good” and “no errors” for an example. However, it may also happen that the negated form becomes neutral or even consistent with the original meaning. For instance, “no matter” is also positive while “not like” can be either negative or completely neutral based on the context.

Sentiment Contributions of Negated Words

#9 Topic Modeling (LDA)

In this final section of my analysis, I touch upon the use of topic modeling for this video caption data. One of the ways topic modeling can be done is LDA (Latent Dirichlet Allocation). This method assumes that each document (here video) is a mixture of topics, while each topic is effectively a combination of words. With this approach, we can have overlaps between topics and calculate the probabilities of each word appearing in one of the k topics.

I run the LDA on the tokenized video caption data by converting it into a document term matrix, then tidying up the document term matrix, and finally converting it back into a tidytext table for visualization. Using the beta values we can discover which words are the most associated with the predefined number of topics. I chose to have six topics but it was entirely ad-hoc. Based on the chart we can see topics do not really represent a technical classification at the first sight. It is rather just a collection of words that relate to each other.

Words Associated Most with the 6 Topics Created by LDA

However, after making classification on videos where we are the most certain about the class we can see that videos are indeed pretty similar based on the title. For example, we assigned the first topic to a bunch of videos covering web development (CSS, HTML, UI design). The second topic seems to encapsulate videos that are more about data structures and related mathematical or programming concepts. Moreover, it seems like the third topic mostly includes Git and “Live Coding with Jesse” videos. The fourth topic clearly represents videos about carrier development and social events, while the fifth topic seems to cover building web applications. Finally, the sixth topic is apparently a group of videos teaching elementary Computer Scinence.

#10 Summary

We covered a lot in this analysis. First, we looked at how to download video captions and metadata for YouTube channels with the easy-to-use tool, called youtube-dl. Then we looked at the general trends and characteristics of videos to let us have a context in mind during the text analysis. After that, we discussed text processing and nuances specific to the video data. Then we jumped into analyzing term frequencies, trying to understand the profile of the channel and which programming languages are the most covered. From there we moved on to looking at TF-IDF values which showed us which keywords and phrases are the most distinctive for different groups of videos. After TF-IDF we also concerned ourselves with co-occurrence. We visualized networks that represented the bigrams and the pairwise correlation between terms. We explored which terms are most related to some of the programming languages like Python, JavaScript, or CSS. Following that, we touched upon the field of sentiment analysis where we realized that we should be really careful with the interpretation of results and acknowledge the great deal of complexity that comes with text analysis which we ignore when applying a lexicon-based approach.

One thing that I would like to definitely highlight as an ending note is that we should always keep in mind when looking at text analytics results like presented here that our interpretation does not always correspond to what the data shows us. The human ability to make generalizations and simplifications subconsciously in a split of a second is sometimes a blessing and sometimes a huge threat to the integrity of the analysis.