Introduction
A few months ago, I wrote a post in which I described all of the movies I watched in 2020 through a series of summary statistics. Since then, I’ve continued my little hobby of updating my database as I watch movies. I’ve started to think about what this movie database actually represents: it is a collection of all the movies that interested me enough to spend two hours of my day devoted to staring at a screen. I gave some of them high ratings, others low ratings. Surely there must be some commonalities among the movies that I like and other commonalities among the movies that I dislike?
Everything has descriptive data associated with it and movies are no exception. Runtime, budget, genre, director, actors… All of these factors are metadata – descriptive data that helps describe the movie itself, but is not what we would consider “the movie”. I figured there must be some commonalities between the metadata of the movies that I like. For example, I tend to prefer movies that are around two hours long. I also imagine that I have an implicit bias towards rating movies with a larger budget at a higher level than movies made on a smaller budget. There are certainly a few genres that I’m particularly partial to (science fiction and westerns for example).
There are other metadata that I imagine have less of an impact on how I rate movies. For example, in my previous post I showed that I don’t preferentially rate any given decade higher than another. While I like certain actors, I don’t think there are any that would solely boost my movie rating on their own.
There are statistical tools that can allow us to explore some of these relationships. One tool that has been peaking my fancy lately is correlation analysis. Performing a correlation analysis on my movie database can give me some insight as to how each variable in the movie’s metadata influences my overall rating, as well as how they might correlate with each other.
Media platforms use this type of analysis, among many others, in order to recommend you content that will keep you glued to the screen. Recommendation engines are essentially correlation analyses paired with regression models, souped-up on terabytes of data. This analysis is a smaller, less robust version of what companies, such as Netflix, are doing with our data already. Those companies have access to many other factors that I didn’t have access to, or didn’t bother recording, such as time of day that a movie was watched, what my friends thought of the movie, etc.. Still, hopefully I can find some interesting correlations in my small dataset.
The Analysis
I used R as my coding language. For my correlation analysis, I first started by importing my dataset.
## Setting the working directory and importing my dataset
setwd("C:/Users/alex/filepath")
m=read.csv("movie_database.csv")
I used a package called corrplot, which has a number of handy tools built into it for this type of analysis.
##
install.packages("corrplot")
library(corrplot)
My database has a number of columns that I’m not interested in using for this analysis. For that reason, I created a vector with just the columns I wanted to look at, and used that to create a subsetted data frame. Any rows that have missing information needed to be removed, and then I needed to make sure that the data was being recognized by R as a number, rather than a character.
## Creating a list of what I want to keep
corr_want = c("personal_rating","r.direction","r.cinematography","r.score",
"r.acting","r.screenplay","year","rewatch","budget","runtime","nudity")
corr_frame = m[corr_want]
corr_frame = na.omit(corr_frame)
corr_frame = sapply(corr_frame, as.numeric)
At this point, I was pretty much ready to go! I could now perform the correlation analysis. The cor() function analyzed the covariances between the x and y columns of the dataset. After that was generated, the corrplot() function plotted the correlations in an easy to visualize manner.
## Correlation analysis then plotting
correlations=cor(corr_frame)
corrplot(correlations, method="circle", order = "FPC")
Let’s break down this plot. Likely, the first thing that you’ll notice is that I have a “personal rating” column, as well as a number of “r.________” columns that are all highly correlated. In my personal rating column I record how I felt about the movie in general, regardless of how good or bad I thought it was objectively (I like campy horror for example, so I might give a “bad” horror movie a high personal rating).
Each of the “r.________” columns correspond to how I rate that particular aspect of the movie. For example, the movie O’ Brother Where Art Thou was given a 10/10 in the r.score column of my database. I try to make these ratings as objective as possible.
After r.score, you can see the metadata I included in this analysis: whether or not I was rewatching the movie, if the movie contained nudity, the movie’s runtime, its year of release and its budget. There were a few other aspects that I wanted to include, but the matrix started to become difficult to read on a small screen. I may delve into additional aspects in the future.
Any given square is the correlation between the two variables that intersect that square. There is a diagonal line from the top box to the bottom box, which shows that there is a perfect correlation when you compare a variable to itself (which hopefully makes sense). The darker blue the circle is, the stronger the positive correlation. The darker red, the stronger the negative correlation. Circles that are very light have very little to no correlation.
This is all great, but these correlations are basically just observations. With this matrix, I couldn’t be sure if the correlations I discovered were statistically significant. Luckily, the corrplot package has some built in tools to help with that. The cor.mtest() function created a new dataframe which will contain all of our relevant p-value information. After analyzing the p-values, I added them to the previous plot.
## Finding the p-values
testRes = cor.mtest(corr_frame, conf.level = 0.95)
corrplot(correlations, p.mat = testRes$p, insig = 'p-value', sig.level = -1)
In statistics, we generally consider something statistically significant when it has a p-value >0.05. In this case, any “0”s on the grid represent p-values less than 0.01 and are showing “0” due to rounding. This means that the squares with a “0” p-value are highly significantly correlated.
You may notice that there seem to be clusters of variables that are correlated. Again, the corrplot package contains tools that let me visualize those clusters.
## Visualizing the clusters of correlated variables
corrplot(correlations, p.mat = testRes$p, sig.level = 0.10, order = 'hclust', addrect = 2)
In this plot, you can see the variables that are clustered outlined with a dark box. It seems that there are two clusters – my ratings and the metadata for the movie. Boxes that contain an “X” through them were not considered to be statistically significant on the p>0.10 significance level.
Even though I could see the strength of the association and the statistical significance, I still didn’t know the range of the values that the association could be. In order to visualize this, I looked at the confidence intervals of the strength of the correlation. The 95% confidence interval gives the range of values that we can be 95% sure the true association lies in between. We can plot this using the corrplot() function again.
## Visualizing the confidence intervals
corrplot(correlations, p.mat = testRes$p, lowCI = testRes$lowCI, uppCI = testRes$uppCI,
addrect = 3, rect.col = 'navy', plotC = 'rect', cl.pos = 'n', type = "lower")
Again, an “X” through a box indicates that the correlation is not statistically significant.
In this case, I am just keeping the lower half of the graph, since the redundancy can be overwhelming. You can see that all of my rating categories are highly correlated, and with a high confidence. To me, this means that there is likely bias in how I “objectively” rate each movie in each of my “r.________” rating categories.
You can also see that the negative correlation between budget and how much I enjoy a movie could fall very close to 0 correlation. I’ll be interested to see how this correlation changes with more data added to my dataset. Budget also had a statistically significant positive association with runtime, which makes sense to me, given that each minute added to a movie requires more editing, more acting, etc..
I thought the association between rewatched movies and my ratings was interesting. It makes sense to me — I am more likely to rewatch movies that I enjoy.
Interestingly, there is a statistically significant positive correlation between nudity and runtime. I would imagine there are some variables that were not included in this analysis that are biasing the results (perhaps rated R movies longer on average?). I’ll have to explore that another time.
Conclusion
I found this to be a fun exercise in analyzing a fairly messy dataset that was generated by my own preferences. It would be fascinating to see how Netflix, Hulu and other streaming platforms do this sort of analysis as well, though I imagine that’s mostly proprietary information.
Let me know if you have any ideas for other things I could do with this data in the future! My dataset is only gaining more entries as time goes on and I watch more movies. My next step with this project is to create a linear regression model to predict how I will rate unwatched movies, and to see what factors are most important in determining how I rate them.
Leave a comment