Netflix Data Final Analysis: Duration, Release Year, and Growth Trends
Introduction
The purpose of this project is to
examine patterns in Netflix content using a dataset of 8,807 titles. This
analysis focuses on three main questions: whether movies and TV shows are
released in different average years, whether movie duration varies across
rating categories such as G, PG-13, and TV-MA, and whether the number of titles
added to Netflix has increased over time. Together, these questions help
explore how Netflix’s content has evolved, how rating categories relate to the
length of films, and the overall growth of Netflix’s library in recent years.
This project uses the same
inferential techniques practiced in class, including two-sample t-tests to
compare group means, ANOVA to test for differences across multiple categories,
and linear regression to examine trends over time. Together, these statistical
methods allow us to draw evidence-based conclusions from the data rather than
relying solely on descriptive summaries.
Dataset Preparation
The original dataset
(netflix_titles.csv) contains 8,807 entries. To prepare the data for analysis,
movie durations were cleaned by extracting the numeric values from the
“duration” column, converting entries such as “90 min” into a usable numeric
format. The date_added variable was also converted into a proper date format,
allowing the year_added field to be extracted for trend analysis. After
cleaning, separate datasets were created for movies and for examining
year-added patterns to ensure accurate and focused statistical testing.
T-test: Release Year for Movies vs
TV shows
The hypotheses for this part of the
analysis examine whether the average release year differs between movies and TV
shows. The null hypothesis (H₀) states that movies and TV shows share the same
mean release year, while the alternative hypothesis (H₁) proposes that their
average release years are different.
Results:
The t-test results showed a
substantial difference in release years between movies and TV shows. The test
statistic was t = −20.976 with a highly significant p-value less
than 2.2e-16, indicating strong evidence against the null hypothesis. The 95%
confidence interval for the difference in means ranged from −3.81 to −3.16,
confirming that the difference is both statistically significant and
practically meaningful. On average, movies in the dataset were released in
2013.12, while TV shows had a more recent mean release year of 2016.61.
Interpretation:
The results show a highly
significant difference in release year between movies and TV shows (p <
0.001). TV shows on Netflix tend to be released more recently (mean = 2016.6)
compared to movies (mean = 2013.1). This suggests Netflix adds newer TV shows
more frequently than it adds newer movies.
AVONA: Movie Duration Across
Ratings
The hypotheses for this part of the
analysis focus on whether movie duration varies across different rating
categories. The null hypothesis (H₀) states that the mean movie duration is the
same for all rating categories, while the alternative hypothesis (H₁) proposes
that at least one rating category differs in its average duration.
Results:
The ANOVA results indicated a
significant difference in movie duration across the various rating categories.
The test produced an F-statistic of F(13, 6112) = 98.11 with a p-value
less than 2e-16, providing strong evidence that not all rating categories share
the same mean duration. This confirms that movie length varies meaningfully
depending on the rating assigned.
Interpretation:
The ANOVA test indicates a
statistically significant difference in movie duration across rating categories
(p < 0.001). This means movie length varies depending on the
film’s rating. The boxplot supports this by showing categories such as NC-17
and TV-MA tending to have longer films, while TV-Y and TV-G movies are much
shorter on average.
Regression: Number of Titles Added
per year
The regression model used to analyze the trend in Netflix content
additions over time was specified as: count_of_titles_added predicted by year_added,
expressed as count_of_titles_added ~ year_added.
Results:
The regression analysis revealed a
strong upward trend in the number of titles added to Netflix over time. The
slope estimate was 169.05, indicating that Netflix added approximately 169 more
titles each year on average. This relationship was statistically significant,
with a p-value of 5.76e-05. The model also produced an R² value of 0.7531,
meaning that about 75% of the variation in the number of titles added per year
is explained by the year itself, demonstrating a substantial and meaningful
trend in Netflix’s content growth.
Interpretation:
There is a strong and statistically
significant upward trend in the number of titles added to Netflix each year (p <
0.001). The slope estimate (169.05) indicates that, on average, Netflix added
about 169 additional titles per year. The R² of 0.7531 shows that 75% of the
year-to-year change in title additions is explained by the year itself,
demonstrating a major expansion in Netflix’s library.
Final Summary
This project analyzed 8,807 Netflix
titles to examine how content characteristics differ across type, rating, and
time. A t-test showed a significant difference in release years between movies
and TV shows, with TV shows being more recent on average (p <
0.001). An ANOVA test revealed strong differences in movie duration across
rating categories (p < 0.001), indicating that content length is
related to audience rating. A linear regression model demonstrated a
significant upward trend in the number of titles added to Netflix each year (p <
0.001, R² = 0.75). These findings suggest that Netflix’s catalog has expanded
rapidly in recent years, with notable differences in content type and rating
patterns.
References
Shivam B. (n.d.). Netflix Shows Dataset. Kaggle. Retrieved from https://www.kaggle.com/datasets/shivamb/netflix-shows?resource=download
Comments
Post a Comment