Netflix Data Final Analysis: Duration, Release Year, and Growth Trends

Introduction

           The purpose of this project is to examine patterns in Netflix content using a dataset of 8,807 titles. This analysis focuses on three main questions: whether movies and TV shows are released in different average years, whether movie duration varies across rating categories such as G, PG-13, and TV-MA, and whether the number of titles added to Netflix has increased over time. Together, these questions help explore how Netflix’s content has evolved, how rating categories relate to the length of films, and the overall growth of Netflix’s library in recent years.

          This project uses the same inferential techniques practiced in class, including two-sample t-tests to compare group means, ANOVA to test for differences across multiple categories, and linear regression to examine trends over time. Together, these statistical methods allow us to draw evidence-based conclusions from the data rather than relying solely on descriptive summaries.

Dataset Preparation

         The original dataset (netflix_titles.csv) contains 8,807 entries. To prepare the data for analysis, movie durations were cleaned by extracting the numeric values from the “duration” column, converting entries such as “90 min” into a usable numeric format. The date_added variable was also converted into a proper date format, allowing the year_added field to be extracted for trend analysis. After cleaning, separate datasets were created for movies and for examining year-added patterns to ensure accurate and focused statistical testing.

T-test: Release Year for Movies vs TV shows

          The hypotheses for this part of the analysis examine whether the average release year differs between movies and TV shows. The null hypothesis (H₀) states that movies and TV shows share the same mean release year, while the alternative hypothesis (H₁) proposes that their average release years are different.

A screenshot of a computer code

AI-generated content may be incorrect.

Results:

              The t-test results showed a substantial difference in release years between movies and TV shows. The test statistic was t = −20.976 with a highly significant p-value less than 2.2e-16, indicating strong evidence against the null hypothesis. The 95% confidence interval for the difference in means ranged from −3.81 to −3.16, confirming that the difference is both statistically significant and practically meaningful. On average, movies in the dataset were released in 2013.12, while TV shows had a more recent mean release year of 2016.61.

Interpretation:

           The results show a highly significant difference in release year between movies and TV shows (p < 0.001). TV shows on Netflix tend to be released more recently (mean = 2016.6) compared to movies (mean = 2013.1). This suggests Netflix adds newer TV shows more frequently than it adds newer movies.     

AVONA: Movie Duration Across Ratings

         The hypotheses for this part of the analysis focus on whether movie duration varies across different rating categories. The null hypothesis (H₀) states that the mean movie duration is the same for all rating categories, while the alternative hypothesis (H₁) proposes that at least one rating category differs in its average duration.

A screenshot of a computer

AI-generated content may be incorrect.

Results:

        The ANOVA results indicated a significant difference in movie duration across the various rating categories. The test produced an F-statistic of F(13, 6112) = 98.11 with a p-value less than 2e-16, providing strong evidence that not all rating categories share the same mean duration. This confirms that movie length varies meaningfully depending on the rating assigned.

A graph with black and white lines and dots

AI-generated content may be incorrect.

Interpretation: 

             The ANOVA test indicates a statistically significant difference in movie duration across rating categories (p < 0.001). This means movie length varies depending on the film’s rating. The boxplot supports this by showing categories such as NC-17 and TV-MA tending to have longer films, while TV-Y and TV-G movies are much shorter on average.

Regression: Number of Titles Added per year

             The regression model used to analyze the trend in Netflix content additions over time was specified as: count_of_titles_added predicted by year_added, expressed as count_of_titles_added ~ year_added.

A screenshot of a computer code

AI-generated content may be incorrect.

Results:

           The regression analysis revealed a strong upward trend in the number of titles added to Netflix over time. The slope estimate was 169.05, indicating that Netflix added approximately 169 more titles each year on average. This relationship was statistically significant, with a p-value of 5.76e-05. The model also produced an R² value of 0.7531, meaning that about 75% of the variation in the number of titles added per year is explained by the year itself, demonstrating a substantial and meaningful trend in Netflix’s content growth.

A graph with a line

AI-generated content may be incorrect.

Interpretation:

          There is a strong and statistically significant upward trend in the number of titles added to Netflix each year (p < 0.001). The slope estimate (169.05) indicates that, on average, Netflix added about 169 additional titles per year. The R² of 0.7531 shows that 75% of the year-to-year change in title additions is explained by the year itself, demonstrating a major expansion in Netflix’s library. 

Final Summary

            This project analyzed 8,807 Netflix titles to examine how content characteristics differ across type, rating, and time. A t-test showed a significant difference in release years between movies and TV shows, with TV shows being more recent on average (p < 0.001). An ANOVA test revealed strong differences in movie duration across rating categories (p < 0.001), indicating that content length is related to audience rating. A linear regression model demonstrated a significant upward trend in the number of titles added to Netflix each year (p < 0.001, R² = 0.75). These findings suggest that Netflix’s catalog has expanded rapidly in recent years, with notable differences in content type and rating patterns.

References

Shivam B. (n.d.). Netflix Shows Dataset. Kaggle. Retrieved from https://www.kaggle.com/datasets/shivamb/netflix-shows?resource=download

Comments

Popular posts from this blog

Descriptive Statistics: Comparing Two Data Sets in R

The Art of Programming Assignment

Understanding Regression Models: Predicting, Analyzing, and Interpreting Data in R