๐ŸŽฌ Week Lesson: Hollywood Film Dataset Analysis

Programmers Picnic AI-ML Classes by Champak Roy

learnwithchampak.live | aiml.learnwithchampak.live

๐ŸŽฏ Lesson Objective

In this lesson, students will learn how to use a Hollywood-style movie dataset to analyze film success using Python, Pandas, visualization, and basic machine learning.

Python Pandas Data Analysis Machine Learning Film Success Prediction

๐Ÿ“ฅ Dataset Download

Click below to download the sample Hollywood movie dataset as a CSV file. Students can open it in Excel, Google Sheets, or Python.

Sample Dataset Preview

title genre budget_million revenue_million rating year
Sky Warriors Action 120 520 8.1 2019
Love in Paris Romance 35 150 7.2 2020
Dark Planet Sci-Fi 180 760 8.5 2021
Silent Tears Drama 20 65 7.8 2018

๐ŸŒ Real Hollywood Datasets (Download Links)

Use these real-world datasets to practice advanced analysis and machine learning. These datasets are widely used in data science projects.

๐ŸŽฌ TMDb 5000 Movie Dataset

Contains 5000 movies with metadata like budget, revenue, cast, and genres.

โฌ‡ Download from Kaggle

โญ IMDb Movie Dataset

Large dataset including ratings, votes, titles, and crew information.

โฌ‡ Official IMDb Dataset

๐Ÿ’ฐ Box Office Mojo Dataset

Focused on box office revenue, domestic and international earnings.

๐ŸŒ Visit Website

๐Ÿ“Š MovieLens Dataset

Useful for recommendation systems (user ratings and preferences).

โฌ‡ Download MovieLens

๐Ÿ’ก Pro Tip

Start with the small sample dataset, then move to these real datasets for:

  • More features (cast, crew, keywords)
  • Bigger data (better ML models)
  • Real-world complexity

๐Ÿง  What is a Hollywood Dataset?

A Hollywood film dataset contains information about movies such as title, genre, budget, revenue, rating, release year, director, and cast.

Budget: Money spent to make the film.
Revenue: Money earned by the film.
Profit: Revenue minus budget.
Rating: Audience or critic score.

๐Ÿ›  Step 1: Load the Dataset

import pandas as pd

df = pd.read_csv("hollywood_movies_sample.csv")

print(df.head())
print(df.info())

๐Ÿ” Step 2: Explore the Dataset

# Basic statistics
print(df.describe())

# Check missing values
print(df.isnull().sum())

# List all genres
print(df["genre"].unique())

๐Ÿ“Š Step 3: Add Profit and Success Columns

We create a new column called profit_million. Then we create a success label:

  • 1 = Successful film
  • 0 = Not successful film
df["profit_million"] = df["revenue_million"] - df["budget_million"]

df["success"] = df["profit_million"].apply(
    lambda x: 1 if x > 0 else 0
)

print(df[["title", "profit_million", "success"]])

๐Ÿ“ˆ Step 4: Visualize Budget vs Revenue

import matplotlib.pyplot as plt

plt.scatter(df["budget_million"], df["revenue_million"])
plt.xlabel("Budget in Million Dollars")
plt.ylabel("Revenue in Million Dollars")
plt.title("Budget vs Revenue")
plt.show()

๐Ÿ’ป Practice Python Online

Use the embedded Python editor below to run the Hollywood dataset code directly.

๐Ÿš€ Open Editor in New Tab

๐ŸŽญ Step 5: Genre-Based Analysis

genre_profit = df.groupby("genre")["profit_million"].mean()

print(genre_profit)

genre_profit.plot(kind="bar")
plt.xlabel("Genre")
plt.ylabel("Average Profit")
plt.title("Average Profit by Genre")
plt.show()

๐Ÿค– Step 6: Build a Simple ML Model

Goal: Predict whether a film will be successful using budget and rating.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X = df[["budget_million", "rating"]]
y = df["success"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression()
model.fit(X_train, y_train)

print("Accuracy:", model.score(X_test, y_test))

๐Ÿงช Student Assignments

Beginner

  • Find the top 5 highest revenue films.
  • Find the average rating of all films.
  • Find the total number of genres.

Intermediate

  • Create a profit column.
  • Find the most profitable genre.
  • Plot budget vs revenue.

Advanced

  • Build a hit/flop prediction model.
  • Try using genre as an input feature.
  • Create a Streamlit app for prediction.

โ“ MCQs

1. Profit is calculated as:

a) Budget + Revenue
b) Revenue - Budget โœ…
c) Rating - Budget
d) Year + Revenue

2. Which library is used for data analysis?

a) Pandas โœ…
b) Flask
c) HTML
d) CSS

3. What does success = 1 mean?

a) Film failed
b) Film has no rating
c) Film is profitable โœ…
d) Film has no budget

๐Ÿ Final Project

Create a Hollywood Hit Movie Predictor.

  • Input: Budget, rating, genre
  • Output: Hit or Flop
  • Tools: Python, Pandas, Scikit-learn, Streamlit