🎬 Week Lesson: Hollywood Film Dataset Analysi

🎯 Lesson Objective

In this lesson, students will learn how to use a Hollywood-style movie dataset to analyze film success using Python, Pandas, visualization, and basic machine learning.

Python Pandas Data Analysis Machine Learning Film Success Prediction

📥 Dataset Download

Click below to download the sample Hollywood movie dataset as a CSV file. Students can open it in Excel, Google Sheets, or Python.

Sample Dataset Preview

title	genre	budget_million	revenue_million	rating	year
Sky Warriors	Action	120	520	8.1	2019
Love in Paris	Romance	35	150	7.2	2020
Dark Planet	Sci-Fi	180	760	8.5	2021
Silent Tears	Drama	20	65	7.8	2018

🌍 Real Hollywood Datasets (Download Links)

Use these real-world datasets to practice advanced analysis and machine learning. These datasets are widely used in data science projects.

🎬 TMDb 5000 Movie Dataset

Contains 5000 movies with metadata like budget, revenue, cast, and genres.

⬇ Download from Kaggle

⭐ IMDb Movie Dataset

Large dataset including ratings, votes, titles, and crew information.

⬇ Official IMDb Dataset

💰 Box Office Mojo Dataset

Focused on box office revenue, domestic and international earnings.

🌐 Visit Website

📊 MovieLens Dataset

Useful for recommendation systems (user ratings and preferences).

⬇ Download MovieLens

💡 Pro Tip

Start with the small sample dataset, then move to these real datasets for:

More features (cast, crew, keywords)
Bigger data (better ML models)
Real-world complexity

🧠 What is a Hollywood Dataset?

A Hollywood film dataset contains information about movies such as title, genre, budget, revenue, rating, release year, director, and cast.

Budget: Money spent to make the film.

Revenue: Money earned by the film.

Profit: Revenue minus budget.

Rating: Audience or critic score.

🛠 Step 1: Load the Dataset

import pandas as pd

df = pd.read_csv("hollywood_movies_sample.csv")

print(df.head())
print(df.info())

🔍 Step 2: Explore the Dataset

# Basic statistics
print(df.describe())

# Check missing values
print(df.isnull().sum())

# List all genres
print(df["genre"].unique())

📊 Step 3: Add Profit and Success Columns

We create a new column called profit_million. Then we create a success label:

1 = Successful film
0 = Not successful film

df["profit_million"] = df["revenue_million"] - df["budget_million"]

df["success"] = df["profit_million"].apply(
    lambda x: 1 if x > 0 else 0
)

print(df[["title", "profit_million", "success"]])

📈 Step 4: Visualize Budget vs Revenue

import matplotlib.pyplot as plt

plt.scatter(df["budget_million"], df["revenue_million"])
plt.xlabel("Budget in Million Dollars")
plt.ylabel("Revenue in Million Dollars")
plt.title("Budget vs Revenue")
plt.show()

💻 Practice Python Online

Use the embedded Python editor below to run the Hollywood dataset code directly.

🚀 Open Editor in New Tab

🎭 Step 5: Genre-Based Analysis

genre_profit = df.groupby("genre")["profit_million"].mean()

print(genre_profit)

genre_profit.plot(kind="bar")
plt.xlabel("Genre")
plt.ylabel("Average Profit")
plt.title("Average Profit by Genre")
plt.show()

🤖 Step 6: Build a Simple ML Model

Goal: Predict whether a film will be successful using budget and rating.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X = df[["budget_million", "rating"]]
y = df["success"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LogisticRegression()
model.fit(X_train, y_train)

print("Accuracy:", model.score(X_test, y_test))

🧪 Student Assignments

Beginner

Find the top 5 highest revenue films.
Find the average rating of all films.
Find the total number of genres.

Intermediate

Create a profit column.
Find the most profitable genre.
Plot budget vs revenue.

Advanced

Build a hit/flop prediction model.
Try using genre as an input feature.
Create a Streamlit app for prediction.

❓ MCQs

1. Profit is calculated as:

a) Budget + Revenue
b) Revenue - Budget ✅
c) Rating - Budget
d) Year + Revenue

2. Which library is used for data analysis?

a) Pandas ✅
b) Flask
c) HTML
d) CSS

3. What does success = 1 mean?

a) Film failed
b) Film has no rating
c) Film is profitable ✅
d) Film has no budget

🏁 Final Project

Create a Hollywood Hit Movie Predictor.

Input: Budget, rating, genre
Output: Hit or Flop
Tools: Python, Pandas, Scikit-learn, Streamlit