๐ฌ Week Lesson: Hollywood Film Dataset Analysis
Programmers Picnic AI-ML Classes by Champak Roy
learnwithchampak.live | aiml.learnwithchampak.live
๐ฏ Lesson Objective
In this lesson, students will learn how to use a Hollywood-style movie dataset to analyze film success using Python, Pandas, visualization, and basic machine learning.
Python Pandas Data Analysis Machine Learning Film Success Prediction๐ฅ Dataset Download
Click below to download the sample Hollywood movie dataset as a CSV file. Students can open it in Excel, Google Sheets, or Python.
Sample Dataset Preview
| title | genre | budget_million | revenue_million | rating | year |
|---|---|---|---|---|---|
| Sky Warriors | Action | 120 | 520 | 8.1 | 2019 |
| Love in Paris | Romance | 35 | 150 | 7.2 | 2020 |
| Dark Planet | Sci-Fi | 180 | 760 | 8.5 | 2021 |
| Silent Tears | Drama | 20 | 65 | 7.8 | 2018 |
๐ Real Hollywood Datasets (Download Links)
Use these real-world datasets to practice advanced analysis and machine learning. These datasets are widely used in data science projects.
๐ฌ TMDb 5000 Movie Dataset
Contains 5000 movies with metadata like budget, revenue, cast, and genres.
โฌ Download from Kaggleโญ IMDb Movie Dataset
Large dataset including ratings, votes, titles, and crew information.
โฌ Official IMDb Dataset๐ฐ Box Office Mojo Dataset
Focused on box office revenue, domestic and international earnings.
๐ Visit Website๐ MovieLens Dataset
Useful for recommendation systems (user ratings and preferences).
โฌ Download MovieLens๐ก Pro Tip
Start with the small sample dataset, then move to these real datasets for:
- More features (cast, crew, keywords)
- Bigger data (better ML models)
- Real-world complexity
๐ง What is a Hollywood Dataset?
A Hollywood film dataset contains information about movies such as title, genre, budget, revenue, rating, release year, director, and cast.
๐ Step 1: Load the Dataset
import pandas as pd
df = pd.read_csv("hollywood_movies_sample.csv")
print(df.head())
print(df.info())
๐ Step 2: Explore the Dataset
# Basic statistics
print(df.describe())
# Check missing values
print(df.isnull().sum())
# List all genres
print(df["genre"].unique())
๐ Step 3: Add Profit and Success Columns
We create a new column called profit_million. Then we create a success label:
- 1 = Successful film
- 0 = Not successful film
df["profit_million"] = df["revenue_million"] - df["budget_million"]
df["success"] = df["profit_million"].apply(
lambda x: 1 if x > 0 else 0
)
print(df[["title", "profit_million", "success"]])
๐ Step 4: Visualize Budget vs Revenue
import matplotlib.pyplot as plt
plt.scatter(df["budget_million"], df["revenue_million"])
plt.xlabel("Budget in Million Dollars")
plt.ylabel("Revenue in Million Dollars")
plt.title("Budget vs Revenue")
plt.show()
๐ป Practice Python Online
Use the embedded Python editor below to run the Hollywood dataset code directly.
๐ญ Step 5: Genre-Based Analysis
genre_profit = df.groupby("genre")["profit_million"].mean()
print(genre_profit)
genre_profit.plot(kind="bar")
plt.xlabel("Genre")
plt.ylabel("Average Profit")
plt.title("Average Profit by Genre")
plt.show()
๐ค Step 6: Build a Simple ML Model
Goal: Predict whether a film will be successful using budget and rating.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X = df[["budget_million", "rating"]]
y = df["success"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LogisticRegression()
model.fit(X_train, y_train)
print("Accuracy:", model.score(X_test, y_test))
๐งช Student Assignments
Beginner
- Find the top 5 highest revenue films.
- Find the average rating of all films.
- Find the total number of genres.
Intermediate
- Create a profit column.
- Find the most profitable genre.
- Plot budget vs revenue.
Advanced
- Build a hit/flop prediction model.
- Try using genre as an input feature.
- Create a Streamlit app for prediction.
โ MCQs
1. Profit is calculated as:
a) Budget + Revenue
b) Revenue - Budget โ
c) Rating - Budget
d) Year + Revenue
2. Which library is used for data analysis?
a) Pandas โ
b) Flask
c) HTML
d) CSS
3. What does success = 1 mean?
a) Film failed
b) Film has no rating
c) Film is profitable โ
d) Film has no budget
๐ Final Project
Create a Hollywood Hit Movie Predictor.
- Input: Budget, rating, genre
- Output: Hit or Flop
- Tools: Python, Pandas, Scikit-learn, Streamlit