This time we'll try to analyze the famous TMDB 5000 movies dataset available on Kaggle. This dataset is bit disordered now, don't know why...nonetheless, I have cleaned it usinng Excel and put it here
This contains few of the columns of original dataset...but it works
Data is downloaded from : Kaggle TMDB
mv = read.csv("C:/Users/chirag rankja/Desktop/Dataset_mov/movies.csv",
stringsAsFactors = F,strip.white = T,na.strings = "NA")
mv = subset(mv,mv$production!="")
We will try to answer following Questions !
require(dplyr)
require(lubridate)
require(tidyr)
mv$release_date = as.Date.character(x = mv$release_date,format = "%d-%m-%y")
mv$year = lubridate::year(mv$release_date)
head(mv)
options(scipen = 999,digits = 2)
cat("Avg budget is:",mean(mv$budget)/1000000,"millions")
cat("Avg revenue is:",mean(mv$revenue)/1000000,"millions")
cat("Avg ROI is:",(mean(mv$revenue)/mean(mv$budget)))
mv$month <- month(mv$release_date,label = T)
mv$day <- weekdays(mv$release_date,abbreviate = T)
table(mv$day)
table(mv$month)
table(mv$day,mv$month)
mv_earnings<-mv %>% select(genre,budget,revenue) %>%
group_by(genre) %>%
summarise(freq=n(),
avg_invst=mean(budget)/1000000,
avg_roi=(mean(revenue)/mean(budget))) %>%
data.frame() %>% arrange(-freq)
mv_earnings %>% filter (freq>20) %>% arrange(-avg_invst)
Let's find out top five performers across each genre by revenue
mv %>% select(title,genre,budget,revenue,year) %>%
group_by(genre) %>%
filter(genre!="",revenue>900000000) %>% top_n(n = 5,wt=revenue) %>%
mutate(roi=revenue/budget) %>%
arrange(-roi)
mv_top_100<-mv %>% select(budget,revenue,title,genre,production,day,month,year) %>%
arrange(-revenue) %>%
head(n=100) %>%
mutate(roi=revenue/budget)
head(mv_top_100)
table(mv_top_100$genre)
mv_top_earners<-inner_join(mv_top_100,mv_earnings,"genre") %>%
select(title,genre,roi,avg_roi,avg_invst,budget,month,production,day,year) %>%
mutate(diff_roi=roi/avg_roi,invst=budget/10000000) %>% select(-budget) %>%
mutate(diff_invst=invst/avg_invst) %>%
select(title,genre,invst,avg_invst,diff_invst,roi,avg_roi,diff_roi,production,month,day,year) %>%
arrange(-diff_roi)
head(mv_top_earners)
table(mv_top_earners$day)
table(mv_top_earners$month)
mv_top_studios <- data.frame(prod=as.character(na.omit(mv$production)),
bud=as.numeric(mv$budget),
rev=as.numeric(mv$revenue),
month=as.character(mv$month),
day=as.character(mv$day)) %>%
group_by(prod) %>%
summarise(tot_bud=sum(bud),tot_rev=sum(rev),roi=tot_rev/tot_bud,n=n()) %>%
top_n(n = 20,wt=n) %>%
mutate(avg_bud=tot_bud/n,avg_rev=tot_rev/n) %>%
arrange(-avg_rev)
mv_top_studios
table(mv$production=="Walt Disney Pictures",mv$genre)
table(mv$production=="Village Roadshow Pictures",mv$genre)
table(mv$production=="Paramount Pictures",mv$genre)
table(mv$production=="Columbia Pictures",mv$genre)
table(mv$production=="Universal Pictures",mv$genre)
mv %>% select(year,month,day,budget,revenue) %>% group_by(month) %>%
summarise(n=n(),roi=mean(revenue)/mean(budget),ev=n*roi) %>% arrange(-ev)
mv %>% select(year,month,day,budget,revenue) %>% group_by(day) %>%
summarise(n=n(),roi=mean(revenue)/mean(budget),ev=n*roi) %>% arrange(-ev)
Well, the findings are ...
Avg budget to make a Hollywood film is 31 millions Avg revenue one would get is 89 millions This means that, each dollar one would invest, could fetch 2.8 dollars
Most of the films are released on Friday, second highest rank goes to Thursday and third place is held by Wednesday Expected value for earnings are higher for Wed, Thur and Friday, comared to other days of week; same goes true for Dec, Sept and June month
The most profitable genre is Documentary, each dollar invested could fetch 4.5 dollars
Top genres by numbers of films released are ... Drama, Comedy, Action, Adventure and Horror
Eventhough the ROI for these genres would be 2.5 to 3.8, there are more numbers of films being made for them; this could mean that, these films are easy to make in terms of getting the starcast ready to act, getting a suitable story to getting producers on board
Animation, Adventure and Science fiction movies are costlier to make; reasons could be the technology invovlved to create the theme required to execute the story
Top profitable films are : Minions, Despicable me 2, The lord of the ring and Avatar; each of them have earned 10 dollar for each dollar invested
If we compare top 100 films by revenue, more than 50% are from genres Action and Advanture If we compare the top 100 films by revenue and find out which films have broken the record of avg ROI of their respective genres, the list would contain : E.T. , Starwars, The Sixth sense and Deadpool. Each of them have made more than double per dollar compared what their fellow films made in thier respective genre
In the list of top 100 films, most of them were released on Wednesday or Tuesday and most of them were released on May and June
Films made by United Artists, Fox Searchlight and Lionsgate have earned more per dollar than others
Paramount, Universal and Columbia Pictures have made most number of films
Walt disney & Village roadshow pump more money by avg in each films as compared to others
So if you're an agressive investor, invest in films made by Walt Disney or by Village Roadshow; and if you're a passive investor, invest in films made by UA, FSL or LG
Walt disney usually makes Adventure, Animation or Drama films Village roadshow usually makes Action, Comedy and Drama films Paramount,Columbia and Universal studios usually make Action, Comedy and Drama films