This time we'll try to analyze the famous TMDB 5000 movies dataset available on Kaggle. This dataset is bit disordered now, don't know why...nonetheless, I have cleaned it usinng Excel and put it here

This contains few of the columns of original dataset...but it works

Data is downloaded from : Kaggle TMDB

In [ ]:
mv = read.csv("C:/Users/chirag rankja/Desktop/Dataset_mov/movies.csv",
              stringsAsFactors = F,strip.white = T,na.strings = "NA")
In [2]:
mv = subset(mv,mv$production!="")

We will try to answer following Questions !

  1. Top Genre by Number of Movies
  2. Top Genre by Avg ROI and total Revenue
  3. Top Movies by ROI
  4. Top Production by avg ROI
  5. Top Movies released on what Day and which month
  6. Most number of movies released on which day
In [3]:
require(dplyr)
require(lubridate)
require(tidyr)
Loading required package: dplyr

Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Loading required package: lubridate

Attaching package: 'lubridate'

The following object is masked from 'package:base':

    date

Loading required package: tidyr
Warning message:
"package 'tidyr' was built under R version 3.3.3"
In [4]:
mv$release_date = as.Date.character(x = mv$release_date,format = "%d-%m-%y")
In [5]:
mv$year = lubridate::year(mv$release_date)
In [6]:
head(mv)
budgetgenrekeywordoriginal_languageproductionrelease_datecountryruntimerevenuestatustitlevote_averagevote_countyear
237000000 Action culture clash en Ingenious Film Partners 2009-12-10 US 162 2787965087 Released Avatar 7.2 11800 2009
300000000 Adventure ocean en Walt Disney Pictures 2007-05-19 US 169 961000000 Released Pirates of the Caribbean: At World's End6.9 4500 2007
245000000 Action spy en Columbia Pictures 2015-10-26 GB 148 880674609 Released Spectre 6.3 4466 2015
250000000 Action dc comics en Legendary Pictures 2012-07-16 US 165 1084939099 Released The Dark Knight Rises 7.6 9106 2012
260000000 Action based on novel en Walt Disney Pictures 2012-03-07 US 132 284139100 Released John Carter 6.1 2124 2012
258000000 Fantasy dual identity en Columbia Pictures 2007-05-01 US 139 890871626 Released Spider-Man 3 5.9 3576 2007
In [7]:
options(scipen = 999,digits = 2)
cat("Avg budget is:",mean(mv$budget)/1000000,"millions")
Avg budget is: 31 millions
In [8]:
cat("Avg revenue is:",mean(mv$revenue)/1000000,"millions")
Avg revenue is: 89 millions
In [9]:
cat("Avg ROI is:",(mean(mv$revenue)/mean(mv$budget)))
Avg ROI is: 2.8
In [10]:
mv$month <- month(mv$release_date,label = T)
mv$day <- weekdays(mv$release_date,abbreviate = T)
table(mv$day)
table(mv$month)
table(mv$day,mv$month)
 Fri  Mon  Sat  Sun  Thu  Tue  Wed 
1904  211  209  191  861  331  745 
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
314 304 345 318 340 360 343 390 549 437 311 441 
     
      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
  Fri 124 155 152 139 128 153 144 193 231 211 112 162
  Mon  15  14  15   9  16  15  17  18  37  21  14  20
  Sat  24   8   8  15  15  13  12   9  52  20  11  22
  Sun  16  12  16  14  14  14   9  15  35  16  11  19
  Thu  63  61  79  69  58  72  71  66  91  86  52  93
  Tue  35  23  26  12  29  34  20  20  37  27  32  36
  Wed  37  31  49  60  80  59  70  69  66  56  79  89
In [11]:
mv_earnings<-mv %>% select(genre,budget,revenue) %>%
group_by(genre) %>% 
summarise(freq=n(),
            avg_invst=mean(budget)/1000000,
            avg_roi=(mean(revenue)/mean(budget))) %>%
  data.frame() %>% arrange(-freq)
In [12]:
mv_earnings %>% filter (freq>20) %>% arrange(-avg_invst)
genrefreqavg_invstavg_roi
Animation 121 69 3.6
Adventure 334 66 3.2
Fantasy 115 53 2.8
Science Fiction 95 50 3.4
Action 727 49 2.6
Family 52 46 3.7
War 22 40 1.8
Thriller 180 27 2.4
History 24 25 3.0
Western 27 23 2.1
Mystery 40 23 3.1
Crime 188 22 2.2
Romance 100 22 3.2
Comedy 958 20 2.8
Drama 1095 20 2.5
Music 34 16 2.0
Horror 279 12 3.8
Documentary 54 3 4.5

Let's find out top five performers across each genre by revenue

In [13]:
mv %>% select(title,genre,budget,revenue,year) %>%
group_by(genre) %>% 
filter(genre!="",revenue>900000000) %>% top_n(n = 5,wt=revenue) %>%
mutate(roi=revenue/budget) %>%
arrange(-roi)
titlegenrebudgetrevenueyearroi
Minions Family 74000000 1156730962 2015 15.6
Despicable Me 2 Animation 76000000 970761885 2013 12.8
The Lord of the Rings: The Return of the KingAdventure 94000000 1118888979 2003 11.9
Avatar Action 237000000 2787965087 2009 11.8
Jurassic World Action 150000000 1513528810 2015 10.1
Finding Nemo Animation 94000000 940335536 2003 10.0
Titanic Drama 200000000 1845034188 1997 9.2
Frozen Animation 150000000 1274219009 2013 8.5
Furious 7 Action 190000000 1506249360 2015 7.9
The Avengers Science Fiction 220000000 1519557910 2012 6.9
Iron Man 3 Action 200000000 1215439994 2013 6.1
The Jungle Book Family 175000000 966550600 2016 5.5
The Dark Knight Drama 185000000 1004558444 2008 5.4
Toy Story 3 Animation 200000000 1066969703 2010 5.3
Pirates of the Caribbean: Dead Man's Chest Adventure 200000000 1065659812 2006 5.3
Transformers: Age of Extinction Science Fiction 210000000 1091405097 2014 5.2
Alice in Wonderland Family 200000000 1025491110 2010 5.1
Avengers: Age of Ultron Action 280000000 1405403694 2015 5.0
Captain America: Civil War Adventure 250000000 1153304495 2016 4.6
The Hobbit: An Unexpected Journey Adventure 250000000 1021103568 2012 4.1
Pirates of the Caribbean: On Stranger Tides Adventure 380000000 1045713802 2011 2.8
In [14]:
mv_top_100<-mv %>% select(budget,revenue,title,genre,production,day,month,year) %>% 
  arrange(-revenue) %>% 
  head(n=100) %>% 
  mutate(roi=revenue/budget)
head(mv_top_100)
table(mv_top_100$genre)
budgetrevenuetitlegenreproductiondaymonthyearroi
237000000 2787965087 Avatar Action Ingenious Film PartnersThu Dec 2009 11.8
200000000 1845034188 Titanic Drama Paramount Pictures Tue Nov 1997 9.2
220000000 1519557910 The Avengers Science Fiction Paramount Pictures Wed Apr 2012 6.9
150000000 1513528810 Jurassic World Action Universal Studios Tue Jun 2015 10.1
190000000 1506249360 Furious 7 Action Universal Pictures Wed Apr 2015 7.9
280000000 1405403694 Avengers: Age of UltronAction Marvel Studios Wed Apr 2015 5.0
         Action       Adventure       Animation          Comedy           Drama 
             30              32              12               2               4 
         Family         Fantasy         Mystery Science Fiction        Thriller 
              4               5               1               9               1 
In [15]:
mv_top_earners<-inner_join(mv_top_100,mv_earnings,"genre") %>% 
  select(title,genre,roi,avg_roi,avg_invst,budget,month,production,day,year) %>%
  mutate(diff_roi=roi/avg_roi,invst=budget/10000000) %>% select(-budget) %>%
  mutate(diff_invst=invst/avg_invst) %>%
  select(title,genre,invst,avg_invst,diff_invst,roi,avg_roi,diff_roi,production,month,day,year) %>% 
arrange(-diff_roi)
head(mv_top_earners)
titlegenreinvstavg_invstdiff_invstroiavg_roidiff_roiproductionmonthdayyear
E.T. the Extra-Terrestrial Science Fiction 1.0 50 0.021 76 3.4 22.1 Universal Pictures Apr Sat 1982
Star Wars Adventure 1.1 66 0.017 70 3.2 21.9 Lucasfilm May Wed 1977
The Sixth Sense Mystery 4.0 23 0.174 17 3.1 5.4 Spyglass Entertainment Aug Fri 1999
Deadpool Action 5.8 49 0.118 14 2.6 5.3 Twentieth Century Fox Film CorporationFeb Tue 2016
The Lion King Family 4.5 46 0.098 18 3.7 4.7 Walt Disney Pictures Jun Thu 1994
Avatar Action 23.7 49 0.481 12 2.6 4.6 Ingenious Film Partners Dec Thu 2009
In [16]:
table(mv_top_earners$day)
table(mv_top_earners$month)
Fri Mon Sat Sun Thu Tue Wed 
 10   7   6   2  14  18  43 
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
  0   1   6   9  21  25   9   2   2   5  12   8 
In [17]:
mv_top_studios <- data.frame(prod=as.character(na.omit(mv$production)),
                         bud=as.numeric(mv$budget),
                         rev=as.numeric(mv$revenue),
                         month=as.character(mv$month),
                         day=as.character(mv$day)) %>%
    group_by(prod) %>% 
    summarise(tot_bud=sum(bud),tot_rev=sum(rev),roi=tot_rev/tot_bud,n=n()) %>%
    top_n(n = 20,wt=n) %>%
mutate(avg_bud=tot_bud/n,avg_rev=tot_rev/n) %>%
arrange(-avg_rev)

mv_top_studios
prodtot_budtot_revroinavg_budavg_rev
Walt Disney Pictures 9583280000 28683256048 3.0 114 84063860 251607509
Village Roadshow Pictures 4887800003 11824816158 2.4 71 68842254 166546706
DreamWorks SKG 3392500000 10522249688 3.1 64 53007813 164410151
Columbia Pictures 10979274736 28227694704 2.6 200 54896374 141138474
Paramount Pictures 13030613007 38894789355 3.0 281 46372288 138415620
Summit Entertainment 1894260000 5996550097 3.2 44 43051364 136285229
Universal Pictures 10808928652 34958577758 3.2 260 41572803 134456068
Twentieth Century Fox Film Corporation 7168317800 23457494077 3.3 177 40498971 132528215
Warner Bros. 2361793946 7638861582 3.2 65 36335291 117520947
Lionsgate 847600000 3624870834 4.3 32 26487500 113277214
Columbia Pictures Corporation 2381775000 5771267094 2.4 65 36642692 88788725
Regency Enterprises 1260000000 3095816392 2.5 35 36000000 88451897
New Line Cinema 5034485000 13575797705 2.7 157 32066783 86470049
TriStar Pictures 1358600000 3159987293 2.3 40 33965000 78999682
Touchstone Pictures 936400000 2536759423 2.7 38 24642105 66756827
The Weinstein Company 724860000 2015360702 2.8 34 21319412 59275315
United Artists 923230012 4009229329 4.3 72 12822639 55683741
Miramax Films 1651082538 4699628827 2.8 87 18977960 54018722
Fox Searchlight Pictures 630600159 2651812448 4.2 63 10009526 42092261
Metro-Goldwyn-Mayer (MGM) 638470368 827463643 1.3 44 14510690 18805992
In [18]:
table(mv$production=="Walt Disney Pictures",mv$genre)
       
             Action Adventure Animation Comedy Crime Documentary Drama Family
  FALSE    3    718       308        95    937   188          54  1084     46
  TRUE     0      9        26        26     21     0           0    11      6
       
        Fantasy Foreign History Horror Music Mystery Romance Science Fiction
  FALSE     103       1      24    279    34      40      98              95
  TRUE       12       0       0      0     0       0       2               0
       
        Thriller TV Movie  War Western
  FALSE      179        3   22      27
  TRUE         1        0    0       0
In [19]:
table(mv$production=="Village Roadshow Pictures",mv$genre)
       
             Action Adventure Animation Comedy Crime Documentary Drama Family
  FALSE    3    706       329       119    947   186          54  1083     51
  TRUE     0     21         5         2     11     2           0    12      1
       
        Fantasy Foreign History Horror Music Mystery Romance Science Fiction
  FALSE     115       1      24    275    34      39      98              93
  TRUE        0       0       0      4     0       1       2               2
       
        Thriller TV Movie  War Western
  FALSE      173        3   21      27
  TRUE         7        0    1       0
In [20]:
table(mv$production=="Paramount Pictures",mv$genre)
       
             Action Adventure Animation Comedy Crime Documentary Drama Family
  FALSE    3    672       310       113    897   175          52  1043     51
  TRUE     0     55        24         8     61    13           2    52      1
       
        Fantasy Foreign History Horror Music Mystery Romance Science Fiction
  FALSE     108       1      23    261    31      38      97              78
  TRUE        7       0       1     18     3       2       3              17
       
        Thriller TV Movie  War Western
  FALSE      171        3   19      25
  TRUE         9        0    3       2
In [21]:
table(mv$production=="Columbia Pictures",mv$genre)
table(mv$production=="Universal Pictures",mv$genre)
       
             Action Adventure Animation Comedy Crime Documentary Drama Family
  FALSE    3    679       324       113    912   178          53  1052     48
  TRUE     0     48        10         8     46    10           1    43      4
       
        Fantasy Foreign History Horror Music Mystery Romance Science Fiction
  FALSE     108       1      23    276    33      39      95              94
  TRUE        7       0       1      3     1       1       5               1
       
        Thriller TV Movie  War Western
  FALSE      171        3   20      27
  TRUE         9        0    2       0
       
             Action Adventure Animation Comedy Crime Documentary Drama Family
  FALSE    3    679       307       117    891   179          54  1048     51
  TRUE     0     48        27         4     67     9           0    47      1
       
        Fantasy Foreign History Horror Music Mystery Romance Science Fiction
  FALSE     104       1      23    260    33      37      93              88
  TRUE       11       0       1     19     1       3       7               7
       
        Thriller TV Movie  War Western
  FALSE      174        3   21      26
  TRUE         6        0    1       1
In [22]:
mv %>% select(year,month,day,budget,revenue) %>% group_by(month) %>%
summarise(n=n(),roi=mean(revenue)/mean(budget),ev=n*roi) %>% arrange(-ev)
monthnroiev
Dec 441 3.1 1354
Sep 549 2.3 1280
Jun 360 3.5 1252
May 340 3.4 1142
Oct 437 2.5 1082
Nov 311 3.2 989
Jul 343 2.9 982
Aug 390 2.4 919
Apr 318 2.8 876
Mar 345 2.5 860
Feb 304 2.3 688
Jan 314 2.0 631
In [23]:
mv %>% select(year,month,day,budget,revenue) %>% group_by(day) %>%
summarise(n=n(),roi=mean(revenue)/mean(budget),ev=n*roi) %>% arrange(-ev)
daynroiev
Fri 19042.3 4402
Thu 8612.9 2480
Wed 7453.2 2389
Tue 3313.3 1095
Mon 2113.5 747
Sat 2093.0 634
Sun 1912.8 533

Well, the findings are ...

Budget and release dates

Avg budget to make a Hollywood film is 31 millions Avg revenue one would get is 89 millions This means that, each dollar one would invest, could fetch 2.8 dollars

Most of the films are released on Friday, second highest rank goes to Thursday and third place is held by Wednesday Expected value for earnings are higher for Wed, Thur and Friday, comared to other days of week; same goes true for Dec, Sept and June month

Top 100 Films by Revenue

The most profitable genre is Documentary, each dollar invested could fetch 4.5 dollars

Top genres by numbers of films released are ... Drama, Comedy, Action, Adventure and Horror

Eventhough the ROI for these genres would be 2.5 to 3.8, there are more numbers of films being made for them; this could mean that, these films are easy to make in terms of getting the starcast ready to act, getting a suitable story to getting producers on board

Animation, Adventure and Science fiction movies are costlier to make; reasons could be the technology invovlved to create the theme required to execute the story

Top profitable films are : Minions, Despicable me 2, The lord of the ring and Avatar; each of them have earned 10 dollar for each dollar invested

If we compare top 100 films by revenue, more than 50% are from genres Action and Advanture If we compare the top 100 films by revenue and find out which films have broken the record of avg ROI of their respective genres, the list would contain : E.T. , Starwars, The Sixth sense and Deadpool. Each of them have made more than double per dollar compared what their fellow films made in thier respective genre

In the list of top 100 films, most of them were released on Wednesday or Tuesday and most of them were released on May and June

Top studios

Films made by United Artists, Fox Searchlight and Lionsgate have earned more per dollar than others

Paramount, Universal and Columbia Pictures have made most number of films

Walt disney & Village roadshow pump more money by avg in each films as compared to others

So if you're an agressive investor, invest in films made by Walt Disney or by Village Roadshow; and if you're a passive investor, invest in films made by UA, FSL or LG

Walt disney usually makes Adventure, Animation or Drama films Village roadshow usually makes Action, Comedy and Drama films Paramount,Columbia and Universal studios usually make Action, Comedy and Drama films