Coding Session: Investigating guest stars in The Office series

Created

Jan 4, 2023 6:13 AM

Published

November 10, 2022

Author

Mohammad Reza Nabizadeh

Coding Session: Investigating guest stars in The Office series

The Office series is my favorite series and I can't obviously resist anything about it, let’s not even talk about doing some data analysis about this. I did a little bit of research about the guest stars’ presence at this series and their effect on the viewership of the various episodes. This project is a part of the DataCamp skill track for Data Science. Let’s see the project statement first.

Investigating Netflix Movies

The Office! What started as a British mockumentary series about office culture in 2001 has since spawned ten other variants across the world, including an Israeli version (2010-13), a Hindi version (2019-), and even a French Canadian variant (2006-2007). Of all these iterations (including the original), the American series has been the longest-running, spanning 201 episodes over nine seasons.

In this notebook, we will look at a dataset of The Office episodes and try to understand how the popularity and quality of the series varied over time. To do so, we will use the following dataset: datasets/office_episodes.csv, which was downloaded from Kaggle here.

This dataset contains information on a variety of characteristics of each episode. In detail, these are:

💡

datasets/office_episodes.csv episode_number: Canonical episode number. season: Season in which the episode appeared. episode_title: Title of the episode. description: Description of the episode. ratings: Average IMDB rating. votes: Number of votes. viewership_mil: Number of US viewers in millions. duration: Duration in number of minutes. release_date: Airdate. guest_stars: Guest stars in the episode (if any). director: Director of the episode. writers: Writers of the episode.

Data visualization is often a great way to explore your data and uncover insights. In this notebook, you will initiate this process by creating an informative plot of the episode data provided to you. In doing so, you're going to work on several different variables, including the episode number, the viewership, the fan rating, and guest appearances. Here are the requirements needed to pass this project:

Create a matplotlib scatter plot of the data that contains the following attributes:

Each episode's episode number is plotted along the x-axis
Each episode's viewership (in millions) plotted along the y-axis
A color scheme reflecting the scaled ratings (not the regular ratings) of each episode, such that:

Ratings < 0.25 are colored "red"
Ratings >= 0.25 and < 0.50 are colored "orange"
Ratings >= 0.50 and < 0.75 are colored "lightgreen"
Ratings >= 0.75 are colored "darkgreen"

A sizing system, such that episodes with guest appearances have a marker size of 250 and episodes without are sized 25
A title, reading "Popularity, Quality, and Guest Appearances on the Office"
An x-axis label reading "Episode Number"
A y-axis label reading "Viewership (Millions)"

Provide the name of one of the guest stars (hint, there were multiple!) who was in the most watched Office episode. Save it as a string in the variable top_star (e.g. top_star = "Will Ferrell").

Solution

import pandas as pd
import matplotlib.pyplot as plt

#import the file and define the scaled rating
theoffice_raw=pd.read_csv("~/the_office_series.csv")
theoffice_db=theoffice_raw.fillna(0)
max_score=theoffice_db['Ratings'].max()
min_score=theoffice_db['Ratings'].min()
theoffice_db['ScaledRatings']=((theoffice_db['Ratings']-min_score)/
                               (max_score-min_score)*1)
print (theoffice_db.head())

#define the marker colors
colors = []
for lab,row in theoffice_db.iterrows():
    if row['ScaledRatings']<0.25:
        colors.append("red")
    elif row['ScaledRatings']>=0.25 and row['ScaledRatings']<0.5:
        colors.append("orange")
    elif row['ScaledRatings']>=0.5 and row['ScaledRatings']<0.75:
        colors.append("lightgreen")
    else:
        colors.append("darkgreen")

#define the marker size        
size = []
for lab,row in theoffice_db.iterrows() :
    if row['GuestStars']==0:
        size.append(25)
    else:
        size.append(250)
        top_star=theoffice_db.loc[
																	theoffice_db['Viewership'].idxmax(),
																	'GuestStars']
        
print (top_star)

#create the plot
fig = plt.figure()
plt.figure(figsize=(12,8))

plt.scatter(
    x = theoffice_db.iloc[:,0],
    y = theoffice_db['Viewership'], 
    s=size,
    c=colors,  
    alpha=0.9, 
    edgecolors="white", 
    linewidth=1
            )

plt.xlabel("Episode Number")
plt.ylabel("Viewership (Millions)")
plt.title("Popularity, Quality, and Guest Appearances on the Office")

Summary

This was a basic project that I did as a part of a data science course on DataCamp and I wanted to share its solution with those who started to learn Python basics.

Links

📑Resumé 🎣Blog

Follow me

🔗 LinkedIn

✏️ Google Scholar

👨‍💻 GitHub

📊 Kaggle

📧 hello@nabi.me