Introduction to Business Problem

In this project we will analyze the issue of high cancellation rates in City & Resort Hotels, aiming to provide valuable insights and recommendations to address this challenge. The focus is on optimizing revenue generation and improving operational efficiency by reducing cancellation rates.

Objective

The primary objective of this analysis is to understand the factors contributing to high cancellation rates in both City Hotel and Resort Hotel. By identifying these factors, we can develop targeted strategies to minimize cancellations, increase revenue, and optimize hotel room utilization.

Dataset Description

hotel = The datasets contains the booking information of two hotel. One of the hotels is a resort hotel and the other is a city hotel.

is_canceled = Value indicating if the booking was canceled (1) or not (0).

lead_time = Number of days that elapsed between the entering date of the booking into the PMS and the arrival date.

arrival_date_year = Year of arrival date

arrival_date_month = Month of arrival date with 12 categories: "January" to "December"

arrival_date_week_number = Week number of the arrival date

arrival_date_day_of_month = Day of the month of the arrival date

stays_in_weekend_nights = Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel

stays_in_week_nights = Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel BO and BL/Calculated by counting the number of week nights

adults = Number of adults

children = Number of children

babies = Number of babies

meal =

  • BB — Bed & Breakfast (Breakfast is included)
  • FB — Full Board (Bed, breakfast, lunch & evening meals are included, but no drinks included)
  • HB — Half Board (Bed, breakfast & evening meals are included but no drinks included)
  • SC — Self Catering (No meals are included; however, your accommodation will be provided with catering facilities for you to cook light meals).

country = Country of origin.

market_segment = Market segment designation. In categories, the term "TA" means "Travel Agents" and "TO" means "Tour Operators"

distribution_channel = Booking distribution channel.

is_repeated_guest = Value indicating if the booking name was from a repeated guest (1) or not (0)

previous_cancellations = Number of previous bookings that were cancelled by the customer prior to the current booking

previous_bookings_not_canceled = Number of previous bookings not cancelled by the customer prior to the current booking

reserved_room_type = Code of room type reserved. Code is presented instead of designation for anonymity reasons

assigned_room_type = Code for the type of room assigned to the booking. Sometimes the assigned room type differs from the reserved room type due to hotel operation reasons (e.g. overbooking) or by customer request. Code is presented instead of designation for anonymity reasons

booking_changes = Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation

deposit_type = No Deposit – no deposit was made; Non Refund – a deposit was made in the value of the total stay cost; Refundable – a deposit was made with a value under the total cost of stay.

agent = ID of the travel agency that made the booking

company = ID of the company/entity that made the booking or responsible for paying the booking. ID is presented instead of designation for anonymity reasons

days_in_waiting_list = Number of days the booking was in the waiting list before it was confirmed to the customer

customer_type = Group – when the booking is associated to a group; Transient – when the booking is not part of a group or contract, and is not associated to other transient booking; Transient-party – when the booking is transient, but is associated to at least other transient booking

adr = Average Daily Rate (Measures the average rental revenue earned for an occupied room per day. The operating performance of a hotel or other lodging business can be determined by using the ADR.)

required_car_parking_spaces = Number of car parking spaces required by the customer

total_of_special_requests = Number of special requests made by the customer (e.g. twin bed or high floor)

reservation_status = Check - Out – customer has checked in but already departed; No - Show – customer did not check-in and did inform the hotel of the reason why

reservation_status_date = Date at which the last status was set. This variable can be used in conjunction with the ReservationStatus to understand when was the booking canceled or when did the customer checked-out of the hotel

name = Name of the Guest

email = Email

phone-number = Phone number

credit_card = Credit Card Number

Importing Libraries

In [1]:
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

Loading the Dataset

In [2]:
df = pd.read_csv("hotel_booking.csv")
df.head()
Out[2]:
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date name email phone-number credit_card
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... Transient 0.0 0 0 Check-Out 2015-07-01 Ernest Barnes Ernest.Barnes31@outlook.com 669-792-1661 ************4322
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... Transient 0.0 0 0 Check-Out 2015-07-01 Andrea Baker Andrea_Baker94@aol.com 858-637-6955 ************9157
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... Transient 75.0 0 0 Check-Out 2015-07-02 Rebecca Parker Rebecca_Parker@comcast.net 652-885-2745 ************3734
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... Transient 75.0 0 0 Check-Out 2015-07-02 Laura Murray Laura_M@gmail.com 364-656-8427 ************5677
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... Transient 98.0 0 1 Check-Out 2015-07-03 Linda Hines LHines@verizon.com 713-226-5883 ************5498

5 rows × 36 columns

In [3]:
# Shape of the Dataset

df.shape
Out[3]:
(119390, 36)
In [4]:
# Checking Count & Data Types

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 36 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal                            119390 non-null  object 
 13  country                         118902 non-null  object 
 14  market_segment                  119390 non-null  object 
 15  distribution_channel            119390 non-null  object 
 16  is_repeated_guest               119390 non-null  int64  
 17  previous_cancellations          119390 non-null  int64  
 18  previous_bookings_not_canceled  119390 non-null  int64  
 19  reserved_room_type              119390 non-null  object 
 20  assigned_room_type              119390 non-null  object 
 21  booking_changes                 119390 non-null  int64  
 22  deposit_type                    119390 non-null  object 
 23  agent                           103050 non-null  float64
 24  company                         6797 non-null    float64
 25  days_in_waiting_list            119390 non-null  int64  
 26  customer_type                   119390 non-null  object 
 27  adr                             119390 non-null  float64
 28  required_car_parking_spaces     119390 non-null  int64  
 29  total_of_special_requests       119390 non-null  int64  
 30  reservation_status              119390 non-null  object 
 31  reservation_status_date         119390 non-null  object 
 32  name                            119390 non-null  object 
 33  email                           119390 non-null  object 
 34  phone-number                    119390 non-null  object 
 35  credit_card                     119390 non-null  object 
dtypes: float64(4), int64(16), object(16)
memory usage: 32.8+ MB
In [5]:
# Checking for null values

df.isnull().sum()
Out[5]:
hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company                           112593
days_in_waiting_list                   0
customer_type                          0
adr                                    0
required_car_parking_spaces            0
total_of_special_requests              0
reservation_status                     0
reservation_status_date                0
name                                   0
email                                  0
phone-number                           0
credit_card                            0
dtype: int64

Data Formatting and Cleaning

Dropping "agent" & "company" columns because they have very high null values. Also dropping columns which are not relevent to our analysis.

Dropping rows with null values in "children" & "country" columns as they have very less null values compared to the overall dataset.

In [6]:
df.drop(columns = ["agent", "company", "arrival_date_week_number", "email", "phone-number", "credit_card", "name"], inplace = True)
df.dropna(inplace = True)
In [7]:
# Converting "reservation_status_date" to datetime type as it was in object type.

df["reservation_status_date"] = pd.to_datetime(df["reservation_status_date"]).dt.date

# Extracting year from "reservation_status_date"
df["reservation_status_year"] = pd.to_datetime(df["reservation_status_date"]).dt.year
In [8]:
# Dropping rows where "reservation_status_year" is 2014

df = df.drop(df[df["reservation_status_year"] == 2014].index)
In [9]:
# Changing object "arrival_date_month" & "children" columns to integer type.

df["arrival_date_month"] = pd.to_datetime(df["arrival_date_month"], format = "%B").dt.month

df["children"] = df["children"].astype(int)
In [10]:
# Assigning categories name to categorical integer columns.

df["is_canceled"] = df["is_canceled"].replace({0 : "Not Canceled", 1 : "Canceled"})
df["is_repeated_guest"] = df["is_repeated_guest"].replace({0 : "Not Repeated", 1 : "Repeated"})

df["is_canceled_binary"] = df["is_canceled"].replace({"Not Canceled" : 0, "Canceled" : 1})
In [11]:
# Define a mapping of month numbers to month names.

month_mapping = { 1 : "January", 2 : "February", 3 : "March", 4 : "April", 5 : "May", 6 : "June", 7 : "July", 8 : "August",
                  9 : "September", 10 : "October", 11 : "November", 12 : "December" }

# Convert the integer month column to month names using the mapping.

exit_month = pd.to_datetime(df["reservation_status_date"]).dt.month
df["exit_month_name"] = exit_month.map(month_mapping)
In [12]:
# Creating a new column adding values for stays during week and weekend nights

df["total_stays_nights"] = df["stays_in_weekend_nights"] + df["stays_in_week_nights"]
In [13]:
# Describing object columns.

df.describe(include = "object")
Out[13]:
hotel is_canceled meal country market_segment distribution_channel is_repeated_guest reserved_room_type assigned_room_type deposit_type customer_type reservation_status reservation_status_date exit_month_name
count 118717 118717 118717 118717 118717 118717 118717 118717 118717 118717 118717 118717 118717 118717
unique 2 2 5 177 7 5 2 10 12 3 4 3 924 12
top City Hotel Not Canceled BB PRT Online TA TA/TO Not Repeated A A No Deposit Transient Check-Out 2015-10-21 July
freq 79122 74745 91682 48405 56402 97549 115091 85420 73682 103982 89173 74745 1461 12074
In [14]:
# Describing integer and float columns.

df.describe()
Out[14]:
lead_time arrival_date_year arrival_date_month arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children babies previous_cancellations previous_bookings_not_canceled booking_changes days_in_waiting_list adr required_car_parking_spaces total_of_special_requests reservation_status_year is_canceled_binary total_stays_nights
count 118717.000000 118717.000000 118717.000000 118717.000000 118717.000000 118717.000000 118717.000000 118717.000000 118717.000000 118717.000000 118717.000000 118717.000000 118717.000000 118717.000000 118717.000000 118717.000000 118717.000000 118717.000000 118717.000000
mean 104.029617 2016.159421 6.551345 15.800854 0.930305 2.502902 1.858175 0.104366 0.007960 0.085582 0.131835 0.221518 2.334308 102.063542 0.061979 0.572555 2016.097728 0.370393 3.433207
std 106.737483 0.706551 3.089032 8.779217 0.996318 1.901514 0.578990 0.399456 0.097454 0.843575 1.485794 0.653225 17.643652 50.500364 0.244346 0.792967 0.711241 0.482912 2.546266
min 0.000000 2015.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -6.380000 0.000000 0.000000 2015.000000 0.000000 0.000000
25% 18.000000 2016.000000 4.000000 8.000000 0.000000 1.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 70.000000 0.000000 0.000000 2016.000000 0.000000 2.000000
50% 69.000000 2016.000000 7.000000 16.000000 1.000000 2.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 95.000000 0.000000 0.000000 2016.000000 0.000000 3.000000
75% 160.000000 2017.000000 9.000000 23.000000 2.000000 3.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 126.000000 0.000000 1.000000 2017.000000 1.000000 4.000000
max 737.000000 2017.000000 12.000000 31.000000 16.000000 41.000000 55.000000 10.000000 10.000000 26.000000 72.000000 21.000000 391.000000 5400.000000 8.000000 5.000000 2017.000000 1.000000 57.000000

There seems to be presence of some outliers in "adr" column. Lets remove it as price can play a crucial role in our analysis and we don't want the outlier to influence our analysis.

In [16]:
# Removing the outlier from our dataset

df = df[df["adr"] < 5000]
df.shape
Out[16]:
(118716, 33)

Finally, seeing the cleaned dataset after all the cleansing and formatting.

In [17]:
df.head().transpose()
Out[17]:
0 1 2 3 4
hotel Resort Hotel Resort Hotel Resort Hotel Resort Hotel Resort Hotel
is_canceled Not Canceled Not Canceled Not Canceled Not Canceled Not Canceled
lead_time 342 737 7 13 14
arrival_date_year 2015 2015 2015 2015 2015
arrival_date_month 7 7 7 7 7
arrival_date_day_of_month 1 1 1 1 1
stays_in_weekend_nights 0 0 0 0 0
stays_in_week_nights 0 0 1 1 2
adults 2 2 1 1 2
children 0 0 0 0 0
babies 0 0 0 0 0
meal BB BB BB BB BB
country PRT PRT GBR GBR GBR
market_segment Direct Direct Direct Corporate Online TA
distribution_channel Direct Direct Direct Corporate TA/TO
is_repeated_guest Not Repeated Not Repeated Not Repeated Not Repeated Not Repeated
previous_cancellations 0 0 0 0 0
previous_bookings_not_canceled 0 0 0 0 0
reserved_room_type C C A A A
assigned_room_type C C C A A
booking_changes 3 4 0 0 0
deposit_type No Deposit No Deposit No Deposit No Deposit No Deposit
days_in_waiting_list 0 0 0 0 0
customer_type Transient Transient Transient Transient Transient
adr 0.0 0.0 75.0 75.0 98.0
required_car_parking_spaces 0 0 0 0 0
total_of_special_requests 0 0 0 0 1
reservation_status Check-Out Check-Out Check-Out Check-Out Check-Out
reservation_status_date 2015-07-01 2015-07-01 2015-07-02 2015-07-02 2015-07-03
reservation_status_year 2015 2015 2015 2015 2015
is_canceled_binary 0 0 0 0 0
exit_month_name July July July July July
total_stays_nights 0 0 1 1 2

Data Analysis and Visualization

In [18]:
# Calculating percentage for types of hotels.

df["hotel"].value_counts(normalize = True) * 100
Out[18]:
City Hotel      66.647293
Resort Hotel    33.352707
Name: hotel, dtype: float64
In [19]:
# Visualization of Hotel types

plt.figure(figsize = (8, 4))

ax = sns.countplot(data = df, x = "hotel", edgecolor = "k", palette = "flare")

# Creating annotation which will label each bar with its corresponding count.
for index, value in enumerate(df["hotel"].value_counts(ascending = True)):
    ax.text(index, value, str(int(value)), ha = "center", va = "bottom", fontsize = 10)

plt.title("Reservation Status by Hotel Types", size = 15, weight = "bold", color = "#76448A")
plt.xlabel("Hotel Type")
plt.ylabel("Number of Reservations")
plt.ylim([0, 85000])

plt.show()

The graph shows there is a significant number of bookings for City Hotels, almost double the number of bookings for Resort Hotels.

Understanding the distribution of hotel types in the dataset is valuable for further analysis and can provide insights into the preferences and popularity of different hotel types among customers.

In [20]:
# Calculating ratio of Cancellation of Bookings.

df["is_canceled"].value_counts(normalize = True) * 100
Out[20]:
Not Canceled    62.961185
Canceled        37.038815
Name: is_canceled, dtype: float64
In [21]:
# Visualization of Cancellation Status

plt.figure(figsize = (8, 4))

ax = sns.countplot(data = df, x = "is_canceled", edgecolor = "k", palette = "flare")

for index, value in enumerate(df["is_canceled"].value_counts()):
    ax.text(index, value, str(int(value)), ha = "center", va = "bottom", fontsize = 10)

plt.title("Cancellation Status", size = 15, weight = "bold", color = "#76448A")
ax.set(xlabel = None)
plt.ylabel("Number of Cancellation")
plt.ylim([0, 80000])

plt.show()

This analysis shows that around 37.13% of the bookings in the dataset were canceled, while the remaining 62.87% were not canceled.

Further investigation into the reasons behind cancellations can help in developing strategies to reduce the cancellation rate and improve overall customer satisfaction.

In [22]:
# Visualizing Cancellation status for both the hotel types

plt.figure(figsize = (8, 4))

sns.countplot(data = df, x = "hotel", hue = "is_canceled", edgecolor = "k", palette = "flare")

plt.title("Reservation Status in different Hotels by Cancellations", size = 14, weight = "bold", color = "#76448A")
plt.xlabel("Hotel Type")
plt.ylabel("Number of Reservations")
plt.legend(title = "Cancellation Type", fontsize = 7)

plt.show()

City Hotels have a higher cancellation rate compared to Resort Hotels. This suggests that customers booking City Hotels are more likely to cancel their reservations.

The cancellation rates for both types of hotels are significant, and it's essential to analyze the factors contributing to these cancellations.

In [23]:
# Generating a bar plot to display the top 5 countries with the highest number of reservations.

# Selecting country name and reservation count using "value_counts()".
country_name = df["country"].value_counts().head(5).index
country_count = df["country"].value_counts().head(5).values

plt.figure(figsize = (8, 4))

ax = sns.barplot(data = df, x = country_name, y = country_count, edgecolor = "k", palette = "flare_r")

for index, value in enumerate(df["country"].value_counts().head()):
    ax.text(index, value, str(int(value)), ha = "center", va = "bottom", fontsize = 10)

plt.title("Countries with Highest Number of Reservations", size = 13, weight = "bold", color = "#76448A")
plt.xlabel("Countries")
plt.ylabel("Number of Reservations")
plt.ylim([0, 55000])

plt.show()

Portugal (PRT) stands out with significantly more reservations than other countries in the top 5. It's the most popular destination among the bookings in our dataset.

The United Kingdom (GBR), France (FRA), Spain (ESP), and Germany (DEU) have much lower reservations than Portugal (PRT), but are in a similar range among the top 5 popular travel destinations.

In [24]:
# Generating a list of top 5 countries with their respective reservation cancellation counts.

canceled_data = df[df["is_canceled"] == "Canceled"]

can_country_name = canceled_data["country"].value_counts().head().index
can_country_count = canceled_data["country"].value_counts().head().values

print("Top 5 countries with their reservation cancellation counts:")
print(canceled_data["country"].value_counts().head())
Top 5 countries with their reservation cancellation counts:
PRT    27333
GBR     2453
ESP     2177
FRA     1934
ITA     1333
Name: country, dtype: int64
In [25]:
# Creating a donut chart to visualize the distribution of reservation cancellations among the top 5 countries.

plt.figure(figsize = (6, 5))

palette_color = sns.color_palette("flare_r")
explode = (0.05, 0.05, 0.05, 0.05, 0.05)

plt.pie(can_country_count, labels = can_country_name, colors = palette_color, 
        autopct= "%.1f%%", startangle = 90, explode = explode, pctdistance = 0.8)

# Drawing a white circle at the center to create the hole in the donut chart
centre_circle = plt.Circle((0, 0), 0.65, fc = "white")
fig = plt.gcf()
 
# Adding Circle in Pie chart
fig.gca().add_artist(centre_circle)

plt.axis("equal")
plt.title("Countries with Highest Number of Cancellations", size = 11, weight = "bold", color = "#76448A")

plt.show()

Portugal (PRT) is the most popular country among the top 5 for both reservations and cancellations. This indicates that while it is a popular destination, it also experiences a relatively high cancellation rate.

The United Kingdom (GBR) has a substantial number of reservations, but it's cancellation count is considerably lower than Portugal's. This suggests that bookings from the UK are more reliable than those from Portugal. Same can be said for other countries in the top 5.

In [26]:
# Generating a correlation heatmap to visualize the relationships between numeric columns in our dataset.

# Selecting only the numeric columns
numeric_columns = df.select_dtypes(include = ["int64", "float64"])

# Calculating the correlation matrix
correlation_matrix = numeric_columns.corr()

# Creating the heatmap using Seaborn
plt.figure(figsize = (10, 8))

sns.heatmap(correlation_matrix, annot = True, cmap = "flare", fmt = ".2f", linewidth = 0.7)

plt.title("Correlation Heatmap", weight = "bold", color = "#76448A")

plt.show()

From the above correlation heatmap, we can identify which features have a higher correlation with cancellations.

This will give us an indication of which factors might be influencing cancellations.

In [27]:
# Creating subplots of violin and bar plot to examine the correlation between "Lead Time" and "Cancellation Status".

plt.figure(figsize = (13, 10))

plt.subplot(2, 1, 1)

ax1 = sns.violinplot(x = "is_canceled", y = "lead_time", order=["Canceled", "Not Canceled"], data = df, palette = "flare")

ax1.set(xlabel = None)
plt.ylabel("Lead Time")

plt.subplot(2, 1, 2)

groupby_can = df.groupby("is_canceled")["lead_time"].mean()

ax2 = sns.barplot(data = df, x = "is_canceled", y = "lead_time", order=["Canceled", "Not Canceled"], 
                  edgecolor = "k", palette = "flare")

for index, value in enumerate(groupby_can):
    ax2.text(index, value - 50, f"{value:.2f}", ha = "center", va = "bottom", fontsize = 10, color = "white")
    
plt.xlabel("Cancellation Status")
plt.ylabel("Lead Time")

plt.suptitle("Correlation between Lead Time & Cancellation Status",  size = 17, weight = "bold", color = "#76448A")
plt.tight_layout()

plt.show()

The wider distribution for canceled bookings in the violin plot indicates that there is more variability in lead times (i.e., number of days between booking and arrival) for these reservations, potentially indicating that customers who book further in advance might be more likely to cancel their reservations.

On the other hand, the narrower distribution for not-canceled bookings suggests that bookings with shorter lead times tend to have a higher chance of not being canceled.

This assumption is supported by the bar plot below it, in which lead time is much higher for canceled bookings than the bookings that were not canceled.

Note :

Thus, bookings with longer lead times may have a higher chance of being canceled compared to bookings with shorter lead times.

In [28]:
# Using subplots of box and count plot to explore the correlation between "Number of Special Requests" & "Cancellation Status".

plt.figure(figsize = (13, 10))

plt.subplot(2, 1, 1)

sns.boxplot(x = "is_canceled", y = "total_of_special_requests", data = df, palette = "flare_r")

plt.xlabel("Cancellation Status")
plt.ylabel("Number of Special Requests")

plt.subplot(2, 1, 2)

sns.countplot(data = df, x = "total_of_special_requests", hue = "is_canceled", edgecolor = "k", palette = "flare")

plt.xlabel("Number of Special Requests")
plt.ylabel("Count")
plt.legend(title = "Cancellation Status")

plt.suptitle("Correlation between Special Requests & Cancellation Status", weight = "bold", color = "#76448A", size = 17)
plt.tight_layout()

plt.show()

Majority of not-canceled bookings have a small number of special requests (upto 2), with the highest concentration of requests falling between 0 and 1.

In contrast, canceled bookings tend to have fewer special requests, with the majority having none.

Again, this assumption is supported by the count plot, count for 0 special request is very high for bookings that were not canceled and almost entirely for canceled bookings. While 1 special request also have a significant number for bookings that were not canceled, which decreases as increase in request numbers.

Note :

This suggests that the number of special requests have an affect on the cancellation status of bookings. When a booking has 0 special requests, it has equal chances of being not canceled and canceled. But as the special requests increase, chance for a booking to not get canceled also increases.

In [29]:
# Analyzing Reservation Status Distribution Year-wise

df["reservation_status_year"].value_counts()
Out[29]:
2016    57528
2017    36395
2015    24793
Name: reservation_status_year, dtype: int64
In [30]:
# Using a side-by-side bar plot to compare the Year on Year "Reservations in Hotels" and "Cancellations".
# The visualization provides insights into trends and patterns over time.

plt.figure(figsize = (15, 5))

plt.subplot(1, 2, 1)

sns.countplot(data = df, x = "reservation_status_year", hue = "hotel", edgecolor = "k", palette = "flare")

plt.title("Year on Year Reservations in Hotels", size = 15)
plt.xlabel("Year")
plt.ylabel("Number of Reservations")
plt.legend(title = "Hotel Type", fontsize = 13)

plt.subplot(1, 2, 2)

sns.countplot(data = df, x = "reservation_status_year", hue = "is_canceled", edgecolor = "k", palette = "flare")

plt.title("Year on Year Cancellations", size = 15)
plt.xlabel("Year")
plt.ylabel("Number of Cancellations")
plt.legend(title = "Cancellation Type", fontsize = 13)

plt.suptitle("Year on Year Change", size = 17, color = "#76448A", weight = "bold")
plt.tight_layout()  ## To prevent overlapping of the subplots

plt.show()

The side-by-side bar plot provides insights into the trends and patterns of reservations and cancellations for the available data for the years 2015, 2016, and 2017.

Both types of hotels experienced fluctuations in reservations over the years, with the City Hotel generally having much higher booking numbers.

Cancellation rates showed fluctuations as well, with the year 2016 witnessing a significant increase in cancellations and the year 2017 showing a decline in cancellations compared to 2016 and finally, worst ratio between cancellations in year 2015.

In [31]:
# Visualizing "Month-wise Bookings" in a descending order, displaying the number of reservations for each month.

plt.figure(figsize = (13, 5))

ax = sns.countplot(data = df, x = "exit_month_name", edgecolor = "k",
                   palette = "flare_r", order = df["exit_month_name"].value_counts().index)

for index, value in enumerate(df["exit_month_name"].value_counts()):
    ax.text(index, value, str(int(value)), ha = "center", va = "bottom", fontsize = 10)

plt.title("Month-wise Bookings", size = 15, color = "#76448A", weight = "bold")
plt.xlabel("Months")
plt.ylabel("Number of Bookings")

plt.show()

From the graph it is evident that, July & August have the highest bookings, while December & November have the lowest bookings.

Bookings seems to rise during summer months and decrease towards year-end, suggesting a seasonal impact on the bookings.

Note :

Further analysis of canceled and non-canceled bookings by month is needed for a clearer picture of booking trends.

In [32]:
# Visualizing "Cancellation Rates Month-wise" for canceled reservations in a descending order,
# displaying the number of cancellations for each month.

plt.figure(figsize = (13, 5))

ax = sns.countplot(data = canceled_data, x = "exit_month_name", edgecolor = "k",
                   palette = "flare_r", order = canceled_data["exit_month_name"].value_counts().index)

for index, value in enumerate(canceled_data["exit_month_name"].value_counts()):
    ax.text(index, value, str(int(value)), ha = "center", va = "bottom", fontsize = 10)

plt.title("Month-wise Cancellations", size = 15, color = "#76448A", weight = "bold")
plt.xlabel("Months")
plt.ylabel("Number of Cancellation")
plt.ylim([0, 6500])

plt.show()
In [33]:
# Visualizing "Non Cancellation Rates Month-wise" for not canceled reservations in a descending order,
# displaying the number of non cancellations for each month.

plt.figure(figsize = (13, 5))

non_canceled_data = df[df["is_canceled"] == "Not Canceled"]

ax = sns.countplot(data = non_canceled_data, x = "exit_month_name", edgecolor = "k",
                   palette = "flare_r", order = non_canceled_data["exit_month_name"].value_counts().index)

for index, value in enumerate(non_canceled_data["exit_month_name"].value_counts()):
    ax.text(index, value, str(int(value)), ha = "center", va = "bottom", fontsize = 10)

plt.title("Month-wise Non Cancellations", size = 15, color = "#76448A", weight = "bold")
plt.xlabel("Months")
plt.ylabel("Number of Non Cancellation")
plt.ylim([0, 9000])

plt.show()
In [34]:
# Using count plot to visualize the month-wise cancellation status,
# displaying the number of reservations for each month, categorized by cancellation type.

plt.figure(figsize = (13, 5))

order = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]

sns.countplot(data = df, x = "exit_month_name", hue = "is_canceled", order = order, edgecolor = "k", palette = "flare")

plt.title("Month-wise Cancellation Status at Check-Out", size = 15, color = "#76448A", weight = "bold")
plt.xlabel("Months")
plt.ylabel("Number of Reservations")
plt.legend(title='Cancellation Type', fontsize = 7)

plt.show()

Interesting Insights:

August and September stand out with having one of the highest number of non-cancellations and least number of cancellations, suggesting that customers are more committed to their travel plans during these months.

Conversely, January and December exhibit the lowest number of successful reservations and the highest number of cancellations. This indicates that bookings made for January and December are more likely to be canceled.

The pricing strategies employed by hotels during these months might also impact customer behavior, or holiday plans and changing weather conditions.

In [35]:
# Visualizing month-wise ADR for canceled bookings using a bar plot, including annotations for better understanding.

plt.figure(figsize = (13, 5))

df["exit_month_name_ordered"] = pd.Categorical(df["exit_month_name"], categories = order, ordered = True)

grouped_data = df[df["is_canceled"] == "Canceled"].groupby("exit_month_name_ordered")[["adr"]].sum().reset_index()

ax = sns.barplot(x = "exit_month_name_ordered", y = "adr", data = grouped_data, palette = "flare_r", edgecolor = "k")

for index, value in enumerate(grouped_data["adr"]):
    ax.text(index, value, str(int(value)), ha = "center", va = "bottom", fontsize = 9)

plt.title("Month-wise ADR for Canceled Bookings", size = 15, weight = "bold", color = "#76448A")
plt.xlabel("Months")
plt.ylabel("ADR")
plt.ylim([0,650000])

plt.show()

There is an overlap between months with the highest cancellations and months with the highest ADR for canceled bookings. Both January and July appear at the top of both lists.

The overlap between months with high cancellations and high ADR for canceled bookings may suggest that customers are canceling reservations with higher rates, possibly due to the increased cost or changing travel plans.

August stands out as a month with both high non-cancellation rates and a relatively low ADR for canceled bookings.

This indicates that customers may be more committed to reservations made during August, even if the cancellations for those bookings have lower ADR.

Note :

Thus, we can say that price plays a crucial role in determining whether customers cancel or not-cancel their reservations.

In [36]:
# Calculating and sorting the mean ADR for "Resort Hotel" and "City Hotel" reservations on different reservation status dates.

resort_hotel_adr = df[df["hotel"] == "Resort Hotel"].groupby("reservation_status_date")["adr"].mean().reset_index()
resort_hotel_adr.sort_values("reservation_status_date", inplace = True)

city_hotel_adr = df[df["hotel"] == "City Hotel"].groupby("reservation_status_date")["adr"].mean().reset_index()
city_hotel_adr.sort_values("reservation_status_date", inplace = True)
In [37]:
# Filtering the "resort_hotel_adr" and "city_hotel_adr" DataFrames to include only the ADR data between "07-2015" and "10-2017".
# As the data was not consistent beyond these ranges, skipping these dates was important to get a clear visualization of the trends.

start_date = datetime.strptime("2015-07", "%Y-%m").date()
end_date = datetime.strptime("2017-10", "%Y-%m").date()

resort_hotel_adr = resort_hotel_adr[(resort_hotel_adr["reservation_status_date"] > start_date) &
                                    (resort_hotel_adr["reservation_status_date"] < end_date)]

city_hotel_adr = city_hotel_adr[(city_hotel_adr["reservation_status_date"] > start_date) &
                                (city_hotel_adr["reservation_status_date"] < end_date)]
In [38]:
# Visualizing the "Average Daily Rate by Hotels" using line plots for "Resort" and "City" reservations.
# The plot shows the trends in the average daily rates for both hotel types over the selected date range.

plt.figure(figsize = (20, 6))

colors = sns.color_palette("flare_r")

plt.plot(resort_hotel_adr["reservation_status_date"], resort_hotel_adr["adr"], label = "Resort Hotel", color = colors[0])
plt.plot(city_hotel_adr["reservation_status_date"], city_hotel_adr["adr"], label = "City Hotel", color = colors[4])

plt.title("Average Daily Rate by Hotels", size = 19, weight = "bold", color = "#76448A")
plt.legend(fontsize = 17, loc = "lower right")

plt.show()

We can observe a considerable fluctuation in the ADR data for Resorts compared to City Hotels, where the latter maintains a consistent ADR. Specifically, there is a significant increase in Resort's ADR during the discussed months.

This fluctuation in ADR could potentially be a contributing factor to the lower number of reservations in Resorts in comparison to City Hotels.

In [39]:
# Calculating and sorting the mean ADR for "Canceled" and "Not Canceled" reservations based on different reservation status dates.

canceled_adr = df[df["is_canceled"] == "Canceled"].groupby("reservation_status_date")["adr"].mean().reset_index()
canceled_adr.sort_values("reservation_status_date", inplace = True)

not_canceled_adr = df[df["is_canceled"] == "Not Canceled"].groupby("reservation_status_date")["adr"].mean().reset_index()
not_canceled_adr.sort_values("reservation_status_date", inplace = True)
In [40]:
# Filtering the "canceled_adr" and "not_canceled_adr" DataFrames to include only the ADR data between "07-2015" and "10-2017".
canceled_adr = canceled_adr[(canceled_adr["reservation_status_date"] > start_date) &
                            (canceled_adr["reservation_status_date"] < end_date)]

not_canceled_adr = not_canceled_adr[(not_canceled_adr["reservation_status_date"] > start_date) &
                                    (not_canceled_adr["reservation_status_date"] < end_date)]
In [41]:
# Plotting the "ADR by Cancellations" using line plots for "Canceled" and "Not Canceled" reservations.
# The visualization shows the trends in the average daily rates for both reservation types within the selected date range.

plt.figure(figsize = (20, 6))

colors = sns.color_palette("flare_r")

plt.plot(canceled_adr["reservation_status_date"], canceled_adr["adr"], label = "Canceled", color = colors[0])
plt.plot(not_canceled_adr["reservation_status_date"], not_canceled_adr["adr"], label = "Not Canceled", color = colors[4])

plt.title("Average Daily Rate by Cancellations", size = 19, weight = "bold", color = "#76448A")
plt.legend(fontsize = 17, loc = "lower right")

plt.show()

Looking at this line plot further reinforces our assumption that as ADR increases, there is also a rise in hotel booking cancellations.

Therefore, hotel authorities should focus on improving their pricing strategies.

Implementing a social media campaign and offering coupons and discounts to consumers could potentially lead to lower cancellations, especially during the months with historically high cancellation rates.

In [42]:
# Extracting the top 5 values and their corresponding counts for "stays_in_weekend_nights" and "stays_in_week_nights".

weekend_nights_num = df["stays_in_weekend_nights"].value_counts().head(10).index
weekend_nights_count = df["stays_in_weekend_nights"].value_counts().head(10).values

week_nights_num = df["stays_in_week_nights"].value_counts().head(10).index
week_nights_count = df["stays_in_week_nights"].value_counts().head(10).values
In [43]:
# Visualizing "Week & Weekend Nights Bookings" using side-by-side bar plots, 
# displaying count of bookings for different numbers of nights.

plt.figure(figsize = (12, 4))

plt.suptitle("Top 10 Week vs Weekend Bookings", size = 19, weight = "bold", color = "#76448A")

plt.subplot(1, 2, 1)

ax1 = sns.barplot(data = df, x = week_nights_num, y = week_nights_count, 
                  edgecolor = "k", palette = "flare_r", order = week_nights_num)

for index, value in enumerate(df["stays_in_week_nights"].value_counts().head(10)):
    ax1.text(index, value, int(str(value)), ha = "center", va = "bottom", fontsize = 9)

plt.title("Count for Week Nights Bookings", size = 15)
plt.xlabel("Number of Nights")
plt.ylabel("Count of Bookings")
plt.ylim([0, 55000])

plt.subplot(1, 2, 2)

ax2 = sns.barplot(data = df, x = weekend_nights_num, y = weekend_nights_count, 
                  edgecolor = "k", palette = "flare_r", order = weekend_nights_num)

for index, value in enumerate(df["stays_in_weekend_nights"].value_counts().head(10)):
    ax2.text(index, value, int(str(value)), ha = "center", va = "bottom", fontsize = 9)

plt.title("Count for Weekend Nights Bookings", size = 15)
plt.xlabel("Number of Nights")
plt.ylabel("Count of Bookings")
plt.ylim([0, 55000])

plt.tight_layout()

plt.show()

Week Nights

Most common duration for weeknight stays is 2 nights, followed closely by 1 night and 3 nights. There is a gradual decline in bookings for stays longer than these afterwards.

Weekend Nights

A substantial number of weekend stays are either day trips (0 nights or same-day check-ins/check-outs) or 2 nights, with 1 night being a close third. There is a significant drop in bookings for stays lasting more after that.

Summary :

The graph highlights the importance of offering attractive packages and deals for 2-night and 1-night stays, as they are the most popular choices for both weeknight and weekend stays and 0-night for weekend stays only too. Understanding the preference for shorter durations can help hotels optimize their offerings and pricing strategies to cater to guest preferences effectively and efficiently.

In [44]:
# Visualizing "Bookings for Total Stay" using bar plots, displaying count of bookings for different numbers of nights.

total_stays_num = df["total_stays_nights"].value_counts().head(10).index
total_stays_count = df["total_stays_nights"].value_counts().head(10).values

plt.figure(figsize = (8, 4))

ax = sns.barplot(data = df, x = total_stays_num, y = total_stays_count, order = total_stays_num, 
            edgecolor = "k", palette = "flare_r")

for index, value in enumerate(df["total_stays_nights"].value_counts().head(10)):
    ax.text(index, value, int(str(value)), ha = "center", va = "bottom", fontsize = 9)

plt.title("Count of Bookings for Total Stay", size = 15, weight = "bold", color = "#76448A")
plt.xlabel("Number of Nights")
plt.ylabel("Count of Bookings")
plt.ylim([0, 30000])

plt.show()

2 and 3-night stays are highly popular among customers, with the highest demand. Shorter trips of 1-4 days are more common, indicating a preference for brief getaways.

Bookings drop significantly for stays longer than 4 nights, suggesting customers are less inclined to book longer durations. Longer stays are less common, likely due to factors like cost or work commitments.

In [45]:
# Visualizing "Top 5 Room Types" using bar plots, displaying count of bookings for different types of rooms.

room_type = df["reserved_room_type"].value_counts().head().index
room_type_count = df["reserved_room_type"].value_counts().head().values

plt.figure(figsize = (8, 4))

ax = sns.barplot(data = df, x = room_type, y = room_type_count, order = room_type, 
            edgecolor = "k", palette = "flare_r")

for index, value in enumerate(df["reserved_room_type"].value_counts().head()):
    ax.text(index, value, int(str(value)), ha = "center", va = "bottom", fontsize = 9)

plt.title("Top 5 Room Types Booked by Customers", size = 15, weight = "bold", color = "#76448A")
plt.xlabel("Room Type")
plt.ylabel("Count of Bookings")
plt.ylim([0, 90000])

plt.show()
In [46]:
# Calculating the percentage of canceled reservations when assigned room type is different from booked room type.
# It will help us to see if getting a different room than assigned initially has any impact on cancellations or not.

diff_room = df[df["reserved_room_type"] != df["assigned_room_type"]]
diff_room["is_canceled"].value_counts(normalize = True) * 100
Out[46]:
Not Canceled    94.605075
Canceled         5.394925
Name: is_canceled, dtype: float64

This suggests that in most cases, receiving a different room type than initially booked does not significantly impact the decision of customers to cancel their reservations.

The high percentage of not canceled reservations indicates that many guests are flexible and willing to accept the assigned room type without canceling their bookings.

In [47]:
# Calculating the percentage distribution of different meal types.

df["meal"].value_counts(normalize = True) * 100
Out[47]:
BB           77.227164
HB           12.158429
SC            8.960881
Undefined     0.981334
FB            0.672192
Name: meal, dtype: float64
In [48]:
# Generating a count plot to display the distribution of different meal types.

plt.figure(figsize = (8, 4))

ax = sns.countplot(data = df, x = "meal", order = df["meal"].value_counts().index, edgecolor = "k", palette = "flare_r")

for index, value in enumerate(df["meal"].value_counts()):
    ax.text(index, value, str(int(value)), ha = "center", va = "bottom", fontsize = 10)

plt.title("Types of Meals Opted by Customers", size = 13, weight = "bold", color = "#76448A")
plt.xlabel("Meal")
plt.ylabel("Number of Reservations")
plt.ylim([0, 100000])
    
plt.show()

Bed & Breakfast is the most preferred option, followed by Half Board and Self-Catering.

Hotels can use this information to tailor their offerings to the customer preferences, whether it's focusing on breakfast-inclusive packages, promoting Half Board options for those seeking added convenience, or providing attractive alternatives for travelers who prefer Self-Catering arrangements.

In [49]:
# Calculating the percentage distribution of different market segments.

df["market_segment"].value_counts(normalize = True) * 100
Out[49]:
Online TA        47.510024
Offline TA/TO    20.350248
Groups           16.531891
Direct           10.484686
Corporate         4.305233
Complementary     0.618282
Aviation          0.199636
Name: market_segment, dtype: float64
In [50]:
# Creating a donut chart to visualize the distribution of top 5 market segments.

plt.figure(figsize = (6, 5))

palette_color = sns.color_palette("flare_r")
explode = (0.05, 0.05, 0.05, 0.05, 0.05)

plt.pie(df["market_segment"].value_counts().head().values, labels = df["market_segment"].value_counts().head().index, 
        colors = palette_color, autopct= "%.2f%%", startangle = 90, explode = explode, pctdistance = 0.8)

centre_circle = plt.Circle((0, 0), 0.65, fc = "white")
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.axis("equal")
plt.title("Top 5 Market Segments by Share", size = 11, weight = "bold", color = "#76448A")

plt.show()

The fact that Online Travel Agencies (Online TA) represent nearly 50% of the market share indicates a strong preference for online bookings among customers nowadays.

Percentage of Offline TA/To indicates that while online channels dominate, there is still a considerable share of the market that prefers traditional offline booking methods.

The market share for Group bookings suggests a notable demand for group travel arrangements, such as business conferences, events, or leisure group trips.

Market share for Direct bookings demonstrate that a substantial number of customers prefer booking directly through the hotel or service provider, underlining the importance of offering seamless and attractive direct booking options.

In conclusion, the data suggests that the hotel's online presence and association with online travel agencies play a significant role in attracting customers. Group bookings also seem to be a notable source of revenue, followed by direct bookings and corporate bookings. It's essential for the hotel to continue focusing on their online presence, catering to group travelers, and maintaining relationships with corporate clients to optimize their revenue streams. Additionally, they may want to explore ways to increase direct bookings and foster loyalty among customers to improve overall profitability.

In [51]:
# Calculating the percentage distribution of Repeated & Non-Repeated Guests.

df["is_repeated_guest"].value_counts(normalize = True) * 100
Out[51]:
Not Repeated    96.945652
Repeated         3.054348
Name: is_repeated_guest, dtype: float64

The high percentage of "Not Repeated" customers implies that the hotel attracts a significant number of new customers, potentially indicating continuous efforts in attracting and engaging with new customers.

On the other hand, the presence of lower number of "Repeated" customers highlights the hotel's inability to build customer loyalty, as only these customers have chosen to return and continue their engagement with the Hotels. Nurturing and retaining repeat customers can lead to long-term business success and positive word-of-mouth marketing.

In [52]:
# Calculating the percentage distribution of the number of "Previous Cancellations" for the top 5 values.

df["previous_cancellations"].value_counts(normalize = True).head() * 100
Out[52]:
0     94.721015
1      4.918461
2      0.094343
3      0.054753
24     0.040433
Name: previous_cancellations, dtype: float64

The majority of reservations (approximately 94.58%) have no previous cancellations, suggesting that most customers either make their first booking or have a history of reliable bookings without cancellations.

However, a small portion of customers (around 5.06%) have experienced at least one previous cancellation, indicating the possibility of certain customers facing changing plans or unpredictable circumstances that lead to booking alterations.

In [53]:
# Calculating the percentage distribution of the number of "Required Car Parking Spaces".

df["required_car_parking_spaces"].value_counts(normalize = True) * 100
Out[53]:
0    93.842448
1     6.129755
2     0.023586
3     0.002527
8     0.001685
Name: required_car_parking_spaces, dtype: float64

The vast majority of reservations do not require any car parking spaces, indicating that customers might be arriving by means other than personal vehicles or are utilizing alternative transportation options.

The presence of a small percentage of reservations that require one car parking space or more could suggest that few customers prefer the convenience of on-site parking or are likely to arrive by car.

In [54]:
# Calculating the percentage distribution of different "Customer Types".

df["customer_type"].value_counts(normalize = True) * 100
Out[54]:
Transient          75.113717
Transient-Party    20.972742
Contract            3.433404
Group               0.480137
Name: customer_type, dtype: float64
In [55]:
# Generating a count plot to display the distribution of different customer types.

plt.figure(figsize = (8, 4))

ax = sns.countplot(data = df, x = "customer_type", order = df["customer_type"].value_counts().index, 
                   edgecolor = "k", palette = "flare_r")

for index, value in enumerate(df["customer_type"].value_counts()):
    ax.text(index, value, str(int(value)), ha = "center", va = "bottom", fontsize = 10)

plt.title("Types of Customers by Bookings Count", size = 13, weight = "bold", color = "#76448A")
plt.xlabel("Customer Categories")
plt.ylabel("Number of Reservations")
plt.ylim([0, 95000])
    
plt.show()

The majority of customers fall under the "Transient" category, suggesting that the hotel likely serves a significant number of individual or one-time customers.

The presence of "Transient-Party" and "Contract" customers highlights the diversity in the customer base, potentially involving business travelers or leisure bookings.

However, "Group" customers comprise a relatively small percentage, indicating that the hotel should explore marketing strategies for attracting more by hosting group events.

In [56]:
# Calculating the percentage distribution of "Customer Types" for canceled reservations.

canceled_data["customer_type"].value_counts(normalize = True) * 100
Out[56]:
Transient          82.913739
Transient-Party    14.084283
Contract            2.870073
Group               0.131905
Name: customer_type, dtype: float64
In [57]:
# Generating a count plot to display the distribution of different customer types with their cancellation status.

plt.figure(figsize = (8, 4))

ax = sns.countplot(data = df, x = "customer_type", order = df["customer_type"].value_counts().index, hue = "is_canceled",
                   edgecolor = "k", palette = "flare_r")

plt.title("Types of Customers with their Cancellation Status", size = 13, weight = "bold", color = "#76448A")
plt.xlabel("Customer Categories")
plt.ylabel("Number of Reservations")
plt.legend(title = "Cancellation Status")
    
plt.show()

The high proportion of "Transient" customers among canceled reservations suggests that individual or one-time guests may be more prone to canceling their bookings.

On the other hand, the relatively lower cancellation rates for "Transient-Party", "Contract" and "Group" customers might indicate that these bookings are more likely to be firm and less subject to last-minute changes.

In [58]:
# Calculating the percentage distribution of different "Deposit Types".

df["deposit_type"].value_counts(normalize = True) * 100
Out[58]:
No Deposit    87.588868
Non Refund    12.274672
Refundable     0.136460
Name: deposit_type, dtype: float64
In [59]:
# Creating a donut chart to visualize the distribution of the types of deposits made by customers.

plt.figure(figsize = (6, 5))

palette_color = sns.color_palette("flare_r")
explode = (0.05, 0.05, 0.05)

plt.pie(df["deposit_type"].value_counts().head().values, labels = df["deposit_type"].value_counts().head().index, 
        colors = palette_color, autopct= "%.1f%%", startangle = 90, explode = explode, pctdistance = 0.8)

centre_circle = plt.Circle((0, 0), 0.65, fc = "white")
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.axis("equal")
plt.title("Types of Deposits made by Customers", size = 11, weight = "bold", color = "#76448A")

plt.show()

Among the different deposit types, "No Deposit" is the most common, accounting for approximately 87.6% of reservations.

"Non Refund" is the second most prevalent deposit type, constituting around 12.3% of reservations.

"Refundable" is the least common deposit type, with only 0.1% of reservations opting for this type.

In [60]:
plt.figure(figsize = (8, 4))

ax = sns.countplot(data = df, x = "deposit_type", order = df["deposit_type"].value_counts().index, hue = "is_canceled",
                   edgecolor = "k", palette = "flare_r")

plt.title("Types of Deposits made by Customers with their Cancellation Status", size = 11, weight = "bold", color = "#76448A")
plt.xlabel("Deposit Types")
plt.ylabel("Number of Reservations")
plt.legend(title = "Cancellation Status")
    
plt.show()

The relatively low cancellation rate of 28.3% and high non-cancellation of 71.7% for reservations with "No Deposit" suggests that customers might be more committed to these reservations, possibly due to the absence of any upfront payment requirement.

This flexibility might lead to a higher likelihood of honoring the reservation, resulting in a lower cancellation rate.

The high cancellation rate of 99.4% and merely a 0.6% non-cancellation rate for reservations with a "Non Refund" deposit type, suggests that these reservations may involve strict cancellation policies, making them less flexible for customers.

As a result, customers might be more hesitant to finalize such reservations, leading to a higher likelihood of cancellations.