Poisson Regression Examples

Author

Madeline Sands

Published

April 18, 2024

Blueprinty Case Study

Introduction

Blueprinty is a small firm that makes software for developing blueprints specifically for submitting patent applications to the US patent office. Their marketing team would like to make the claim that patent applicants using Blueprinty’s software are more successful in getting their patent applications approved compared to traditional methods. Ideal data to study such an effect and prove some type of causal relationship might include the success rate of patent applications before using Blueprinty’s software and after using it if the use of Blueprinty was randomly assigned. Unfortunately, this data is not available.

However, Blueprinty has collected data on 1,500 mature (non-startup) engineering firms. The data includes each firm’s number of patents awarded over the last 5 years, regional location, age since incorporation, and whether or not the firm uses Blueprinty’s software. Blueprinty’s marketing team would like to use this data to make the claim that firms using Blueprinty’s software are more successful in getting their patent applications approved.

Data

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 

blueprinty_df = pd.read_csv("data/blueprinty.csv")

blueprinty_df.info()

print(blueprinty_df.head())

dummy_blueprinty_df = pd.get_dummies(blueprinty_df['region'], prefix='dummy_')

# Concatenate the original DataFrame with the dummy variables DataFrame
blueprinty = pd.concat([blueprinty_df, dummy_blueprinty_df], axis=1)
print(blueprinty.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  1500 non-null   int64  
 1   patents     1500 non-null   int64  
 2   region      1500 non-null   object 
 3   age         1500 non-null   float64
 4   iscustomer  1500 non-null   int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 58.7+ KB
   Unnamed: 0  patents     region   age  iscustomer
0           1        0    Midwest  32.5           0
1         786        3  Southwest  37.5           0
2         348        4  Northwest  27.0           1
3         927        3  Northeast  24.5           0
4         830        3  Southwest  37.0           0
   Unnamed: 0  patents     region   age  iscustomer  dummy__Midwest  \
0           1        0    Midwest  32.5           0            True   
1         786        3  Southwest  37.5           0           False   
2         348        4  Northwest  27.0           1           False   
3         927        3  Northeast  24.5           0           False   
4         830        3  Southwest  37.0           0           False   

   dummy__Northeast  dummy__Northwest  dummy__South  dummy__Southwest  
0             False             False         False             False  
1             False             False         False              True  
2             False              True         False             False  
3              True             False         False             False  
4             False             False         False              True

grouped = blueprinty_df.groupby('iscustomer')['patents']
means = grouped.mean()

# Plot histograms for each group
plt.figure(figsize=(10, 6))
for status, group_data in grouped:
    plt.hist(group_data, alpha=0.5, label=status, bins=20)

# Overlay mean lines
for status, mean in means.items():
    plt.axvline(mean, color='k', linestyle='dashed', linewidth=1)
    plt.text(mean + 0.1, 50, f'Group {status} Mean: {mean:.2f}', rotation=90)

plt.xlabel('Number of Patents')
plt.ylabel('Frequency')
plt.title('Histogram of Number of Patents by Customer Status')
plt.legend()
plt.grid(True)
plt.show()

The histogram above shows the number of patents submitted by customer status as well as includes mean bars for the two groups. In this data set, 1 indicates that a engineering firm is a blueprinty customer and a 0 indicates that they are not a blueprinty customer. Based on this histogram, Blueprinty’s customers do have a slightly higher average number of patents submitted, 4.09 patents compared to the non-blueprinty customers, 3.62 patents. However, there are a couple of important caveats to these numbers. First, the non-blueprinty customer base not only is larger than that of blueprinty’s, but the range is wider, and especially hovers around the 0-2 patent range. This means that these customers could be bringing the average down among the non-blueprinty customer base. Additionally, Blueprinty customers are not selected at random. Therefore, additional systematic differences in the age and location of the engineering firms could be further contributing to the difference in the mean number of patents submitted.

It may be important to account for systematic differences in the age and regional location of customers vs non-customers. Therefore, we are going to examine the distribution of Blueprinty’s customers by regions and ages.

Customers by Region

grouped = blueprinty_df.groupby('iscustomer')['region']

# Plot histograms for each group
plt.figure(figsize=(10, 6))
for status, group_data in grouped:
    plt.hist(group_data, alpha=0.5, label=status, bins=10)

plt.xlabel('Number of Customers')
plt.ylabel('Frequency')
plt.title('Histogram of Distribution of Blueprinty Customers by Region')
plt.legend()
plt.grid(True)
plt.show()

Based on region, Blueprinty has the highest number of customers located in the Northeast. However, the Northeast also has the highest number of engineering firms out of all the other regions. If we were to take the average number of Blueprinty customer’s per region, blueprinty would have approximately 10% of the engineering firms among the various regions.

Customers by age

grouped_age = blueprinty_df.groupby('iscustomer')['age']
means_age = grouped_age.mean()
# Plot histograms for each group
plt.figure(figsize=(10, 6))
for status, group_data in grouped_age:
    plt.hist(group_data, alpha=0.5, label=status, bins=20)

for status, mean in means_age.items():
    plt.axvline(mean, color='k', linestyle='dashed', linewidth=1)
    plt.text(mean + 0.5, 75, f'{status} Mean: {mean:.2f} years', rotation=90)

plt.xlabel('Age of Customers (years)')
plt.ylabel('Frequency')
plt.title('Histogram of Customer Age')
plt.legend()
plt.grid(True)
plt.show()

The histogram above shows the distribution of Blueprinty and non-Blueprinty customers by age. On average, Blueprinty’s customers are about 2.54 years younger than other customers. Additionally, there is a slight rightward skew to Blueprinty’s customers where they seem to have clients that have younger businesses, which would contribute to the younger average age of their client’s businesses. We can see that Blueprinty has been targeting younger buinesses compared to older businesses. This makes sense for Blueprinty, and could be part of their business plan. These businesses could have a greater ROI when using the Blueprinty software compared to an older business that has expertise submitting patents.

Estimation of Simple Poisson Model

Since our outcome variable of interest can only be small integer values per a set unit of time, we will use a Poisson density function to model the number of patents awarded to each engineering firm over the last 5 years. Just to note, a Poisson model is a distribution of discrete counts in a pre-determined time frame, hence why we will be using it instead of another distribution model.

We start by estimating a simple Poisson model via Maximum Likelihood.

First, the likelihood function represents the probability of observing the data given a specific value of $\lambda$. To mathematically write down the likelihood for $Y \sim \text{Poisson}(\lambda)$, we first need to write down the density of the dependent variable for one observation. For a Poisson distribution, the density function is $f(Y|\lambda) = e^{-\lambda}\lambda^Y/Y!$.

Using this density function, we can then estimate the likelihood by multiplying the density together N times, where the n represents the number of observations.

Next we want to take the natural log of the density and create the joint log-likelihood by adding the N log-densities together, which can be done due to log properties. Therefore, when we take the natural log of our liklihood function, we get $\ln f(Y|\lambda) = \sum_{i=1}^{n} (-\lambda + x_i \ln(\lambda) - \ln(\lambda))$ .

Next we want to maximize our likelihood function. To find the maximum likelihood estimate of λ, we need to maximize the logarithm of the likelihood function with respect to λ. This can be done by taking the derivative of the log-likelihood function with respect to λ, setting it equal to zero, and solving for λ: $\frac{d \ln L(\lambda)}{d\lambda} = 0$

Once we solve this function for λ, we have the parameter value that best fits the observed data according to the Poisson distribution.

from scipy.special import factorial

def poisson_loglikelihood(lambda_value, Y):
   log_likelihood = np.sum(-lambda_value + Y * np.log(lambda_value) - np.log(factorial(Y)))
   return log_likelihood

In python, we can create a function that takes our lambda value and Y and returns our log likelihood value.

# number of patents
Y = blueprinty_df["patents"]

#range of lambda values
lambda_range = np.linspace(0.1, 20, 100)

# log-likelihood for each lambda
log_likelihoods = [poisson_loglikelihood(lambda_val, Y) for lambda_val in lambda_range]

plt.plot(lambda_range, log_likelihoods)
plt.xlabel('Lambda')
plt.ylabel('Log-Likelihood')
plt.title('Log-Likelihood of Poisson Distribution')
plt.grid(True)
plt.show()

When we take the first derivative of our function and set it equal to 0, we are able to solve for the value that maximizes our function. This is because the first derivative indicates a rate or slope, and we are looking for the point of our function that has a 0 slope i.e. some type of peak. From the graphical representation above, we can see that our function has a max somewhere between 2.5 and 5 for our lambda value, or approximately 3.

from scipy.optimize import fsolve

def poisson_loglikelihood_derivative(lambda_value, Y):
    return np.sum(-1 + Y / lambda_value)

# Observed number of patents
Y = blueprinty_df["patents"] # Update with your observed value

# Solve for lambda using fsolve
lambda_mle = fsolve(poisson_loglikelihood_derivative, x0=1, args=(Y,))[0]

print(f"Maximum Likelihood Estimate (MLE) for lambda: {lambda_mle:.3f}")

Maximum Likelihood Estimate (MLE) for lambda: 3.685

We don’t need to calculate our first derivative by hand each time, set equal to 0 and run. Instead, we can use the function scipy.optimize and fsolve to solved our function when it is set equal to 0 and find that value that maximizes our function. In this case, this MLE for lambda is 3.685.

Estimation of Poisson Regression Model

Next, we extend our simple Poisson model to a Poisson Regression Model such that $Y_i = \text{Poisson}(\lambda_i)$ where $\lambda_i = \exp(X_i'\beta)$. The interpretation is that the success rate of patent awards is not constant across all firms ($\lambda$) but rather is a function of firm characteristics $X_i$. Specifically, we will use the covariates age, age squared, region, and whether the firm is a customer of Blueprinty.

todo: Update your likelihood or log-likelihood function with an additional argument to take in a covariate matrix X. Also change the parameter of the model from lambda to the beta vector. In this model, lambda must be a positive number, so we choose the inverse link function g() to be exp() so that $\lambda_i = e^{X_i'\beta}$. For example:

#def poisson_regression_loglikelihood(beta, X, Y):
    #linear_predictor = np.dot(X, beta)
    #lambda_value = np.exp(linear_predictor)
    #log_likelihood = np.sum(-lambda_value + Y * np.log(lambda_value) - np.log(factorial(Y)))
    
    #return log_likelihood

todo: Use your function along with R’s optim() or Python’s sp.optimize() to find the MLE vector and the Hessian of the Poisson model with covariates. Specifically, the first column of X should be all 1’s to enable a constant term in the model, and the subsequent columns should be age, age squared, binary variables for all but one of the regions, and the binary customer variable. Use the Hessian to find standard errors of the beta parameter estimates and present a table of coefficients and standard errors.

#def poisson_regression_loglikelihood(beta, X, Y):
    #linear_predictor = np.dot(X, beta)
    #lambda_value = np.exp(linear_predictor)
    #log_likelihood = -np.sum(lambda_value - Y * linear_predictor + np.log(factorial(Y)))
    #return log_likelihood

#def poisson_regression_gradient(beta, X, Y):
    #linear_predictor = np.dot(X, beta)
    #lambda_value = np.exp(linear_predictor)
    #gradient = np.dot(X.T, lambda_value - Y)
    #return gradient

# Define the Hessian of the log-likelihood function
#def poisson_regression_hessian(beta, X, Y):
    #linear_predictor = np.dot(X, beta)
    #lambda_value = np.exp(linear_predictor)
    #diag_lambda = np.diag(lambda_value)
    #hessian = np.dot(X.T, np.dot(diag_lambda, X))
    #return hessian

# Generate sample data (replace this with your actual data)
X = blueprinty[['age', 'dummy__Midwest', 'dummy__Northeast', 'dummy__Northwest', 'dummy__South','dummy__Southwest', 'iscustomer']].copy()
X['constant'] = 1  # constant column
X['age_squared'] = X['age'] ** 2
Y = blueprinty['patents']

# Initial guess for beta
#initial_beta = np.zeros(X.shape[1])

# Optimize using minimize
#result = minimize(poisson_regression_loglikelihood, initial_beta, args=(X.values, Y), jac=poisson_regression_gradient, hess=poisson_regression_hessian, method='trust-constr', options={'disp': True})

# Extract MLE estimates
#beta_mle = result.x

# Compute standard errors using the inverse of the Hessian
#covariance_matrix = np.linalg.inv(poisson_regression_hessian(beta_mle, X.values, Y))
#standard_errors = np.sqrt(np.diag(covariance_matrix))

# Print coefficients and standard errors
#print("Coefficients:")
#for i, coef in enumerate(beta_mle):
    ##print(f"Beta_{i}: {coef:.4f} (SE: {standard_errors[i]:.4f})")

from scipy.optimize import minimize

#def poisson_regression_loglikelihood(beta, X, Y):
    #linear_predictor = np.dot(X, beta)
    #lambda_value = np.exp(linear_predictor)
    #log_likelihood = np.sum(-lambda_value + Y * np.log(lambda_value) - np.log(factorial(Y)))
    
    #return log_likelihood

# Define covariates matrix X 
#X = blueprinty[['age', 'dummy__Midwest', 'dummy__Northeast', 'dummy__Northwest', 'dummy__South','dummy__Southwest', 'iscustomer']]
#X['constant'] = 1  # constant column
#X["age"] = X["age"].astype(int)
# Create age squared column
#X['age_squared'] = X['age'] ** 2


# Convert X and Y to numpy arrays
#X = X.values
#Y = blueprinty['patents'].values

# Initial guess for beta vector
#initial_beta = np.ones(X.shape[1])

# Optimize the log-likelihood function using scipy.optimize
#result = minimize(poisson_regression_loglikelihood, initial_beta, args=(X, Y), method='BFGS')

# MLE vector
##mle_vector = result.x

# Hessian of the Poisson model
#hessian = result.hess_inv

# Calculate standard errors of beta parameter estimates
#se_beta = np.sqrt(np.diag(hessian))
# Present a table of coefficients and standard errors
#coefficients_table = pd.DataFrame({'Coefficients': mle_vector, 'Standard Errors': se_beta}, index=['constant', 'age', 'age_squared'] + list(regions[:-1]) + ['iscustomer'])
#print(coefficients_table)

todo: Check your results using R’s glm() function or Python sm.GLM() function.

#import statsmodels.api as sm
#X = blueprinty[['age', 'dummy__Midwest', 'dummy__Northeast', 'dummy__Northwest', 'dummy__South','dummy__Southwest', 'iscustomer']]
#Y = blueprinty['patents'].values
#model = sm.GLM(Y, X, family=sm.families.Poisson()).fit()
#print(model.summary())

AirBnB Case Study

Introduction

AirBnB is a popular platform for booking short-term rentals. In March 2017, students Annika Awad, Evan Lebo, and Anna Linden scraped of 40,000 Airbnb listings from New York City. The data include the following variables:

Variable Definitions

- `id` = unique ID number for each unit
- `last_scraped` = date when information scraped
- `host_since` = date when host first listed the unit on Airbnb
- `days` = `last_scraped` - `host_since` = number of days the unit has been listed
- `room_type` = Entire home/apt., Private room, or Shared room
- `bathrooms` = number of bathrooms
- `bedrooms` = number of bedrooms
- `price` = price per night (dollars)
- `number_of_reviews` = number of reviews for the unit on Airbnb
- `review_scores_cleanliness` = a cleanliness score from reviews (1-10)
- `review_scores_location` = a "quality of location" score from reviews (1-10)
- `review_scores_value` = a "quality of value" score from reviews (1-10)
- `instant_bookable` = "t" if instantly bookable, "f" if not

todo: Assume the number of reviews is a good proxy for the number of bookings. Perform some exploratory data analysis to get a feel for the data, handle or drop observations with missing values on relevant variables, build one or more models (e.g., a poisson regression model for the number of bookings as proxied by the number of reviews), and interpret model coefficients to describe variation in the number of reviews as a function of the variables provided.

Introduction

Here, we have been provided with 40,000 Airbnb listings from New York City. We are assuming that the number os reviews are a good proxy for the number of bookings on the Airbnb platform. Our goal is going to attempt to find the number of bookings an airbnb listing has via a function of the number of reviews the airbnb listing has received.

Data

import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm

airbnb = pd.read_csv("data/airbnb.csv")
airbnb.head()

print(airbnb.isnull().sum())
airbnb = airbnb[["id", "days", "last_scraped", "host_since", "room_type", "bathrooms", "bedrooms", "price", "number_of_reviews", "review_scores_cleanliness", "review_scores_location", "review_scores_value", "instant_bookable"]]

airbnb.info()

airbnb.dropna(inplace = True)

Unnamed: 0                       0
id                               0
days                             0
last_scraped                     0
host_since                      35
room_type                        0
bathrooms                      160
bedrooms                        76
price                            0
number_of_reviews                0
review_scores_cleanliness    10195
review_scores_location       10254
review_scores_value          10256
instant_bookable                 0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40628 entries, 0 to 40627
Data columns (total 13 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         40628 non-null  int64  
 1   days                       40628 non-null  int64  
 2   last_scraped               40628 non-null  object 
 3   host_since                 40593 non-null  object 
 4   room_type                  40628 non-null  object 
 5   bathrooms                  40468 non-null  float64
 6   bedrooms                   40552 non-null  float64
 7   price                      40628 non-null  int64  
 8   number_of_reviews          40628 non-null  int64  
 9   review_scores_cleanliness  30433 non-null  float64
 10  review_scores_location     30374 non-null  float64
 11  review_scores_value        30372 non-null  float64
 12  instant_bookable           40628 non-null  object 
dtypes: float64(5), int64(4), object(4)
memory usage: 4.0+ MB

We will first conduct some exploratory data analysis to get a feel for the data. From airbnb.info() we see that we have 13 variables in our dataframe with the max number of rows being 40628. There do be some missing values, such as in bedrooms, host_since, review_scores_cleanliness, review_scores_location, and review_scores_value.

First, we want to examine the distribution of room_type by the number of reviews number_of_reviews.

airbnb_room = airbnb[['room_type']]
plt.hist(airbnb_room, bins=20, alpha = 0.5)
plt.xlabel('Airbnb Room Type and Number of Reviews')
plt.ylabel('Frequency')
plt.title('Histogram of Room Type and Number of reviews')
plt.show()

As we can see, Airbnb tends to have either private rooms and entire homes or apts. There is not much information on shared rooms on the Airbnb platform.

airbnb_price = airbnb[['price']]
x_range = (0, 2200)
plt.hist(airbnb_price, range =x_range, bins=100, alpha = 0.5)

plt.xlabel('Price of Airbnb Listing')
plt.ylabel('Frequency')
plt.title('Histogram of Price')
plt.show()

The price of airbnb listing’s range from close to 0 all the way up to 2000 per night. The most common price seems to be around $100/night. Additionally, because we are missing some variables from the price column, we want to potentially fill in the missing information to run a logistic regression from this data.

sns.pairplot(airbnb[['bathrooms', 'bedrooms', 'price', 'number_of_reviews']])
plt.show()

import pyrsm as rsm

mapping = {"t": 1, "f": 0}
airbnb["instant_bookable"] = airbnb["instant_bookable"].map(mapping)

# want to build a Poisson regression model
X = airbnb[['bathrooms', 'bedrooms', 'price', "review_scores_cleanliness", "instant_bookable", "review_scores_value", "review_scores_location"]]
y = airbnb['number_of_reviews']

# Add constant for intercept
X = sm.add_constant(X)

# Fit Poisson regression model
model = sm.GLM(y, X, family=sm.families.Poisson()).fit()

# Print model summary
print(model.summary())

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:      number_of_reviews   No. Observations:                30140
Model:                            GLM   Df Residuals:                    30132
Model Family:                 Poisson   Df Model:                            7
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:            -5.2905e+05
Date:                Wed, 29 May 2024   Deviance:                   9.3672e+05
Time:                        23:55:49   Pearson chi2:                 1.41e+06
No. Iterations:                     6   Pseudo R-squ. (CS):             0.5518
Covariance Type:            nonrobust                                         
=============================================================================================
                                coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                         3.5414      0.016    224.237      0.000       3.510       3.572
bathrooms                    -0.1286      0.004    -34.537      0.000      -0.136      -0.121
bedrooms                      0.0785      0.002     39.944      0.000       0.075       0.082
price                      1.059e-05   7.43e-06      1.425      0.154   -3.97e-06    2.51e-05
review_scores_cleanliness     0.1136      0.001     76.229      0.000       0.111       0.117
instant_bookable              0.3321      0.003    115.247      0.000       0.326       0.338
review_scores_value          -0.0916      0.002    -51.094      0.000      -0.095      -0.088
review_scores_location       -0.0754      0.002    -47.098      0.000      -0.079      -0.072
=============================================================================================

The output provided is from a Poisson regression model that aims to predict the number of reviews as a proxy for the number of bookings based on various predictor variables. Here’s how we can interpret the model coefficients in the context of the given statement:

Intercept (const): The intercept term represents the expected number of reviews when all predictor variables are zero. In this model, the intercept is 3.5414. This means that when all other predictor variables are zero, we would expect approximately 34.41 reviews.
bathrooms: The coefficient for the bathrooms variable is -0.1286. This suggests that for each additional bathroom, we would expect the number of reviews to decrease by approximately 0.1286, holding other variables constant. This might imply that listings with more bathrooms tend to receive fewer reviews, possibly due to higher pricing or reduced occupancy.
bedrooms: The coefficient for the bedrooms variable is 0.0785. This indicates that for each additional bedroom, we would expect the number of reviews to increase by approximately 0.0785, holding other variables constant. This suggests that listings with more bedrooms tend to receive more reviews, possibly because they accommodate larger groups or families.
price: The coefficient for the price variable is 1.059e-05, but its p-value is 0.154, which is greater than the typical significance level of 0.05. This suggests that the effect of price on the number of reviews is not statistically significant at the 95% confidence level. In other words, changes in price do not have a significant impact on the number of reviews in this model.
review_scores_cleanliness: The coefficient for the review_scores_cleanliness variable is 0.1136. This suggests that for each unit increase in cleanliness score, we would expect the number of reviews to increase by approximately 0.1136, holding other variables constant. This implies that listings with higher cleanliness scores tend to receive more reviews.
instant_bookable: The coefficient for the instant_bookable variable is 0.3321. This indicates that listings that are instantly bookable tend to receive more reviews compared to those that are not, with an increase of approximately 0.3321 in the expected number of reviews.
review_scores_value: The coefficient for the review_scores_value variable is -0.0916. This suggests that for each unit increase in value score, we would expect the number of reviews to decrease by approximately 0.0916, holding other variables constant. This might indicate that listings perceived as offering better value tend to receive fewer reviews.
review_scores_location: The coefficient for the review_scores_location variable is -0.0754. This indicates that for each unit increase in location score, we would expect the number of reviews to decrease by approximately 0.0754, holding other variables constant. This could imply that listings in better locations receive fewer reviews, possibly because they are more expensive or have fewer available dates due to high demand.

Overall, this model suggests that various factors such as the number of bathrooms, bedrooms, cleanliness scores, instant bookability, and value perception significantly influence the number of reviews received, while the effect of price and location scores is not statistically significant at the chosen significance level.