1 Airbnb Project - Athens

1.1 Introduction

Our group project focuses on the analysis of the Airbnb data from Athens AirBnB dataset. In our project we will perform Data Cleaning, Exploratory Data Analysis, Data visualizations and finally a Regression Analysis. The final output of all the analysis and the final regression model is a multivariate regression model that predicts the nightly cost per stay based on distance from center, number of accomodates and number of bathrooms necessary. We built an interactive dashboard, which allows you to adjust the settings should your preferences change.

1.2 Summary of key findings

The dataset from Airbnb contained a vast amount of information which allows the development of an extensive range of insights into the operations of Airbnb in Athens. We eliminated ~65 of the columns in the dataset (the ones which added no value to the analysis or were not related the regression model we were asked to set up). We also had to convert a number of the columns into workable data as many numeric values were stored as strings. Furthermore, we analyzed the data even further and adjusted unnecessary NA values and extreme outliers, which depressed the overall quality of the remaining dataset.

After building a workable dataset, the data was explored and tested many hypothesis which we discussed at the start of the project. In order to develop a better understand about relationship and the outcomes of our analysis we created appealing visualisations. We observed that there is a relationship between the average price of a night and the distance from the center. The property information was overlaid onto a map of Athens to provide a visual aid to understand the distribution of AirBnb properties and their prices throughout the city. Moreover, we experienced that there is no significant price difference between the room types, if you adjust the price with the number of accomodates it can carry. Finally, we tried to identify the major factors of the general rating of the Airbnbs. Interestingly, neither the price nor the response time of the host played a major role in this, however, we discovered a relationship with the “Superhost” status. Therefore, superhosts seem to have really strict requirements and the status reflects the overall good quality and experience of the stay.

In order to narrow down the variables to include in the regression models, we built a correlation heatmap (in which we included ~10 different variables). We found out that there were many variables highly correlated with the such as the distance from the center, the number of accomodates, and the number of bathrooms. We were quite surprised that the superhost status (although the status definitely impacts the rating), has nearly no relationship with the price, which can be achieved.

The regression model was developed through iteration and the final model’s independent variables were Distance (corrected with neighbourhoods), number of accomodates, Roomtype, Bathrooms and Review Score (as proxy for demand) which has a R^2 value of 0.59 and and mean squared prediction error of 0.167. Although some of these variables are intercorrelated with each other, we argue that it anyways is significant for the total price to include more of the factors, as it e.g. plays a crucial role how the structure of the apartments is. Therefore we included both the #accomodates and the #bathrooms.

Our final regression model was used to predict the price for 2 people to rent a property with within 1,500m in the center. As we are only two person, we didn’t care about the number of bathrooms and bedrooms. The predicted price was €42.24 with a 95% confidence interval from €39.50 to €45.27.

2 Initial data analysis

2.1 Cleaning the data

Based on our initial data analysis we identified 4 major types of variables in the underlying data set:

  1. Character values: 47
  2. Date values: 5
  3. Logical values: 15
  4. Numeric values: 39

We also have seen that we have 11,314 observations (apartments) & a total of 106 data points per apartment.

2.1.1 Reducing the dataset

We identified many variables that have a characteristic which make them either not interesting to analyze (e.g. only one/ very few distinct values or text strings) or that we think we will not use in the analysis later on. –> So we excluded these columns/ data points in order to make the data easier & faster to handle.

athens_data_red <- athens_data %>% 
    #Select the relevant variables
  select(
         id,
         neighbourhood,
         zipcode,
         latitude,
         longitude,
         property_type,
         room_type,
         accommodates,
         bathrooms,
         bedrooms,
         beds,
         price,
         weekly_price,
         monthly_price,
         security_deposit,
         cleaning_fee,
         guests_included,
         extra_people,
         minimum_nights,
         maximum_nights,
         availability_365,
         number_of_reviews_ltm,
         review_scores_rating,
         review_scores_checkin,
         review_scores_cleanliness,
         review_scores_accuracy,
         review_scores_communication,
         review_scores_location,
         review_scores_value,
         cancellation_policy,
         reviews_per_month,
         host = host_id, 
         host_response_time,
         host_response_rate,
         host_acceptance_rate,
         host_is_superhost,
         host_listings_count,
         host_total_listings_count,
         host_identity_verified,number_of_reviews,
         host_instant_booking =  instant_bookable
  )

We now only have 41 columns left, which make the data set easier to handle.

2.1.2 Adjust data values

In a next step we will adjust the type of some variables so that we can actually can work with the data more easily. * We transform the price, weekly price, monthly price, security deposit, cleaning fee, extra people, host response rate and host acceptance rate from character variables to numeric ones * We create factor variables for Property type, room types, cancellation policy and host response time

Transforming character values to numeric values

# Transform character values to numeric values
athens_data_clean <- athens_data_red %>% 
   mutate(
     price = as.numeric(str_remove_all(price, "[$ ,]")),
     weekly_price = as.numeric(str_remove_all(weekly_price, "[$ , ]")),
     monthly_price = as.numeric(str_remove_all(monthly_price, "[$ ,]")),
     cleaning_fee = as.numeric(str_remove_all(cleaning_fee, "[$ ,]")),
     security_deposit = as.numeric(str_remove_all(security_deposit, "[$ ,]")),
     extra_people = as.numeric(str_remove_all(extra_people, "[$ ,]")),
     host_response_rate = as.numeric(str_remove_all(cleaning_fee, "[% ,]")),
     host_acceptance_rate = as.numeric(str_remove_all(cleaning_fee, "[% ,]"))
     )

Transforming character values to factor values

# Create factor variables for room types 
room_types <- unique(athens_data_clean$room_type)
athens_data_clean$room_type <- factor(athens_data_clean$room_type, labels = room_types)

# Create factor variables for cancellation policies 
cancellation_policies <- unique(athens_data_clean$cancellation_policy)
athens_data_clean$cancellation_policy <- factor(athens_data_clean$cancellation_policy, labels = cancellation_policies)

# Create factor variables for host response time 
athens_data_clean <- athens_data_clean %>% 
  mutate(host_response_time = fct_relevel(host_response_time,
                                            "within an hour", 
                                            "within a few hours",
                                            "within a day",
                                            "a few days or more"
                                            ))

We identfied that there exists an issue with the creation of a factor variable for property types. There are too much categories in order to generate reasonable factors. So, we will analyze how much the share each category has. Best case would be that the majority of the property type share is done with a small number (20% of factors account 80% of the total share). If that is the case we can just summarize the rest in a new category calles “other”.

# Identify the amount of each property type
most_com_properties <- athens_data_clean %>%
    count(property_type) %>%
    mutate(percentage = n/sum(n)*100)%>%
    arrange(desc(n))

most_com_properties
## # A tibble: 26 x 3
##    property_type          n percentage
##    <chr>              <int>      <dbl>
##  1 Apartment           9677     85.5  
##  2 House                386      3.41 
##  3 Condominium          261      2.31 
##  4 Serviced apartment   187      1.65 
##  5 Loft                 180      1.59 
##  6 Aparthotel           139      1.23 
##  7 Hotel                135      1.19 
##  8 Boutique hotel       120      1.06 
##  9 Bed and breakfast     49      0.433
## 10 Hostel                38      0.336
## # … with 16 more rows

As the 5 most common property types account for ~95% of the total share we can just focus on them and summarize the rest in “Others”

# First we need to summarize the other values in the Category "Others"
athens_data_clean <- athens_data_clean %>% 
  mutate(
    property_type = case_when(
      property_type %in% c("Apartment","House", "Condominium","Serviced Apartment", "Loft") 
      ~ property_type, 
      TRUE ~ "Other"))
    

# In a next step we can make a factor out of the 6 pre-defined categories    
athens_data_clean <- athens_data_clean %>% 
  mutate(
     property_type = fct_relevel(property_type,
                                        "Apartment",
                                        "House",
                                        "Condominium",
                                        "Serviced Apartment",
                                        "Loft",
                                        "Other"))

We now have transformed the data types of most variables in order to make the data set easier to work with in the analysis part below. We have deleted unnecessary values, adjusted wrong variable types and now we will further inspect the quality of our data.

2.1.3 Readjust NA values

In a this step we will further manipulate the data set. In specific we will correct the NA values in cases in which we can estimate the value. * If no weekly price -> no discount -> we will insert the daily price multiplied by 7 * If no monthly price -> no discount -> we will insert the daily price multiplied by 30 * If no security deposit/ cleaning fee -> no fee -> we will insert 0

# We will replace the NAs in the weekly prices and assume there is no discount if NA
 athens_data_clean$weekly_price[is.na(athens_data_clean$weekly_price)] <- 
  athens_data_clean$price *7


# We will replace the NAs in the monthly prices and assume there is no discount if NA
 athens_data_clean$monthly_price[is.na(athens_data_clean$monthly_price)] <- 
  athens_data_clean$price * 30


# We will replace the NAs in the security deposit & cleaning fee and assume 0 if NA
 athens_data_clean$cleaning_fee[is.na(athens_data_clean$cleaning_fee)] <- 0
 athens_data_clean$security_deposit[is.na(athens_data_clean$security_deposit)] <- 0

We now also eliminated unnecessary NA values. The only thing we haven’t yet adjusted are potential outliers which will be captured in the next section.

2.1.4 Readjust outliers

We will screen the most important variable price, which we need in our analysis later on, for potential outliers. We will exclude the extreme values, which make no sense economically (way too high prices). Reasons which could explain these extremly high prices are unwillingness to list at the moment, fake listings or extremly luxurious apartments.

We will start with a quick plot to see if we have outliers in the price data

# Quick plot to see outliers
athens_data_clean %>% 
  ggplot(aes(x = price)) +
  geom_histogram() +
  labs(title= "Distribution of prices in our original data")

The distribution looks very scewed, we cannot identifiy anything. Probably a log-normal distribution is better suited -> We use log -> normal

# Looks very scewed, probably a log-normal distribution, use log -> normal
athens_data_clean %>% 
  ggplot(aes(x = log(price))) +
  geom_histogram()

Here seem to be a few outliers. We will remove them using the IQR method, becauses we belive that keeping those values would skew our analysis

# Removing outliers with IQR method
IQR.outliers <- function(x) {
  Q3 <- quantile(x,0.95)
  Q1 <- quantile(x,0.05)
  IQR <- (Q3-Q1)
  left <- (Q1-(1.5*IQR))
  right <- (Q3+(1.5*IQR))
  print(c(left, right))
  c(x[x <left],x[x>right])
}

# Print outliers
IQR.outliers(athens_data_clean$price)
##   5%  95% 
## -180  352
##  [1]  354  600  459  400  400  385 1000  515  410  640  412  502  650  400  400
## [16]  600  450 1000 1000 1000  495 1000  400  500  500  500  500  500  500  525
## [31]  404  800  700  500  402  450  450  540  360  810  487 7000 7000 7000 7000
## [46]  390  460  400  600  400  353  357  400  426  500 1500  900 1200  450  450
## [61]  800  400  990  600 1000  500 1000 5000  400  800  360 1000  390  500  500
## [76]  400  400  400  600 1290  999 1000  700  720  700  700 1000

We see that we will not exclude any of the low prices, but every price which is above 352

athens_data_clean %>% 
  filter(!(price %in% IQR.outliers(athens_data_clean$price))) %>% 
  ggplot(aes(x = log(price))) +
  geom_histogram()
##   5%  95% 
## -180  352

We can see that the graph now looks way more normally distributed than before. We believe that the dataset is now more representative.

#Defining our final data set, which has no more outliers
athens_data_final <- athens_data_clean %>% 
  filter(!(price %in% IQR.outliers(athens_data_clean$price)))
##   5%  95% 
## -180  352

We finally derived our final data set with which we can start with the analysis part of the project. * We reduced the relevant columns to 41 * We reduced the relevant data points (without outliers) to 11,227 * We readjusted many data types * We removed unnecessary NAs and increased the quality of the dataset * We analyzed outliers and removed them

2.2 Exploratory Data Analysis (EDA) & data visualisation

As we now have finally derived with a data set, which has only the relevant values, right variable types, adjusted NA values and is corrected for outliers, we can finally start with the analysis of the data.

2.2.1 Analysis on location

How important is the location for the price? Are central locations more expensive?

# First we start with a simple plot, showing our Airbnbs
qmplot(longitude, latitude, data = athens_data_final, color = price)

# Syntagma coordinates
syntagma <- c(37.975344, 23.73472)
names(syntagma) <- c("longitude", "latitude")

# Athene map
athens_map = get_map(location=c(23.68,
                                37.945,
                                23.8,
                                38.035), maptype="terrain-background")

athens_map <- ggmap(athens_map)

# We dont want to see the axis when we are ploting maps
map_theme <-  theme(axis.title.x=element_blank(),
                    axis.text.x=element_blank(),
                    axis.ticks.x=element_blank(),
                    axis.title.y=element_blank(),
                    axis.text.y=element_blank(),
                    axis.ticks.y=element_blank())

# Plot the map and Syntagma, is there a connection between prices and the centre?
athens_map +
  geom_point(data=athens_data_final, aes(x = longitude, y = latitude, color = price)) +
  geom_point(aes(x = syntagma['latitude'], syntagma['longitude']), 
             color = 'red', size = 5) +
  map_theme +
  labs(title="Airbnbs around the centre seem to be more expensive", 
       subtitle = "Centre - Syntagma Square")

According to the graph there seems to be a connection between the price and the distance to the center.

Next we are going to explore if based on location there are differences in room types. We expect to have center locations to have on average smaller offerings (e.g. shared rooms)

# Calculate the distance
athens_data_final<- athens_data_final %>% 
  rowwise() %>% 
  mutate(
    cent_dist = distm(c(latitude, longitude), c(37.975344, 23.73472), 
                      fun = distHaversine)[1,1]
  )

# First we calculate the average distance 
avg_dist <- athens_data_final %>% 
  group_by(neighbourhood) %>% 
  summarise(
    avg_dist = mean(cent_dist)
  ) %>% 
  arrange(-avg_dist)

# Now we create a graph showing how average distance impacts room type
athens_data_final %>% 
  filter(!is.na(neighbourhood)) %>% 
  select(neighbourhood,
         room_type) %>% 
  group_by(neighbourhood,
           room_type) %>% 
  summarise(n = n()) %>% 
  mutate(perc = n/sum(n)) %>% 
  ggplot(aes(fill=room_type, x=perc, y=factor(neighbourhood,levels = avg_dist$neighbourhood))) + 
    geom_bar(position="fill", stat="identity") +
  labs(title="Distance from the centre does not impact room type",
       subtitle = "Average distance in decreasing order") +
  ylab("") +
  xlab("") +
  guides(fill=guide_legend(title="Room types"))

We identified that there is no significance patterns visible. Our hypothesis that more central locations have a higher amount of shared rooms, private rooms than locations further away must therefore be invalid.

2.2.2 Analysis of rating

How are ratings general distributed. Is there a skew in the data, or are they nearly normal?

#First we want to see how the ratings are distributed in general
athens_data_final %>% 
  ggplot(aes(x=review_scores_rating)) +
  geom_histogram() +
  # Due to the high skew in distribution, a log y scale makes it easier to read
  scale_y_log10() +
  xlab("Review scores rating") +
  ylab("Quantity") +
  labs(title = "Most hosts seem to convince the tentants of their apartment", subtitle = "High negative skew in distribution")

We identified that the rating has a strong negative skew in the distribution. We therefore imply that most tenants have a overall positive experience of their stay and are more likely to give feedback if they have made positive experiences than negative ones.

Next we want to see the influence of the response time on the overall rating. We think that the response rate is a proxy for the general commitment of the host, which is quite important in our opinion.

# We want to see if the response time has an influence on the general rating of the apartment
# create a bar chart to see the review scores based on response time
athens_data_final %>% 
  filter(host_response_time != "N/A" & !is.na(host_response_time)) %>% 
  group_by(host_response_time) %>% 
  ggplot(aes(y=host_response_time, x=review_scores_rating)) +
  geom_boxplot() +
  xlim(85,100) +
  ylab("Host response time") +
  xlab("Review Scores rating") +
  labs(title = "Fast response time not valued enough to have impact on rating", subtitle = "The longer the response time the higher the median rating")

We identified that the response time has not a huge impact on the general rating of the Airbnb. We will now try to identify more significant factors. Let’s try to test if the price per bed influences the rating.

# We want to see if the price is a significant factor for the rating
# In order to reduce the bias in the data we will use the price per bed 

athens_data_final %>% 
  summarize(
    price_per_bed = price/beds,
    review_scores_rating
  ) %>% ggplot(aes(x=price_per_bed, y=review_scores_rating)) +
  geom_point() +
  scale_x_log10() +
  ylim(60,100) +
  xlab("Price per bed") +
  ylab("Review Score Rating") +
  labs(title = "No correlation between price per bed and review score", 
       subtitle = "Distribution of review scores and price per bed")

Once again we cannot identify a clear trend in the data. They seems to be no correlation between the price per bed and the average rating. We will give the analysis one last try and explore if the rating is influenced by the fact if the host is a superhost (which has many responsibilites compared to a normal host) or not.

# Analysis if superhost status has a positive impact on the rating
athens_data_final %>% 
  filter(!is.na(host_is_superhost)) %>% 
  ggplot(aes(x=host_is_superhost, y=review_scores_rating)) +
  geom_boxplot() +
  ylim(60,100) +
  xlab("Host is superhost?") +
  ylab("Review Score Rating") +
  labs(title = "Superhosts seem to make people happier during their stay", 
       subtitle = "Rating distribution based on Superhost criterion") 

Finally we found a relationship. In our eyes this makes completely sense - in order to receive a superhost rating you need to fulfill a lot of requirements (e.g. you are not allowed to cancel as soon as you have accepted hosts & you need to have specific response times etc.). Therefore, the superhost variable includes a lot of positive attributes, which kind of explains that people feel that stays in their apartments worked out particulariy well. Many of them also do this professionally and therefore value reputation a lot.

We were quite surprised that neither the price per bed nor the response time of the host (which we have seen as an indicator of the commitment from host side) played a major role in the overall rating. We came up with possible explainations. We think that the price has no impact as people book apartments based on their individual price preferences and then rate the stay according to their experiences. Therefore the price criterion is outweighted by other factors. Regarding the host response time, we concluded that this variable probably doesn’t reflect the commitment of the host in an ideal way. There are many more factors, which are not included - therefore the general impact of the response time is too low to see any impact.

2.2.3 Analysis of room type

We haven’t yet analyzed the room type. However, we have the hypothesis that the room type will impact the price which can be achieved with an apartment.

Is there a difference in price among room types?

# create a plot to show the density and distribution for the price grouped by each room type
athens_data_final %>% 
  ggplot(aes(x=price, y=room_type, fill=room_type)) +
  geom_violin( ) +
  # make differences more visible in relevant interval
  xlim(0,250) +
  # In order to make differences more visible
  scale_x_log10() +
  xlab("Price") +
  ylab("Density") +
  stat_summary(fun.y=median, geom="point", size=3, color="black") +
  labs(title = "Private rooms with highest median prices, closely followed by whole apartments",
       subtitle = "Distribution of price per room type") +
  theme(strip.text.x = element_text(size = 10), legend.position = "none")

First we were quite confused that private rooms are on average more expensive than the apartments. However after having a look of the quanitity of the room types we identified that apartments are way more common than shared rooms. As the overall data quantity is so little compared to apartments, it’s likely that outliers adjust the price upwards. It makes sense that shared rooms are really cheap, in the rante between 10 and 30 Euro per night.

We will now conduct the same analysis but adjust (like above) the price by the amount of persons the apartment can carry. We expect the results to be more equally distributed.

# create a  plot to show the density and distribution for the price per person grouped by each room type
athens_data_final %>% 
  ggplot(aes(x=price/accommodates, y=room_type, fill=room_type)) +
  geom_violin() +
  # make differences more visible in relevant interval
  xlim(0,250) +
  # In order to make differences more visible
  scale_x_log10() +
  xlab("Price") +
  ylab("Density") +
  stat_summary(fun.y=median, geom="point", size=3, color="black") +
  labs(title = "Differences in prices per person smaller between apartment types",
       subtitle = "Distribution of price per person per room type") +
  theme(strip.text.x = element_text(size = 10), legend.position = "none")

We saw that although the total room price for apartments is higher than the one for shared rooms & hotel rooms in total, the price difference is smaller if you account for the number of accomodates which can be fit in one apartment. Now the median price per person is nearly identical among these 3 categories. Private rooms are still an outlier, but we think it is due to the same reasoning as above.

2.3 Building our model

2.3.1 Correlation analyis

We will try to find the best fitting model to predict per night prices. Therefore the first step is to analyze potential regressions with the price to deduct the key drivers of this variable. We will start with a simple line chart showing the absolute correlation with the price variable.

athens_data_final %>% 
  na.omit() %>% 
  select_if(is.numeric) %>% 
  cor() %>% 
  as.data.frame() %>% 
  select(price) %>% 
  add_rownames(var = "variable") %>%
  arrange(price) %>% 
  ggplot(aes(x = price, y = reorder(variable, price))) +
  geom_col() +
  ylab("") +
  xlab("Correlation") +
  labs(title = "Distance from central is the most negative correlation",
       subtitle = "Correlations with price")

Next we also want to explore potential intercorrelations for the most promising variables. We therefore will cut the datapoints with a low/ no correlation and create a correlation heatmap for the other ones.

athens_data_final %>% 
  # reducing the dataset in order to make it more readable
  select(cent_dist, price, accommodates, bedrooms, bathrooms, host_is_superhost, beds, cleaning_fee) %>% 
  na.omit() %>% 
  cor() %>% 
  round(2) %>% 
  melt() %>% 
  mutate(
    # Renaming in order to make the graph more readable
    Var1 = case_when(
      Var1 == "cent_dist" ~ "Distance from centre",
      Var1 == "price" ~ "Price",
      Var1 == "accommodates" ~ "Accommodates",
      Var1 == "bedrooms" ~ "Number of bedrooms",
      Var1 == "bathrooms" ~ "Number of bathrooms",
      Var1 == "host_is_superhost" ~ "Superhost",
      Var1 == "beds" ~ "Number of beds",
      Var1 == "cleaning_fee" ~ "Cleaning fee"),
    Var2 = case_when(
      Var2 == "cent_dist" ~ "Distance from centre",
      Var2 == "price" ~ "Price",
      Var2 == "accommodates" ~ "Accommodates",
      Var2 == "bedrooms" ~ "Number of bedrooms",
      Var2 == "bathrooms" ~ "Number of bathrooms",
      Var2 == "host_is_superhost" ~ "Superhost",
      Var2 == "beds" ~ "Number of beds",
      Var2 == "cleaning_fee" ~ "Cleaning fee")
    )%>% 
  ggplot(aes(Var2, Var1, fill = value))+
  geom_tile(color = "white")+
  # adjust the colors 
  scale_fill_gradient2(low = lbs_blue, high = lbs_pink, mid = "white", 
                       midpoint = 0, limit = c(-1,1), space = "Lab", 
                       name="Correlation") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  coord_fixed() +
  labs(title="Bathrooms, bedrooms, accomedation \nand price all positive correlation",
       subtitle="Correlation accross variables") +
  xlab("") +
  ylab("")

As we can see there are many variables with have a high correlation with each other. Although some of these variables are intercorrelated, we argue that it anyways is significant for the total price to include more of the factors, as it e.g. plays a crucial role how the structure of the apartments is. It is nice if it can fit 10 accomodates, however if everybody needs to sleep in the same room, the price will probably be negatively adjusted. Especially if you travel with friends these factors play a role individually, why we shouldn’t drop them in the analysis.

2.3.2 Possible models

First we will split our data into a training and testing set

# Set seed so we will get the same results
set.seed(202019)

# Split our data 25% - 75% to train and test
size <- floor(0.75 * nrow(athens_data_final))
train_ind <- sample(seq_len(nrow(athens_data_final)), size = size)

train <- athens_data_final[train_ind, ]
test <- athens_data_final[-train_ind, ]

Then we run our first regression. To choose a model we will use Akaike’s information criterion, which tells us about the significance of our finding and lets us choose among different number of variables.

library(stats)

# Univariate regression

# Model 1

model1 <- lm(log(price) ~ as.factor(accommodates), 
             data=na.omit(train)) 
# Are airbnbs that accomodate 8 people necessarily 2 times as expensive? We do not think so, therefore we use factors instead.

summary(model1) 
## 
## Call:
## lm(formula = log(price) ~ as.factor(accommodates), data = na.omit(train))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2346 -0.3189 -0.0343  0.2689  2.1660 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 2.9060     0.0594   48.89   <2e-16 ***
## as.factor(accommodates)2    0.6318     0.0608   10.40   <2e-16 ***
## as.factor(accommodates)3    0.7659     0.0618   12.40   <2e-16 ***
## as.factor(accommodates)4    0.9791     0.0605   16.20   <2e-16 ***
## as.factor(accommodates)5    1.0913     0.0629   17.34   <2e-16 ***
## as.factor(accommodates)6    1.2606     0.0623   20.22   <2e-16 ***
## as.factor(accommodates)7    1.3722     0.0737   18.61   <2e-16 ***
## as.factor(accommodates)8    1.5456     0.0708   21.82   <2e-16 ***
## as.factor(accommodates)9    1.7469     0.0995   17.56   <2e-16 ***
## as.factor(accommodates)10   1.6034     0.0969   16.54   <2e-16 ***
## as.factor(accommodates)11   1.4644     0.1681    8.71   <2e-16 ***
## as.factor(accommodates)12   1.8935     0.1168   16.21   <2e-16 ***
## as.factor(accommodates)13   1.1504     0.1880    6.12    1e-09 ***
## as.factor(accommodates)14   2.0423     0.1681   12.15   <2e-16 ***
## as.factor(accommodates)15   2.1075     0.2192    9.61   <2e-16 ***
## as.factor(accommodates)16   2.3186     0.1189   19.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.472 on 5645 degrees of freedom
## Multiple R-squared:  0.28,   Adjusted R-squared:  0.278 
## F-statistic:  147 on 15 and 5645 DF,  p-value: <2e-16
summary(model1)$r.squared # R2 0.247
## [1] 0.28
model1 %>% AIC() # 13241
## [1] 7579

Judging by the correlations we can predict which variables might have a bigger impact, now we will use how many people the airbnb accomodates and how many bedrooms there are

# Multivariate Regression
# Model 2

model2 <- lm(log(price) ~ as.factor(accommodates) + bedrooms, 
             data=na.omit(train))

summary(model2) 
## 
## Call:
## lm(formula = log(price) ~ as.factor(accommodates) + bedrooms, 
##     data = na.omit(train))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9781 -0.3034 -0.0383  0.2529  2.1500 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 2.8133     0.0602   46.77  < 2e-16 ***
## as.factor(accommodates)2    0.6415     0.0604   10.62  < 2e-16 ***
## as.factor(accommodates)3    0.7636     0.0614   12.44  < 2e-16 ***
## as.factor(accommodates)4    0.9464     0.0602   15.71  < 2e-16 ***
## as.factor(accommodates)5    0.9970     0.0636   15.68  < 2e-16 ***
## as.factor(accommodates)6    1.1434     0.0636   17.99  < 2e-16 ***
## as.factor(accommodates)7    1.2280     0.0753   16.30  < 2e-16 ***
## as.factor(accommodates)8    1.3598     0.0739   18.39  < 2e-16 ***
## as.factor(accommodates)9    1.5171     0.1027   14.77  < 2e-16 ***
## as.factor(accommodates)10   1.3470     0.1012   13.31  < 2e-16 ***
## as.factor(accommodates)11   1.2490     0.1692    7.38  1.8e-13 ***
## as.factor(accommodates)12   1.6307     0.1204   13.54  < 2e-16 ***
## as.factor(accommodates)13   0.9178     0.1890    4.86  1.2e-06 ***
## as.factor(accommodates)14   1.7060     0.1720    9.92  < 2e-16 ***
## as.factor(accommodates)15   1.6854     0.2238    7.53  5.9e-14 ***
## as.factor(accommodates)16   1.9493     0.1264   15.43  < 2e-16 ***
## bedrooms                    0.0990     0.0120    8.26  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.469 on 5644 degrees of freedom
## Multiple R-squared:  0.289,  Adjusted R-squared:  0.287 
## F-statistic:  143 on 16 and 5644 DF,  p-value: <2e-16
summary(model2)$r.squared # R2 0.254
## [1] 0.289
model2 %>% AIC() # 13151
## [1] 7513

Both the r2 and the AIC is smaller with this model, which means that the this one would be preferred

model3 <- lm(log(price) ~ as.factor(accommodates) + cent_dist, data=na.omit(train))

summary(model3) 
## 
## Call:
## lm(formula = log(price) ~ as.factor(accommodates) + cent_dist, 
##     data = na.omit(train))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.173 -0.293 -0.024  0.251  2.076 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.27e+00   5.75e-02   56.93  < 2e-16 ***
## as.factor(accommodates)2   5.79e-01   5.72e-02   10.13  < 2e-16 ***
## as.factor(accommodates)3   7.25e-01   5.81e-02   12.49  < 2e-16 ***
## as.factor(accommodates)4   9.27e-01   5.69e-02   16.30  < 2e-16 ***
## as.factor(accommodates)5   1.03e+00   5.92e-02   17.47  < 2e-16 ***
## as.factor(accommodates)6   1.19e+00   5.87e-02   20.22  < 2e-16 ***
## as.factor(accommodates)7   1.27e+00   6.94e-02   18.30  < 2e-16 ***
## as.factor(accommodates)8   1.47e+00   6.66e-02   22.10  < 2e-16 ***
## as.factor(accommodates)9   1.59e+00   9.37e-02   16.97  < 2e-16 ***
## as.factor(accommodates)10  1.53e+00   9.11e-02   16.78  < 2e-16 ***
## as.factor(accommodates)11  1.27e+00   1.58e-01    8.01  1.4e-15 ***
## as.factor(accommodates)12  1.85e+00   1.10e-01   16.87  < 2e-16 ***
## as.factor(accommodates)13  1.07e+00   1.77e-01    6.04  1.6e-09 ***
## as.factor(accommodates)14  1.96e+00   1.58e-01   12.40  < 2e-16 ***
## as.factor(accommodates)15  2.02e+00   2.06e-01    9.82  < 2e-16 ***
## as.factor(accommodates)16  2.23e+00   1.12e-01   19.96  < 2e-16 ***
## cent_dist                 -1.75e-04   6.43e-06  -27.30  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.444 on 5644 degrees of freedom
## Multiple R-squared:  0.364,  Adjusted R-squared:  0.362 
## F-statistic:  202 on 16 and 5644 DF,  p-value: <2e-16
summary(model3)$r.squared # R2 0.331
## [1] 0.364
model3 %>% AIC() # 12240
## [1] 6879

Our R2 is much better now, and our Akaike criterion also droped by quite a big margin. This is likely due to the fact, that the distance from the center is a big factor when people price airbnbs

# Model 4
model4 <- lm(log(price) ~ as.factor(accommodates) + cent_dist + room_type,
             data=na.omit(train))

summary(model4) 
## 
## Call:
## lm(formula = log(price) ~ as.factor(accommodates) + cent_dist + 
##     room_type, data = na.omit(train))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9693 -0.2869 -0.0342  0.2397  2.0081 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.65e+00   6.02e-02   60.51  < 2e-16 ***
## as.factor(accommodates)2   2.68e-01   5.88e-02    4.57  5.0e-06 ***
## as.factor(accommodates)3   3.55e-01   6.07e-02    5.85  5.3e-09 ***
## as.factor(accommodates)4   5.52e-01   5.96e-02    9.26  < 2e-16 ***
## as.factor(accommodates)5   6.58e-01   6.18e-02   10.64  < 2e-16 ***
## as.factor(accommodates)6   8.11e-01   6.13e-02   13.24  < 2e-16 ***
## as.factor(accommodates)7   8.95e-01   7.12e-02   12.57  < 2e-16 ***
## as.factor(accommodates)8   1.10e+00   6.85e-02   16.07  < 2e-16 ***
## as.factor(accommodates)9   1.21e+00   9.40e-02   12.90  < 2e-16 ***
## as.factor(accommodates)10  1.15e+00   9.16e-02   12.51  < 2e-16 ***
## as.factor(accommodates)11  8.90e-01   1.56e-01    5.71  1.2e-08 ***
## as.factor(accommodates)12  1.51e+00   1.09e-01   13.90  < 2e-16 ***
## as.factor(accommodates)13  6.89e-01   1.74e-01    3.97  7.3e-05 ***
## as.factor(accommodates)14  1.58e+00   1.56e-01   10.16  < 2e-16 ***
## as.factor(accommodates)15  1.64e+00   2.02e-01    8.14  4.7e-16 ***
## as.factor(accommodates)16  1.91e+00   1.11e-01   17.24  < 2e-16 ***
## cent_dist                 -1.73e-04   6.27e-06  -27.53  < 2e-16 ***
## room_typePrivate room      1.67e-01   5.50e-02    3.03   0.0025 ** 
## room_typeHotel room       -3.94e-01   2.64e-02  -14.93  < 2e-16 ***
## room_typeShared room      -8.53e-01   8.53e-02  -10.01  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.432 on 5641 degrees of freedom
## Multiple R-squared:  0.397,  Adjusted R-squared:  0.395 
## F-statistic:  196 on 19 and 5641 DF,  p-value: <2e-16
summary(model4)$r.squared #R2 0.362
## [1] 0.397
model4 %>% AIC() # 11846
## [1] 6582

Room types will impact prices, as people would pay a premium for better acommendation, threfore with the room types we could improve our model also.

# Model 5
model5 <- lm(log(price) ~ as.factor(accommodates) + room_type + bedrooms + bathrooms  + cent_dist, 
             data=na.omit(train))

summary(model5) 
## 
## Call:
## lm(formula = log(price) ~ as.factor(accommodates) + room_type + 
##     bedrooms + bathrooms + cent_dist, data = na.omit(train))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6652 -0.2814 -0.0249  0.2390  1.9998 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.33e+00   6.14e-02   54.31  < 2e-16 ***
## as.factor(accommodates)2   2.95e-01   5.72e-02    5.16  2.5e-07 ***
## as.factor(accommodates)3   3.66e-01   5.91e-02    6.19  6.5e-10 ***
## as.factor(accommodates)4   5.26e-01   5.82e-02    9.03  < 2e-16 ***
## as.factor(accommodates)5   5.54e-01   6.12e-02    9.06  < 2e-16 ***
## as.factor(accommodates)6   6.50e-01   6.12e-02   10.62  < 2e-16 ***
## as.factor(accommodates)7   6.65e-01   7.13e-02    9.33  < 2e-16 ***
## as.factor(accommodates)8   7.87e-01   7.01e-02   11.23  < 2e-16 ***
## as.factor(accommodates)9   8.11e-01   9.53e-02    8.51  < 2e-16 ***
## as.factor(accommodates)10  6.86e-01   9.40e-02    7.30  3.4e-13 ***
## as.factor(accommodates)11  5.73e-01   1.54e-01    3.73  0.00019 ***
## as.factor(accommodates)12  1.02e+00   1.10e-01    9.24  < 2e-16 ***
## as.factor(accommodates)13  3.00e-01   1.71e-01    1.75  0.07962 .  
## as.factor(accommodates)14  9.07e-01   1.57e-01    5.78  7.8e-09 ***
## as.factor(accommodates)15  9.22e-01   2.02e-01    4.55  5.4e-06 ***
## as.factor(accommodates)16  1.18e+00   1.17e-01   10.14  < 2e-16 ***
## room_typePrivate room      1.02e-01   5.38e-02    1.90  0.05695 .  
## room_typeHotel room       -4.50e-01   2.59e-02  -17.38  < 2e-16 ***
## room_typeShared room      -9.15e-01   8.32e-02  -11.00  < 2e-16 ***
## bedrooms                   8.98e-02   1.12e-02    8.00  1.5e-15 ***
## bathrooms                  2.02e-01   1.55e-02   13.01  < 2e-16 ***
## cent_dist                 -1.72e-04   6.14e-06  -27.95  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.42 on 5639 degrees of freedom
## Multiple R-squared:  0.429,  Adjusted R-squared:  0.427 
## F-statistic:  202 on 21 and 5639 DF,  p-value: <2e-16
summary(model5)$r.squared # R2 0.379
## [1] 0.429
model5 %>% AIC() # 11616
## [1] 6278

In the next model we try to implement our distance variable, and more information about the flats. Although our model has higher R2 and AIC, it did not have a big effect.

# Model 6
model6 <- lm(log(price) ~ as.factor(accommodates) + room_type + bedrooms + bathrooms  + cent_dist + as.factor(neighbourhood) * cent_dist, 
             data=na.omit(train))

summary(model6) 
## 
## Call:
## lm(formula = log(price) ~ as.factor(accommodates) + room_type + 
##     bedrooms + bathrooms + cent_dist + as.factor(neighbourhood) * 
##     cent_dist, data = na.omit(train))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5906 -0.2475 -0.0191  0.2165  2.0084 
## 
## Coefficients: (1 not defined because of singularities)
##                                                      Estimate Std. Error
## (Intercept)                                          1.95e+00   7.08e-01
## as.factor(accommodates)2                             2.27e-01   5.41e-02
## as.factor(accommodates)3                             3.04e-01   5.58e-02
## as.factor(accommodates)4                             4.45e-01   5.51e-02
## as.factor(accommodates)5                             4.79e-01   5.79e-02
## as.factor(accommodates)6                             5.55e-01   5.79e-02
## as.factor(accommodates)7                             5.60e-01   6.73e-02
## as.factor(accommodates)8                             7.11e-01   6.61e-02
## as.factor(accommodates)9                             7.15e-01   8.97e-02
## as.factor(accommodates)10                            6.02e-01   8.86e-02
## as.factor(accommodates)11                            4.84e-01   1.44e-01
## as.factor(accommodates)12                            8.84e-01   1.04e-01
## as.factor(accommodates)13                            2.10e-01   1.61e-01
## as.factor(accommodates)14                            8.26e-01   1.48e-01
## as.factor(accommodates)15                            9.08e-01   1.90e-01
## as.factor(accommodates)16                            1.05e+00   1.10e-01
## room_typePrivate room                                1.03e-01   5.16e-02
## room_typeHotel room                                 -3.97e-01   2.46e-02
## room_typeShared room                                -1.00e+00   7.99e-02
## bedrooms                                             1.10e-01   1.07e-02
## bathrooms                                            1.83e-01   1.47e-02
## cent_dist                                            1.91e-04   1.73e-04
## as.factor(neighbourhood)Agios Nikolaos               5.78e-01   8.57e-01
## as.factor(neighbourhood)Akadimia Platonos            2.76e+00   9.21e-01
## as.factor(neighbourhood)Ambelokipi                   9.96e-01   7.15e-01
## as.factor(neighbourhood)Attiki                       1.10e+00   7.75e-01
## as.factor(neighbourhood)Exarcheia                    1.00e+00   7.18e-01
## as.factor(neighbourhood)Gazi                         1.32e-01   1.07e+00
## as.factor(neighbourhood)Goudi                       -1.46e+00   2.90e+00
## as.factor(neighbourhood)Ilisia                       1.48e+00   7.96e-01
## as.factor(neighbourhood)Kerameikos                   1.53e+00   7.26e-01
## as.factor(neighbourhood)Kolonaki                     1.38e+00   7.08e-01
## as.factor(neighbourhood)Kolonos                      1.42e+00   8.14e-01
## as.factor(neighbourhood)Koukaki                      1.61e+00   7.11e-01
## as.factor(neighbourhood)Kypseli                      6.11e-01   7.61e-01
## as.factor(neighbourhood)Larissis                     9.45e-01   7.25e-01
## as.factor(neighbourhood)Metaxourgeio                 7.19e-01   7.21e-01
## as.factor(neighbourhood)Mets                         1.59e+00   7.19e-01
## as.factor(neighbourhood)Monastiraki                  1.31e+00   8.99e-01
## as.factor(neighbourhood)Neapoli                      9.11e-01   7.30e-01
## as.factor(neighbourhood)Neos Kosmos                  1.58e+00   7.09e-01
## as.factor(neighbourhood)Pangrati                     1.21e+00   7.12e-01
## as.factor(neighbourhood)Patisia                     -1.15e-01   8.13e-01
## as.factor(neighbourhood)Pedion Areos                 1.70e+00   7.63e-01
## as.factor(neighbourhood)Petralona                    1.47e+00   7.56e-01
## as.factor(neighbourhood)Plaka                        1.59e+00   7.07e-01
## as.factor(neighbourhood)Profitis Daniil              2.21e+00   1.01e+00
## as.factor(neighbourhood)Psyri                        8.93e-01   7.27e-01
## as.factor(neighbourhood)Rizoupoli                   -5.53e-01   2.65e+00
## as.factor(neighbourhood)Rouf                        -1.44e-02   4.14e-01
## as.factor(neighbourhood)Sepolia                      2.85e+00   2.66e+00
## as.factor(neighbourhood)Thiseio                      2.27e+00   7.74e-01
## as.factor(neighbourhood)Votanikos                    8.60e-01   1.47e+00
## cent_dist:as.factor(neighbourhood)Agios Nikolaos    -1.30e-04   2.33e-04
## cent_dist:as.factor(neighbourhood)Akadimia Platonos -7.26e-04   2.59e-04
## cent_dist:as.factor(neighbourhood)Ambelokipi        -1.87e-04   1.76e-04
## cent_dist:as.factor(neighbourhood)Attiki            -3.05e-04   2.14e-04
## cent_dist:as.factor(neighbourhood)Exarcheia         -2.05e-04   2.00e-04
## cent_dist:as.factor(neighbourhood)Gazi               1.98e-04   3.53e-04
## cent_dist:as.factor(neighbourhood)Goudi              4.76e-04   7.45e-04
## cent_dist:as.factor(neighbourhood)Ilisia            -3.50e-04   2.36e-04
## cent_dist:as.factor(neighbourhood)Kerameikos        -3.26e-04   1.87e-04
## cent_dist:as.factor(neighbourhood)Kolonaki          -2.27e-04   1.80e-04
## cent_dist:as.factor(neighbourhood)Kolonos           -3.69e-04   2.27e-04
## cent_dist:as.factor(neighbourhood)Koukaki           -4.52e-04   1.79e-04
## cent_dist:as.factor(neighbourhood)Kypseli           -1.20e-04   1.99e-04
## cent_dist:as.factor(neighbourhood)Larissis          -2.06e-04   1.93e-04
## cent_dist:as.factor(neighbourhood)Metaxourgeio       3.51e-05   1.89e-04
## cent_dist:as.factor(neighbourhood)Mets              -5.44e-04   2.03e-04
## cent_dist:as.factor(neighbourhood)Monastiraki       -2.49e-06   5.30e-04
## cent_dist:as.factor(neighbourhood)Neapoli           -6.33e-05   2.12e-04
## cent_dist:as.factor(neighbourhood)Neos Kosmos       -4.75e-04   1.76e-04
## cent_dist:as.factor(neighbourhood)Pangrati          -3.02e-04   1.82e-04
## cent_dist:as.factor(neighbourhood)Patisia            4.67e-05   1.99e-04
## cent_dist:as.factor(neighbourhood)Pedion Areos      -5.38e-04   2.15e-04
## cent_dist:as.factor(neighbourhood)Petralona         -3.41e-04   1.96e-04
## cent_dist:as.factor(neighbourhood)Plaka             -2.80e-04   1.82e-04
## cent_dist:as.factor(neighbourhood)Profitis Daniil   -4.10e-04   2.99e-04
## cent_dist:as.factor(neighbourhood)Psyri              1.47e-04   2.18e-04
## cent_dist:as.factor(neighbourhood)Rizoupoli          7.87e-05   5.21e-04
## cent_dist:as.factor(neighbourhood)Rouf                     NA         NA
## cent_dist:as.factor(neighbourhood)Sepolia           -7.68e-04   7.35e-04
## cent_dist:as.factor(neighbourhood)Thiseio           -6.74e-04   2.30e-04
## cent_dist:as.factor(neighbourhood)Votanikos         -7.58e-05   4.72e-04
##                                                     t value Pr(>|t|)    
## (Intercept)                                            2.76  0.00589 ** 
## as.factor(accommodates)2                               4.20  2.8e-05 ***
## as.factor(accommodates)3                               5.44  5.5e-08 ***
## as.factor(accommodates)4                               8.09  7.5e-16 ***
## as.factor(accommodates)5                               8.27  < 2e-16 ***
## as.factor(accommodates)6                               9.58  < 2e-16 ***
## as.factor(accommodates)7                               8.33  < 2e-16 ***
## as.factor(accommodates)8                              10.75  < 2e-16 ***
## as.factor(accommodates)9                               7.96  2.0e-15 ***
## as.factor(accommodates)10                              6.79  1.2e-11 ***
## as.factor(accommodates)11                              3.35  0.00081 ***
## as.factor(accommodates)12                              8.49  < 2e-16 ***
## as.factor(accommodates)13                              1.30  0.19200    
## as.factor(accommodates)14                              5.60  2.3e-08 ***
## as.factor(accommodates)15                              4.78  1.8e-06 ***
## as.factor(accommodates)16                              9.49  < 2e-16 ***
## room_typePrivate room                                  2.00  0.04546 *  
## room_typeHotel room                                  -16.16  < 2e-16 ***
## room_typeShared room                                 -12.56  < 2e-16 ***
## bedrooms                                              10.29  < 2e-16 ***
## bathrooms                                             12.46  < 2e-16 ***
## cent_dist                                              1.10  0.26997    
## as.factor(neighbourhood)Agios Nikolaos                 0.67  0.49995    
## as.factor(neighbourhood)Akadimia Platonos              3.00  0.00275 ** 
## as.factor(neighbourhood)Ambelokipi                     1.39  0.16346    
## as.factor(neighbourhood)Attiki                         1.42  0.15687    
## as.factor(neighbourhood)Exarcheia                      1.40  0.16200    
## as.factor(neighbourhood)Gazi                           0.12  0.90213    
## as.factor(neighbourhood)Goudi                         -0.50  0.61443    
## as.factor(neighbourhood)Ilisia                         1.86  0.06244 .  
## as.factor(neighbourhood)Kerameikos                     2.11  0.03468 *  
## as.factor(neighbourhood)Kolonaki                       1.95  0.05139 .  
## as.factor(neighbourhood)Kolonos                        1.74  0.08200 .  
## as.factor(neighbourhood)Koukaki                        2.26  0.02391 *  
## as.factor(neighbourhood)Kypseli                        0.80  0.42216    
## as.factor(neighbourhood)Larissis                       1.30  0.19240    
## as.factor(neighbourhood)Metaxourgeio                   1.00  0.31827    
## as.factor(neighbourhood)Mets                           2.21  0.02716 *  
## as.factor(neighbourhood)Monastiraki                    1.46  0.14380    
## as.factor(neighbourhood)Neapoli                        1.25  0.21156    
## as.factor(neighbourhood)Neos Kosmos                    2.23  0.02611 *  
## as.factor(neighbourhood)Pangrati                       1.70  0.08825 .  
## as.factor(neighbourhood)Patisia                       -0.14  0.88801    
## as.factor(neighbourhood)Pedion Areos                   2.23  0.02578 *  
## as.factor(neighbourhood)Petralona                      1.95  0.05165 .  
## as.factor(neighbourhood)Plaka                          2.25  0.02441 *  
## as.factor(neighbourhood)Profitis Daniil                2.19  0.02863 *  
## as.factor(neighbourhood)Psyri                          1.23  0.21909    
## as.factor(neighbourhood)Rizoupoli                     -0.21  0.83448    
## as.factor(neighbourhood)Rouf                          -0.03  0.97219    
## as.factor(neighbourhood)Sepolia                        1.07  0.28411    
## as.factor(neighbourhood)Thiseio                        2.93  0.00337 ** 
## as.factor(neighbourhood)Votanikos                      0.58  0.55986    
## cent_dist:as.factor(neighbourhood)Agios Nikolaos      -0.56  0.57543    
## cent_dist:as.factor(neighbourhood)Akadimia Platonos   -2.80  0.00515 ** 
## cent_dist:as.factor(neighbourhood)Ambelokipi          -1.07  0.28679    
## cent_dist:as.factor(neighbourhood)Attiki              -1.43  0.15379    
## cent_dist:as.factor(neighbourhood)Exarcheia           -1.03  0.30514    
## cent_dist:as.factor(neighbourhood)Gazi                 0.56  0.57380    
## cent_dist:as.factor(neighbourhood)Goudi                0.64  0.52324    
## cent_dist:as.factor(neighbourhood)Ilisia              -1.49  0.13749    
## cent_dist:as.factor(neighbourhood)Kerameikos          -1.74  0.08201 .  
## cent_dist:as.factor(neighbourhood)Kolonaki            -1.26  0.20687    
## cent_dist:as.factor(neighbourhood)Kolonos             -1.63  0.10391    
## cent_dist:as.factor(neighbourhood)Koukaki             -2.53  0.01149 *  
## cent_dist:as.factor(neighbourhood)Kypseli             -0.60  0.54823    
## cent_dist:as.factor(neighbourhood)Larissis            -1.07  0.28465    
## cent_dist:as.factor(neighbourhood)Metaxourgeio         0.19  0.85282    
## cent_dist:as.factor(neighbourhood)Mets                -2.68  0.00747 ** 
## cent_dist:as.factor(neighbourhood)Monastiraki          0.00  0.99626    
## cent_dist:as.factor(neighbourhood)Neapoli             -0.30  0.76545    
## cent_dist:as.factor(neighbourhood)Neos Kosmos         -2.69  0.00712 ** 
## cent_dist:as.factor(neighbourhood)Pangrati            -1.66  0.09645 .  
## cent_dist:as.factor(neighbourhood)Patisia              0.23  0.81454    
## cent_dist:as.factor(neighbourhood)Pedion Areos        -2.50  0.01257 *  
## cent_dist:as.factor(neighbourhood)Petralona           -1.74  0.08142 .  
## cent_dist:as.factor(neighbourhood)Plaka               -1.54  0.12345    
## cent_dist:as.factor(neighbourhood)Profitis Daniil     -1.37  0.16994    
## cent_dist:as.factor(neighbourhood)Psyri                0.68  0.49881    
## cent_dist:as.factor(neighbourhood)Rizoupoli            0.15  0.87980    
## cent_dist:as.factor(neighbourhood)Rouf                   NA       NA    
## cent_dist:as.factor(neighbourhood)Sepolia             -1.04  0.29613    
## cent_dist:as.factor(neighbourhood)Thiseio             -2.92  0.00347 ** 
## cent_dist:as.factor(neighbourhood)Votanikos           -0.16  0.87249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.393 on 5578 degrees of freedom
## Multiple R-squared:  0.507,  Adjusted R-squared:  0.499 
## F-statistic: 69.9 on 82 and 5578 DF,  p-value: <2e-16
summary(model6)$r.squared # R2 0.507
## [1] 0.507
model6 %>% AIC() # 5575
## [1] 5575

With the interaction between the distance and neigbourhood we achived our biggest improvement yet. Distance is important during flat hunting, but the neighbourhood also plays a huge role.

# Model 7
model7 <- lm(log(price) ~ as.factor(accommodates) + room_type + bathrooms  +
               as.factor(neighbourhood) * cent_dist + 
               review_scores_rating * reviews_per_month, 
             data=na.omit(train))

summary(model7) 
## 
## Call:
## lm(formula = log(price) ~ as.factor(accommodates) + room_type + 
##     bathrooms + as.factor(neighbourhood) * cent_dist + review_scores_rating * 
##     reviews_per_month, data = na.omit(train))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6798 -0.2427 -0.0203  0.2154  1.8630 
## 
## Coefficients: (1 not defined because of singularities)
##                                                      Estimate Std. Error
## (Intercept)                                          1.71e+00   6.90e-01
## as.factor(accommodates)2                             2.57e-01   5.22e-02
## as.factor(accommodates)3                             3.25e-01   5.39e-02
## as.factor(accommodates)4                             4.86e-01   5.30e-02
## as.factor(accommodates)5                             5.80e-01   5.49e-02
## as.factor(accommodates)6                             6.81e-01   5.46e-02
## as.factor(accommodates)7                             7.47e-01   6.34e-02
## as.factor(accommodates)8                             8.86e-01   6.15e-02
## as.factor(accommodates)9                             9.82e-01   8.41e-02
## as.factor(accommodates)10                            8.86e-01   8.24e-02
## as.factor(accommodates)11                            8.35e-01   1.38e-01
## as.factor(accommodates)12                            1.09e+00   9.78e-02
## as.factor(accommodates)13                            4.92e-01   1.54e-01
## as.factor(accommodates)14                            1.13e+00   1.40e-01
## as.factor(accommodates)15                            1.31e+00   1.79e-01
## as.factor(accommodates)16                            1.34e+00   1.01e-01
## room_typePrivate room                                9.23e-02   5.00e-02
## room_typeHotel room                                 -3.99e-01   2.38e-02
## room_typeShared room                                -1.04e+00   7.72e-02
## bathrooms                                            2.22e-01   1.36e-02
## as.factor(neighbourhood)Agios Nikolaos               4.75e-01   8.28e-01
## as.factor(neighbourhood)Akadimia Platonos            3.28e+00   8.89e-01
## as.factor(neighbourhood)Ambelokipi                   1.14e+00   6.90e-01
## as.factor(neighbourhood)Attiki                       1.17e+00   7.49e-01
## as.factor(neighbourhood)Exarcheia                    1.21e+00   6.93e-01
## as.factor(neighbourhood)Gazi                         2.37e-01   1.03e+00
## as.factor(neighbourhood)Goudi                       -1.70e+00   2.81e+00
## as.factor(neighbourhood)Ilisia                       1.68e+00   7.69e-01
## as.factor(neighbourhood)Kerameikos                   1.81e+00   7.01e-01
## as.factor(neighbourhood)Kolonaki                     1.60e+00   6.84e-01
## as.factor(neighbourhood)Kolonos                      1.66e+00   7.86e-01
## as.factor(neighbourhood)Koukaki                      1.84e+00   6.87e-01
## as.factor(neighbourhood)Kypseli                      8.85e-01   7.35e-01
## as.factor(neighbourhood)Larissis                     1.25e+00   7.00e-01
## as.factor(neighbourhood)Metaxourgeio                 1.04e+00   6.96e-01
## as.factor(neighbourhood)Mets                         1.81e+00   6.94e-01
## as.factor(neighbourhood)Monastiraki                  1.95e+00   8.69e-01
## as.factor(neighbourhood)Neapoli                      1.07e+00   7.05e-01
## as.factor(neighbourhood)Neos Kosmos                  1.77e+00   6.85e-01
## as.factor(neighbourhood)Pangrati                     1.38e+00   6.87e-01
## as.factor(neighbourhood)Patisia                      9.61e-02   7.85e-01
## as.factor(neighbourhood)Pedion Areos                 1.88e+00   7.37e-01
## as.factor(neighbourhood)Petralona                    1.65e+00   7.30e-01
## as.factor(neighbourhood)Plaka                        1.77e+00   6.83e-01
## as.factor(neighbourhood)Profitis Daniil              2.34e+00   9.74e-01
## as.factor(neighbourhood)Psyri                        9.94e-01   7.02e-01
## as.factor(neighbourhood)Rizoupoli                   -3.99e-01   2.56e+00
## as.factor(neighbourhood)Rouf                        -6.89e-02   4.00e-01
## as.factor(neighbourhood)Sepolia                      3.36e+00   2.57e+00
## as.factor(neighbourhood)Thiseio                      2.49e+00   7.48e-01
## as.factor(neighbourhood)Votanikos                    9.43e-01   1.42e+00
## cent_dist                                            2.24e-04   1.67e-04
## review_scores_rating                                 2.09e-03   8.80e-04
## reviews_per_month                                   -1.03e+00   7.53e-02
## as.factor(neighbourhood)Agios Nikolaos:cent_dist    -6.90e-05   2.25e-04
## as.factor(neighbourhood)Akadimia Platonos:cent_dist -8.66e-04   2.51e-04
## as.factor(neighbourhood)Ambelokipi:cent_dist        -2.19e-04   1.70e-04
## as.factor(neighbourhood)Attiki:cent_dist            -2.95e-04   2.06e-04
## as.factor(neighbourhood)Exarcheia:cent_dist         -2.53e-04   1.93e-04
## as.factor(neighbourhood)Gazi:cent_dist               1.81e-04   3.41e-04
## as.factor(neighbourhood)Goudi:cent_dist              5.34e-04   7.20e-04
## as.factor(neighbourhood)Ilisia:cent_dist            -4.06e-04   2.28e-04
## as.factor(neighbourhood)Kerameikos:cent_dist        -3.97e-04   1.81e-04
## as.factor(neighbourhood)Kolonaki:cent_dist          -3.22e-04   1.74e-04
## as.factor(neighbourhood)Kolonos:cent_dist           -4.17e-04   2.19e-04
## as.factor(neighbourhood)Koukaki:cent_dist           -5.06e-04   1.73e-04
## as.factor(neighbourhood)Kypseli:cent_dist           -1.97e-04   1.93e-04
## as.factor(neighbourhood)Larissis:cent_dist          -2.94e-04   1.86e-04
## as.factor(neighbourhood)Metaxourgeio:cent_dist      -5.47e-05   1.83e-04
## as.factor(neighbourhood)Mets:cent_dist              -6.18e-04   1.97e-04
## as.factor(neighbourhood)Monastiraki:cent_dist       -4.16e-04   5.12e-04
## as.factor(neighbourhood)Neapoli:cent_dist           -9.39e-05   2.05e-04
## as.factor(neighbourhood)Neos Kosmos:cent_dist       -5.22e-04   1.70e-04
## as.factor(neighbourhood)Pangrati:cent_dist          -3.47e-04   1.75e-04
## as.factor(neighbourhood)Patisia:cent_dist           -3.07e-07   1.93e-04
## as.factor(neighbourhood)Pedion Areos:cent_dist      -5.83e-04   2.08e-04
## as.factor(neighbourhood)Petralona:cent_dist         -3.80e-04   1.89e-04
## as.factor(neighbourhood)Plaka:cent_dist             -3.22e-04   1.75e-04
## as.factor(neighbourhood)Profitis Daniil:cent_dist   -4.36e-04   2.89e-04
## as.factor(neighbourhood)Psyri:cent_dist              2.00e-04   2.10e-04
## as.factor(neighbourhood)Rizoupoli:cent_dist          4.21e-05   5.03e-04
## as.factor(neighbourhood)Rouf:cent_dist                     NA         NA
## as.factor(neighbourhood)Sepolia:cent_dist           -9.06e-04   7.10e-04
## as.factor(neighbourhood)Thiseio:cent_dist           -7.40e-04   2.23e-04
## as.factor(neighbourhood)Votanikos:cent_dist         -8.10e-05   4.56e-04
## review_scores_rating:reviews_per_month               1.01e-02   7.80e-04
##                                                     t value Pr(>|t|)    
## (Intercept)                                            2.48  0.01309 *  
## as.factor(accommodates)2                               4.91  9.3e-07 ***
## as.factor(accommodates)3                               6.04  1.7e-09 ***
## as.factor(accommodates)4                               9.17  < 2e-16 ***
## as.factor(accommodates)5                              10.55  < 2e-16 ***
## as.factor(accommodates)6                              12.48  < 2e-16 ***
## as.factor(accommodates)7                              11.78  < 2e-16 ***
## as.factor(accommodates)8                              14.41  < 2e-16 ***
## as.factor(accommodates)9                              11.68  < 2e-16 ***
## as.factor(accommodates)10                             10.75  < 2e-16 ***
## as.factor(accommodates)11                              6.05  1.5e-09 ***
## as.factor(accommodates)12                             11.20  < 2e-16 ***
## as.factor(accommodates)13                              3.20  0.00138 ** 
## as.factor(accommodates)14                              8.07  8.8e-16 ***
## as.factor(accommodates)15                              7.30  3.2e-13 ***
## as.factor(accommodates)16                             13.17  < 2e-16 ***
## room_typePrivate room                                  1.85  0.06481 .  
## room_typeHotel room                                  -16.75  < 2e-16 ***
## room_typeShared room                                 -13.52  < 2e-16 ***
## bathrooms                                             16.29  < 2e-16 ***
## as.factor(neighbourhood)Agios Nikolaos                 0.57  0.56596    
## as.factor(neighbourhood)Akadimia Platonos              3.69  0.00022 ***
## as.factor(neighbourhood)Ambelokipi                     1.66  0.09773 .  
## as.factor(neighbourhood)Attiki                         1.57  0.11715    
## as.factor(neighbourhood)Exarcheia                      1.75  0.08065 .  
## as.factor(neighbourhood)Gazi                           0.23  0.81895    
## as.factor(neighbourhood)Goudi                         -0.61  0.54506    
## as.factor(neighbourhood)Ilisia                         2.18  0.02939 *  
## as.factor(neighbourhood)Kerameikos                     2.58  0.01003 *  
## as.factor(neighbourhood)Kolonaki                       2.33  0.01966 *  
## as.factor(neighbourhood)Kolonos                        2.11  0.03475 *  
## as.factor(neighbourhood)Koukaki                        2.67  0.00753 ** 
## as.factor(neighbourhood)Kypseli                        1.21  0.22814    
## as.factor(neighbourhood)Larissis                       1.79  0.07325 .  
## as.factor(neighbourhood)Metaxourgeio                   1.50  0.13427    
## as.factor(neighbourhood)Mets                           2.61  0.00913 ** 
## as.factor(neighbourhood)Monastiraki                    2.24  0.02489 *  
## as.factor(neighbourhood)Neapoli                        1.52  0.12962    
## as.factor(neighbourhood)Neos Kosmos                    2.59  0.00955 ** 
## as.factor(neighbourhood)Pangrati                       2.01  0.04421 *  
## as.factor(neighbourhood)Patisia                        0.12  0.90259    
## as.factor(neighbourhood)Pedion Areos                   2.55  0.01084 *  
## as.factor(neighbourhood)Petralona                      2.26  0.02365 *  
## as.factor(neighbourhood)Plaka                          2.59  0.00957 ** 
## as.factor(neighbourhood)Profitis Daniil                2.40  0.01636 *  
## as.factor(neighbourhood)Psyri                          1.42  0.15681    
## as.factor(neighbourhood)Rizoupoli                     -0.16  0.87582    
## as.factor(neighbourhood)Rouf                          -0.17  0.86320    
## as.factor(neighbourhood)Sepolia                        1.31  0.19103    
## as.factor(neighbourhood)Thiseio                        3.33  0.00087 ***
## as.factor(neighbourhood)Votanikos                      0.66  0.50799    
## cent_dist                                              1.34  0.17925    
## review_scores_rating                                   2.38  0.01752 *  
## reviews_per_month                                    -13.64  < 2e-16 ***
## as.factor(neighbourhood)Agios Nikolaos:cent_dist      -0.31  0.75896    
## as.factor(neighbourhood)Akadimia Platonos:cent_dist   -3.46  0.00055 ***
## as.factor(neighbourhood)Ambelokipi:cent_dist          -1.29  0.19736    
## as.factor(neighbourhood)Attiki:cent_dist              -1.43  0.15373    
## as.factor(neighbourhood)Exarcheia:cent_dist           -1.31  0.18941    
## as.factor(neighbourhood)Gazi:cent_dist                 0.53  0.59406    
## as.factor(neighbourhood)Goudi:cent_dist                0.74  0.45793    
## as.factor(neighbourhood)Ilisia:cent_dist              -1.78  0.07470 .  
## as.factor(neighbourhood)Kerameikos:cent_dist          -2.19  0.02823 *  
## as.factor(neighbourhood)Kolonaki:cent_dist            -1.85  0.06380 .  
## as.factor(neighbourhood)Kolonos:cent_dist             -1.91  0.05678 .  
## as.factor(neighbourhood)Koukaki:cent_dist             -2.93  0.00341 ** 
## as.factor(neighbourhood)Kypseli:cent_dist             -1.02  0.30651    
## as.factor(neighbourhood)Larissis:cent_dist            -1.58  0.11522    
## as.factor(neighbourhood)Metaxourgeio:cent_dist        -0.30  0.76495    
## as.factor(neighbourhood)Mets:cent_dist                -3.15  0.00166 ** 
## as.factor(neighbourhood)Monastiraki:cent_dist         -0.81  0.41671    
## as.factor(neighbourhood)Neapoli:cent_dist             -0.46  0.64681    
## as.factor(neighbourhood)Neos Kosmos:cent_dist         -3.06  0.00219 ** 
## as.factor(neighbourhood)Pangrati:cent_dist            -1.98  0.04793 *  
## as.factor(neighbourhood)Patisia:cent_dist              0.00  0.99873    
## as.factor(neighbourhood)Pedion Areos:cent_dist        -2.80  0.00508 ** 
## as.factor(neighbourhood)Petralona:cent_dist           -2.01  0.04406 *  
## as.factor(neighbourhood)Plaka:cent_dist               -1.84  0.06640 .  
## as.factor(neighbourhood)Profitis Daniil:cent_dist     -1.51  0.13071    
## as.factor(neighbourhood)Psyri:cent_dist                0.95  0.34217    
## as.factor(neighbourhood)Rizoupoli:cent_dist            0.08  0.93324    
## as.factor(neighbourhood)Rouf:cent_dist                   NA       NA    
## as.factor(neighbourhood)Sepolia:cent_dist             -1.28  0.20184    
## as.factor(neighbourhood)Thiseio:cent_dist             -3.33  0.00089 ***
## as.factor(neighbourhood)Votanikos:cent_dist           -0.18  0.85901    
## review_scores_rating:reviews_per_month                12.96  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.38 on 5576 degrees of freedom
## Multiple R-squared:  0.54,   Adjusted R-squared:  0.533 
## F-statistic: 77.9 on 84 and 5576 DF,  p-value: <2e-16
summary(model7)$r.squared #R2 0.54
## [1] 0.54
model7 %>% AIC() # 5184
## [1] 5184

In our final model we use 2 interactions. One for the distance, which we correct with neighbourhoods, and one for the reviews, where we try to weight the rating and frequency, giving a proxy for the demand of that airbnb.

With these adjustments our model our model outperforms any other model we tried, and is still fairly understandable. Our R2 is around 54%, and our Akaike infromation criterion is 5184, which is much lower than our first tries which were around 12000.

We now plot a graph showing the distribution of the residuals

ggplot(model7, aes(x = .fitted, y = .resid)) + 
  geom_point() +
  labs(title = "Residuals vary around zero") +
  ylab("Residual") +
  xlab("")

2.3.3 Statistical tests

Because of our data is dependent on human actions, we are going to test if the variance is constant in our model. If not, then we will correct for this using robust standard errors.

# Heteroscedasticity

ols_test_breusch_pagan(model7)
## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##                  Data                  
##  --------------------------------------
##  Response : log(price) 
##  Variables: fitted values of log(price) 
## 
##        Test Summary         
##  ---------------------------
##  DF            =    1 
##  Chi2          =    55.3884 
##  Prob > Chi2   =    9.89e-14
# The test tells us that the variance is not constant accross our sample, therefore we will use robust standard errors.

lmrob_control <- lmrob.control()
lmrob_control$fast.s.large.n <- Inf

model7_rob <- lmrob(log(price) ~ 
                      as.factor(accommodates) + room_type + bathrooms  +
                      as.factor(neighbourhood) * cent_dist + 
                      review_scores_rating * reviews_per_month, 
      data=na.omit(train),
      control=lmrob_control)

summary(model7_rob)
## 
## Call:
## lmrob(formula = log(price) ~ as.factor(accommodates) + room_type + bathrooms + 
##     as.factor(neighbourhood) * cent_dist + review_scores_rating * reviews_per_month, 
##     data = na.omit(train), control = lmrob_control)
##  \--> method = "MM"
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -6.34638 -0.22279 -0.00471  0.22794  1.89908 
## 
## Coefficients: (1 not defined because of singularities)
##                                                      Estimate Std. Error
## (Intercept)                                          1.83e+00   5.76e-01
## as.factor(accommodates)2                             1.99e-01   4.68e-02
## as.factor(accommodates)3                             2.70e-01   4.86e-02
## as.factor(accommodates)4                             4.13e-01   4.84e-02
## as.factor(accommodates)5                             5.14e-01   5.07e-02
## as.factor(accommodates)6                             6.16e-01   5.09e-02
## as.factor(accommodates)7                             6.91e-01   5.96e-02
## as.factor(accommodates)8                             7.96e-01   6.59e-02
## as.factor(accommodates)9                             9.40e-01   7.54e-02
## as.factor(accommodates)10                            9.05e-01   9.94e-02
## as.factor(accommodates)11                            8.43e-01   3.65e-01
## as.factor(accommodates)12                            1.10e+00   1.73e-01
## as.factor(accommodates)13                            4.75e-01   2.62e-01
## as.factor(accommodates)14                            1.14e+00   2.71e-01
## as.factor(accommodates)15                            1.28e+00   1.86e-01
## as.factor(accommodates)16                            1.36e+00   1.34e-01
## room_typePrivate room                                2.26e-01   1.22e-01
## room_typeHotel room                                 -4.57e-01   2.76e-02
## room_typeShared room                                -1.17e+00   9.17e-02
## bathrooms                                            2.30e-01   2.17e-02
## as.factor(neighbourhood)Agios Nikolaos               3.68e-01   7.27e-01
## as.factor(neighbourhood)Akadimia Platonos            2.96e+00   1.26e+00
## as.factor(neighbourhood)Ambelokipi                   1.05e+00   5.74e-01
## as.factor(neighbourhood)Attiki                       1.14e+00   6.06e-01
## as.factor(neighbourhood)Exarcheia                    1.09e+00   5.75e-01
## as.factor(neighbourhood)Gazi                         3.76e-01   8.52e-01
## as.factor(neighbourhood)Goudi                       -1.55e+00   2.02e+00
## as.factor(neighbourhood)Ilisia                       1.68e+00   6.14e-01
## as.factor(neighbourhood)Kerameikos                   1.67e+00   5.90e-01
## as.factor(neighbourhood)Kolonaki                     1.53e+00   5.69e-01
## as.factor(neighbourhood)Kolonos                      1.95e+00   7.23e-01
## as.factor(neighbourhood)Koukaki                      1.78e+00   5.72e-01
## as.factor(neighbourhood)Kypseli                      1.08e+00   6.25e-01
## as.factor(neighbourhood)Larissis                     1.25e+00   5.82e-01
## as.factor(neighbourhood)Metaxourgeio                 9.04e-01   5.86e-01
## as.factor(neighbourhood)Mets                         1.66e+00   5.77e-01
## as.factor(neighbourhood)Monastiraki                  1.76e+00   6.93e-01
## as.factor(neighbourhood)Neapoli                      9.08e-01   5.96e-01
## as.factor(neighbourhood)Neos Kosmos                  1.67e+00   5.69e-01
## as.factor(neighbourhood)Pangrati                     1.29e+00   5.72e-01
## as.factor(neighbourhood)Patisia                      5.58e-02   6.74e-01
## as.factor(neighbourhood)Pedion Areos                 1.71e+00   6.39e-01
## as.factor(neighbourhood)Petralona                    1.57e+00   5.98e-01
## as.factor(neighbourhood)Plaka                        1.72e+00   5.68e-01
## as.factor(neighbourhood)Profitis Daniil             -7.94e+00   5.49e+00
## as.factor(neighbourhood)Psyri                        1.43e+00   5.95e-01
## as.factor(neighbourhood)Rizoupoli                   -5.91e-02   4.59e+00
## as.factor(neighbourhood)Rouf                        -5.70e-02   1.09e-01
## as.factor(neighbourhood)Sepolia                      3.76e+00   1.67e+00
## as.factor(neighbourhood)Thiseio                      2.20e+00   6.20e-01
## as.factor(neighbourhood)Votanikos                    9.95e-01   1.25e+00
## cent_dist                                            2.06e-04   1.37e-04
## review_scores_rating                                 1.98e-03   9.56e-04
## reviews_per_month                                   -9.38e-01   8.80e-02
## as.factor(neighbourhood)Agios Nikolaos:cent_dist    -3.78e-05   2.00e-04
## as.factor(neighbourhood)Akadimia Platonos:cent_dist -7.90e-04   3.60e-04
## as.factor(neighbourhood)Ambelokipi:cent_dist        -1.94e-04   1.40e-04
## as.factor(neighbourhood)Attiki:cent_dist            -2.94e-04   1.61e-04
## as.factor(neighbourhood)Exarcheia:cent_dist         -1.96e-04   1.58e-04
## as.factor(neighbourhood)Gazi:cent_dist               1.22e-04   2.82e-04
## as.factor(neighbourhood)Goudi:cent_dist              4.99e-04   5.10e-04
## as.factor(neighbourhood)Ilisia:cent_dist            -4.15e-04   1.71e-04
## as.factor(neighbourhood)Kerameikos:cent_dist        -3.57e-04   1.54e-04
## as.factor(neighbourhood)Kolonaki:cent_dist          -3.12e-04   1.47e-04
## as.factor(neighbourhood)Kolonos:cent_dist           -5.44e-04   2.15e-04
## as.factor(neighbourhood)Koukaki:cent_dist           -4.97e-04   1.44e-04
## as.factor(neighbourhood)Kypseli:cent_dist           -2.76e-04   1.68e-04
## as.factor(neighbourhood)Larissis:cent_dist          -3.13e-04   1.55e-04
## as.factor(neighbourhood)Metaxourgeio:cent_dist      -2.33e-05   1.61e-04
## as.factor(neighbourhood)Mets:cent_dist              -5.65e-04   1.63e-04
## as.factor(neighbourhood)Monastiraki:cent_dist       -2.81e-04   4.16e-04
## as.factor(neighbourhood)Neapoli:cent_dist           -1.94e-05   1.95e-04
## as.factor(neighbourhood)Neos Kosmos:cent_dist       -4.85e-04   1.41e-04
## as.factor(neighbourhood)Pangrati:cent_dist          -3.16e-04   1.47e-04
## as.factor(neighbourhood)Patisia:cent_dist            6.96e-06   1.66e-04
## as.factor(neighbourhood)Pedion Areos:cent_dist      -5.18e-04   1.87e-04
## as.factor(neighbourhood)Petralona:cent_dist         -3.58e-04   1.51e-04
## as.factor(neighbourhood)Plaka:cent_dist             -3.45e-04   1.52e-04
## as.factor(neighbourhood)Profitis Daniil:cent_dist    3.28e-03   1.92e-03
## as.factor(neighbourhood)Psyri:cent_dist             -2.08e-04   2.00e-04
## as.factor(neighbourhood)Rizoupoli:cent_dist         -1.08e-05   8.97e-04
## as.factor(neighbourhood)Rouf:cent_dist                     NA         NA
## as.factor(neighbourhood)Sepolia:cent_dist           -1.01e-03   4.49e-04
## as.factor(neighbourhood)Thiseio:cent_dist           -6.17e-04   1.80e-04
## as.factor(neighbourhood)Votanikos:cent_dist         -1.01e-04   4.05e-04
## review_scores_rating:reviews_per_month               9.28e-03   9.08e-04
##                                                     t value Pr(>|t|)    
## (Intercept)                                            3.17  0.00154 ** 
## as.factor(accommodates)2                               4.25  2.2e-05 ***
## as.factor(accommodates)3                               5.57  2.7e-08 ***
## as.factor(accommodates)4                               8.55  < 2e-16 ***
## as.factor(accommodates)5                              10.14  < 2e-16 ***
## as.factor(accommodates)6                              12.11  < 2e-16 ***
## as.factor(accommodates)7                              11.60  < 2e-16 ***
## as.factor(accommodates)8                              12.07  < 2e-16 ***
## as.factor(accommodates)9                              12.47  < 2e-16 ***
## as.factor(accommodates)10                              9.11  < 2e-16 ***
## as.factor(accommodates)11                              2.31  0.02096 *  
## as.factor(accommodates)12                              6.37  2.0e-10 ***
## as.factor(accommodates)13                              1.81  0.07033 .  
## as.factor(accommodates)14                              4.21  2.5e-05 ***
## as.factor(accommodates)15                              6.89  6.1e-12 ***
## as.factor(accommodates)16                             10.19  < 2e-16 ***
## room_typePrivate room                                  1.86  0.06330 .  
## room_typeHotel room                                  -16.52  < 2e-16 ***
## room_typeShared room                                 -12.81  < 2e-16 ***
## bathrooms                                             10.61  < 2e-16 ***
## as.factor(neighbourhood)Agios Nikolaos                 0.51  0.61296    
## as.factor(neighbourhood)Akadimia Platonos              2.34  0.01923 *  
## as.factor(neighbourhood)Ambelokipi                     1.83  0.06658 .  
## as.factor(neighbourhood)Attiki                         1.88  0.06014 .  
## as.factor(neighbourhood)Exarcheia                      1.90  0.05805 .  
## as.factor(neighbourhood)Gazi                           0.44  0.65921    
## as.factor(neighbourhood)Goudi                         -0.77  0.44399    
## as.factor(neighbourhood)Ilisia                         2.74  0.00620 ** 
## as.factor(neighbourhood)Kerameikos                     2.83  0.00466 ** 
## as.factor(neighbourhood)Kolonaki                       2.69  0.00709 ** 
## as.factor(neighbourhood)Kolonos                        2.69  0.00711 ** 
## as.factor(neighbourhood)Koukaki                        3.11  0.00190 ** 
## as.factor(neighbourhood)Kypseli                        1.72  0.08538 .  
## as.factor(neighbourhood)Larissis                       2.14  0.03217 *  
## as.factor(neighbourhood)Metaxourgeio                   1.54  0.12295    
## as.factor(neighbourhood)Mets                           2.88  0.00398 ** 
## as.factor(neighbourhood)Monastiraki                    2.54  0.01106 *  
## as.factor(neighbourhood)Neapoli                        1.52  0.12788    
## as.factor(neighbourhood)Neos Kosmos                    2.93  0.00341 ** 
## as.factor(neighbourhood)Pangrati                       2.26  0.02387 *  
## as.factor(neighbourhood)Patisia                        0.08  0.93399    
## as.factor(neighbourhood)Pedion Areos                   2.67  0.00759 ** 
## as.factor(neighbourhood)Petralona                      2.62  0.00875 ** 
## as.factor(neighbourhood)Plaka                          3.03  0.00245 ** 
## as.factor(neighbourhood)Profitis Daniil               -1.45  0.14838    
## as.factor(neighbourhood)Psyri                          2.41  0.01602 *  
## as.factor(neighbourhood)Rizoupoli                     -0.01  0.98973    
## as.factor(neighbourhood)Rouf                          -0.52  0.60021    
## as.factor(neighbourhood)Sepolia                        2.26  0.02416 *  
## as.factor(neighbourhood)Thiseio                        3.54  0.00040 ***
## as.factor(neighbourhood)Votanikos                      0.80  0.42463    
## cent_dist                                              1.51  0.13203    
## review_scores_rating                                   2.07  0.03884 *  
## reviews_per_month                                    -10.66  < 2e-16 ***
## as.factor(neighbourhood)Agios Nikolaos:cent_dist      -0.19  0.85008    
## as.factor(neighbourhood)Akadimia Platonos:cent_dist   -2.19  0.02833 *  
## as.factor(neighbourhood)Ambelokipi:cent_dist          -1.39  0.16488    
## as.factor(neighbourhood)Attiki:cent_dist              -1.83  0.06747 .  
## as.factor(neighbourhood)Exarcheia:cent_dist           -1.24  0.21509    
## as.factor(neighbourhood)Gazi:cent_dist                 0.43  0.66626    
## as.factor(neighbourhood)Goudi:cent_dist                0.98  0.32801    
## as.factor(neighbourhood)Ilisia:cent_dist              -2.42  0.01536 *  
## as.factor(neighbourhood)Kerameikos:cent_dist          -2.32  0.02039 *  
## as.factor(neighbourhood)Kolonaki:cent_dist            -2.12  0.03420 *  
## as.factor(neighbourhood)Kolonos:cent_dist             -2.53  0.01152 *  
## as.factor(neighbourhood)Koukaki:cent_dist             -3.45  0.00056 ***
## as.factor(neighbourhood)Kypseli:cent_dist             -1.65  0.09913 .  
## as.factor(neighbourhood)Larissis:cent_dist            -2.02  0.04346 *  
## as.factor(neighbourhood)Metaxourgeio:cent_dist        -0.15  0.88460    
## as.factor(neighbourhood)Mets:cent_dist                -3.46  0.00055 ***
## as.factor(neighbourhood)Monastiraki:cent_dist         -0.68  0.49895    
## as.factor(neighbourhood)Neapoli:cent_dist             -0.10  0.92098    
## as.factor(neighbourhood)Neos Kosmos:cent_dist         -3.43  0.00060 ***
## as.factor(neighbourhood)Pangrati:cent_dist            -2.16  0.03104 *  
## as.factor(neighbourhood)Patisia:cent_dist              0.04  0.96651    
## as.factor(neighbourhood)Pedion Areos:cent_dist        -2.77  0.00561 ** 
## as.factor(neighbourhood)Petralona:cent_dist           -2.38  0.01732 *  
## as.factor(neighbourhood)Plaka:cent_dist               -2.27  0.02295 *  
## as.factor(neighbourhood)Profitis Daniil:cent_dist      1.71  0.08728 .  
## as.factor(neighbourhood)Psyri:cent_dist               -1.04  0.29936    
## as.factor(neighbourhood)Rizoupoli:cent_dist           -0.01  0.99042    
## as.factor(neighbourhood)Rouf:cent_dist                   NA       NA    
## as.factor(neighbourhood)Sepolia:cent_dist             -2.25  0.02435 *  
## as.factor(neighbourhood)Thiseio:cent_dist             -3.43  0.00060 ***
## as.factor(neighbourhood)Votanikos:cent_dist           -0.25  0.80250    
## review_scores_rating:reviews_per_month                10.22  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Robust residual standard error: 0.336 
## Multiple R-squared:  0.589,  Adjusted R-squared:  0.582 
## Convergence in 28 IRWLS iterations
## 
## Robustness weights: 
##  16 observations c(156,1032,1153,1319,1654,2742,3084,3107,3153,4056,4401,4512,4947,5079,5415,5607)
##   are outliers with |weight| = 0 ( < 1.8e-05); 
##  473 weights are ~= 1. The remaining 5172 ones are summarized as
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.871   0.951   0.894   0.985   0.999 
## Algorithmic parameters: 
##        tuning.chi                bb        tuning.psi        refine.tol 
##          1.55e+00          5.00e-01          4.69e+00          1.00e-07 
##           rel.tol         scale.tol         solve.tol       eps.outlier 
##          1.00e-07          1.00e-10          1.00e-07          1.77e-05 
##             eps.x warn.limit.reject warn.limit.meanrw 
##          1.04e-08          5.00e-01          5.00e-01 
##   nResample      max.it    best.r.s    k.fast.s       k.max maxit.scale 
##         500          50           2           1         200         200 
##   trace.lev         mts  compute.rd 
##           0        1000           0 
##                   psi           subsampling                   cov 
##            "bisquare"         "nonsingular"         ".vcov.avar1" 
## compute.outlier.stats 
##                  "SM" 
## seed : int(0)
summary(model7_rob)$r.squared #R2 0.589
## [1] 0.589

We in fact had heteroscedasticity in our model and with the new standard errors our model’s R2 improved to be ~59%

2.3.4 Error analysis

Test our models with MSPE (mean squared prediction error)

# Model 1
mean((log(test$price) - predict.lm(model1, test)) ^ 2, na.rm=T)
## [1] 0.295
# Model 2
mean((log(test$price) - predict.lm(model2, test)) ^ 2, na.rm=T)
## [1] 0.293
# Model 3
mean((log(test$price) - predict.lm(model3, test)) ^ 2, na.rm=T)
## [1] 0.267
# Model 4
mean((log(test$price) - predict.lm(model4, test)) ^ 2, na.rm=T)
## [1] 0.258
# Model 5
mean((log(test$price) - predict.lm(model5, test)) ^ 2, na.rm=T)
## [1] 0.252
# Model 6
mean((log(test$price) - predict.lm(model6, test)) ^ 2, na.rm=T)
## [1] 0.229
# Model 7
mean((log(test$price) - predict.lm(model7, test)) ^ 2, na.rm=T)
## [1] 0.165
# Model 7 with Robust standard errors
mean((log(test$price) - predict.lm(model7_rob, test)) ^ 2, na.rm=T)
## [1] 0.167
# Our final model beats any other model on our test data also

We used stepwise method to look for the lowest possible AIC model, but it contained variables which would be hard to defend logically

  • full.model <- lm(log(price) ~., data = na.omit(train))

  • step.model <- stepAIC(full.model, direction = “both”, trace = FALSE)

  • step.model %>% summary() %>% select(coefficients)

  • as.data.frame(summary(step.model)$coefficients) %>% arrange(Estimate)

3 Final predictions

First let’s visualize is our errors could be explained by their location

pred_price <- exp(predict(model7, test, se.fit = TRUE)$fit)

pred_test <- test %>% 
  cbind(pred_price) %>% 
  mutate(
    pred_error = (test$price - pred_price) / pred_price
  ) %>% 
  na.omit()
  
athens_map +
  geom_point(data=pred_test, aes(x = longitude, y = latitude, color = pred_error*100)) +
  geom_point(aes(x = syntagma['latitude'], syntagma['longitude']), color = 'red', size = 5) +
  map_theme +
  labs(title = "Our residuals do not correlate with distance" , subtitle = "Colors represent error level") +
  scale_color_continuous(name="Error level (%)")

Finally let’s see how much would be a night if me and my fried would like to visit Athene and would like to leave in 1.5km radius of the centre.

pred_friend <- test %>% 
  filter(
    cent_dist <= 1500,
    accommodates == 2
  )

ggplot(data=pred_friend, aes(x=price, y=exp(predict(model7_rob, pred_friend, se.fit = TRUE)$fit))) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1) +
  labs(title = "Our model predicts well most of the prices, except some extreme prices",
       subtitle = "Predicted vs actual prices (1.5km from centre, 2 accomodation)") +
  ylab("Predicted price") +
  xlab("Actual price")

The predicted price was €42.24 with a 95% confidence interval from €39.50 to €45.27.

3.1 Create your own predictions

Would you like to travel with more friends? Or would you like to move further our?

Use this tool to find out more about our predictions and actuals

library(shiny)

ui <- fluidPage(
  titlePanel(title=h4("Predicted Athene airbnb prices", align="center")),
  sidebarPanel( 
    numericInput("cent_dist", label="How far from the centre (max meter)?", value=1500),
    numericInput("accom", label="How many people?", value=2),
    selectInput("bathrooms", label="How many bathrooms?", 
                choices = c("Any", unique(test$bathrooms)))),
  mainPanel(plotOutput("plot2")),
            tableOutput("table"))

server <- function(input,output){
  
  dat <- reactive({
    
    data <- pred_friend <- athens_data_final %>% 
      filter(
        cent_dist <= input$cent_dist,
        ifelse(input$bathrooms != "Any", 
               bathrooms == input$bathrooms, 
               bathrooms == bathrooms),
        accommodates == input$accom
        ) %>% 
      na.omit()
    
    return(data)
  })
  
  output$table <- renderTable({
    
    reac_data <- dat()
    table <- predict(model7_rob, newdata = reac_data, 
                                        interval = "confidence") %>% 
      exp() %>% 
      data.frame() %>%
      summarize(lower_bound = mean(lwr),
                predicted_price = mean(fit),
                upper_bound = mean(upr))
    
    names(table) <- c("Lower CI Prediction", "Mean Prediction", "Upper CI Prediction")
    
    return(table)
  })
  
  output$plot2<-renderPlot({
    
    reac_data <- dat()
    print(head(reac_data))
    
    
    acc_str <- paste0("Accomodates:", input$accom, sep=" ")
    cent_str <- paste0("Distance from centre:", input$cent_dist, sep=" ")
    bathrooms_str <- paste0("Bathrooms:", input$bathrooms, sep=" ")
    
    ggplot(data=reac_data, aes(x=price, 
                           y=exp(predict(model7_rob, reac_data, se.fit = TRUE)$fit))) +
      geom_point() +
      geom_abline(intercept = 0, slope = 1) +
      labs(title = "Prediction vs. Actual prices in Athene",
           subtitle = paste0(acc_str, cent_str, bathrooms_str, sep=" ")) +
      ylab("Predicted price") +
      xlab("Actual price")
    
    })
  }
  
shinyApp(ui, server)
Shiny applications not supported in static R Markdown documents