Explored how local business density and types (e.g., cafes, bars, restaurants) correlate with bike availability at Lisbon’s CityBike stations. Integrated geospatial APIs, statistical modelling, and regression analysis.

Skills

Statistical Testing, Regression Modelling, Data Cleaning

Tools

Python, PostgreSQL, CityBike API, Foursquare API

Project Overview

This project investigates whether surrounding businesses influence bike availability at Lisbon’s CityBike stations. Using the CityBike and Foursquare APIs, I collected geospatial and venue-level data for over 9,600 locations near 195 stations. I then applied multivariate regression and statistical tests to evaluate predictors such as venue type, rating, and popularity.

Despite weak correlations overall, the number of nearby bars and cafes showed minor but statistically significant associations with bike availability. This analysis highlights both the potential and limitations of using public venue data to understand urban mobility patterns.

‍

Repository README

README

CityBike Lisbon: Location Data Modelling with Python

Project

This project explores the influence of surrounding businesses on bike availability at Lisbon CityBike stations. Using data from the CityBike and Foursquare APIs, I performed multivariate regression analysis to assess how proximity to bars, cafes, and other venues impacts station usage.

Goals

Identify predictor variables that influence bike availability at Lisbon CityBike stations.
Visualise trends between station-level bike availability and surrounding locations
Use multivariate regression analysis to determine whether location context meaningfully predicts bike availability.

Data

Tech Stack

Python (pandas, seaborn, matplotlib, statsmodels)
APIs: CityBike, Foursquare
PostgreSQL
Jupyter Notebooks

Process

Step 1

city_bikes.ipynb
Acquired CityBike Data from CityBike API
- Resulting 195 bike station coordinates
Parsing JSON File data into Dataframes for cleaning
CityBike API Data Timestamp = 12:39 AM Sunday, July 27 2025

Step 2

yelp_foursquare_EDA.ipynb
Acquired location data from FOURSQAURE APIs within 1,000metres of Bike Stations
- Data parameters included;
  - Location Name
  - Distance From CityBike Station
  - Location Category
  - Rating
  - Popularity
  - Hours
  - Hours Popular
  - Price
- 9,662 Locations within 1,000 metres radii of Bike Stations
  - Grouped by Bike Station, visualisation methods performed for cleaning, aggregation and modelling
Issue Unable to access YELP API without Business Contact
- YELP API Step Ignored

Final cleaned dataset:

176 Bike Stations
9,662 surrounding locations (filtered)

SQL Database

Created SQL Database for future access
- Created, linked and stored data in normalised tables

Step 3

Exploratory Data Analysis
- Conducted visual inspections with seaborn/matplotlib pairplots
  - Observed null or weak visual relationships
- Generated Pearson R correlations for proposed linear relationships & variance inflation factor values for Multicollinearity
- Observed weak linear correlations between x-variables and y-variables
  - y-variables
    - free_bikes, total_slots
- Cleaning Procedure
  - Foursquare API Limitation:
    - API limit at 50 locations per bike station, which introduces selection bias.
  - Filtered Locations distance < 298 metres to remove selection bias of 50 location limitations
    - Bike Station Maximum locations = 49
  - Cleaned Quantity : Bike Stations = 176

Step 4

Statistical Modelling
- Mulivariate Regression Models
  - Dependent Variables
    - Free Bikes
    - Total Bike Slots
  - Full model with all business categories
  - Refined model using only top independent variables
    - bar_count
    - cafe_counts
  - Adjusted R² improved in simplified model
- Linear Regression Models
  - Independent Variable
    - Average Popularity of locations
  - Dependent Variables
    - Free Bikes
    - Total Bike Slots
  - Strong correlations
    - Statistically Insignificant
    - Maximum of 1.4% explanatory power on the variability of dependent variables

Results

API Quality

Foursquare’s API returned a maximum of 50 location results per coordinate query, which introduced selection bias in areas with dense points of interest.
To mitigate this, we applied a filter of < 298 metres, reducing the max results per Bike Station to 49.
Final cleaned dataset includes 176 Bike Stations with confirmed nearby locations under this threshold.

Visual Observations

Pairplots showed no clear visual relationship between business categories, average popularity or average ratings, and the availability of Free Bikes or Total Bike Slots.
This held true both before and after bias correction, suggesting a weak influence of surrounding locations on bike availability.
Cleaned Dataframe Scatterplot Visualisations
- Scatterplot of Free Vikes Vs Ratings
- Scatterplot of Free Bikes Vs Categories
- Pairplot of all Bike Stations

OLS Regression Model Findings

Pearson R correlation and VIF analyses confirmed low multicollinearity for business categories, with a proposed threshold of 5. However, little predictive strength in further analysis.
- VIF analysis did find high multicollinearity among ratings and popularity. These variables removed from Multivariate Regression Analysis.
A linear regression model was applied using Average Popularity of Locations as the independent variable, and Free Bikes and Total Bike Slots as the dependent variables.
- Weak linear relationships were observed between top independent variables and dependent variables.
Selecting Top Independeent Variables:
- Coefficients
  - Each additional bar within 298 metres is associated with 1.67 more free bikes.
  - Each additional café is associated with 0.74 fewer free bikes.
- p-value
  - Both Correlations Statistically Significant at the level of 5% alpha
- Adjusted R-squared of multivariate models increased:
  - The model explains approximately 17.9% of the variance in free bike availability, indicating low predictive power.

Conclusion

The Foursquare API provided geospatial location context, but within Lisbon city, the nearby venue data utilised does not show strong predictive influence on city bike station metrics.
No strong predictive relationships or correlations were observed between Location Categories or Location Ratings with Total Bike Slots or Free Bikes of CityBike Stations.
- Location Categories are a poor predictor of both Bike dependent variables
- Location Ratings & Popularity are a poor predictor of both Bike dependent variables
However, bar_count and cafe_counts offered the strongest available predictive value for Free Bikes accounting for approximately 17.9% of the variance.
Alternative predictor variables and features not present in this analysis may be more valuable in predicting bike availability than nearby venue types or ratings

Future Goals

Additional Independent Variables
- Examples
  - Time Of Day
  - Season
- CityBike API Data Timestamp = 12:39 AM Sunday, July 27 2025
  - Further Goal to observe trends across daily, weekly & monthly timestamps
Cleaning further of categories
- Top 4 cleaned categories utilised from CityBike API
  - Restaurant
  - Bar
  - Cafe
  - Coffee
- Further cleaning and grouping of broader categories
  - Examples
    - Public / Government
    - Retail
    - Hospitality
Additional Statistical Analysis
- Statistical Testing
  - Models
    - Plot residuals vs fitted values
    - Generate Q-Q plot of residuals
    - Add a line of best fit to scatterplots using sns.regplot()
  - T-Tests to identify acquired sample data of locations as a reflection of true populations.
    - As a result of restricted FOURSQUARE API Response limits.
  - However, given poor predictor variables, significance of successful Statistical Testing minimal due to infered results

Challenges

Selection Bias in FOURSQUARE API Reponses.
- Restricted responses from FOURSQUARE API required a limitation of 50 locations for Bike Station Coordiantes.
- Restricted data available to assess full population of locations surrounding Bike Station, requiring cleaning and reduction of location distance threshold.
Low Predictive Strength of Independent Variables.
- Low/Weak linear relationships from all geospatially available data with our CityBike Stations dependent variables.
- Resulting conclusions are low model effectiveness and future goals to identify alternative independent variables.
Multicollinearity
- Strong multicollinearity present among top linear correlated variables, Average Rating and Average Popularity.
- Separation of multicollinearity variables was required to meet assumptions for accurate model predictions.

Github Repository

Visit Website

Climate Kegs – Carbon Neutrality Campaign

Partnering initially with Climate Neutral to measure emissions across RMU’s global operations, I redirected the project into a community-driven philanthropic sustainability campaign in response to funding limitations. This led to the creation of “Climate Kegs”, a tree-planting beer initiative launched in partnership with One Tree Planted. Each keg funded the planting of approximately 750 trees, equivalent to offsetting one tonne of CO₂ over 20 years, by turning hospitality-driven customer engagement into measurable local climate action.

Salifort Motors Attrition Rates

A capstone project for the Google Advanced Data Analytics certification, this exploration investigates trends and variable associations through classification models between employee departures at the fictional Salifort Motors company.

Sales Analytics: Ecommerce Growth to $500K

Led the development and expansion of RMU Skis’ ecommerce channel, achieving 40% year-over-year growth across two consecutive years. Analyzed website traffic, conversion rates, and SKU-level profitability to drive targeted marketing, optimize product strategy, and support data-driven decision-making.

CityBike Trends in Lisbon, PT