CityBike Trends in Lisbon, PT

Explored how local business density and types (e.g., cafes, bars, restaurants) correlate with bike availability at Lisbon’s CityBike stations. Integrated geospatial APIs, statistical modelling, and regression analysis.

Github Repository
Visit Website
Arrow
Skills
Statistical Testing, Regression Modelling, Data Cleaning
Tools
Python, PostgreSQL, CityBike API, Foursquare API

Project Overview

This project investigates whether surrounding businesses influence bike availability at Lisbon’s CityBike stations. Using the CityBike and Foursquare APIs, I collected geospatial and venue-level data for over 9,600 locations near 195 stations. I then applied multivariate regression and statistical tests to evaluate predictors such as venue type, rating, and popularity.

Despite weak correlations overall, the number of nearby bars and cafes showed minor but statistically significant associations with bike availability. This analysis highlights both the potential and limitations of using public venue data to understand urban mobility patterns.

Repository README

README

CityBike Lisbon: Location Data Modelling with Python

Project

This project explores the influence of surrounding businesses on bike availability at Lisbon CityBike stations. Using data from the CityBike and Foursquare APIs, I performed multivariate regression analysis to assess how proximity to bars, cafes, and other venues impacts station usage.

Goals

  • Identify predictor variables that influence bike availability at Lisbon CityBike stations.
  • Visualise trends between station-level bike availability and surrounding locations
  • Use multivariate regression analysis to determine whether location context meaningfully predicts bike availability.

Data

Tech Stack

  • Python (pandas, seaborn, matplotlib, statsmodels)
  • APIs: CityBike, Foursquare
  • PostgreSQL
  • Jupyter Notebooks

Process

Step 1

  • city_bikes.ipynb

  • Acquired CityBike Data from CityBike API

    • Resulting 195 bike station coordinates
  • Parsing JSON File data into Dataframes for cleaning

  • CityBike API Data Timestamp = 12:39 AM Sunday, July 27 2025

Step 2

  • yelp_foursquare_EDA.ipynb

  • Acquired location data from FOURSQAURE APIs within 1,000metres of Bike Stations

    • Data parameters included;

      • Location Name
      • Distance From CityBike Station
      • Location Category
      • Rating
      • Popularity
      • Hours
      • Hours Popular
      • Price
    • 9,662 Locations within 1,000 metres radii of Bike Stations

      • Grouped by Bike Station, visualisation methods performed for cleaning, aggregation and modelling
  • Issue Unable to access YELP API without Business Contact

    • YELP API Step Ignored

Final cleaned dataset:

  • 176 Bike Stations
  • 9,662 surrounding locations (filtered)

SQL Database

  • Created SQL Database for future access
    • Created, linked and stored data in normalised tables

Step 3

  • Exploratory Data Analysis
    • Conducted visual inspections with seaborn/matplotlib pairplots

      • Observed null or weak visual relationships
    • Generated Pearson R correlations for proposed linear relationships & variance inflation factor values for Multicollinearity

    • Observed weak linear correlations between x-variables and y-variables

      • y-variables
        • free_bikes, total_slots
    • Cleaning Procedure

      • Foursquare API Limitation:
        • API limit at 50 locations per bike station, which introduces selection bias.
      • Filtered Locations distance < 298 metres to remove selection bias of 50 location limitations
        • Bike Station Maximum locations = 49
      • Cleaned Quantity : Bike Stations = 176

Step 4

  • Statistical Modelling
    • Mulivariate Regression Models

      • Dependent Variables
        • Free Bikes
        • Total Bike Slots
      • Full model with all business categories
      • Refined model using only top independent variables
        • bar_count
        • cafe_counts
      • Adjusted R² improved in simplified model
    • Linear Regression Models

      • Independent Variable
        • Average Popularity of locations
      • Dependent Variables
        • Free Bikes
        • Total Bike Slots
      • Strong correlations
        • Statistically Insignificant
        • Maximum of 1.4% explanatory power on the variability of dependent variables

Results

API Quality

  • Foursquare’s API returned a maximum of 50 location results per coordinate query, which introduced selection bias in areas with dense points of interest.
  • To mitigate this, we applied a filter of < 298 metres, reducing the max results per Bike Station to 49.
  • Final cleaned dataset includes 176 Bike Stations with confirmed nearby locations under this threshold.

Visual Observations

  • Pairplots showed no clear visual relationship between business categories, average popularity or average ratings, and the availability of Free Bikes or Total Bike Slots.

  • This held true both before and after bias correction, suggesting a weak influence of surrounding locations on bike availability.

  • Cleaned Dataframe Scatterplot Visualisations
    Scatterplot of Free Bikes Vs Ratings

    • Scatterplot of Free Vikes Vs Ratings

    Scatterplot of Free Bikes Vs Categories

    • Scatterplot of Free Bikes Vs Categories

    Pairplot of Bike Stations

    • Pairplot of all Bike Stations

OLS Regression Model Findings

  • Pearson R correlation and VIF analyses confirmed low multicollinearity for business categories, with a proposed threshold of 5. However, little predictive strength in further analysis.

    • VIF analysis did find high multicollinearity among ratings and popularity. These variables removed from Multivariate Regression Analysis.
  • A linear regression model was applied using Average Popularity of Locations as the independent variable, and Free Bikes and Total Bike Slots as the dependent variables.

    • Weak linear relationships were observed between top independent variables and dependent variables.
  • Selecting Top Independeent Variables:

    • Coefficients
      • Each additional bar within 298 metres is associated with 1.67 more free bikes.
      • Each additional café is associated with 0.74 fewer free bikes.
    • p-value
      • Both Correlations Statistically Significant at the level of 5% alpha
    • Adjusted R-squared of multivariate models increased:
      • The model explains approximately 17.9% of the variance in free bike availability, indicating low predictive power.

Conclusion

  • The Foursquare API provided geospatial location context, but within Lisbon city, the nearby venue data utilised does not show strong predictive influence on city bike station metrics.

  • No strong predictive relationships or correlations were observed between Location Categories or Location Ratings with Total Bike Slots or Free Bikes of CityBike Stations.

    • Location Categories are a poor predictor of both Bike dependent variables
    • Location Ratings & Popularity are a poor predictor of both Bike dependent variables
  • However, bar_count and cafe_counts offered the strongest available predictive value for Free Bikes accounting for approximately 17.9% of the variance.

  • Alternative predictor variables and features not present in this analysis may be more valuable in predicting bike availability than nearby venue types or ratings

Future Goals

  • Additional Independent Variables

    • Examples
      • Time Of Day
      • Season
    • CityBike API Data Timestamp = 12:39 AM Sunday, July 27 2025
      • Further Goal to observe trends across daily, weekly & monthly timestamps
  • Cleaning further of categories

    • Top 4 cleaned categories utilised from CityBike API
      • Restaurant
      • Bar
      • Cafe
      • Coffee
    • Further cleaning and grouping of broader categories
      • Examples
        • Public / Government
        • Retail
        • Hospitality
  • Additional Statistical Analysis

    • Statistical Testing
      • Models
        • Plot residuals vs fitted values
        • Generate Q-Q plot of residuals
        • Add a line of best fit to scatterplots using sns.regplot()
      • T-Tests to identify acquired sample data of locations as a reflection of true populations.
        • As a result of restricted FOURSQUARE API Response limits.
      • However, given poor predictor variables, significance of successful Statistical Testing minimal due to infered results

Challenges

  • Selection Bias in FOURSQUARE API Reponses.

    • Restricted responses from FOURSQUARE API required a limitation of 50 locations for Bike Station Coordiantes.
    • Restricted data available to assess full population of locations surrounding Bike Station, requiring cleaning and reduction of location distance threshold.
  • Low Predictive Strength of Independent Variables.

    • Low/Weak linear relationships from all geospatially available data with our CityBike Stations dependent variables.
    • Resulting conclusions are low model effectiveness and future goals to identify alternative independent variables.
  • Multicollinearity

    • Strong multicollinearity present among top linear correlated variables, Average Rating and Average Popularity.
    • Separation of multicollinearity variables was required to meet assumptions for accurate model predictions.
Github Repository
Visit Website
Arrow