Skip to content

Saving Realtime Transit Data to a DataFrame

Getting Realtime Transit Data from the STM API using Python looked at getting realtime transit data and displaying it in a notebook.

This post explores saving the same data to a Polars DataFrame so it's easy to analyze. I like working with Polars because it has a really intuitive API.

Python setup

Install polars into the same environment as used in the Getting Realtime Transit Data post:

pip install polars

Notebook setup

In the first cell of the notebook, import the required libraries:

import requests
import gtfs_realtime_pb2
import polars as pl

The first two here are the same as in the Getting Realtime Transit Data post.

Getting the data

The following code comes from the earlier post, the only difference now being that it has a function, realtime. Instead of hardcoding the API key, we add an api_key parameter to this realtime function. This gives code that's easier to work with in a notebook, as the function can be called multiple times to get realtime data for multiple points in time.

def realtime(api_key):
    url = "https://api.stm.info/pub/od/gtfs-rt/ic/v2/vehiclePositions"
    headers = {
        "accept": "application/x-protobuf",
        "apiKey": f"{api_key}",
    }
    response = requests.get(url, headers=headers)

    protobuf_data = response.content

    message = gtfs_realtime_pb2.FeedMessage()

    message.ParseFromString(protobuf_data)

    return message

Processing to a DataFrame

message contains the realtime data. To process the fields we want from that realtime data, we will initially represent each returned entity as a dictionary and then store each of those dictionaries in a list.

[
    {'trip_id':'123', 'route_id':'45', 'longitude'....}
    {'trip_id':'456', 'route_id':'29', 'longitude'....}
    ...
]

With a list of dictionaries, where each dictionary represents one entity, we'll then convert it to a DataFrame.

Here's what the code will look like, (added to the earlier code in the realtime function).

def realtime(api_key):
    ...
    # Create a list to store each entity in
    data = []

    # Get the timestamp from the message header
    header_timestamp = message.header.timestamp

    # Loop through the entities
    for entity in message.entity:

        # Create an empty dict to store the entity information
        entity_data = {}

        # Extract all the relevant fields and add them to the empty dict
        entity_data['header_timestamp'] = header_timestamp
        entity_data['entity_id'] = entity.id

        trip = entity.vehicle.trip
        entity_data['trip_id'] = trip.trip_id
        entity_data['start_time'] = trip.start_time
        entity_data['start_date'] = trip.start_date
        entity_data['route_id'] = trip.route_id

        position = entity.vehicle.position
        entity_data['latitude'] = position.latitude
        entity_data['longitude'] = position.longitude
        entity_data['bearing'] = position.bearing
        entity_data['speed'] = position.speed

        entity_data['current_stop_sequence'] = entity.vehicle.current_stop_sequence
        entity_data['current_status'] = entity.vehicle.current_status
        entity_data['timestamp'] = entity.vehicle.timestamp

        vehicle = entity.vehicle.vehicle
        entity_data['vehicle_id'] = vehicle.id

        entity_data['occupancy_status'] = entity.vehicle.occupancy_status

        # Add this record to the list
        data.append(entity_data)


    # Convert the list of dicts to a polars DataFrame
    df = pl.DataFrame(data)

    return df

Now calling the function with an API key:

df = realtime("<api_key>")

Returns a polars DataFrame object:

View of transit data in dataframe

We can now start to explore the data. For example, using a filter to get the current buses running for a particular route. Here, I use route 45...because it's the best in Montreal.

df.filter(pl.col("route_id")=="45")