On the Intersection of AI and Urban Mobility Systems

A reflection on how machine learning is reshaping the way cities move — and why the gap between research and implementation is still wider than we admit.

There is a moment in every research project when the gap between what works in the lab and what works on the street becomes impossible to ignore. For me, that moment arrived on a Tuesday morning in February, watching a prototype intersection-control system hesitate for two full seconds before making a decision a human traffic officer would have made in milliseconds.

Two seconds sounds trivial. On a road with vehicles moving at 60 km/h, two seconds is thirty-three metres. It is the difference between a safe merge and a near miss.

The literature is not exaggerating the gains. It is simply measuring them under conditions cities cannot always guarantee.

The Promise

The case for AI in urban mobility is compelling and well-documented. Adaptive signal control can cut average intersection delay; route optimisation redistributes flow before it congeals into congestion; predictive models anticipate a broken-down bus or a flooded underpass minutes before the disruption cascades.

15–25%Delay reduction
847Sensors in trial
14 moLive, incident-free

These are real numbers from real deployments. But every one of them carries an asterisk, and the asterisk is where the engineering actually lives.

Where Theory Meets Tarmac

The systems we build in research assume a certain quality of input: a camera never occluded by a delivery truck, a GPS trace accurate to two metres, a dataset that reflects an ordinary Monday rather than a Monday during a marathon, a religious holiday, and the school run all at once.

Three failure modes I now design around

  1. Sensor dropout. A node goes dark mid-peak. The model must degrade, not panic.
  2. Distribution shift. The model trained on Melbourne is asked to perform in Karachi.
  3. Latency spikes. The right decision delivered too late is the wrong decision.

Lab vs Field vs Deployment

The same system lives three very different lives. The tabs below trace one intersection controller across all three.

Clean room. Synthetic flows, perfect sensing, unlimited compute. The model converges quickly and the metrics look spectacular. This is where most papers stop.

It is also where the most dangerous assumptions are quietly baked in — chiefly that the inputs at inference will resemble the inputs at training.

Reality intrudes. A pilot at four instrumented intersections introduces occlusion, packet loss, and a maintenance crew who unplug a sensor for a week without telling anyone.

Accuracy drops. More importantly, variance rises — and variance, not mean error, is what makes operators nervous.

It has to just work. No researcher on call, a hard latency ceiling, and a rule-based fallback that takes over the instant confidence drops. Glory is optional; predictability is not.

This is the phase nobody writes about, and the only one that matters to the people who actually use the road.

A Quick Comparison

ApproachLatencyRobustness to dropoutInterpretability
Rule-based controllerVery lowHighHigh
Deep RL (end-to-end)HighLowLow
Hybrid (ours)LowHighMedium
Cloud-only inferenceVariableVery lowMedium

The hybrid row is not the most accurate option on paper. It is the one I would put on a road my own family drives through.

Field Notes

A few frames from the Brisbane deployment. Swipe through — captions tell the story.

The Fallback, in Code

The whole philosophy reduces to a few lines. If the model cannot commit within budget, it does not get to decide.

def decide(state, model, fallback, budget_ms=800):
    start = now_ms()
    pred = model.infer(state, deadline=budget_ms)
    if pred is None or pred.confidence < 0.6 or now_ms() - start > budget_ms:
        # Degrade gracefully — predictable beats clever.
        return fallback.decide(state)
    return pred.action

Not glamorous. Not publishable on its own. But it is the reason the system has run for fourteen months without an incident.

Frequently Asked

Does the fallback make the AI pointless?
No — the model still drives the large majority of decisions and captures most of the delay reduction. The fallback is insurance, not the policy. It changes the risk profile, not the average performance.
Why not just train a bigger model?
Scale helps accuracy but not the tail. The failures that matter in deployment are distributional and infrastructural, not capacity-limited. A larger model fails more confidently, which is worse.
How do you measure success after launch?
Operational logs, not benchmark scores. Fourteen months of telemetry showing safe degradation under real dropout is worth more than any single accuracy figure on a slide.

What Research Can Do Better

Three changes would narrow the gap considerably — and none of them require a new architecture.

  • Publish failures as rigorously as successes. The lab-to-field delta is a result, not an embarrassment.
  • Lengthen partnerships. A twelve-month grant barely covers a sensor-procurement cycle; embed for years, not months.
  • Design for graceful degradation. When the AI fails — and it will — the fallback should be safe, predictable, and legible to whoever inherits it at 3am.

This post draws on ongoing work funded by the Australian Research Council. Views expressed are my own.