We spent 3 hours on a bug that had nothing to do with our code

The automation had been working fine in development for two weeks. Every test passed. We deployed it. It did nothing.

No errors. No crashes. Just… nothing. The workflow ran, reached the step we’d built, and silently failed to interact with the page it was supposed to work on. Elements we could see were unreachable. Actions that worked locally produced no effect in production.

We spent the first hour looking at our code. Rewrote selectors. Added waits. Added more waits. Reordered steps. Nothing changed.

The second hour we looked at the environment. Maybe a network issue. Maybe a proxy interfering. We ruled both out.

The third hour we finally looked at the one thing we’d treated as a given: the version of the browser we were running in production. It was several major versions behind what we’d been using locally. The page we were automating had JavaScript that the production browser couldn’t fully execute. It rendered partially, failed silently, and our automation was faithfully clicking on a page that was already broken at load time.

Updating the browser version fixed it in five minutes.

The lesson here isn’t about browsers. It’s about assumptions. We had assumed the production environment matched our local one closely enough to not matter. It didn’t. And because the failure mode was silent — no exception, no timeout on our end, just actions that didn’t work — we kept looking in the wrong places.

The bigger lesson: when something works perfectly locally and does nothing in production with no error, stop looking at your code first. Look at what’s different about the environment. Start with the most boring possible explanation and rule it out before you touch anything.

We now pin runtime versions explicitly across environments. Boring fix. Completely worth it.