How To Kill Staging Without "testing In Production"

Revelry Labs

Unleashing Human Potential with Technology

An illustration of a person standing on a stage, pulling the rope to a long blue theatre curtain.

How to kill staging without “testing in production”

Most software teams have left behind the Waterfall(1) methodology in favor of approaches that move faster and flex more to accommodate changes and maximize the value of feedback. One thing that has come more slowly, though, is the movement away from waterfall-ish delivery practices like release schedules, code freezes, and extensive QA passes in a staging environment. Many of us hear stories from successful engineering organizations about shipping to production many times a day, but getting there ourselves can be easier said than done.

Kill staging?

The problems with staging environments have been pointed out before(2):

  • They require constant maintenance to stay in sync and up-to-date. As these changes accrue over time, many organizations lose track of the steps that would be required to rebuild the environment from scratch.
  • Staging data and usage patterns don’t match production, reducing the value of your testing.
  • As different people use staging simultaneously, their data and settings interact, injecting uncertainty and adding communication overhead: “Please nobody run any transactions on staging this afternoon while I’m testing the new TPS reports.”
  • Verifying changes in staging becomes a bottleneck in the release process, which can have cascading effects in the form of merge freezes and pull request backlogs.
  • It costs money to keep a separate environment running all the time.

There’s also a lot of good advice out there on practices that can help an organization move away from staging:

  • Have excellent automated testing and use a continuous integration service.
  • Invest in excellent error reporting and monitoring in production so you can catch problems immediately.
  • Deploy code and enable features in small increments to reduce risk and make it easy to track down what went wrong.
  • Use feature flags to separate “deploying the code” from “releasing the feature,” protecting real users from new code until it’s been verified.
  • Toggle the feature flags on for internal and/or alpha users to perform QA and other validation in production, while keeping an eye on that awesome monitoring.

But is that enough?

Some organizations can make “test in production” a reality, but there are many who aren’t there yet or may not ever want to get there.

Feature flags are great for rolling out features, but I think trying to use them to reduce all kinds of risk in production can become ineffectual quickly. Even in the simplest examples, the code branches introduced by feature flags are a form of technical debt that needs to be cleaned up later. This makes the overhead questionable for small bug fixes that nonetheless can introduce risk. On the other end of the scale, trying to use feature flags to cordon off an extensive refactoring can quickly result in lots of extra complexity and quite a substantial amount of tech debt. I’ve seen dogmatic use of feature flags result in a “bolt it on” culture where engineers would rather duplicate and tweak code for each flag combination than revisit any existing designs for fear of introducing non-flagged changes or unexpected interactions.

Another thing that’s easy to neglect is that implementation in agile teams is often done without complete feature specifications, and it’s very helpful to get additional perspectives on the changes before they’re deployed. Sure, feature flags allow internal testers to access features before they’re generally available, but I think it’s better to tighten the feedback loop even more and get input from QA, design, product, and other stakeholders before changes get merged.

The missing piece? Review Apps

The best development experiences I’ve had were on teams that used some kind of “review app.” A review app is a standalone copy of your app, generated automatically from each pull request, and allowing the changes from that pull request to be tested in isolation. Doing things this way has many of the advantages of staging environments and few of the drawbacks:

  • Get feedback from QA, product, and stakeholders more quickly, even simultaneously with code review.
  • Allow code reviewers to play with the code in action without having to pull it down and run it locally.
  • Ship every PR to prod with confidence so you can keep moving forward, avoiding scary, bloated deployments and merge freezes.
  • Deployments are reproducible, and every review app is fresh and up-to-date.
  • Each feature gets its own isolated environment, avoiding the confusion of a shared application state.
  • Just deploying the review app forces the team to understand and document all of the things that need to be done to pave the way for the feature to be used, such as environment variables or dependencies on other systems, features, or settings.

One drawback of this approach is that the tooling can be challenging for complex systems, and it can require some maintenance. At one organization we sometimes had to spin up separate sandbox apps for related services in order to test our work.

While review apps avoid some of the problems of staging environments going stale, they do share some weaknesses as well in terms of just not being identical to production. You’ll likely be working with stubbed or sanitized data, and load isn’t going to be realistic. Those things probably do have to happen in production, where things like feature flags, canary deploys, science experiments(3), and monitoring are definitely still your friends.

Finally, the cost can be a factor. If accounting is upset about paying for the staging server, they may not be interested in a plan that involves running a half-dozen different review apps at any given time. I think a lot of organizations with strong engineering cultures do recognize the payoffs of making development smoother and getting things done right, though.

How to build review apps?

GitLab and Heroku both offer Review App features, though the focus is slightly different. GitLab handles the orchestration on the code and CI side, but not the infrastructure proper(4). Heroku does the hosting part, of course, and integrates with source control pretty seamlessly with their pipelines, making it pretty easy to set up for any simple-ish web app(5). You may need some custom scripts to seed a database or wire up integrations, but the heavy lifting is done for you.

If you’re hosting elsewhere, you’ll probably need to roll your own depending on your environment, from a more batteries included service like AWS Elastic Beanstalk(6) all the way down to custom scripts running on bare metal.

Conclusion

If you’d like to get rid of staging and streamline your delivery practices, but think you could never get away with deploying every merge straight to production, then review apps might be the tool for you.

Footnotes

  1. https://en.wikipedia.org/wiki/Waterfall_model
  2. https://readwrite.com/2016/01/22/staging-servers/
  3. https://github.com/github/scientist
  4. https://docs.gitlab.com/ee/ci/review_apps/#introduction
  5. https://blog.heroku.com/heroku-review-apps-ga
  6. https://blog.scottlogic.com/2018/02/23/review-apps-beanstalk.html