Cloud-Native Developer Environments

The term "cloud-native" has a somewhat broad meaning these days. Its use started about a decade ago. Back then, it meant something like this. Built to run in the cloud. Since then, it has evolved into a set of principles and best practices for operating 12-factor applications. Engineering teams can also follow these principles for on-premises applications. There is now a foundation devoted to this and a large bevy of vendors offering tooling that they have marketed as cloud-native.

I have been fortunate enough to serve as a software architect for many organizations as they modernized their engineering to adopt cloud-native best practices. A happy side effect of cloud-native is the opportunity to improve the developer experience by also modernizing the developer environment. I had the opportunity to witness and participate in the various approaches to these modernization efforts. What follows is a fictional Architectural Decision Record (ADR) that recounts my observations on this matter.

ADR-001: Cloud-Native, Ephemeral, Single-Tenant Developer Environments

Status: Fictional

Date: 2025-06-02

Deciders: You, dear reader.

Context:

Our current developer workflow faces several challenges impacting productivity and release velocity. These challenges center on operational congestion and conflict resolution in centralized development environments. Since ours is a microservice architecture and only a single release of any particular service can run there at a time, developers who are changing the same service in different pull requests must take turns deploying and testing their changes in that environment. Engineering teams will encounter this issue more frequently when developing new features that require changes to multiple services, which are implemented by different developers. Not only must each developer wait for their turn to deploy their version of the service, but they must also deploy corresponding versions of dependent services. This waiting time becomes even more exaggerated if there are any developmental one-way doors, such as backward-breaking API or database schema changes.

Cloud-Native Aspirations & Parity:

We aim to leverage cloud-native principles for better automation and scalability.
There's a need for improved dev/prod parity to reduce "works on my machine" issues and ensure smoother deployments (a.k.a. lower change failure rate).
Automation for provisioning and deploying environments is crucial.

For your consideration, here are some more terms and their definitions for all the approaches described below.

In-house services: engineering teams own and maintain these Kubernetes-hosted services.
IaaS-like: Message brokers, databases, caches, search engines, service mesh.
PaaS-like: Martech, ecommerce, chat, API gateway, authentication, AI services, CMS, analytics. These are dependencies needed by the in-house services under development. As such, they don't have to be fully functional.
Stack: The core set of in-house services, IaaS, and PaaS assets needed by a developer in any particular team or division to vet their changes. Different teams or divisions may have different stacks. Specific stacks may include or exclude certain services.

Decision Drivers:

Increase developer velocity and reduce friction.
Improve stability and isolation for development activities.
Enhance dev/prod parity.
Leverage cloud-native tooling and practices.
Reduce time spent on environment-related troubleshooting.

Considered Alternatives:

There are different ways to configure a developer environment in a cloud-native way. We also include the non-cloud-native option as an alternative here for comparison reasons.

Traditional Shared Environment (No Changes to the Status Quo):

Description: A shared, persistent, multi-tenant environment where all developers deploy and test their changes.

Pros: Lowest compute costs (theoretically, due to resource sharing).

Cons:

Highest congestion.
Prone to environmental issues that are hard to resolve.
Developers "wait their turn," leading to low morale.
It is challenging to manage conflicting changes and dependencies.

Entire stack Per Developer (Dedicated VMs):

Description: Each developer provisions their own Kubernetes cluster and corresponding IaaS and PaaS assets similarly to production using IaC.

alternative2

Pros:

Lowest congestion.
Lowest lead time for changes (once the environment is up).

Cons:

Highest compute costs.
Complex and potentially slow provisioning process.

Entire Stack Per Developer (Kubernetes only):

Description: Each developer is assigned their own Kubernetes cluster, where they deploy all in-house services, as well as IaaS and PaaS assets.

alternative3

Pros:

Low congestion
Potentially not as expensive because the IaaS and PaaS assets must share the same dedicated VMs as the Kubernetes cluster.

Cons:

Since IaaS and PaaS will not run under Kubernetes in production, there is an increased risk of issues arising from the gap in dev/prod parity.
There is less flexibility in technology choices as cloud vendor-specific IaaS/PaaS technologies that cannot run in Kubernetes are not viable options.
Stateful systems, typical for IaaS and PaaS, tend to be less stable in Kubernetes.

Decision:

We will adopt ephemeral, single-tenant developer environments provisioned on-demand within a Kubernetes cluster, utilizing per-developer namespaces. IaaS and PaaS assets will be persistent and shared across all developers.

decision

Platform: There is a single developer Kubernetes cluster, where each developer deploys the entire stack of in-house services to their non-shared namespace.

Scope: Each developer namespace will host an "entire stack" or a relevant, configurable subset of microservices required for their current task.

Lifecycle: Environments will be ephemeral, meaning they can be quickly created, reset, and destroyed.

Deployment: Per-developer namespace deployment.

Consequences:

What are the benefits and drawbacks to this hybrid approach as compared to the alternatives in the lens of the decision drivers?

Pros:

Medium-low congestion: Significant reduction in developer blocking and waiting.
Improved Isolation: Developers can work on features, bug fixes, and experiments without impacting others.
Faster Feedback Loops: Quicker to deploy and test changes in an isolated setting.
Enhanced Dev/Prod Parity: Using Kubernetes for stateless in-house services and cloud vendor IaaS and PaaS mirrors typical production environments more closely.
Facilitates Experimentation: Easy to spin up environments for testing risky changes or new ideas.

Cons:

Medium-high compute costs: While potentially lower than dedicated VMs for everything, running multiple full stacks (even if scaled down) will incur higher cloud bills than a single shared environment.
Higher certificate/identity management: Managing access and identities for numerous ephemeral environments and their services might require robust automation.
Increased complexity: Initial setup and maintenance of the automation for these environments.
Resource Management: Requires careful monitoring and policies for resource consumption and cleanup of idle environments.

Risks:

Where can cloud-native development environments go wrong?

Backwards-Breaking Changes Management

While isolated environments help test, coordination for changes to database schemas, APIs, and message formats across teams and services remains crucial. Discipline and cross-team communication is still required.

Integration Bugs & Testing Strategy

Test Automation: Comprehensive test automation is paramount for confidence in isolated environments.

Divergence from Prod: Ensuring that the "isolated dev env" configuration does not significantly drift from "what is currently running in prod" requires diligence and good IaC practices.

Accurate End-to-End Testing: A strategy for less frequent, more comprehensive integration testing (potentially in a short-lived, shared "staging-like" environment built on the same principles) might still be needed to catch issues missed in isolated environments.

Remedy for Integration Issues: The process for identifying and fixing integration bugs needs to be efficient; relying solely on deploying to a shared integration environment post-development is what we aim to move away from.

CD Pipeline Complexity:

CI (Continuous Integration) and CD (Continuous Delivery/Deployment) pipelines can become more complicated to manage when deploying to numerous dynamic environments.

Ensuring efficient and fast pipeline execution for these ephemeral environments is key.

Assumptions:

What must already be in place in order to fully leverage the benefits of a more modern developer environment?

Stack Specifics & Test Doubles

Deployment: Kubernetes does not directly support any notion of a stack. Engineering must either implement stacks in-house or turn to third-party tooling.

Decommissioning/Reset: Environments can be regularly decommissioned (e.g., nightly, weekly) or reset to a clean state to manage costs and prevent drift.

Test Doubles: Engineering must implement test doubles for "out-of-stack" services, including complex third-party SaaS and legacy monoliths that cannot be easily containerized.

Cloud-Native Readiness

CI/CD: Robust pipelines are in place to support this model.

IaC: Engineering provides and supports Infrastructure as Code (e.g., Helm, Kustomize, Terraform) to accommodate all automated provisioning and deployment requirements.

Observability: Developers will need access to the APM and application logs for their isolated environments.

Developer Skills: Developers are comfortable with basic Kubernetes concepts and the tooling for managing their ephemeral environments.

Microservice Architecture: The existing system employs a microservice architecture, where services are independently deployable, configurable, and adhere to other 12-factor best practices.

Next Steps

This ADR is fictitious in that it is based on an organization that is an amalgamation of several organizations that I have worked with in the past. Your own organization may merit a different decision, perhaps one of the alternatives listed here or a different variation altogether. A lot of this depends on your past pain points and your current level of technological maturity. Feel free to reach out if you need help.

Book Your Free Consultation Today!