Senior / Staff DevOps & Site Reliability Engineer

Full-Time | Remote / Hybrid | Engineering

About the Role

We're scaling fast — and we want to do it without the chaos that usually comes with it.

As our Senior/Staff DevOps & Site Reliability Engineer, you'll own the infrastructure that powers Superscale's AI platform. But this isn't a traditional "keep the lights on" SRE role. You'll be building an infrastructure layer designed for a new kind of engineering team: one where every developer works alongside multiple AI coding agents, and the infra itself is a force multiplier.

You'll be our first dedicated infrastructure hire, which means you get to set the standard — from observability and incident response to CI/CD pipelines and cloud architecture. You'll make sure we scale smoothly as load, team size, and AI workloads grow, and you'll be the counterpart engineers rely on to ship systems that are resilient from day one.

We believe in hiring for breadth and building leverage through AI tooling. We're not growing the team by stacking people in the same roles — we're hiring unique skill sets and amplifying everyone through best-in-class infrastructure and AI-native workflows. You'll be central to making that philosophy real.

Key Responsibilities

Own and evolve our AWS infrastructure: containerized services, networking, security, and cost optimization — building toward a setup that scales with both user load and AI workloads
Design and implement state-of-the-art monitoring, alerting, and observability with Datadog (no more "is this broken for everyone?" Slack messages — you'll know before anyone asks)
Build proactive systems for incident detection and response — shifting the team from reactive firefighting to confident, data-informed operations
Architect and deploy infrastructure for AI-native development: cloud-based coding agent environments where multiple agents per developer can build, test, and deploy in parallel
Prepare our infrastructure for AI-specific load patterns: bursty GPU/LLM workloads, intelligent request routing, and cost-efficient scaling strategies
Create a developer platform that treats coding agents as first-class citizens — giving them access to the same data, tools, secrets, and deployment pipelines that human engineers use
Design CI/CD pipelines and deployment workflows that are fast, reliable, and safe — optimized for high-frequency pushes from both humans and agents
Partner with the engineering team to build systems that are scaling- and future-proof from the architecture level, not patched after the fact
Establish infrastructure-as-code practices, documentation, and runbooks that make the whole team more autonomous

Requirements

5+ years of experience in DevOps, SRE, or platform engineering, with deep hands-on AWS expertise
Strong experience with container orchestration (ECS or Kubernetes), infrastructure-as-code (Terraform, Pulumi), and modern CI/CD systems (e.g GitHub Actions)
Proven track record of building observability stacks (Datadog, Grafana, Prometheus, CloudWatch, or similar) that actually prevent incidents, not just log them