We are at a critical inflection point. Our low-code platform is preparing for an immediate scale-up to 3,000,000 concurrent users. We currently operate on a GKE-based architecture with 78 microservices and a MongoDB Atlas backend. We need a Lead Site Reliability Engineer who can transform our current synchronous system into a high-concurrency, asynchronous engine capable of surviving massive traffic spikes without database or compute failure.
Responsibilities
- Decoupled Architecture : Transition synchronous API flows to Google Cloud Pub / Sub to act as a shock absorber for a MongoDB Atlas M60+ cluster.
- Database Guardrails : Implement and own the "Speed Limit" for our database. You will configure Subscriber-side Flow Control in Node.js and Kubernetes HPA to ensure we never exceed 10,000 IOPS or 32k connections.
- Resource Isolation : Isolate heavy Puppeteer / Chrome workloads from core platform services using Cloud Run or dedicated Spot VM node pools with taints / tolerations.
- Observability & Alerts : Build a "Nerve Center" using Cloud Monitoring. You must track Message Age, Disk Throughput, and Connection Saturation with millisecond precision.
- Platform Hardening : Work with our 78 microservices to optimize their container footprints using Vertical Pod Autoscaling (VPA) and efficient bin-packing.
Technical Requirements
GCP Mastery : Deep experience with GKE, Pub / Sub, and Cloud Run. You should know how to request and manage high-scale CPU quotas.Advanced Node.js : You must understand how to manage the Event Loop under heavy load and how to properly ack / nack messages in a distributed queue.MongoDB at Scale : Experience with Atlas M60 / M80 tiers. You must know how to diagnose Index Resident Memory issues and manage connection pooling at the platform level.The "SRE Mindset" : You believe that a "Slow" system is better than a "Broken" one. You have experience implementing Backpressure and Circuit Breakers.