Back to jobsplenful
Site Reliability Engineer
San Francisco, US hybrid full time senior Jan 21, 2026
Skills
About this role
About Plenful
Plenful is on a mission to move pharmacy forward through intelligent automation. We build AI-powered software that eliminates administrative burden, strengthens compliance, and unlocks revenue across critical pharmacy workflows, solving one of the biggest challenges in healthcare today: delayed patient care.
Built by a passionate team of former healthcare operators and world-class AI technologists, Plenful combines deep domain expertise with enterprise-grade technology to automate complex workflows across intake authorization, 340B program optimization, and pharmacy revenue reconciliation. Our AI platform is trusted by 95+ leading healthcare organizations to power smarter, faster, and more resilient pharmacy operations.
Backed by leading investors including Notable Capital, Bessemer Venture Partners, and TQ Ventures, Plenful is building the institutional memory for healthcare and powering the most complex, highest ROI healthcare workflows. We’re actively hiring as we continue to scale.
About the role
We’re hiring a Site Reliability Engineer (SRE) to ensure the reliability, performance, and scalability of Plenful’s production systems as we continue to grow.
This role is centered on operating real systems at scale — not just building infrastructure, but deeply understanding how it behaves under load, fails in production, and recovers. You’ll define reliability standards, own production health, and build the feedback loops that make our systems more resilient over time.
You’ll work closely with backend, data, and ML engineers to ensure our platform is highly available, measurable, and continuously improving. This includes everything from incident response and performance debugging to SLO design and system-level optimization.
What You’ll Do
Reliability Engineering & System Ownership
Define and implement SLIs, SLOs, and error budgets across core servicesOwn production system health, including uptime, latency, and availability targetsContinuously improve system resilience through proactive reliability workIdentify and mitigate single points of failure across distributed systems
Production Operations & Incident Response
Participate in and improve on-call rotations and incident response processesLead incident triage, mitigation, and resolution in real timeConduct blameless postmortems and ensure follow-through on action itemsBuild tooling and automation to reduce MTTR (Mean Time to Recovery)
Observability & System Insight
Design and evolve observability systems across:Metrics, logs, and distributed tracing (OpenTelemetry)Tooling including Datadog, CloudWatch, Grafana, Sentry
Improve signal quality to reduce noise and alert fatigueDevelop dashboards and alerts that reflect real system health and user impactUse observability data to drive performance and reliability improvements
Performance & Scalability
Analyze system performance under load and identify bottlenecksOptimize latency, throughput, and resource utilization across:Serverless systems (AWS Lambda)Containerized services (ECS)Data systems (Aurora Postgres, ClickHouse)
Partner with engineering teams to improve system efficiency and scaling behavior
Automation & Reliability Tooling
Build automation to eliminate repetitive operational workImprove deployment safety through reliability checks and safeguardsContribute to CI/CD pipelines (GitHub Actions) with a focus on system stabilityDevelop tools for:Incident responseDebuggingCapacity planning
Security, Compliance & Operational Maturity
Partner with security and compliance to ensure systems meet operational standardsSupport audit readiness and reliability-related compliance requirements (Vanta)Integrate monitoring and alerting into security and SIEM workflowsHelp mature operational practices across the engineering team
Environment & Technical Context
You’ll work across a modern distributed stack:
Cloud: AWS (ECS, Lambda, RDS Aurora Postgres, CloudWatch)Infrastructure: Terraform, Ansible, LinuxCI/CD: GitHub ActionsObservability: Datadog, Grafana, CloudWatch, OpenTelemetry, Sentry, pganalyzeData Systems: Postgres, ClickHouseSecurity & Compliance: Vanta, SIEM toolingProduct & Analytics: AmplitudeML/Platform Infra: TrueFoundry
What Success Looks Like
Clear, enforced SLOs and error budgets across critical systemsIncidents are well-managed, rare, and decrease over timeEngineers have high-confidence signals about system healthAlerts are actionable, not noisySystems scale predictably under load without degradationPostmortems lead to real, measurable improvementsReliability is treated as a shared engineering responsibility, not a reactive function
Ideal Background
Must Have
5+ years in Site Reliability Engineering, SRE-adjacent roles, or production infrastructureStrong experience operating and debugging distributed systems in productionHands-on experience with:Observability tooling (Datadog, Grafana, OpenTelemetry, etc.)Incident response and on-call practicesPerformance and reliability debugging
Experience defining and working with SLOs / SLIs / error budgetsFamiliarity with:AWS environmentsServerless and container-based architecturesPostgres or similar relational databases
Ability to write code/scripts (Python, Bash, etc.) for automation and toolingStrong systems thinking and ability to reason about failure modes
Nice to Have
Experience in high-growth or high-scale environmentsBackground in regulated industries (healthcare, fintech)Experience with ClickHouse or analytical systems at scaleFamiliarity with chaos engineering or load testing frameworksExposure to ML infrastructure or data platforms
Plenful perks
Comprehensive Benefits Package: Enjoy unlimited PTO, fully covered health insurance (medical, dental, and vision), meal stipend, health & wellness stipend, 401(k) matching, and stock options.Mission-Driven, World-Class Team: Join an exceptional group of professionals aligned around a meaningful mission and committed to making an impact.Opportunities for Growth: Strengthen your partnership expertise through collaboration with experienced, high-performing leaders across the organization.Flexible Work Environment: Employees based in the Bay Area enjoy two days per week in a brand-new downtown San Francisco office. Employees based in other cities enjoy a fully remote work environment with the ability to travel for collaboration.
Locations: San Francisco (Hybrid)