Site Reliability Engineering for high‑stakes systems

Practical reliability for fast‑moving teams. I help you ship faster and sleep better through well‑designed SLOs, observability, and incident response.

SLO / Error Budgets Incident Response Observability Performance Cloud Cost Reliability Reviews

Fewer incidents

Stabilize core services with pragmatic guardrails, runbooks, and chaos‑safe changes.

Faster recovery

On‑call you can trust: clean escalation paths, actionable alerts, and blameless postmortems.

Predictable velocity

SLOs and golden signals drive product decisions without slowing delivery.

Services

Reliability Strategy

Define service tiering, SLOs, and error budgets. Establish change policy and reliability guardrails.

Observability

Metrics, logs, and traces that matter. Alerting tuned for signal over noise.

Incident Management

IM playbooks, roles, post‑incident reviews, and tooling integrations that reduce MTTR.

Performance & Resilience

Capacity planning, load testing, fault injection, and autoscaling strategies.

Platform & Cloud

IaC reviews, multi‑AZ patterns, safe deployments (blue/green, canary), and cost controls.

Advisory & Fractional SRE

Hands‑on guidance or part‑time leadership to bootstrap or level‑up your SRE practice.

How I work

Assess — short discovery to baseline reliability risks and opportunities.
Prioritize — focus on "boring, scalable" improvements with strong ROI.
Deliver — pair with your team, ship changes, and transfer knowledge.

At a glance

Timezone: America/Vancouver • Remote‑first
Tooling: AWS, Kubernetes, Terraform, Prometheus, Grafana, Datadog, Elastic, PagerDuty
Industries: games, fintech, SaaS, e‑commerce

Get in touch

Email jhavero@gmail.com or call +1 (778) 882-7514. Or send a message below (uses mailto: so it works on S3 without a backend).




Contact

Raguero SCM Services
Burnaby, BC • Remote

Email: jhavero@gmail.com
Phone: +1 (778) 882‑7514

LinkedIn: https://www.linkedin.com/in/alex-raguero-472a20/