
Site Reliability Engineer (Datadog)
- Johannesburg, Gauteng
- Permanent
- Full-time
- Datadog Certified Fundamentals – Must have
- Degree in Information Technology or Computer Science
- Management of operations on virtualized and distributed infrastructures,
- Management of operations on environment with clustering, replication, load balancer
- ITIL Practitioner (V3) / ITIL Specialist (V4)
- Windows Server: Advantage
- 1–3 years of experience working with a modern monitoring/observability tool, ideally Datadog (or alternatives like Prometheus, Grafana, New Relic, or Dynatrace).
- Experience in:
- Deploying and configuring monitoring agents
- Creating dashboards and monitors
- Parameterizing tags and labels for proper data correlation
- Basic familiarity with cloud platforms (AWS, Azure, or GCP) and container environments (Docker/Kubernetes)
- Experience working with Centreon - Advantage
- Strong interest in monitoring, DevOps, SRE, or cloud infrastructure
- Knowledge of basic scripting (e.g., Bash, Python) is a plus
- Support the design, implementation, and optimization of Datadog monitoring solutions across infrastructure, applications, and services.
- Work alongside DevOps, infrastructure, and application teams to ensure complete observability using custom dashboards, alerts, and tagging strategies.
- Assist in the deployment and onboarding of new systems into the monitoring ecosystem.
- Serve as the go-to person for building visualizations, improving signal-to-noise ratios in alerting, and aligning monitoring with business objectives.
- Ideal for a young and motivated engineer looking to grow within observability and cloud-native monitoring.
- Deploy and configure Datadog agents across various environments (cloud and on-prem).
- Create and customize dashboards, monitors, and alerts for systems, services, containers, and applications.
- Implement tagging strategies to organize, filter, and correlate metrics and logs effectively.
- Integrate Datadog with various platforms (AWS, Azure, GCP, Kubernetes, Docker, etc.) to collect telemetry data.
- Collaborate with developers, DevOps, and infrastructure teams to identify key business and system metrics to monitor.
- Continuously tune and optimize monitors to reduce false positives and improve actionable alerting.
- Document dashboards, alert logic, best practices, and knowledge for cross-team enablement.
- Analyze incidents and outages post-mortem to identify monitoring gaps and enhance visibility.
- Assist in evangelizing observability practices within the organization and contribute to monitoring as code efforts (e.g., Terraform for Datadog resources).
- Stay up to date with new Datadog features and industry trends in observability and monitoring.
ExecutivePlacements.com