Position Details: Site Reliability Engineer (SRE)

Description:

Our end client is:
- 100% REMOTE,
- FORTUNE 500 Top Workplaces USA for 2022
- Resume building opportunity for all contractors

***ONLY USC, GC holder, H4, or GC EAD
Contractor must be on our payroll as a W2

***Please include your LinkedIn profile link, if applicable

Software Site Reliability Engineer focused on Operations and Production Readiness

Site Reliability Engineers (SREs) are responsible for keeping production systems running smoothly.
SREs are a blend of pragmatic operators and software developers that apply engineering principles, operational discipline, and mature automation to our operating environments.
SREs specialize in systems (operating systems, networks, observability), while implementing best practices to continuously improve availability, reliability, and scalability.

Responsibilities:

Develop and run SRE own tooling and observability using automation like CI/CD, and Kubernetes.
Build monitoring that alerts on symptoms rather than on outages.
Document every action so your findings turn into repeatable actions and then into automation.
Debug production issues across services and levels of the stack.
Plan the growth and reliability of services.

Technical:

Configuration management: use Chef and Ansible to effectively manage our infrastructure

Infrastructure as code:

Use Terraform and GitLab CI/CD for automation, containerize our environments (Kubernetes), and leverage cloud technologies to meet goals

Systems:

Manage, configure, and troubleshoot operating system issues, storage (block and object), networking VPC (Virtual Private Cloud), proxies and CDN (Content Delivery Network) and administer high-availability PostgreSQL and Redis clusters

Monitoring and instrumentation:

Implement metrics in Prometheus, Grafana, log management and related system, and Slack/PagerDuty integrations

Engineering practices:

Execution Planning:

Organization:

Management:

Collaboration and Communication:

Leading and contributing to scope and designs for issues, epics, and OKRs (Objective and Key Result)
Contributing to the Handbook, create and update runbooks, general documentation, and write blogs
Completing Root Cause Analysis (RCA) investigations and performing readiness reviews
Improving team practices through code reviews, handoffs of work and incidents

Influence and Maturity

Knowledge sharing, mentoring.
Self-awareness, handling conflict in the team, and providing and receiving feedback
Maintaining good relationships with other engineering teams that help improve the product

Accountability:

willing to proactively step in and do the right thing while providing candid and constructive feedback

Required Skills:

We lover referrals!
You and the friend you refer receive a referral fee if we hire your referral, so refer away, if you prefer!

Perform an action: