Position Details: Site Reliability Engineer (SRE)
Remote, 100% Remote |
3 |
$60-75/hr |
Description:
Our end client is:
- 100% REMOTE,
- FORTUNE 500 Top Workplaces USA for 2022
- Resume building opportunity for all contractors
***ONLY USC, GC holder, H4, or GC EAD
Contractor must be on our payroll as a W2
***Please include your LinkedIn profile link, if applicable
Software Site Reliability Engineer focused on Operations and Production Readiness
- Site Reliability Engineers (SREs) are responsible for keeping production systems running smoothly.
- SREs are a blend of pragmatic operators and software developers that apply engineering principles, operational discipline, and mature automation to our operating environments.
- SREs specialize in systems (operating systems, networks, observability), while implementing best practices to continuously improve availability, reliability, and scalability.
Responsibilities:
- Develop and run SRE own tooling and observability using automation like CI/CD, and Kubernetes.
- Build monitoring that alerts on symptoms rather than on outages.
- Document every action so your findings turn into repeatable actions and then into automation.
- Debug production issues across services and levels of the stack.
- Plan the growth and reliability of services.
Technical:
- Configuration management: use Chef and Ansible to effectively manage our infrastructure
Infrastructure as code:
- Use Terraform and GitLab CI/CD for automation, containerize our environments (Kubernetes), and leverage cloud technologies to meet goals
Systems:
- Manage, configure, and troubleshoot operating system issues, storage (block and object), networking VPC (Virtual Private Cloud), proxies and CDN (Content Delivery Network) and administer high-availability PostgreSQL and Redis clusters
Monitoring and instrumentation:
- Implement metrics in Prometheus, Grafana, log management and related system, and Slack/PagerDuty integrations
Engineering practices:
- Availability, reliability, and scalability, as well as disaster recovery
Execution Planning:
- Familiar with agile methodologies; use epics and issues to drive projects
Organization:
- Workload organization, OKR (Objective and Key Result) leadership
Management:
- a manager of one, able to self-organize and report asynchronously
Collaboration and Communication:
- Leading and contributing to scope and designs for issues, epics, and OKRs (Objective and Key Result)
- Contributing to the Handbook, create and update runbooks, general documentation, and write blogs
- Completing Root Cause Analysis (RCA) investigations and performing readiness reviews
- Improving team practices through code reviews, handoffs of work and incidents
Influence and Maturity
- Knowledge sharing, mentoring.
- Self-awareness, handling conflict in the team, and providing and receiving feedback
- Maintaining good relationships with other engineering teams that help improve the product
Accountability:
- willing to proactively step in and do the right thing while providing candid and constructive feedback
Required Skills:
- 5-8 years of software development and technical operations monitoring
- Bachelor’s degree in IT or related required
- Programming skills/background – Java, Python, Ruby, etc.
- Ability to build CI/CD pipelines - Jenkins, Nexus
- Knowledge of 4 or more of the following technical areas, with deep knowledge in 1 area
- AWS Cloud Practitioner, resources provisioning and configuration through CLI/API
- Chef (basic syntax, recipes, cookbooks) or Ansible (basic syntax, tasks, playbooks)
- Kubernetes understanding, CLI (Command Line Interface), service re-provisioning
- Provision and setup metric in AppD or Grafana or Datadog
- Provision and setup logs and queries for frequent questions
- Networking VPC, proxies and CDN (Content Delivery Network)
We lover referrals!
You and the friend you refer receive a referral fee if we hire your referral, so refer away, if you prefer!