Site Reliability Engineer
Job Description :
Primary focus is to provide technical expertise, education, and tooling to ensure the highest level of reliability and availability for critical applications.
Able to provide consultation and strategic recommendations by quickly assessing and remediating complex availability issues. Responsible for driving automation, efficiencies to increase quality, availability and security.
Organizational Context :
Technical individual contributor integrated with technology and business partners to ensure efficiencies in increasing quality, availability and security to technical platforms. Works individually and with teams to drive reliability goals and objectives across platforms.
Key Responsibilities Product Development :
- Enable creation and updating of logging standards to streamline dashboard creation and ensure usability of logging repository.
- Drive monitoring requirements to ensure business-service level visibility for all support teams.
- Participate in architectural decisions to ensure software transaction flows are appropriately supported and designed.
- Is an IT infrastructure Subject Matter Expert (SME) and works with Development teams to build to standards that drive the highest levels of availability.
- Provides guidance to software engineers related to design patterns that are resistant to failure.
- Communicates effectively with Development and Operation teams to align on requirements, driving SDLC requirements, capabilities, and limitations pertinent to delivering highly resilient applications.
- Responsible for evaluating and implementing orchestration, automation, and tooling solutions to ensure consistent processes and repetitive tasks are performed with a higher level of accuracy and reduced defects.
- Build, implement and advise on recovery tooling to adhere to enterprise standards and/or frameworks.
- Introduce new and impactful technologies to the production support tool chain that help minimize friction for production releases and support, and to more quickly diagnose and recover from production incidents.
Operational Readiness :
- Responsible for availability, proactive monitoring / alerting, capacity planning, performance (reducing latency and increasing efficiency) to include testing for technical platforms.
- Partner with appropriate supporting teams to ensure operational readiness throughout the application lifecycle.
Production Support :
- Ensure application data flows are accurate and up to date with the objective to increase the knowledge base of all support teams and drive reliability.
- Facilitates the resolutions of non-application issues (3rd party upstream issues, infrastructure issues, storage, database, network, file transfer etc.)
Scope of Impact/Influence :
- Consults with teams to build standards that drive the highest levels of availability
- Mentors teams through ongoing development efforts
- Partner with development teams to adhere to SDLC standards
- Center of Enablement coach and advise about the SRE function working with varies teams and provide real-life examples when necessary
- Bachelor s Degree in related field preferred; Relevant industry experience can substitute.
- 8+ years of engineering and/or architecture experience in a complex environment, such as: large scale web infrastructure or development team.
- Experience supporting a 24/7 enterprise environment with on-call responsibilities for production support.
- Experience in a broad range of software development and operations technologies such as Infrastructure, virtualization, load balancing, containers, JVM- s, web servers, application debugging, queueing technologies, caching technologies, databases (RDBMS and NoSQL), routing and switching, etc.
- Experience in high transaction volume OLTP sites or the Financial Services industry is preferred
High-performing Behaviors :
- Has an - Automation First- mindset - fundamentally will not accept doing things over and over by hand
- Combines deep technical expertise, a continuous improvement and automation mindset, and systematic and rational root cause analysis to identify opportunities to make things faster and better
- Challenges the status quo, identifies opportunities to adopt innovative technologies to enable business capabilities, generates creative ideas and solutions to difficult problems
- A recognized expert and highly sought-after consultant that is knowledgeable regarding current research and technology in the industry and uses that knowledg