RR0402: Systems Reliability Engineer
As a Systems Reliability Engineer, your mission will be to ensure the speed, availability, and scale of the systems as well as ensuring their ability to withstand unprecedented increases in load. In this role you will be at the heart of solving production problems. Your scope is from the kernel to the application. The position requires the flexibility to take a holistic approach to troubleshooting and the ability to delve deeply into technical details. The Systems Reliability Engineer will build automation tools for system health, production acceptance tests to validate production changes, and will ensure the system is well instrumented and highly fault tolerant.
- Manage availability, latency, scalability and efficiency of applications development by instilling engineering reliability into our development lifecycle with a focus on fault tolerant approaches
- Respond to and resolve unexpected and potential service problems and write software to prevent problem recurrence
- Drive capacity planning, performance analysis, instrumentation and other non-functional systems requirements
- Review and influence ongoing design, architecture, standards and methods for improving operating services
- Manage system releases, write production software acceptance tests, and coordinate all aspects of the release including coverage and communication plans
- Bachelor’s degree in Computer Science or equivalent
- 3+ years of experience as a software Engineer or Development of customer-facing, high-availability, large-scale distributed applications.
- Experience in C or C++, Java technologies.
- PHP, Python, Ruby or other scripting languages
- Extensive experience with Linux/Unix
- Prior successful experience as a systems performance or site/systems reliability engineer
- Extensive experience working with fault tolerant approaches in a large-scale distributed environment and high-performance systems
- Demonstrated experience working in large, complex systems environments
- Deep understanding of Internet and networking protocols
- Expertise analyzing and troubleshooting large-scale distributed systems.
- Knowledge of IP networking, network analysis and performance and application issues using standard tools such as tcpdump
- Ability to handle periodic on-call duty as well as out-of-band requests