Systems Reliability Engineer
For one of the largest Global Media companies in the world your work as a Systems Reliability Engineer is one of the most important technical roles within the company ensuring best systems practices. You will ensure the speed, availability, and scale of the systems as well as ensuring their ability to withstand unprecedented increases in load. You will be at the heart of solving any production problems. Your scope runs from the kernel to the application. You will build automation tools to ensure the system is well instrumented and highly fault tolerant.
• Manage availability, latency, scalability and efficiency of applications development by instilling engineering reliability into development lifecycle with a focus on fault tolerant approaches
• Respond to and resolve unexpected and potential service problems and write software to prevent problem recurrence
• Drive capacity planning, performance analysis, instrumentation and other non-functional systems requirements
• Review and influence ongoing design, architecture, standards and methods for improving operating services
• Manage system releases, write production software acceptance tests, and coordinate all aspects of the release including coverage and communication plans
• Bachelor’s degree in Computer Science or equivalent
• Software Engineer or Development of customer-facing, high-availability, large-scale distributed applications.
• Experience developing in C or C++, Java technologies.
• PHP, Python, Ruby or other scripting languages
• Extensive experience with Linux/Unix
• Prior successful experience as a systems performance or site/systems reliability engineer
• Extensive experience working with fault tolerant approaches in a large-scale distributed environment and high-performance systems
• Demonstrated experience working in complex systems environments
• Deep understanding of Internet and networking protocols
• Expertise analyzing and troubleshooting large-scale distributed systems.
• Knowledge of IP networking, network analysis and performance and application issues using standard tools such as tcpdump