We're looking for a good generalist Systems and Operations Engineer with skills spanning operating systems, databases, networking, hardware, and scripting/programming, who can help us scale an environment that is complex, large-scale, and uses a variety of internal and external web services.
- Work with Engineering and Operations teams to implement and deploy scripts/tools to collect key application information into monitoring/alerting systems and other databases.
- Maintain and support large development, QA and integration environments.
- Work with the team to automate management tasks, streamline processes and perform standard administration functions as needed.
- Monitor system performance, scalability, and availability; make recommendations to improve and implement system wide changes to enhance overall system proficiency.
- Participate 24/7 oncall responsibility; Perform initial investigation of production issues, identify if any system or component is under stress or failing
The ideal candidate should possess the following attributes:
- Working knowledge of bash scripts, and working knowledge of at least one other scripting language such as ruby (preferred), PHP, python. Has used a templating system (ruby erb, php etc )
- Experience building custom application monitoring and alerting, mining application logs, and writing scripts to integrate data/information into monitoring systems
- Java log and php log analysis/debugging skill; master of regex (splunk analysis)
- Experience setting up and maintaining monitoring and alerting infrastructure tools like Ganglia, Nagios, Hyperic, Zenoss, Cacti, etc.
- Good understanding of linux (e.g. How to determine CPU%, disk image, open sockets, bad network interface, etc)
- Understanding of database design, administration, and profiling/tuning (MySQL or Oracle); ability to do advanced CRUD operations
- Understanding of load balancers/ firewalls/ NATing; HTTP(S); TCPDUMP, wireshark, ifconfig, curl
- Experience with maven, rpm/yum, rundeck
- Experience provisioning, maintaining, triaging and optimizing Java web applications; Understanding of Java specific performance issues and their resolution, e.g. garbage collection issues
- Experience with EC2 / Cloud environments
- Strong, creative problem solving and analytical skills
- Ability to work on multiple projects concurrently, prioritizing tasks, and knowing when to escalate
- Ability to work independently and as part of a team.
- Self-motivated, energetic and tenacious.
- Strong communication skills. Able and willing to document operation procedures in detail
- Flexible and responsive to changing situations
- Handle pressure-filled situations in a professional manner
Zynga - 16 months ago