To ensure that our public facing, cloud-based services are reliable, scalable and secure in the face of rapid growth and change.
What we'd like you to do in the first few months
Develop and implement a strategy to ensure our monitoring/alerting systems are complete and escalate reliably
Improve our reliability story by ensuring deployment works across multiple cloud providers
Analyze our fleet performance on different providers and ensure we're being cost effective
How we expect you to operate to achieve the above outcomes
Automation - We automate anything and everything to ensure predictable, fast deployments.
Consistency - We're moving fast, but it's a marathon, not a sprint. We need people who know how to pace themselves.
Performance - We need someone who can reason about and optimize performance in a highly distributed, cloud-based environment with hundreds of instances and 10s of billions of objects.
Communication - We are a team distributed across the US. The ability to effectively communicate asynchronously in a variety of mediums (Email, IM, Group Chat, Skype/Hangouts) is critical.
Responsive - We provide infrastructure for other apps. We are looking for someone who is responsive when things go south (with the rest of the eng. team!)
Platform diversity - We work across multiple cloud providers. Being comfortable working across these platforms is essential.
Technology and Concepts you'll be working with
AWS, EC2, S3, Joyent Cloud, Ubuntu, Solaris, SmartOS, node.js, JSON, Ruby, Chef, Python, Fabric, OAuth, HTTP/S, MySQL, Redis