Senior Manager, Site Reliability Tools - San Francisco, CA

This job posting is no longer available on Find similar jobs: Senior Manager jobs - jobs

Founded in 1999, is the enterprise cloud computing company that is leading customers in their transformation to become social enterprises . Social enterprises are able to connect with customers, partners and employees in entirely new ways. Based on's real-time, multitenant architecture, the company's platform and application services give customers the tools to create a true social front office and revolutionize the way they sell, service, market, collaborate, work, and innovate. With more than 9,000 employees, the first enterprise cloud computing company to exceed $2.5B in annual revenue run rate, and more than 100,000 customers worldwide, we are proud to contribute to the success of companies of all sizes and industries, around the globe. We're also one of the "Best Places to Work" (FORTUNE). If you're passionate about innovation, come help revolutionize how companies collaborate and communicate with customers. is looking to hire a well seasoned technology manager to lead our Site Reliability Tools team. If you enjoy an entrepreneurial role where you can develop solutions, solve complex computing problems and drive teams to succeed then this is the opportunity you have been waiting for!

This is a cutting edge role with a team focused on continuous service availability, through global cloud-scale event analysis, anomaly detection, full stack and cross platform correlation, and automated probable root cause identification. In addition, the team will be responsible for building resilience into the Salesforce cloud, through operations orchestration, self healing mechanisms, destructive testing, datacenter failover automation, and global traffic management for the greatest Enterprise Cloud Computing company in the world!

At Salesforce, we build reliability into everything we do. But at cloud scale, even reliable complex systems break from time to time. Because of this, we need to build resilience to failure into the platform and the surrounding infrastructure and support systems.

Key Responsibilities:

- Continuous service availability of massive-scale cloud platforms
(This is Mission Control for the Cloud)

- Lead a multifunctional team to implement cloud scale incident detection and remediation.

- Pull signal from noise, through global event stream aggregation, deduplication, correlation, and root cause analysis, using best of breed solutions chosen from open source, in-house designs, 3rd party products and platforms, and big data analysis.

- Provide superior incident remediation capability, through operations orchestration, failover automation, and self-healing techniques.

- Empower a 7/24 global team of reliability engineers to respond instantly to problems, through customized ITIL and ITSM implementations which shephard process flow, automate and speed communications, and capture vital information for later analysis and mining.

- Evaluate technologies, both internal and external, to improve resilience. Perform vendor comparisons, proof-of-concept trials, and new technology prototypes. Drive vendor selection, and participate in negotiation and ongoing relationship management.

- Collaborate cross functionally to build survivability, fast failover and ease of operations into our product lines. Work with the development, infrastructure engineering and architecture teams to drive site reliability requirements, sustainable operations and alert notification into every product deployed.

- Drive global adoption of best practices, tools and processes across the entire Technology organization to support continuous availability.

- Determine long term strategy and build a multi-year roadmap for Site Reliability tooling and platforms.


- BS in Comp Sci / proximate degree or equivalent industry experience
- At least 6-15 years’ experience in a large scale, high-transaction OLTP internet service engineering, development, architecture or service management
- A minimum of 3 years in a direct people management position
- Java/Perl/Python or C++ development experience
- Deep Functional Knowledge in multiple of the following areas:
- High Availability architecture
- Open source monitoring and orchestration technologies (Nagios, Kafka, Puppet, etc)
- Third party monitoring and orchestration vendors (Tivoli, et al)
- ITSM and ITIL implementations (BMC, Axios, etc)
- Communications automations (xMatters, Everbridge, Mir3, etc)
- Block-based, filesystem-based, and log based replication schemes
- Modern distributed data storage technologies: Hbase/Hadoop HDFS
- Enterprise RHEL/Debian/BSD Linux systems management
- High-end SAN storage (HDS, EMC) solutions
- Large scale NAS filers (HNAS, NetApp), and how to make them perform
- High volume database operations environments. Oracle/Postgres a plus

- MS or PHD in Comp Science, Mathematics or equivalent
- Chef/Puppet enterprise design/deployment
- Large scale automation architecture/development
- Product Owner/Scrum Master ADM
- Social Enterprise Platform design/system support
- application development

About this company
92 reviews