Senior SRE Manager

Tehran
Full-time
Technology

In the story of Snappfood, we believe in creating value that goes beyond the ordinary. We are willing to establish innovative tendencies and are eager to have you on our team to help us get through our business challenges with creativity, intelligence, and agility.
We are waiting for you to continue this story.

Responsibilities:

  • Lead and oversee incident management processes, ensuring quick resolution and minimal impact on systems and services.
  • Strong problem-solving and decision-making abilities under pressure.
  • Establish and maintain best practices for incident handling, root cause analysis, and post-incident reviews.
  • Supervise and mentor NOC Engineers, ensuring effective monitoring, troubleshooting, and support for production systems.
  • Collaborate with the NOC and SRE teams to enhance system reliability and establish seamless workflows.
  • Manage release pipelines to ensure efficient, safe, and automated deployments while minimizing production risks.
  • Coordinate with software development teams to align deployment processes with reliability and scalability goals.
  • Define and track service level objectives (SLOs) and indicators (SLIs) to maintain high availability and performance.
  • Drive initiatives to improve system reliability, capacity planning, and disaster recovery readiness.
  • Foster a culture of automation to reduce manual interventions and improve operational efficiency.
  • Act as a strategic leader, collaborating with cross-functional teams to identify and implement improvements in infrastructure and operations.
  • Stay updated with industry trends and advocate for adopting new tools and processes that align with organizational goals.
  • Stay informed about industry trends, emerging tools, and best practices, integrating them to improve infrastructure and operations at scale.

 

Requirements:

  • Excellent leadership and team management skills.
  • Strong problem-solving and decision-making abilities under pressure.
  • Effective communication and collaboration skills for cross-functional engagement.
  • Ability to prioritize and manage competing demands in a fast-paced environment.
  • Conflict management skills to address and resolve disagreements constructively, promoting healthy team dynamics.
  • Bachelor’s degree in computer science, software engineering, or a related field. A master’s degree is preferred.
  • Minimum of 7 years of experience in Site Reliability Engineering, including at least 3 years in a managerial role.
  • Strong expertise in incident management and post-incident processes.
  • Lead efforts to identify and address potential issues impacting system reliability, including capacity planning, disaster recovery, and incident management.
  • Develop and oversee the implementation of automated solutions to enhance system reliability, scalability, and operational efficiency.
  • Monitor and analyze system metrics, providing insights to proactively resolve potential issues and improve overall system performance and availability.
  • Design, implement, and oversee comprehensive monitoring, alerting, and observability systems for real-time detection and response to incidents.
  • Provide strategic direction and guidance to teams in designing and deploying applications that meet production standards and reliability goals.
  • Hands-on experience with infrastructure automation tools.
  • Proficiency with cloud-native architectures, microservices, and container orchestration.
  • Familiarity with release automation tools.
  • Solid networking and systems troubleshooting skills.

 

Benefits:

  • Vouchers for vacation, Gym, Therapy Sessions, Internet Costs
  • Complementary Insurance
  • Educational platform of advanced courses
  • Snappfood’s Discount codes
  • Loans

فرصت های شغلی مشابه

با اعداد انگلیسی وارد شود