Resiliency Engineer

Description

In this role, you will be responsible for designing, implementing, and managing systems and processes that enhance the resilience of our platforms. You will play a key role in ensuring our services are robust, fault-tolerant, and able to recover gracefully from failures.

Key Responsibilities:

  • Develop strategies and architectures for high availability, fault tolerance, and disaster recovery to ensure our systems can withstand and recover from failures.
  • Build and maintain monitoring systems to detect anomalies and failures, and lead the response to incidents to minimize downtime and impact on users.
  • Analyze system performance and usage patterns to identify bottlenecks and optimize resource allocation for peak loads.
  • Adopt best practices in reliability, including service level objectives (SLOs), service level indicators (SLIs), and error budgets.
  • Conduct chaos engineering experiments to test the resilience of systems and identify weaknesses before they become critical issues.
  • Develop and maintain automation tools and scripts that enhance the efficiency of deployment and recovery processes.
  • Create comprehensive documentation for resiliency practices and provide training to engineering teams on best practices for building resilient systems.
  • Work closely with cross-functional teams, including software developers, product managers, and operations, to ensure alignment on reliability goals and practices.

Basic Qualifications:

  • Bachelor’s degree in Computer Science, Engineering, or a related field.
  • 10 years of experience in systems engineering, site reliability engineering, or a related field with a focus on resiliency.

Preferred Qualifications:

  • Strong understanding of distributed systems, microservices architecture, and cloud infrastructure (AWS, GCP, Azure).
  • Proficiency in programming/scripting languages (e.g., Python, Go, Java, .NET, or similar).
  • Experience with monitoring tools (e.g., Prometheus, Grafana, Datadog, Dynatrace) and incident management systems.
  • Familiarity with chaos engineering principles and tools (e.g., Gremlin, Chaos Monkey).
  • Excellent problem-solving skills and a proactive approach to identifying and addressing issues.
  • Strong communication and collaboration skills, with the ability to work effectively across teams.
  • Experience in a DevOps or SRE role within a high-availability environment.
  • Experience with Big Data and No SQL Platforms.
  • Knowledge of container orchestration platforms (e.g., Kubernetes, Docker).
  • Familiarity with infrastructure as code (IAC) tools (e.g., Terraform, Ansible).

#LI-Hybrid

#LI-MH


Exempt Status: (Yes = not eligible for overtime pay) (No = eligible for overtime pay)

Yes

Workplace Type:

Hybrid

Huntington is an equal opportunity and affirmative action employer and is committed to providing equal employment opportunities for all regardless of race, color, religion, sex, national origin, age, disability, sexual orientation, veteran status, gender identity and expression, genetic information, or any other basis protected by local, state, or federal law.

Tobacco-Free Hiring Practice: Visit Huntington's Career Web Site for more details.

Agency Statement: Huntington does not accept solicitation from Third Party Recruiters for any position