SRE (Site Reliability Engineer)
Required skills
Job description
TON Foundation is a non-profit organization supporting the growth of the TON Blockchain and its ecosystem. Founded in Switzerland in 2023 and backed by a global community, the Foundation empowers developers, creators, and businesses through grants, technical resources, and strategic partnerships. TON operates as a decentralized, open-source network, independent of centralized control and open to contributions from all.
We are looking for a Site Reliability Engineer to ensure a resilient, secure, and production-ready platform that enables the safe and efficient deployment of applications and services. This role focuses on improving service availability, monitoring, incident response, and system reliability, while supporting operational teams and driving continuous improvements in scalability, uptime, and platform stability.
Responsibilities
Increase resiliency and reliability of PaaS solutions with things like:
Configure and maintain monitoring and alerting for our Kubernetes clusters and production services
Load testing and performance tuning across our production services
Build dashboards, monitoring, and alerting mechanisms
Develop and integrate solutions with a bias for automation in order to improve and maintain reliability across the production estate and make recovery easier
Design and implement fault-tolerant solutions across stateful services and supporting infrastructure
Design and track metrics for uptime and performance ensuring high levels of visibility are maintained
Collaborate closely with all other engineering functions to provide timely feedback from our environments
Participate in the on-call rota and support incident response and service recovery
Requirements
Experience with monitoring systems such as Prometheus, Grafana, and VictoriaMetrics
Experience designing and supporting fault-tolerant Redis, RabbitMQ, and PostgreSQL clusters
Strong understanding of scaling, resilience, and high availability under load
Proficiency in load testing and performance tooling such as K6
Strong Linux and scripting skills for platform automation and troubleshooting
Ability to work closely with engineering teams to improve delivery, reliability, and developer experience