In the short term we need to increase the resiliency and reliability of our current PaaS solution with things such as:
Improving the maintainability of our infrastructure as code
Building dashboards, monitoring & alerting mechanisms with Datadog
Load testing and performance tuning our production services
Lifecycling and maintenance of our Kubernetes clusters
In the medium to long term you’ll get to:
Implement new and shiny technologies on top of Kubernetes as you see fit to ensure our tech can scale with the business.
Develop and integrate solutions with a bias for automation in order to improve and maintain reliability across the production estate and make recovery easier.
Design and track metrics for site uptime and performance ensuring high levels of visibility are maintained.
Own the deployment pipelines and continuously improve our monitoring and alerting capabilities.
Collaborate closely with all other engineering functions to provide timely feedback from our environments.
Support Engineering on their journey to deliver better software, faster and more safely (think “It’s OK to deploy on Fridays” 😎).
You have strong systems administration skills, know the difference between a container and a virtual machine, and know your way around a Linux terminal
You have platform engineering/SRE experience at leading startups or fast growing tech companies
You have either had experience with some of our tech stack or are confident you can cross train and up skill quickly
You have experience working in a regulated industry
You are confident working with and guiding developers on monitoring and logging of complex systems at scale
You have worked on complex projects
You reflexively reach for AI agents to assist in researching and solving your problems
You can work collaboratively with different teams i.e. Security, Data, Engineering
You want to forge and own MoonPays reliability & recovery processes
You’ve got at least a basic understanding of complex reliability structures, theories, principles, and best practices
You have worked with JavaScript codebases and frameworks e.g Typescript, Node.JS and React
Typescript as our programming language of choice
Node.js as our backend platform
TypeORM, TypeDI, TypeGraphQL and routing-controllers as our backend libraries
React and NextJS hosted on Vercel as our frontend
Google Cloud Platform to host our services
Postgres as our core database
Redis for caching
Bull to manage background tasks
DataDog for logging and monitoring
ArgoCD for continuous deployment on Kubernetes
GitHub to manage our source code
Jest to run our tests ✅