This role reports directly to the Director of Reliability
Why Are We Hiring
Being a professional is hard. Not only do you need to be an expert in your field but you then need to worry about how you communicate, manage your time, store and share information, keep your data secure, make your tooling reliable, and a million other details. Filevine dreams of a day when all of these details are taken care of so that professionals can get back to focusing on what they love.
We are building our team because we recognize that to make good on that dream we need the help of amazingly talented people like you.
About the Team and Job
To achieve the dream of allowing professionals to focus on what they love Filevine products need key features. They need to be reliable, scalable, performant, cost effective, secure and they need to have a way to recover in the event of a disaster.
The Reliability team is responsible for thinking through these problems and engineering solutions to them. We hire excellent engineers who apply software engineering to these problems to create autonomous systems that take care of these details for us. We use the principle of continuous improvement to make each iteration of these autonomous systems better than the last.
The state of our autonomous systems are nascent with the foundational pieces either recently having been completed or currently under development.
As a Site Reliability Engineer you will be embedded with a cross functional team who has key responsibilities for certain portions of our systems. Over the course of the first year you will gain the valuable context needed to be truly effective and move at speed in the Filevine environment. During your successive years you will be given specific mission critical objectives that help build out and improve our autonomous systems and simultaneously build out your personal brand as an exceptional engineer who has built and maintained amazing systems that can grow to internet scale.
Qualifications
Curiosity, a willingness to learn, a passion to continually improve, and unbridled enthusiasm to make things better everyday without the need to be directed to do so
Proficiency in all of the skills expected of our SRE II's
A bachelors degree in computer science, information systems, a related field; comparable certifications; or equivalent direct work experience
A minimum of 8 years of experience in hands on technical roles
A minimum of 2 years of Site Reliability Engineering experience
Experience building autonomous systems that manage software operational details without human intervention
Preferred Qualifications
M.S. in computer science, information systems, a related field; comparable certifications; or equivalent direct work experience
2-6 years of Site Reliability Engineering Experience
Experience developing, deploying, and maintaining internet scale applications
Experience incorporating Artificial Intelligence or Machine Learning into internet scale applications
Responsibilities
Provide leadership, mentoring, and excellent judgement by being responsible for:
Developing autonomous systems that manage the details necessary to build, deploy, test, and operate all Filevine Inc. products
Being the voice of Reliability on your team throughout the SDLC
Collecting, monitoring, aggregating, dashboarding, and alerting on software and server events
Improving the CI/CD pipeline
Developing playbooks, tools, and scripts to streamline processes and shorten problem resolution time
Identifying and fixing gaps in the availability of systems
Improving and defending the security of software and systems
Documenting and diagramming processes, procedures, and best practices
Finding, learning, improving, or creating new tools that are reliable, usable, and helpful to enable other engineers to perform their work more efficiently
Work within assigned team to complete duties as assigned, while mentoring, training, and reviewing more junior engineers.
Work either individually or in conjunction with other engineers to complete assignments
Be part of an on-call rotation with other team members to provide 24/7/365 production reliability support
Be part of an on-call rotation with other team members to provide escalated emergency support for the services your team owns
Communicate frequently, clearly, and effectively with various technical and management audiences