Filevine is hiring a VP of Engineering, Reliability to lead one of the most critical functions in our engineering organization. This leader will own the strategy, people, operations, and outcomes for the teams responsible for infrastructure, site reliability, database engineering, observability, and incident management across Filevine's platform.
This is not a maintenance role. We are looking for a leader who will assess our current reliability posture and operating model with fresh eyes, define a forward-looking vision for how reliability engineering should work at Filevine, and execute that vision at the pace of an AI business. The right candidate has led Reliability organizations through similar inflection points, brings strong convictions about what "good" looks like at scale, and has the operational credibility and executive presence to drive meaningful change.
Responsibilities
Strategic Vision: Define and execute the reliability engineering roadmap, aligning infrastructure and AI-native architecture with Filevine’s enterprise growth and platform modernization.
Operating Model Evolution: Balance centralized platform capabilities with distributed ownership, ensuring the reliability model scales across a diversifying technology portfolio.
Performance Frameworks: Establish and manage SLO/SLI/error budget frameworks to create a shared language for balancing feature velocity with system stability.
Efficiency & Planning: Lead infrastructure cost management (optimization and forecasting), capacity planning, and disaster recovery to meet rigorous enterprise contractual commitments.
Organizational Development: Lead and scale a multi-disciplinary organization (DevOps, SRE, DBRE, Tooling), fostering a culture of ownership, high craftsmanship, and clear career growth.
Operational Excellence: Drive continuous improvement through DORA metrics, incident trend analysis, and systematic toil reduction to enhance service availability and deployment health.
Developer Empowerment: Delivery of self-service tooling, guardrails, and documentation that allow feature teams to operate their own services effectively without bottlenecks.
Security & Compliance: Act as the primary engineering interface for the CISO to advance compliance posture (FedRAMP, SOC 2, CJIS, ISO) and translate security needs into pragmatic action.
Executive Partnership: Collaborate with the CTO, CPO, and Architect to communicate risks and investment needs, positioning reliability as a key enabler for enterprise go-to-market success.
Qualifications
Extensive Leadership: 15+ years of engineering experience, with 7+ years specifically leading infrastructure, reliability, or platform teams at scale in product-driven companies.
Organizational Scale: Proven track record managing organizations of 40+ engineers across SRE, DevOps, and Tooling, including developing multiple layers of management.
Strategic Evolution: Demonstrated experience evolving reliability operating models to meet the shifting needs of a scaling business.
High-Trust Environments: Deep expertise operating in regulated sectors (Legal Tech, Fintech, Gov, or Healthcare) where compliance and data sensitivity are primary constraints.
SRE Mastery: Practical, production-hardened understanding of SRE principles, including SLOs, error budgets, toil reduction, and incident management.
Cloud-Native Fluency: Strong technical command of AWS, container orchestration, Terraform (IaC), CI/CD, and modern observability stacks.
Financial & Resource Stewardship: Direct experience owning cloud infrastructure budgets and successfully driving meaningful cost optimization and forecasting.
AI/ML Infrastructure: Familiarity with the reliability requirements for modern AI workloads, such as model serving, vector search, and data pipeline integrity.
Executive Presence: Ability to engage the C-suite on risk trade-offs and transformation progress with a "builder mentality" that thrives on solving complex, high-stakes problems.
What You'll Be Working On
Transforming the Reliability Operating Model: You will assess Filevine's current reliability posture with fresh eyes and define what "good" looks like for our next stage of growth. That means redesigning how the reliability organization operates — clarifying ownership boundaries, reducing toil, and building the self-service platforms that allow feature engineering teams to own their services with confidence. You will move us from a model where reliability is a bottleneck to one where it is a true force multiplier.
Building and Leading a High-Performing Team: You will develop the people and leadership bench across DevOps, SRE, DBRE, and Tooling. That means investing in your managers and tech leads, establishing clear career paths, and building a culture where reliability engineers take genuine pride in their craft. You will make smart decisions about where headcount creates leverage and where structural or tooling improvements are the better investment.
Establishing SLOs and a Reliability-Velocity Framework: You will introduce SLOs, SLIs, and error budgets as a shared language across engineering and product — giving teams a principled way to make trade-off decisions between shipping fast and staying stable. This isn't about slowing things down; it's about making risk visible and giving teams the tools to make informed decisions.
Owning Cloud Infrastructure Cost and Capacity: You will turn cloud cost management into an active discipline — driving optimization, building forecasting rigor, and creating real accountability across engineering. You will also lead capacity planning and disaster recovery strategy, ensuring Filevine can meet the contractual and operational expectations of enterprise customers.
Partnering on Security and Compliance: You will serve as the primary engineering interface with the CISO, translating compliance requirements across FedRAMP, SOC 2, CJIS, ISO, and other frameworks into pragmatic engineering decisions. You will bring credibility and clear judgment to risk trade-off conversations at the executive level — helping the business invest in the right places and manage risk proportionally.
Enabling Filevine's AI-Native Platform: As Filevine transitions to an AI-native architecture, you will ensure our infrastructure evolves to meet it — including the reliability patterns, failure modes, and scaling demands introduced by AI/ML workloads, vector search, and agentic systems. You will work closely with the Reliability Architect and Platform leadership to make reliability a foundation for LOIS, not an afterthought.