Description
The Opportunity: Build the Data & Evaluation Backbone for AI-Native Developer Workflows
This isn’t a typical DS role focused on optimizing a mature funnel. As the first Data Scientist at Guild.ai, you’ll establish the company’s “truth layer”—from product instrumentation and decision metrics to evaluation frameworks for autonomous, event-driven AI systems.
We’re tackling one of the hardest—and most important—problems in software engineering: helping developers understand, evolve, and operate complex systems using autonomous and event-driven AI. Your work will ensure we ship the right things, know whether they’re working, and continuously improve quality, reliability, and user trust.
If you thrive in ambiguity, love turning messy signals into crisp insight, and want to build the measurement culture for a 0→1 product with real technical depth, this role is for you.
What You Will Do
- Define What “Good” Means: Partner with founders, engineering, and design to define product KPIs and quality metrics—especially around AI behaviors (helpfulness, correctness, reliability, latency, cost, user trust).
- Build the Measurement Foundation: Establish event taxonomy, instrumentation standards, and core datasets. Ensure we can answer product questions quickly and confidently.
- Create AI Evaluation & Monitoring Systems: Develop offline/online evaluation approaches for agentic workflows (e.g., golden sets, human review loops, heuristic + model-based scoring, regression detection, error taxonomies).
- Run Experiments That Change Decisions: Design and analyze A/B tests and quasi-experiments; bring statistical rigor to iteration speed.
- Turn Insight into Action: Produce analyses, narratives, and recommendations that directly shape roadmap tradeoffs and product direction.
- Enable Self-Serve Analytics: Build dashboards and lightweight tooling that help the entire team understand usage, performance, and customer outcomes.
- Be a Cross-Functional Glue Layer: Work tightly with engineering on logging/telemetry, with PM on prioritization, and with GTM/customer conversations to connect product behavior to real-world impact.
- Define Data Science at Guild.ai: Establish best practices for metrics, experimentation, and decision-making frameworks that scale as the team grows.
What You Will Bring
- Strong foundations in statistics, experimentation, and causal reasoning, with a track record of driving product decisions through data
- Fluency in SQL and Python, and comfort working across the data stack (from raw events to analysis-ready datasets) Experience building analytics and measurement systems in a fast-moving environment (startup and/or high-ownership teams)
- Ability to translate ambiguous questions into well-scoped analyses and clear recommendations
- High judgment and crisp communication—especially when data is incomplete or messy
- A founder’s mentality: comfortable building from scratch, prioritizing ruthlessly, and owning outcomes end-to-end
Bonus Points
- Experience evaluating or monitoring LLMs / agentic systems (quality measurement, human-in-the-loop evals, regression testing, safety/reliability metrics)
- Familiarity with developer tools, infrastructure, observability, or Git-based workflows
- Comfort with modern data tooling (warehouses, dbt, orchestration, BI) and event-driven architectures
- Experience establishing experimentation and analytics culture at an early-stage startup
Benefits & Perks
- Significant equity in an early-stage, venture-backed startup
- Comprehensive Health Benefits (Medical, Dental, Vision)
- Flexible PTO to ensure you have the time you need to recharge
Thank you for your interest—we can’t wait to meet you.