Part 1 | Inside the AI SDLC at ServiceNow

navdeepgill · ‎06-09-2025

At ServiceNow, building enterprise-grade AI goes beyond choosing powerful models. It requires a structured, responsible, and repeatable process that translates research into solutions, minimizing costly workflow downtime and driving real-world business value.

This four-part series takes you inside our AI Software Development Lifecycle (AI SDLC), highlighting the methods and principles that guide how we evaluate, train, align, validate, and deploy GenAI models across the Now Platform.

You’ll learn how we:

Scope models based on business priorities and platform requirements
Tune with targeted data to optimize performance and align behavior with platform standards
Deploy responsibly at scale—monitoring, measuring, and continuously improving

Let’s begin with Part 1…

Part 1: Pick the Right Model Before You Train: How AI Model Development Begins at ServiceNow

As organizations increasingly rely on AI to drive productivity and innovation, it’s critical that the models powering these systems are built with care, clarity, and purpose. At ServiceNow, the AI model development lifecycle begins long before a model is trained or deployed, it starts with rigorous planning, research, and evaluation.

In this first post of our four-part series, we’ll explore how ServiceNow scopes, vets, and prepares candidate models for development, ensuring that the right choices are made before any code is written.

Laying the Groundwork: Why Planning Matters

Developing an AI model isn’t just about picking a high-performing system from a leaderboard. It’s about aligning technical capabilities with business needs, ensuring legal and ethical compliance, and building a foundation for ongoing development. That’s why the first phase of ServiceNow’s lifecycle is entirely focused on research, vetting, and planning.

Landscape Scanning and Model Discovery

Between major model releases, our teams continuously scan the landscape of academic research, open-source projects, and proprietary systems. We evaluate new models based on several key dimensions:

Reported performance on standard benchmarks
Model architecture and size
Context window capabilities
Licensing and usage constraints
Relevance to platform and domain-specific needs

Our research team typically screens dozens of models each quarter, but only a select few advance to feasibility testing. This disciplined scanning process ensures we stay ahead of emerging advancements while focusing our resources on the most promising candidates.

Preliminary Evaluation and Feasibility Testing

Once promising models are identified, they undergo preliminary evaluation using a controlled suite of tests designed to validate claimed capabilities on neutral data the models haven’t encountered before. For example, a model may need to achieve a ROUGE-L score of 80 or higher on our ITSM summarization benchmark to advance.

The focus is not on perfect accuracy, but on understanding the model’s baseline abilities, how well they generalize, and whether they show promise for further tuning. Evaluation results are documented and reviewed to inform next steps.

Initial Legal and Licensing Review

Before any development work begins, models are reviewed to ensure their usage complies with licensing terms and regulatory expectations. This includes:

Verifying that models are appropriately licensed for enterprise use
Identifying any restrictions on downstream usage or data handling
Flagging potential risks related to intellectual property, compliance, or other business factors that could hinder suitability for production use.

This legal check allows development to proceed responsibly, while flagging any cases that may require deeper legal involvement later.

Setting Release Goals with Product Teams

AI development at ServiceNow is goal driven. Technical teams work closely with product management to define what the next model release should achieve, whether that’s unlocking new capabilities, improving quality, or enhancing performance.

For example, in the Yokohama release, the ServiceNow SLM incorporated Text2Flow, which previously ran on a separate model. This consolidation simplified model architecture and reduced complexity in production.

Typical release goals like this often focus on:

New capabilities (e.g., supporting new languages or use cases)
Improved quality in specific domains or languages
Performance and efficiency enhancements

This collaboration ensures tight alignment between engineering, product, and design from the very beginning.

Gathering and Prioritizing Requirements

With high-level goals in place, requirements are gathered from across the organization. These may include:

Dependency requests from feature teams
Observed gaps in prior model releases
Emerging customer needs or feedback

Requirements are then formalized and prioritized based on business value, technical feasibility, and alignment with the strategic direction of the platform. This prioritization shapes what models are selected and how development will be focused.

Selecting Candidate Models

Selecting candidate models for development is a critical phase in the model lifecycle. This step ensures that promising models are identified and thoroughly evaluated before investment. At a minimum, the selection process includes:

The incumbent model – the current production model, used as a baseline for comparison
One or more new candidate models – selected based on eligibility and potential for improvement

This multi-candidate approach supports a competitive, data-driven evaluation process that helps identify the most suitable model for further development.

Assessing Base Abilities

To evaluate candidate models, a comprehensive assessment of their base abilities is performed. This assessment uses a mix of public, academic, industry-standard, and internal ServiceNow-specific benchmarks. These benchmarks include evaluation datasets with structured questions and verifiable answers to measure performance across various dimensions.

Categories of Abilities Evaluated:

Basic Abilities:

Linguistic understanding
Reasoning and commonsense
General knowledge (breadth and depth)
Domain-specific expertise (e.g., math, scientific reasoning, coding)
Content moderation, security, and truthfulness

Advanced Abilities:

Complex reasoning
Instruction following
Conversational fluency
Multilingual capabilities

ServiceNow-Specific Criteria:

Chat summarization
Case summarization
KB article generation

Evaluation Approach:

Model outputs are compared against:

Ground-truth (verifiable) answers
Judging models and rubrics designed for qualitative and quantitative analysis

Assessment Objectives:

Detect Regressions
Ensure there is no decline in performance compared to the incumbent model based on industry benchmarks.
Validate Against Internal Requirements
Confirm that ServiceNow-specific needs—especially formatting and domain alignment are met.
Uncover Gaps and Improvement Potential
Highlight discrepancies between current and desired performance, establishing a baseline and identifying limitations for future improvement.

Conclusion: Building with Intention

Before a single line of training code is executed, ServiceNow invests in thorough research, vetting, and goal setting to ensure AI models are built with clarity and purpose. This foundation is what enables us to innovate responsibly and deliver meaningful outcomes to our customers.

In our next post, we’ll walk through how candidate models are tuned, evaluated, and refined through rigorous experimentation.

Coming up next: “Tuning the Core: From Candidate Models to Capable Systems.”
Download our Responsible AI whitepaper to explore our approach in more depth.

Rich2 · ‎06-17-2025

I would love to see a guide on moving Now Assist configurations from a dev instance to production. Any special considerations, gotchas to be aware of, ect. This would be a great help to many and I have been unable to find a resource. Thanks.