Part 2 | Inside the AI SDLC at ServiceNow: From Foundational Models to Production-Ready AI

navdeepgill · ‎06-30-2025

In Part 1 of this series, we explored how ServiceNow rigorously scopes and evaluates AI models before any training begins—focusing on planning, feasibility testing, legal review, and requirement setting. This foundational work ensures we choose the right models for the right reasons.

Now, in Part 2, we unpack how those models are tuned, evaluated, and refined for production readiness, combining academic rigor with enterprise-grade operational practices.

After candidate models are identified and preliminarily vetted, the next critical phase in the AI Software Development Lifecycle at ServiceNow is model tuning. This phase transforms promising candidates into high-performing, production-ready systems.

But tuning at ServiceNow isn’t just about improving benchmark scores, it’s a rigorous, iterative process grounded in empirical evaluation, responsible data management, and structured experimentation to enhance performance for specific use cases, while ensuring overall reliability, safety, and security of responses.

Assess Base Abilities

To understand a model's inherent capabilities before applying any fine-tuning, we benchmark its base performance across three key categories:

1. Generic and Domain-Specific Capabilities

These are evaluated using a mix of public, academic, and industry-standard benchmarks that test foundational skills. Key areas of focus include:

Math and scientific reasoning
Code generation and domain-specific knowledge
Linguistic fluency, common-sense reasoning, and general knowledge breadth

These core tasks are highly predictive of a model’s overall reasoning ability and quality.

2. ServiceNow-Specific Criteria

Internal benchmarks are used to measure alignment with platform and product-specific needs, such as:

Output formatting (e.g., generating accurate JSON responses with English keys and localized values; supporting complex, schema-aligned outputs for enterprise use cases such as Flow Designer, code snippets, Cypher queries, and agent notes across multiple languages)
Performance on ServiceNow platform tasks (e.g., strong results on proprietary benchmarks for Agent Assist, Text2Flow, Text2Code, and Text2Cypher; demonstrated alignment with customer workflows and domain-specific schema; consistent performance across multilingual enterprise tasks in Glide, Flow Designer, and content moderation pipelines)

These criteria ensure models are evaluated for tasks directly relevant to our use cases.

3. Trust and Safety

We assess model behavior across key safety dimensions, including:

Content Moderation
Assessed using internal benchmarks that measure the model’s ability to flag unsafe content across categories like bias, toxicity, and jailbreak prompts, using precision, recall, F1, and false positive rate.
Truthfulness
Measured using the TruthfulQA benchmark, which evaluates how reliably the model provides factually accurate responses over misleading or fabricated ones.
Security
Evaluated on a large-scale adversarial prompt set covering jailbreaks, prompt injection, and role-playing attacks. Metrics focus on the model’s ability to resist exploitation while maintaining appropriate filtering.

Assessments serve three key goals:

Prevent regressions from baseline or previous model versions
Validate alignment with ServiceNow-specific requirements
Identify performance gaps to guide downstream tuning efforts

Evaluation methods include automated scoring against verifiable answers, rubric-based assessments, and reviews by judge models , specialized large language models (LLMs) used to evaluate the quality, accuracy, and relevance of responses generated by other models.

These benchmarks provide a baseline view of model strengths and limitations, helping inform and shape the next phases of the development cycle.

Narrowing the Field: Selecting the Primary Candidate

Once base abilities are benchmarked, a primary candidate is chosen based on:

How well it aligns with the release goals
Stability and consistency across benchmarks
Comparative performance vs. other models, including the incumbent

The primary candidate becomes the focus of in-depth experimentation, while secondary candidates are maintained as viable backups and undergo similar training for comparative purposes.

Tuning Techniques: SFT and CPT

At ServiceNow, model tuning is driven by a strategic combination of techniques, selected based on the model’s architecture, observed deficiencies, and alignment with release goals.

Continual Pretraining (CPT)

When deeper performance improvements are required, CPT is applied as the first step. This method involves training the model on large volumes of unlabeled or weakly labeled data to build broader capabilities. CPT is compute-intensive and typically spans 2–4 weeks of elapsed time, depending on the data mixture and sequence length. For context, recent CPT phases used up to 26 nodes with H100 GPUs, processing over 9 billion tokens with sequence lengths reaching 32K tokens. Extensive iteration is often required to avoid degrading the model’s initial performance and to maintain stability across downstream tasks.

Supervised Fine-Tuning (SFT)

All models undergo SFT. This technique uses labeled, high-quality instruction-response pairs to refine and align model behavior. SFT is faster, more efficient, and more targeted than CPT, and is always applied after CPT if CPT is used.

The appropriate tuning path, SFT alone or CPT followed by SFT, is selected based on each model’s needs and the expected performance lift. Key hyperparameters such as batch size, gradient accumulation, sequence length, and number of epochs are selected and adjusted based on internal experimentation. In addition, stratified sampling is used to guide data exposure per epoch, contributing to more stable and controlled training outcomes. While not exhaustively optimized, these parameters are tuned through practical iteration to ensure effective model alignment.

Refining the Dataset

ServiceNow maintains a continuously evolving, high-quality dataset repository that supports model development across releases. For each tuning cycle, datasets are enhanced through a structured process focused on performance lift, safety, and relevance.

1. Gap Identification

Model evaluation results and release-specific goals drive the identification of capability gaps. These insights inform what new data is required and where enhancements should focus.

2. Data Sourcing

New data is drawn from three main sources:

Open-Source Data
All datasets, whether open-source, synthetic, vendor-provided, or proprietary, undergo a legal review process to ensure compliance with licensing, usage rights, and enterprise deployment standards. Open-source datasets specifically undergo a rigorous permissibility check, as guided by legal counsel, to verify licensing terms, data provenance, method of generation, and content legality.
Vendor-Purchased Data
Commercially acquired datasets come pre-validated but are still subjected to ServiceNow’s internal quality checks to ensure consistency, safety, and value.
Synthetic Data
Synthetic data is generated through a custom instruction-based framework that evolves and refines prompts to mimic realistic scenarios, including those resembling customer data. The process relies entirely on prompt-driven generation, producing a large volume of instruction-completion pairs—typically 10 to 20 million, which are then distilled down to a high-quality subset of approximately 500,000 examples. This method is especially effective for addressing underrepresented skills or edge-case scenarios.

3. Data Preparation

Each dataset, regardless of source, undergoes refinement through a standardized pipeline:

Reannotation
Prompts are reused, but completions are regenerated using a preferred, high-quality model. This boosts consistency, safety, and overall output quality, particularly when evolving from smaller to larger model checkpoints.
Quality Filtering
Judge models evaluate prompt-completion pairs, flagging and removing unsafe, low-quality, or incoherent samples.
Deduplication
Near-duplicate records are removed to reduce overfitting and maintain dataset diversity.
Rebalancing
Skills and use cases that are underrepresented are explicitly targeted. This is achieved through:

Up-sampling in the dataset (replicating records), or
Up-sampling during training (increasing sampling frequency in each epoch)

The result is a finely tuned, balanced, and compliant dataset, optimized not only for performance, but also for trust, safety, and domain relevance. All data sources, including open-source, synthetic, and vendor-provided, undergo legal and compliance review to ensure suitability for enterprise use, including redistribution, model training, and commercial deployment.

The Experimentation Cycle

Model training at ServiceNow is a structured, iterative process, not a one-off task. Each tuning cycle follows a disciplined sequence designed to continuously refine and improve performance:

Train the model using the latest curated dataset, based on a defined strategy
Evaluate against internal and public benchmarks using automated and rubric-based methods
Analyze results to identify quality gaps or misalignments with release goals
Refine the training approach, this may involve updating the dataset, changing the training configuration, or both
Repeat the process until the model either meets the quality bar or further improvements plateau. Success is defined by a combination of internal and public benchmark performance, alignment with defined business metrics, and qualitative evaluations such as instruction-following, tone, and output safety. A model is considered ready when it consistently meets or exceeds these thresholds across priority use cases and evaluation scenarios.

Training may occur on ServiceNow infrastructure or on approved and vetted third-party platforms, with checkpoints stored on GPU-accessible systems to support ongoing evaluation and iteration.

Evaluating Progress and Selecting Final Candidates

After several tuning cycles, the best-performing models are shortlisted for deeper evaluation and release consideration. Selection goes beyond benchmark performance and is based on a comprehensive review that includes:

Robustness across diverse prompts and inputs – Models are tested across a wide range of input types, including edge cases, ambiguous language, and inconsistent formatting, to ensure consistent and resilient behavior.
Generalization to new tasks, domains, and formats – Evaluation includes the model’s ability to adapt to novel instructions, emerging domains, and varying structural formats without requiring task-specific tuning.
Compatibility with downstream Now Platform use cases – Models are assessed for their effectiveness in real-world workflows, ensuring they align with platform requirements and deliver value in practical enterprise scenarios.
Bias and fairness assessment through automated and human-in-the-loop evaluation – Model outputs are analyzed to detect patterns of bias or exclusion, ensuring that responses are inclusive, balanced, and aligned with fairness principles.

Once a candidate list is formed, final selections are made collaboratively, including decisions on which model(s) to move forward and whether any complementary prompt refinements are needed prior to release.

Conclusion: Engineering for Intelligence

Model development at ServiceNow is driven by data, iteration, and precision. Our goal isn’t to chase benchmarks, it’s to develop intelligent systems that solve real problems, integrate seamlessly with our platform, and behave reliably at scale.

Stay tuned for Blog 3: “Alignment and Assurance: Preparing Models for Production”, where we’ll explore how human alignment techniques, like preference modeling and safety tuning, are used to shape model behavior, and how validation frameworks ensure that every model meets ServiceNow’s standards for reliability, inclusiveness, and platform readiness.

If you missed the earlier post(s), be sure to check them out:

Part 1 | Inside the AI SDLC at ServiceNow

Download our Responsible AI whitepaper to explore our approach in more depth.