Evaluating the prompt

  • Release version: Zurich
  • Updated July 31, 2025
  • 2 minutes to read
  • Summarize
    Summarized using AI
    This content was generated using new OpenAI-powered functionality. Results are provided on an as is basis and are not guaranteed to be accurate or complete.

    Summary of Evaluating the prompt

    Evaluating prompts is a continuous process that happens both during and after prompt development to ensure effectiveness. This evaluation involves analyzing batches of test data, typically outside the Now Assist Skill Kit environment, to refine and validate prompt performance before deployment.

    Show full answer Show less

    Ongoing Evaluation During Development

    While developing prompts, continuous and adaptive evaluation should be performed using larger batches of examples rather than just a few. This approach helps avoid overreacting to random variances (“noise”) in model responses and supports informed prompt adjustments based on observed outputs.

    Final Performance Evaluation

    Before deploying a skill, it is critical to evaluate the prompt on a separate, representative test dataset that was not used during development. This practice prevents prompt overfitting, where iterative changes to a prompt based on the same data can artificially inflate performance by overspecializing the prompt to specific examples.

    Evaluation Metrics

    Choosing appropriate metrics depends on the nature of the use case and output length:

    • Classification-based assessment: Best for short, well-defined outputs (e.g., true/false, multiple-choice). Metrics like precision, recall, and F1 score can be directly calculated when model responses are structured and labeled.
    • Assessment of longer generations: For more complex or open-ended outputs, human evaluators score outputs on multiple qualitative axes:
    • Faithfulness: Does the output accurately reflect the prompt context without hallucinating unrelated information?
    • Correctness: Is the output factually and instructionally accurate?
    • Helpfulness: Does the output effectively support the task and user needs? This is subjective but essential to measure.
    • Fluency: Is the text grammatically correct, coherent, and free of typos?

    Scoring these properties on a numerical scale (e.g., 1-5) rather than binary yes/no helps capture nuances in performance.

    Evaluating the prompt is an ongoing process that occurs during and after prompt development and completion.

    Prompt evaluation overview

    To determine the effectiveness of your prompt, you should evaluate batches of test data. You should copy the model-generated responses and perform evaluations outside of Now Assist Skill Kit.

    During prompt development

    Ongoing, improvised evaluation should take place alongside the development of the prompt. This ongoing evaluation enables you to adapt the prompt based on observed model outputs. It may be tempting to test a change to a prompt against just one or two examples, however, to avoid reacting to noise, you should look at larger batches, and consider the statistical significance of the performance differences that you observed.

    Chart that shows a comparison of prompt performance.

    Final performance evaluation

    Before you deploy a skill, you should test the prompt on a representative batch of data that was isolated from the development process, that is, “test” data. You want to use isolated test data because of a phenomenon known as prompt overfitting. Iteratively editing a prompt based on the model outputs generated on the same data that is used for testing can lead to significant over-estimates of performance. This result is because the prompt can become overspecialized to the specific examples used in development. Even though the effect is typically less dramatic than what occurs when fitting machine learn model parameters to a test dataset, it’s rooted in the same underlying principles, and should be avoided.

    Evaluation metrics

    Selecting the right metrics for evaluation is an important consideration. The following list provides a few approaches, each of which may be more or less appropriate depending on the use case.

    • Classification-based assessment of short generations

      This approach requires labeled records, and it works best when the labels are short, well-defined “right answers,” for example, true or false, multiple-choice, or category selection. In these cases, the model outputs can usually be parsed and formatted, then metrics like precision, recall, F1 scores, and so on can be directly calculated.

    • Assessment of longer generations

      Many of the most interesting generative AI use cases require longer model generations, and there are many possible “right answers.” In these cases, the output can be scored (by human evaluators) along several different axes, for example:

      • Faithfulness

        Is the generated text faithful to the context provided in the skill prompt? (The opposite of faithfulness is hallucination, which is to say that the model injects out-of-context information.)

      • Correctness

        Is the generated text correct relative to the skill instruction?

      • Helpfulness

        Is the generated text helpful relative to the task that the skill wants to accomplish? (Helpfulness is subjective but it’s important to try to measure. Doing so properly requires a solid understanding of the needs of the people who will ultimately be using the skill.)

      • Fluency

        Is the generated text grammatically correct? Does it have any typos, issues with coherency, and so on?

      Note:
      It’s useful to score these properties on a scale, like 1-5, rather than with yes or no.