ServiceNow recherche

Large Language Models

Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models
Abstention Ability (AA) is a critical aspect of Large Language Model (LLM) reliability, referring to an LLM’s capability to …
AgentMerge: Enhancing Generalization in Fine-Tuned LLM Agents
Recent advancements in large language models (LLMs) have spurred interest in developing autonomous agents capable of performing complex …
Context is Key: A Benchmark for Forecasting with Essential Textual Information
Forecasting is a critical task in decision making across various domains. While numerical data provides a foundation, it often lacks …
Evaluating Interventional Reasoning Capabilities of Large Language Models
Numerous decision-making tasks require estimating causal effects under interventions on different parts of a system. As practitioners …
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data …
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference …
Change Is the Only Constant: Dynamic LLM Slicing based on Layer Redundancy
This paper introduces a novel model compression approach through dynamic layer-specific pruning in Large Language Models (LLMs), …
Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences
Direct Preference Optimization (DPO) is an effective technique that leverages pairwise preference data (usually one chosen and rejected …
XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference
In-context learning (ICL) approaches typically leverage prompting to condition decoder-only language model generation on reference …
An Ecosystem for Web Agents: WorkArena, BrowserGym, AgentLab and more
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those …