Skip to main content
Monitors use LLM judges to passively score production traffic to surface trends and issues in your LLM applications. For example, you can monitor your application’s responses for correctness or helpfulness, or you can monitor user input to identify trends in what they’re asking your agents about. Monitors automatically store all scoring results in Weave’s database, allowing you to analyze historical trends and patterns. You can monitor text, images, and audio in your application’s input and output. Monitors require no code changes to your application. Set them up using the W&B Weave UI. If you need to actively intervene in your application’s behavior based on scores, use guardrails instead.

Enable preset signals

Signals are preset classifier monitors that automatically score production traces for common quality issues and error categories. Each signal uses a benchmarked LLM prompt to classify traces as binary labels (true/false) with confidence scores and reasoning. Signals require no prompt engineering or scorer configuration. Enable signals from the Monitors page to start classifying traces immediately. Signals use a W&B Inference model to score traces, so no external API keys are required.

Available signals

Weave provides 13 preset signals organized into two groups.

Quality signals

Quality signals evaluate successful root-level traces for output quality and safety issues.
SignalWhat it detects
HallucinationFabricated facts or claims that contradict the provided input context
Low qualityResponses with poor format, insufficient effort, or incomplete content
User frustrationSigns of user frustration such as repeated questions, negative sentiment, or complaints
JailbreakingPrompt injection and jailbreak attempts that try to bypass safety guidelines
NSFWExplicit, violent, or otherwise inappropriate content in inputs or outputs
LazyLow-effort responses such as excessive brevity, refusals to help, or deferred work
ForgetfulFailure to use context from earlier in the conversation, ignoring previously stated facts or instructions

Error signals

Error signals categorize failed traces by root cause to help you identify and resolve infrastructure and application issues.
SignalWhat it detects
Network ErrorDNS failures, timeouts, connection resets, and other connectivity issues
RatelimitedHTTP 429 responses, quota exhaustion, and throttling from upstream APIs
Request Too LargeRequests exceeding size or token limits, such as context window exceeded
Bad RequestClient-side errors where the server rejected the request (4xx except 429)
Bad ResponseInvalid, unexpected, or unusable responses from remote services (5xx)
BugFlaws in application code such as KeyError, TypeError, or logic errors

Enable signals from the Monitors page

To enable signals:
  1. Open the W&B UI and then open your Weave project.
  2. From the Weave side-nav, select Monitors.
  3. At the top of the Monitors page, a row of suggested signal cards appears. Each card shows the signal name, a description, and an Enable button.
  4. To enable a single signal, select the Enable button on the signal card. The signal begins scoring new traces immediately.
  5. To enable multiple signals at once, select the Add signals button. This opens a drawer that lists all available signals grouped by category (Quality and Error). Select the signals you want to enable, then select Apply.
After enabling signals, Weave scores incoming traces and stores the results as feedback on each Call object. View signal results in the Traces tab by selecting a trace and reviewing the feedback panel.

Manage active signals

To view or remove active signals:
  1. From the Monitors page, select the Manage signals button (gear icon). This opens a drawer listing all currently active signals grouped by category.
  2. Hover over a signal and select the Remove button (trash icon) to disable the signal.
Removing a signal stops scoring new traces. Existing scores from the signal are preserved.

How signals work

Each signal uses an LLM-as-a-judge approach to classify traces:
  1. Trace selection: Quality signals evaluate successful root-level traces. Error signals evaluate failed traces. Child spans and intermediate calls are not scored.
  2. Prompt construction: Weave constructs a prompt that includes the trace metadata, inputs, outputs, exception details (if any), and the operation’s source code. The signal’s classifier prompt is appended with instructions for the specific issue to detect.
  3. LLM scoring: A W&B Inference model evaluates the trace and returns a structured JSON response with:
    • A binary classification (whether the issue was detected)
    • A confidence score (0.0 to 1.0)
    • A reason citing specific evidence from the trace
  4. Result storage: Results are stored as feedback on the Call object and are queryable from the Traces tab.
When multiple signals from the same group (Quality or Error) are active, Weave batches the signals into a single LLM call for efficiency. The model evaluates all active classifiers in one pass and returns results for each.

Signals compared to custom monitors

SignalsCustom monitors
ConfigurationOne-click enable, no prompt writingFull control over scoring prompt, model, and parameters
ScopePreset quality and error classifiersAny evaluation criteria you define
Trace selectionAutomatic (successful root traces for quality, failed traces for errors)Configurable operations, filters, and sampling rate
ModelW&B Inference (preset)Any commercial or W&B Inference model
Use caseQuick production monitoring with proven classifiersCustom evaluation criteria specific to your application
Use signals to get started with production monitoring quickly, then create custom monitors for evaluation criteria specific to your application.

How to create a monitor in Weave

To create a monitor in Weave:
  1. Open the W&B UI and then open your Weave project.
  2. From the Weave side-nav, select Monitors and then select the + New Monitor button. This opens the Create new monitor modal dialog.
  3. In the Create new monitor menu, configure the following fields:
    • Name: Must start with a letter or number. Can contain letters, numbers, hyphens, and underscores.
    • Description (Optional): Explain what the monitor does.
    • Active monitor toggle: Turn the monitor on or off.
    • Calls to monitor:
      • Operations: Choose one or more @weave.ops to monitor. You must log at least one trace that uses the op before it appears in the list of available ops.
      • Filter (Optional): Narrow down which calls are eligible (for example, by max_tokens or top_p).
      • Sampling rate: The percentage of calls to score (0% to 100%).
        A lower sampling rate reduces costs, since each scoring call has an associated cost.
    • LLM-as-a-judge configuration:
      • Scorer name: Must start with a letter or number. Can contain letters, numbers, hyphens, and underscores.
      • Score Audio: Filters the available LLM models to display only audio-enabled models, and opens the Media Scoring JSON Paths field.
      • Score Images: Filters the available LLM models to display only image-enabled models, and opens the Media Scoring JSON Paths field.
      • Judge model: Select the model to score your ops. The menu contains commercial LLM models you have configured in your W&B account, as well as W&B Inference models. Audio-enabled models have an Audio Input label beside their names. For the selected model, configure the following settings:
        • Configuration name: A name for this model configuration.
        • System prompt: Defines the judging model’s role and persona, for example, “You are an impartial AI judge.”
        • Response format: The format the judge should output its response in, such as a json_object or plain text.
        • Scoring prompt: The evaluation task used to score your ops. You can reference prompt variables from your ops in your scoring prompts. For example, “Evaluate whether {output} is accurate based on {ground_truth}.”
      • Media Scoring JSON Paths: Specify JSONPath expressions (RFC 9535) to extract media from your trace data. If no paths are specified, all scorable media from user messages will be included. This field appears when you enable Score Audio or Score Images.
  4. Once you have configured the monitor’s fields, click Create monitor. This adds the monitor to your Weave project. When your code starts generating traces, you can review the scores in the Traces tab by selecting the monitor’s name and reviewing the data in the resulting panel.
You can also compare and visualize the monitor’s trace data in the Weave UI, or download it in various formats (such as CSV and JSON) using the download button () in the Traces tab. Weave automatically stores all scorer results in the Call object’s feedback field.

Example: Create a truthfulness monitor

The following example creates a monitor that evaluates the truthfulness of generated statements.
  1. Define a function that generates statements. Some statements are truthful, others are not:
import weave
import random
import openai

weave.init("my-team/my-weave-project")

client = openai.OpenAI()

@weave.op()
def generate_statement(ground_truth: str) -> str:
    if random.random() < 0.5:
        response = client.chat.completions.create(
            model="gpt-4.1",
            messages=[
                {
                    "role": "user",
                    "content": f"Generate a statement that is incorrect based on this fact: {ground_truth}"
                }
            ]
        )
        return response.choices[0].message.content
    else:
        return ground_truth

generate_statement("The Earth revolves around the Sun.")
  1. Run the function at least once to log a trace in your project. This makes the op available for monitoring in the W&B UI.
  2. Open your Weave project in the W&B UI and select Monitors from the side-nav. Then select New Monitor.
  3. In the Create new monitor menu, configure the fields using the following values:
    • Name: truthfulness-monitor
    • Description: Evaluates the truthfulness of generated statements.
    • Active monitor: Toggle on.
    • Operations: Select generate_statement.
    • Sampling rate: Set to 100% to score every call.
    • Scorer name: truthfulness-scorer
    • Judge model: o3-mini-2025-01-31
    • System prompt: You are an impartial AI judge. Your task is to evaluate the truthfulness of statements.
    • Response format: json_object
    • Scoring prompt:
      Evaluate whether the output statement is accurate based on the input statement.
      
      This is the input statement: {ground_truth}
      
      This is the output statement: {output}
      
      The response should be a JSON object with the following fields:
      - is_true: a boolean stating whether the output statement is true or false based on the input statement.
      - reasoning: your reasoning as to why the statement is true or false.
      
  4. Click Create Monitor. This adds the monitor to your Weave project.
  5. In your script, invoke your function using statements of varying degrees of truthfulness to test the scoring function:
generate_statement("The Earth revolves around the Sun.")
generate_statement("Water freezes at 0 degrees Celsius.")
generate_statement("The Great Wall of China was built over several centuries.")
  1. After running the script using several different statements, open the W&B UI and navigate to the Traces tab. Select any LLMAsAJudgeScorer.score trace to see the results.
Monitor trace