Overview

Datasets are structured test cases for your prompts. Each row is one case. Each column is a value the prompt, evaluator, or review workflow can use. Use datasets when you want to test a prompt against more than one hand-picked Playground run: golden examples, CSV imports, multimodal inputs, production logs, generated cases, and known regressions.

Dataset table showing manually entered rows and columns

What Belongs In A Dataset

Case type	Use it for
Golden examples	Canonical inputs that should keep passing.
Edge cases	Ambiguous, long, malformed, unsafe, or uncommon requests.
Production examples	Real spans copied from Monitor when users expose a useful case.
Regression cases	Failures that should be tested before the next release.
Synthetic cases	Generated variants that broaden coverage around a Behavior or Improve cycle.

Keep rows specific. A dataset row should make it clear what input is being tested and what good output looks like.

How Datasets Work

Dataset columns usually map to prompt variables. If a prompt expects {{request_genre}}, the dataset should have a request_genre column. Extra columns can hold expected output, labels, notes, IDs, or evaluator context. Columns can be:

Column type	Use it when
Static	The value is typed, imported, or copied into the dataset.
Dynamic API	The value should be fetched from your API per row.
Dynamic prompt	The value should be generated by another prompt in the project.

Datasets can also contain text, images, and PDFs. Use multimodal cells when your prompt consumes files or visual context.

Common Workflows

Create or import rows

Add rows manually, upload a CSV, or copy a useful production span from Monitor.

Match columns to prompt variables

Make sure required prompt variables have matching dataset columns.

Add expected output or labels

Store what the evaluator should check, what the reviewer should notice, or why the case matters.

Run evaluations

Use the dataset with one or more evaluators to score prompt output.

Keep useful failures

Promote important production failures or Improve evidence into long-lived regression coverage.

Where Datasets Fit

Datasets connect the rest of the Platform:

Prompts use dataset rows as repeatable inputs.
Evaluators score prompt responses against dataset cases.
Monitor turns real production spans into dataset rows.
Behaviors reveal repeated patterns worth preserving as coverage.
Improve uses linked and generated datasets to compare prompt candidates before review.

The goal is not to build the biggest dataset. The goal is to keep the examples that make release decisions clearer.

Next Steps

Set up a dataset

Create a dataset, add rows, and map columns to prompt variables.

Import CSV into dataset

Bulk-import text, image, or PDF test cases.

Use multimodal cells

Add text, image, and PDF values to dataset rows.

Use dynamic columns

Fetch row values from APIs or other prompts.

Build datasets from logs

Preserve useful production spans as test cases.

Evaluate prompts

Run prompts against datasets and evaluators.

Get started

Instrument

Improve

Behaviors

Monitor

Evaluators

Datasets

Prompts

Tools

Admin

Others

What Belongs In A Dataset

How Datasets Work

Common Workflows

Where Datasets Fit

Next Steps

Set up a dataset

Import CSV into dataset

Use multimodal cells

Use dynamic columns

Build datasets from logs

Evaluate prompts

​What Belongs In A Dataset

​How Datasets Work

​Common Workflows

​Where Datasets Fit

​Next Steps

Set up a dataset

Import CSV into dataset

Use multimodal cells

Use dynamic columns

Build datasets from logs

Evaluate prompts

What Belongs In A Dataset

How Datasets Work

Common Workflows

Where Datasets Fit

Next Steps