
What Belongs In A Dataset
| Case type | Use it for |
|---|---|
| Golden examples | Canonical inputs that should keep passing. |
| Edge cases | Ambiguous, long, malformed, unsafe, or uncommon requests. |
| Production examples | Real spans copied from Monitor when users expose a useful case. |
| Regression cases | Failures that should be tested before the next release. |
| Synthetic cases | Generated variants that broaden coverage around a Behavior or Improve cycle. |
How Datasets Work
Dataset columns usually map to prompt variables. If a prompt expects{{request_genre}}, the dataset should have a request_genre column. Extra columns can hold expected output, labels, notes, IDs, or evaluator context.
Columns can be:
| Column type | Use it when |
|---|---|
| Static | The value is typed, imported, or copied into the dataset. |
| Dynamic API | The value should be fetched from your API per row. |
| Dynamic prompt | The value should be generated by another prompt in the project. |
Common Workflows
Create or import rows
Add rows manually, upload a CSV, or copy a useful production span from Monitor.
Match columns to prompt variables
Make sure required prompt variables have matching dataset columns.
Add expected output or labels
Store what the evaluator should check, what the reviewer should notice, or why the case matters.
Where Datasets Fit
Datasets connect the rest of the Platform:- Prompts use dataset rows as repeatable inputs.
- Evaluators score prompt responses against dataset cases.
- Monitor turns real production spans into dataset rows.
- Behaviors reveal repeated patterns worth preserving as coverage.
- Improve uses linked and generated datasets to compare prompt candidates before review.
Next Steps
Set up a dataset
Create a dataset, add rows, and map columns to prompt variables.
Import CSV into dataset
Bulk-import text, image, or PDF test cases.
Use multimodal cells
Add text, image, and PDF values to dataset rows.
Use dynamic columns
Fetch row values from APIs or other prompts.
Build datasets from logs
Preserve useful production spans as test cases.
Evaluate prompts
Run prompts against datasets and evaluators.