Generating Representative Synthetic Data with LLMs - Zero to Hero

10 min read ~1.5 hour project ~$2 Beginner

This example demonstrates how you can can quickly, easily, and inexpensively generate synthetic data using the Sutro Python SDK.

The Goal

Our goal today will be to generate a dataset of 20,000 high-quality, synthetic product reviews. This could be useful for:

Training/evaluating sentiment analysis models, recommendation systems, spam classifiers, and other machine learning models
Market research simulations, A/B testing, customer segmentation
… and more!

We’ll use a crawl, walk, run approach:

Start by generating a basic dataset of 100 product reviews.
Add structure and randomness (diversity) to the reviews.
Add representation to the reviews, so that they are representative of underlying real-world data.
Scale up to 20,000 reviews, seamlessly and inexpensively.

Let’s get started!

Baby Steps

First, make sure you have the Sutro Python SDK installed. This will include all dependencies required for the examples. Let’s start by creating a basic dataset of 100 product reviews.

import sutro as so
import polars as pl

system_prompt = "Generate a novel product review."
inputs = [""] * 100 # <--- Generate 100 reviews.

results = so.infer(
    inputs, 
    model="qwen-3-4b", 
    system_prompt=system_prompt
)

And that’s it! In five lines of code, we’ve generated 100 product reviews. As a prototyping (p0) job, this should take a few minutes to run. You should see something like the following when you run the job. It should take a a couple of minutes to run (GIF sped up for brevity):

Once it’s done running, we can inspect the results in the Sutro Web UI:

As you can see, we have data - but it’s not very useful (at least not yet)! We have far too much commonality between the product reviews including the type of product being reviewed, the rating, sentiment, and more. This can generally be expected from an LLM until we introduce further steps for obtaining structure, heterogeneity and representativeness. So, let’s do that next.

Let’s Walk - Adding Structure & Randomness

To begin, let’s see if we can introduce some more diversity and structure into our reviews. We’ll do this in a few ways:

Update the system prompt to include specific fields.
Add a random numerical seed to the each of the inputs to increase diversity.
Modify the temperature sampling parameter to increase randomness.
Add a Pydantic model to enforce a schema structure of the reviews we want to generate.
Increase the model size from qwen-3-4b to qwen-3-14b to sample from more world knowledge.

Let’s implement these changes now.

import sutro as so
import polars as pl
from pydantic import BaseModel

system_prompt = """
Generate a novel product review. 
Include a title, text, author, product name, 
product description, product category, 
and rating out of 5.
"""

inputs = [""] * 100

sampling_params = {
    "temperature": 1.1,
}

class ProductReview(BaseModel):
    review_title: str
    review_text: str
    review_author: str
    product_name: str
    product_description: str
    product_category: str
    rating_out_of_5: int

results = so.infer(
    inputs,
    model="qwen-3-14b", 
    system_prompt=system_prompt,
    output_schema=ProductReview,
    random_seed_per_input=True # <-- uses a random seed for each input
)

This is a significant improvement! We now have more product diversity overall, and a consistent schema for the reviews we’ve generated. However, we can still do better. The review titles, product names/descriptions, author names, and ratings are still very similar across the reviews we’ve generated.

Time to Run - Adding Representation

For most valuable use cases, we want a final dataset that is representative of real-world data. To achieve this, we not only want diversity itself, but rather diversity that adheres to the real-world distribution of the data we’re trying to represent. There are various levels of complexity to achieve this, but for this example we’ll take a simple approach by using two other “seed” datasets to produce the representation we’re looking for. In our example, we’re creating product reviews, so we probably need a set of products to review, right? In the real world, perhaps if you’re running an e-commerce business you’d want to use your own product dataset for this. However, for our toy example, we’ll use an Amazon Products sample dataset from Hugging Face. It works well for our purposes - it contains 33,000 products with associated product names, descriptions, and prices.

This will certainly help us get more product representation, but how about reviewer representation? For that, we can use a personas dataset, in this case our very own Synthetic Humans 50k dataset.

This dataset contains 50,000 personas sampled from actual US demographics, and contains qualitative descriptions of each persona. For this example, let’s say we want to generate product reviews from 18-30 year olds living in popular US cities over all of the products in the Amazon Products dataset. To do this, we’ll need to merge the two datasets, sampling a random persona for each product. We’ll then run our previous inference job over the merged dataset. Let’s do that now.

import sutro as so
import polars as pl
from pydantic import BaseModel
from random import randint

products_df = pl.read_parquet('hf://datasets/ckandemir/amazon-products/data/train-00000-of-00001.parquet')[0:20000]

personas_df = pl.read_parquet('hf://datasets/sutro/synthetic-humans-50k/chunk_0.parquet')
personas_df = personas_df.filter(
    (pl.col('age') >= 22) & 
    (pl.col('age') <= 30) & 
    (pl.col('location').is_in([
        'New York, New York', 
        'Los Angeles, California', 
        'Chicago, Illinois', 
        'Houston, Texas', 
        'Miami, Florida', 
        'Seattle, Washington', 
        'Boston, Massachusetts', 
        'San Francisco, California',
        'Washington, D.C.',
        'Atlanta, Georgia',
        'Philadelphia, Pennsylvania',
        'Phoenix, Arizona',
        'San Diego, California',
        'San Jose, California',
        'Austin, Texas',
    ]))
)

def get_random_persona_demographic_summary():
    row = personas_df.sample(1, seed=randint(0, 1000000))
    return row['demographic_summary'][0]

random_personas = [get_random_persona_demographic_summary() for _ in range(len(products_df))]
products_df = products_df.with_columns(
    pl.Series("persona", random_personas)
)

system_prompt = """You will be given a product name, description, and price.
You will also be given a reviewer persona.
Your task is to generate a novel product review from the reviewer persona's perspective. 
Include a title, text, author, product name, product 
description, product category, and rating out of 5.
"""

class ProductReview(BaseModel):
    review_title: str
    review_text: str
    review_author: str
    product_name: str
    product_description: str
    product_category: str
    rating_out_of_5: int

results = so.infer(
    products_df[0:100],
    column=["Product Name: ", "Product Name", " Product Description: ", "Description", " Price: ", "Selling Price", " Reviewer Persona: ", "persona"],
    model="qwen-3-4b", 
    system_prompt=system_prompt,
    output_schema=ProductReview
)

As you can see, we actually went back to the qwen-3-4b model and removed our other diversity-boosting techniques, as we can gain sufficient representation with just the two seed datasets. We’re also using Sutro’s helpful column concatenation feature to assembly a single input (string) using multiple others. Sutro Web UI

Much, much better! We now have a wide array of products and reviewers, faithful to our underlying real-world data. We even see cases where the rating is lower because of the mismatch between the product and reviewer demographics. Now, let’s scale up our example to 20,000 product reviews!

Scaling Up

To scale up to our 20,000 product reviews, it’s dead simple! We just need to make a couple of changes to our code above.

... # previous code

results = so.infer(
    products_df, # <-- we're now using the full dataset
    column="product_info",
    model="qwen-3-4b", 
    system_prompt=system_prompt,
    output_schema=ProductReview,
    job_priority=1 # <-- we're now setting the job priority to 1
)

All we need to do is remove the slicing of the products_df, and set the job priority to 1. Previously, our jobs were running as priority 0 (p0), which is the default for small-scale testing (see Job Priority for more details). This should take less than an hour to run, and generate all 20,000 product reviews. You can inspect the progress and sample results in the Sutro Web UI, cancelling the job when the samples don’t look promising.

In this case, it took 29 minutes to run, and cost only $1.88 using Sutro! Once it’s done running, you can grab the results using the SDK, or download the results directly from the Web UI.

results = so.get_job_results('job-0daebfba-ce27-462e-9d1c-0bc566238a50', include_inputs=True)
results.write_parquet('results.parquet')

Sutro will automatically unpack the JSON fields in the results into separate columns, so you can access them like any other column. You can view the resulting dataset directly on Hugging Face:

Recap

In this example, we demonstrated how you can easily create synthetic data with LLMs using the Sutro Python SDK. Our final 20,000 product review dataset:
✅ Created using a few dozen lines of code.
✅ Representative of our underlying real-world data.
✅ Required zero infrastructure setup.
✅ In less than an hour.
✅ For less than $2!
With the Sutro Python SDK, you can easily create synthetic data with LLMs for your own use cases. Try it out today by requesting access to Sutro!

Addendum

If you want to generate even more variations, you can set the n sampling parameter, which will produce n samples for each input.

sampling_params = {
    "n": 5
}

results = so.infer(
    products_df[0:100],
    column="product_info",
    model="qwen-3-4b", 
    system_prompt=system_prompt,
    output_schema=ProductReview,
    sampling_params=sampling_params
)

This will produce 5 samples for each input. If we applied this to the full dataset, we would generate 100,000 reviews. Sutro Web UI

As you can see, this now has 5 distinct outputs, with slight variations on each review.

​The Goal

​Baby Steps

​Let’s Walk - Adding Structure & Randomness

​Time to Run - Adding Representation

​Scaling Up

​Recap

​Addendum