Damien Henry
June 4, 2026
Jasper Research releases MONET, the largest open text-image dataset (104.9M samples), plus nano-t2i, a codebase to train a competitive T2I model on a single GPU.

Jasper research is releasing MONET, the largest open, image–text dataset ever released. It was built from 2.9 billion images and refined to 104.9 million high-quality samples. The launch comes with nano-t2i, a minimal codebase to train a competitive diffusion model from scratch on a single GPU in a couple of days.
Together, these give researchers everything they need to train production-grade text-to-image models without the prohibitive cost and complexity that has long gatekept the field.
Resources
Curious how Jasper Research is used in production? Discover our APIs
AI image generators like DALL-E, Stable Diffusion, and Midjourney can conjure almost anything from a text description. But training these models requires enormous collections of high-quality images paired with detailed descriptions. Building such a collection is expensive and time-consuming.
Until now, that meant only a handful of well-resourced AI labs could train truly competitive image models. Existing open datasets like LAION-5B were huge but messy: full of duplicates, low-quality images, harmful content, and short, uninformative captions scraped from the web. More curated alternatives existed but were either too small for serious pre-training or kept proprietary.
Reproducibility gap. Academic researchers and smaller companies couldn't train models that matched closed-source commercial systems. Not because the techniques were secret, but because the training data was.
MONET bridges this gap. It's the first openly released, filtered, deduplicated, and multi-captioned dataset designed specifically for pre-training large text-to-image models at scale. It's free to use for commercial purposes under the Apache 2.0 license.
Think of MONET as a giant filtering funnel. The team started with essentially the entire open internet's image collection and ran it through six carefully designed stages to keep only the best.
The challenge? Every filtering decision involves trade-offs. Be too strict and you lose rare, valuable images. Be too lenient and you pollute the dataset with junk that degrades model quality.
.png)

To understand the dataset's coverage, the team classified all 104.9M images using CLIP, a model that can match images to text labels without explicit training. The result is a surprisingly balanced distribution across the subjects that matter most for real-world image generation.

MONET dataset distribution: (left) YOLO-based content classification, (middle) CLIP-based content classification, (right) Qwen3-VL-8B-Instruct based image style.
Unlike some image datasets that are dominated by a single category (e.g., mostly product photos or stock imagery), MONET spans a genuine breadth of human visual culture, from street scenes and wildlife to digital art and food. This diversity is what makes it possible to train a general-purpose image model from MONET alone.
Here's a counter-intuitive insight from recent AI research: the quality of the text descriptions matters at least as much as the quality of the images themselves. A stunning photograph paired with a vague caption teaches the model much less than a good photo paired with a rich, detailed description.
Original web captions are typically short, noisy alt-text like "photo.jpg" or "beautiful sunset". MONET replaces these with AI-generated descriptions from four different vision-language models (VLMs), each offering a different perspective and level of detail.
Why four captioners instead of one? Using a single AI to describe everything creates blind spots. Mix multiple captioners and you get more robust, varied descriptions that generalize better to user prompts.

Every image in MONET ships with up to five captions: the original web caption plus one from each of the four VLMs. During training, a caption is sampled randomly, exposing the model to the full range of prompt styles it will encounter in the real world.
One of MONET's most interesting design decisions is mixing real and AI-generated images. Synthetic data can fill coverage gaps and improve alignment but too much creates problems.
The team ran a systematic experiment: train the same image model with varying proportions of synthetic data and measure quality using the FID score (Fréchet Inception Distance ; lower means more realistic images). The results reveal a clear optimum around the 50% mark (lower FID = better quality).

The catastrophic jump at 100% synthetic (\text{FID} = 15.0 vs. \sim 7\text{–}8 for real-data mixtures) illustrates the "AI eating itself" problem: when a model trains only on images generated by other AI models, quality degrades rapidly as errors amplify through the feedback loop.
MONET's 13% synthetic ratio sits comfortably in the beneficial zone, improving text-image alignment without incurring the risks of synthetic data saturation.
All this curation work is only valuable if it actually produces better AI models. To validate MONET, the team trained a 4-billion-parameter image model exclusively on this dataset and measured it against existing commercial and research models.

The results on GenEval — a benchmark that tests whether a model can accurately depict objects, colors, counts, and spatial relationships described in a prompt — are striking. MONET's 4B model outperforms much larger models like DALL-E 3 and FLUX.1 Dev (12B parameters), despite training exclusively on open data.

A 4.1B MONET model scores 0.74 on GenEval and 85.56 on DPG — competitive with models 3–5× its size trained on closed, proprietary datasets. This demonstrates that data quality and curation matter more than raw data scale.
On the DPG benchmark (which tests longer, more complex prompts), MONET's model is similarly competitive, ranking above DALL-E 3, SD3, and FLUX.1 Dev — systems backed by enormous research teams and confidential training data. The remaining gap with the very largest models (Qwen-Image at 20B, Z-Image at 6B) is largely attributable to parameter count and likely additional fine-tuning data, not MONET's intrinsic quality.
Building MONET meant moving billions of images through an iterative pipeline against a hard NeurIPS deadline. Xet-backed Buckets gave exactly the layer we were missing: fast, mutable storage where re-processing a dataset only ever paid for the bytes that actually changed. We could stage terabytes, overwrite freely, and stream shards straight into training, and then promote the finished, annotated dataset to a clean public Dataset repo. It let a small team ship a 105M-pair open dataset at a scale that used to require a dedicated infra team.
nano-t2i is a minimal text-to-image diffusion training repository built specifically for the MONET dataset. If you want to go from dataset to a working T2I model without wading through a large codebase, this is the fastest path.
The codebase makes it super easy to get started:

Training time: roughly ~1 day on a single H200 to get good results, or ~3 hours on 8× H200s to reach similar quality.
Open research thrives on shared foundations. MONET and nano-t2i are Jasper Research's contribution to that foundation. A carefully curated, commercially licensed dataset and a minimal training codebase that together remove the two biggest barriers to competitive text-to-image research: data and complexity.
Curious how Jasper Research is used in production? Discover our APIs
Jasper Research · NeurIPS 2026
Blog post written May 2026. All data sourced from numbers reported in the paper.

Actionable ways to win AI search, right from the source.
June 2, 2026
|
Jasper Marketing

Create on-brand content without leaving your workflow.
May 27, 2026
|
Jessica Kennedy

Learn why content freshness is essential for AI search visibility and how to keep your content current in an intentional, scalable way.
May 21, 2026
|
Esther Chung




