Model Grading System
We grade our models on a simple 1.0 to 5.0 scale.
The point of the score is practical: how good is this model under normal use? Not whether it can produce one cherry-picked hero image, but whether it holds together across ordinary prompts, poses, and compositions.
## Allowed scores
We only use full points and half points:
1.01.52.02.53.03.54.04.55.0
We do not use quarter-points or arbitrary decimals.
## Release threshold
- Scores below
3.0are usually not worth releasing - Scores
3.0and above can be released - Scores
4.0+are normal release territory for strong catalog entries
That threshold matters because some models can be interesting without being dependable. We do not grade for novelty alone. We grade for usable quality.
## The scale
###
5.0 — Excellent
This is the top tier.
- excellent detail
- flexible across prompts, scenes, and poses
- strong prompt adherence
- low artifact rate
These are the models we can trust broadly. They do not need prompt gymnastics to stay on target.
###
4.0 to 4.5 — Good
These models are good, but more rigid than a true 5.
- solid detail
- generally clean generations
- some narrower behavior or prompting friction
- still dependable enough for normal release use
4.5 means the model is close to excellent, but not quite as flexible or robust as a flagship.
###
3.0 to 3.5 — Average
These models are usable, but clearly limited.
- more rigid
- sometimes underdetailed
- less dependable across prompt types
- quality issues show up often enough to matter
3.0 is the bottom of the releaseable range. 3.5 is better than average, but not fully in the strong tier.
###
2.0 to 2.5 — Poor
These models are usually not worth releasing.
- weak prompt adherence
- frequent visual issues
- lightly overtrained behavior is common
- narrow usefulness even with extra prompting effort
2.5 is a caution score, not a quiet endorsement.
###
1.0 to 1.5 — Bad
This is the bottom of the scale.
- badly overtrained or unstable
- major artifacts
- poor or nonexistent prompt adherence
- not a usable release candidate
1.0 is the minimum score we assign.
## What we are actually grading
When we score a model, we are looking at:
- detail quality
- prompt adherence
- flexibility across prompt types
- anatomy reliability
- pose reliability
- artifact frequency
- signs of overtraining
The score is based on repeated normal-use testing, not a single lucky render.
## Short version
5.0: excellent, flexible4.0: good, more rigid3.0: average, rigid or underdetailed2.0: poor, poor adherence or lightly overtrained1.0: bad, badly overtrained, major artifacts
Half scores let us be more precise without turning the rubric into fake precision.
## Why we publish the score
The rating is there to set expectations. Some models are broad tools. Some are narrower. Some are interesting experiments that do not hold together well enough to recommend widely.
The number is a quick read on that difference.