Best Cloud GPU for Stable Diffusion: RTX 4090 vs A10 vs L4
Stable Diffusion XL (SDXL) is the most demanding widely-deployed image generation model, and GPU choice has a large impact on both generation speed and cost per image. This guide compares three GPUs commonly available on cloud platforms — RTX 4090, A10, and L4 — across the metrics that matter for production workloads: images per hour and cost per 100 images.
VRAM Requirements
Getting VRAM sizing right is prerequisite to everything else. Run out and you’ll either get an OOM crash or be forced to fall back to slower CPU offloading.
Stable Diffusion 1.x and 2.x
SD 1.5 and 2.1 are lightweight by current standards:
- Minimum: 4 GB VRAM (at 512×512, half precision)
- Comfortable: 6–8 GB (1024×1024, half precision, no offloading)
Almost any GPU can handle SD 1.5 and 2.1 in FP16. Even an RTX 3060 (12 GB) has headroom to spare.
Stable Diffusion XL (SDXL)
SDXL uses a larger UNet and a paired text encoder (CLIP-L + OpenCLIP-G), which adds up:
- SDXL base model (FP16): ~6.5 GB for UNet + ~1.4 GB for text encoders = ~7.9 GB total
- SDXL base + refiner (FP16): Both loaded simultaneously: ~13–14 GB
- Practical minimum (base only): 8 GB, but tight — 10+ GB for stability
- Comfortable for base + refiner: 16–24 GB
With VAE decode and activation buffers, 12 GB is a practical floor for SDXL with the refiner. 24 GB gives you headroom for higher resolutions, ControlNet models, and larger batch sizes.
SDXL with ControlNet / IP-Adapter
Adding ControlNet or IP-Adapter adds another 1.5–3 GB depending on the adapter:
- SDXL + ControlNet (FP16): 10–12 GB base only, 16–18 GB with refiner
- SDXL + IP-Adapter XL: ~11 GB with base model only
For full SDXL + refiner + ControlNet pipelines, 24 GB is the practical minimum. This is where GPU choice becomes consequential.
GPU Specs Relevant to Image Generation
Image generation throughput is primarily determined by two factors: compute (TFLOPS for the UNet forward pass) and memory bandwidth (for loading model weights and KV tensors during attention).
| Spec | RTX 4090 | A10 | L4 |
|---|---|---|---|
| VRAM | 24 GB GDDR6X | 24 GB GDDR6 | 24 GB GDDR6 |
| Memory Bandwidth | 1,008 GB/s | 600 GB/s | 300 GB/s |
| FP16 TFLOPS (Tensor Core) | 82.6 | 31.2 | 30.3 |
| FP8 TFLOPS | 165.2 | — | 60.6 |
| TDP | 450 W | 150 W | 72 W |
| Architecture | Ada Lovelace | Ampere | Ada Lovelace |
The RTX 4090 has dramatically higher memory bandwidth and compute than both the A10 and L4. The A10 and L4 are similar in raw TFLOPS, but the L4 is more power-efficient and has better FP8 support. The A10 has 2× the memory bandwidth of the L4, which matters for attention-heavy pipelines at high resolution.
Benchmark: SDXL Images per Hour
The following throughput figures are based on SDXL base-only generation (1024×1024), 20 DPM++ steps, batch size 1, using diffusers with xFormers attention and FP16 precision. These represent single-image sequential generation — batch generation changes the numbers in favor of datacenter GPUs (A10, L4) due to their higher ECC reliability.
| Configuration | RTX 4090 | A10 | L4 |
|---|---|---|---|
| SDXL base, bs=1, 20 steps | ~110 img/hr | ~38 img/hr | ~35 img/hr |
| SDXL base + refiner, bs=1 | ~55 img/hr | ~19 img/hr | ~18 img/hr |
| SDXL base, bs=4, 20 steps | ~320 img/hr | ~115 img/hr | ~95 img/hr |
| SDXL + ControlNet, bs=1 | ~95 img/hr | ~33 img/hr | ~30 img/hr |
| SD 1.5, bs=1, 20 steps | ~500 img/hr | ~175 img/hr | ~160 img/hr |
The RTX 4090’s throughput advantage over the A10 is approximately 2.9–3.0× at batch size 1. This tracks with the memory bandwidth ratio (1,008 / 600 = 1.68×) being the primary bottleneck at bs=1 but the compute ratio (82.6 / 31.2 = 2.65×) becoming more relevant at higher batch sizes.
The A10 and L4 are nearly identical in image generation throughput despite the L4’s lower memory bandwidth — this suggests that at bs=1, both are limited by the compute bottleneck rather than memory bandwidth at standard resolutions. At batch size 4, the A10 pulls ahead of the L4 by ~20% due to its bandwidth advantage becoming more relevant at higher memory load.
Cloud Pricing and Cost per Image
Current cloud pricing for these GPUs (on-demand, single GPU):
| GPU | Provider Examples | Typical Price |
|---|---|---|
| RTX 4090 | RunPod, Vast.ai | $0.59–0.79/hr |
| A10 | AWS (g5.xlarge), CoreWeave | $1.00–1.10/hr |
| A10G | AWS g5.xlarge on-demand | ~$1.006/hr |
| L4 | GCP (g2-standard-4), CoreWeave | $0.60–0.80/hr |
Note: A10 and A10G are different GPUs. The A10G (used in AWS G5 instances) has slightly lower memory bandwidth than the A10 (248 GB/s vs 600 GB/s), making it significantly slower for image generation. Avoid confusing the two.
Cost per 100 Images
Using throughput from the SDXL base, batch size 1 benchmarks:
| GPU | Images/hr | $/hr | Hours per 100 img | Cost per 100 images |
|---|---|---|---|---|
| RTX 4090 | 110 | $0.69 | 0.91 | $0.63 |
| A10 | 38 | $1.05 | 2.63 | $2.76 |
| L4 | 35 | $0.70 | 2.86 | $2.00 |
With batch size 4 (batch-optimized pipeline):
| GPU | Images/hr (bs=4) | $/hr | Cost per 100 images |
|---|---|---|---|
| RTX 4090 | 320 | $0.69 | $0.22 |
| A10 | 115 | $1.05 | $0.91 |
| L4 | 95 | $0.70 | $0.74 |
The RTX 4090 wins decisively on cost per image at every configuration tested. Its consumer-tier pricing combined with Ada Lovelace architecture optimizations (including DLSS 3 hardware not relevant here, but indicative of the generation’s memory subsystem improvements) makes it the clear choice for image generation workloads.
The L4 is the second-best option for cost per image, especially if you need ECC memory (important for long-running production deployments where bit-flip errors in VRAM can produce subtly corrupted outputs). The A10 is the weakest option cost-wise and should generally be avoided for image generation unless it’s the only available option at your cloud provider.
Optimization Techniques That Change the Math
The benchmarks above use a baseline diffusers setup. Several optimizations can significantly increase throughput:
torch.compile
PyTorch 2.0+ torch.compile on the UNet forward pass can improve throughput by 20–40% on Ada Lovelace GPUs (RTX 4090, L4) but has inconsistent results on older Ampere (A10). Compilation takes 2–5 minutes on first run but is cached for subsequent calls.
unet = torch.compile(unet, mode="reduce-overhead")
For the RTX 4090, this alone can push SDXL base throughput from 110 to ~145 img/hr at bs=1.
Flash Attention / xFormers
xFormers memory-efficient attention is standard in most diffusers deployments and provides a 15–25% speedup over naive attention. Flash Attention 2 is now supported in diffusers via enable_flash_attn() and provides similar or slightly better improvements. Both are architecture-agnostic.
SDXL Turbo / LCM Distillation
SDXL Turbo (4-step distilled model) generates comparable quality to SDXL base at 20 steps using only 4 steps, providing a 5× throughput multiplier:
| GPU | SDXL Turbo (4 steps, bs=1) | Cost per 100 images |
|---|---|---|
| RTX 4090 | ~500 img/hr | $0.14 |
| A10 | ~165 img/hr | $0.64 |
| L4 | ~155 img/hr | $0.45 |
If your use case tolerates SDXL Turbo quality (which is excellent for most commercial applications), the cost savings are substantial. The RTX 4090 with SDXL Turbo at $0.14/100 images is hard to beat.
Decision Guide
Use RTX 4090 when:
- Cost per image is your primary metric
- Batch sizes are small (1–4) and you need low latency per image
- You’re running a full SDXL + refiner + ControlNet pipeline (24 GB headroom matters)
- Consumer cloud pricing is acceptable (RunPod, Vast.ai)
Use L4 when:
- You need ECC memory for production reliability (L4 is enterprise-grade)
- Power budget is constrained (72W TDP vs 450W for 4090)
- You’re deploying on GCP and want managed infrastructure
torch.compileperformance gains are important (L4 is Ada Lovelace)
Use A10 when:
- You’re already in the AWS ecosystem (G5 instances) and switching would be disruptive
- Your workload is SD 1.5/2.x (the performance gap narrows significantly)
- You need the specific AWS G5 instance features (instance profiles, VPC integration)
For pure image generation workloads optimizing cost, the RTX 4090 is the winner by a significant margin. The A10’s cloud pricing premium makes it difficult to justify for this specific use case.