pricegpu

Best Cloud GPU for Stable Diffusion: RTX 4090 vs A10 vs L4

stable-diffusionsdxlrtx-4090a10l4image-generationinferencecost

Stable Diffusion XL (SDXL) is the most demanding widely-deployed image generation model, and GPU choice has a large impact on both generation speed and cost per image. This guide compares three GPUs commonly available on cloud platforms — RTX 4090, A10, and L4 — across the metrics that matter for production workloads: images per hour and cost per 100 images.

VRAM Requirements

Getting VRAM sizing right is prerequisite to everything else. Run out and you’ll either get an OOM crash or be forced to fall back to slower CPU offloading.

Stable Diffusion 1.x and 2.x

SD 1.5 and 2.1 are lightweight by current standards:

  • Minimum: 4 GB VRAM (at 512×512, half precision)
  • Comfortable: 6–8 GB (1024×1024, half precision, no offloading)

Almost any GPU can handle SD 1.5 and 2.1 in FP16. Even an RTX 3060 (12 GB) has headroom to spare.

Stable Diffusion XL (SDXL)

SDXL uses a larger UNet and a paired text encoder (CLIP-L + OpenCLIP-G), which adds up:

  • SDXL base model (FP16): ~6.5 GB for UNet + ~1.4 GB for text encoders = ~7.9 GB total
  • SDXL base + refiner (FP16): Both loaded simultaneously: ~13–14 GB
  • Practical minimum (base only): 8 GB, but tight — 10+ GB for stability
  • Comfortable for base + refiner: 16–24 GB

With VAE decode and activation buffers, 12 GB is a practical floor for SDXL with the refiner. 24 GB gives you headroom for higher resolutions, ControlNet models, and larger batch sizes.

SDXL with ControlNet / IP-Adapter

Adding ControlNet or IP-Adapter adds another 1.5–3 GB depending on the adapter:

  • SDXL + ControlNet (FP16): 10–12 GB base only, 16–18 GB with refiner
  • SDXL + IP-Adapter XL: ~11 GB with base model only

For full SDXL + refiner + ControlNet pipelines, 24 GB is the practical minimum. This is where GPU choice becomes consequential.

GPU Specs Relevant to Image Generation

Image generation throughput is primarily determined by two factors: compute (TFLOPS for the UNet forward pass) and memory bandwidth (for loading model weights and KV tensors during attention).

SpecRTX 4090A10L4
VRAM24 GB GDDR6X24 GB GDDR624 GB GDDR6
Memory Bandwidth1,008 GB/s600 GB/s300 GB/s
FP16 TFLOPS (Tensor Core)82.631.230.3
FP8 TFLOPS165.260.6
TDP450 W150 W72 W
ArchitectureAda LovelaceAmpereAda Lovelace

The RTX 4090 has dramatically higher memory bandwidth and compute than both the A10 and L4. The A10 and L4 are similar in raw TFLOPS, but the L4 is more power-efficient and has better FP8 support. The A10 has 2× the memory bandwidth of the L4, which matters for attention-heavy pipelines at high resolution.

Benchmark: SDXL Images per Hour

The following throughput figures are based on SDXL base-only generation (1024×1024), 20 DPM++ steps, batch size 1, using diffusers with xFormers attention and FP16 precision. These represent single-image sequential generation — batch generation changes the numbers in favor of datacenter GPUs (A10, L4) due to their higher ECC reliability.

ConfigurationRTX 4090A10L4
SDXL base, bs=1, 20 steps~110 img/hr~38 img/hr~35 img/hr
SDXL base + refiner, bs=1~55 img/hr~19 img/hr~18 img/hr
SDXL base, bs=4, 20 steps~320 img/hr~115 img/hr~95 img/hr
SDXL + ControlNet, bs=1~95 img/hr~33 img/hr~30 img/hr
SD 1.5, bs=1, 20 steps~500 img/hr~175 img/hr~160 img/hr

The RTX 4090’s throughput advantage over the A10 is approximately 2.9–3.0× at batch size 1. This tracks with the memory bandwidth ratio (1,008 / 600 = 1.68×) being the primary bottleneck at bs=1 but the compute ratio (82.6 / 31.2 = 2.65×) becoming more relevant at higher batch sizes.

The A10 and L4 are nearly identical in image generation throughput despite the L4’s lower memory bandwidth — this suggests that at bs=1, both are limited by the compute bottleneck rather than memory bandwidth at standard resolutions. At batch size 4, the A10 pulls ahead of the L4 by ~20% due to its bandwidth advantage becoming more relevant at higher memory load.

Cloud Pricing and Cost per Image

Current cloud pricing for these GPUs (on-demand, single GPU):

GPUProvider ExamplesTypical Price
RTX 4090RunPod, Vast.ai$0.59–0.79/hr
A10AWS (g5.xlarge), CoreWeave$1.00–1.10/hr
A10GAWS g5.xlarge on-demand~$1.006/hr
L4GCP (g2-standard-4), CoreWeave$0.60–0.80/hr

Note: A10 and A10G are different GPUs. The A10G (used in AWS G5 instances) has slightly lower memory bandwidth than the A10 (248 GB/s vs 600 GB/s), making it significantly slower for image generation. Avoid confusing the two.

Cost per 100 Images

Using throughput from the SDXL base, batch size 1 benchmarks:

GPUImages/hr$/hrHours per 100 imgCost per 100 images
RTX 4090110$0.690.91$0.63
A1038$1.052.63$2.76
L435$0.702.86$2.00

With batch size 4 (batch-optimized pipeline):

GPUImages/hr (bs=4)$/hrCost per 100 images
RTX 4090320$0.69$0.22
A10115$1.05$0.91
L495$0.70$0.74

The RTX 4090 wins decisively on cost per image at every configuration tested. Its consumer-tier pricing combined with Ada Lovelace architecture optimizations (including DLSS 3 hardware not relevant here, but indicative of the generation’s memory subsystem improvements) makes it the clear choice for image generation workloads.

The L4 is the second-best option for cost per image, especially if you need ECC memory (important for long-running production deployments where bit-flip errors in VRAM can produce subtly corrupted outputs). The A10 is the weakest option cost-wise and should generally be avoided for image generation unless it’s the only available option at your cloud provider.

Optimization Techniques That Change the Math

The benchmarks above use a baseline diffusers setup. Several optimizations can significantly increase throughput:

torch.compile

PyTorch 2.0+ torch.compile on the UNet forward pass can improve throughput by 20–40% on Ada Lovelace GPUs (RTX 4090, L4) but has inconsistent results on older Ampere (A10). Compilation takes 2–5 minutes on first run but is cached for subsequent calls.

unet = torch.compile(unet, mode="reduce-overhead")

For the RTX 4090, this alone can push SDXL base throughput from 110 to ~145 img/hr at bs=1.

Flash Attention / xFormers

xFormers memory-efficient attention is standard in most diffusers deployments and provides a 15–25% speedup over naive attention. Flash Attention 2 is now supported in diffusers via enable_flash_attn() and provides similar or slightly better improvements. Both are architecture-agnostic.

SDXL Turbo / LCM Distillation

SDXL Turbo (4-step distilled model) generates comparable quality to SDXL base at 20 steps using only 4 steps, providing a 5× throughput multiplier:

GPUSDXL Turbo (4 steps, bs=1)Cost per 100 images
RTX 4090~500 img/hr$0.14
A10~165 img/hr$0.64
L4~155 img/hr$0.45

If your use case tolerates SDXL Turbo quality (which is excellent for most commercial applications), the cost savings are substantial. The RTX 4090 with SDXL Turbo at $0.14/100 images is hard to beat.

Decision Guide

Use RTX 4090 when:

  • Cost per image is your primary metric
  • Batch sizes are small (1–4) and you need low latency per image
  • You’re running a full SDXL + refiner + ControlNet pipeline (24 GB headroom matters)
  • Consumer cloud pricing is acceptable (RunPod, Vast.ai)

Use L4 when:

  • You need ECC memory for production reliability (L4 is enterprise-grade)
  • Power budget is constrained (72W TDP vs 450W for 4090)
  • You’re deploying on GCP and want managed infrastructure
  • torch.compile performance gains are important (L4 is Ada Lovelace)

Use A10 when:

  • You’re already in the AWS ecosystem (G5 instances) and switching would be disruptive
  • Your workload is SD 1.5/2.x (the performance gap narrows significantly)
  • You need the specific AWS G5 instance features (instance profiles, VPC integration)

For pure image generation workloads optimizing cost, the RTX 4090 is the winner by a significant margin. The A10’s cloud pricing premium makes it difficult to justify for this specific use case.

← All posts