| Platform | H100 Rate | Cold Start | Best For |
|---|---|---|---|
| RunPod Serverless | $0.00069/s | 2–8s | Custom models, PyTorch |
| Modal | $0.000900/s | 1–4s | Python-native workflows |
| Replicate | $0.00115/s | 5–30s | Pre-built models |
| Fal.ai | $0.00080/s | 1–3s | Image gen, fast APIs |
Serverless GPU platforms charge by the second — you pay only when your code is running. No idle costs, no reserved capacity. For bursty workloads (image generation APIs, occasional inference) this can be 10–100x cheaper than an on-demand instance. The tradeoff: cold starts add latency.
RunPod Serverless
RunPod Serverless lets you deploy any Docker container as a serverless endpoint. You define a handler function, and RunPod scales workers automatically.
# runpod_handler.py
import runpod
def handler(job):
input_data = job["input"]
prompt = input_data.get("prompt", "")
# Your inference logic here
result = run_model(prompt)
return {"output": result}
runpod.serverless.start({"handler": handler})
# Deploy with:
# runpod deploy --image your-docker-image:latest
# Scales to 0 workers when idle
# Pricing: ~$0.00069/s for H100Modal
Modal has the most Pythonic API — you decorate functions with @app.function(gpu="H100") and it handles everything:
import modal
app = modal.App("my-inference-app")
image = modal.Image.debian_slim().pip_install(
"transformers", "torch", "accelerate"
)
@app.function(gpu="H100", image=image, timeout=300)
def run_inference(prompt: str) -> str:
from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-8B-Instruct")
return pipe(prompt, max_new_tokens=200)[0]["generated_text"]
@app.local_entrypoint()
def main():
result = run_inference.remote("Hello, how are you?")
print(result)
# Deploy: modal deploy inference.py
# Cost: ~$0.0009/s for H100, free tier availableReplicate
Replicate hosts pre-built models — you don't need to write deployment code. Great for image generation and popular open-source models:
import replicate
# Run SDXL image generation
output = replicate.run(
"stability-ai/sdxl:39ed52f2319f9b68ef0a5ef6e27d5e7a7ab10bfb",
input={
"prompt": "A futuristic city at night",
"width": 1024,
"height": 1024
}
)
print(output[0]) # URL to generated image
# Run Llama 3.1
output = replicate.run(
"meta/meta-llama-3.1-8b-instruct",
input={"prompt": "Explain quantum computing"}
)
print("".join(output))Fal.ai
Fal.ai specializes in image generation with the fastest cold starts in the comparison. It also supports custom model deployment:
import fal_client
# Run FLUX image generation (sub-second cold start)
result = fal_client.run(
"fal-ai/flux/dev",
arguments={
"prompt": "A futuristic data center",
"image_size": "landscape_4_3",
"num_images": 1
}
)
print(result["images"][0]["url"])
# Custom function deployment
@fal_client.function(machine_type="GPU-A100")
def my_model(prompt: str) -> str:
# your code
return resultWhich to Pick?
| If you need… | Use |
|---|---|
| Lowest cost, full control over container | RunPod Serverless |
| Python-native code, clean API, free tier | Modal |
| Pre-built models, no deployment code | Replicate |
| Fastest cold start for image gen | Fal.ai |
| Scale to thousands of concurrent requests | Modal or RunPod |