Skip to main content
Cerebras uses custom wafer-scale processors to achieve inference speeds that far exceed GPU-based providers. Llama 3.1 70B runs at over 2000 tokens/second - roughly 20x faster than typical cloud GPU inference.

Supported Models

ModelIDContextMax OutputToolsNotes
Llama 3.1 70Bllama3.1-70b128K8KYesFastest 70B available
Llama 3.1 8Bllama3.1-8b128K8KYesExtreme speed

Setup

1

Get API access

Sign up at inference.cerebras.ai. Currently in limited access.
2

Set the environment variable

export CEREBRAS_API_KEY=csk-...
3

Verify

profclaw doctor --provider cerebras

Environment Variables

CEREBRAS_API_KEY
string
required
Your Cerebras API key.

Configuration Example

CEREBRAS_API_KEY=csk-...

Model Aliases

AliasModel
cerebrasllama3.1-70b

Usage Examples

# Ultra-fast streaming response
profclaw chat --model cerebras "Stream this long document analysis"

Notes

  • API endpoint: https://api.cerebras.ai/v1 (OpenAI-compatible)
  • Status: Experimental - hardware-specific availability, may have capacity constraints.
  • Best use case: real-time streaming, bulk generation tasks, low-latency chat.
  • Cerebras does not support vision or image inputs.