Run Local LLMs - profClaw

Overview

profClaw supports local LLM inference through Ollama and LM Studio. Run AI agents entirely on your own hardware with no API keys or cloud dependencies.

Ollama Setup

Install Ollama

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

Pull a Model

# General purpose
ollama pull llama3.2

# Coding focused
ollama pull codellama:13b

# Small and fast
ollama pull phi3:mini

Configure profClaw

export OLLAMA_BASE_URL=http://localhost:11434
export OLLAMA_MODEL=llama3.2

Or in settings.yml:

providers:
  default: ollama
  ollama:
    baseUrl: http://localhost:11434
    model: llama3.2

Start and Test

ollama serve &
profclaw serve

profclaw chat
> Hello, are you running locally?

LM Studio Setup

Install LM Studio

Download from lmstudio.ai. Available for macOS, Windows, and Linux.

Download a Model

Open LM Studio, browse the model catalog, and download a model (e.g., Llama 3.2, Mistral, Phi-3).

Start the Server

In LM Studio, go to the Local Server tab and click Start Server. Default port is 1234.

Configure profClaw

export LMSTUDIO_BASE_URL=http://localhost:1234
export LMSTUDIO_MODEL=your-model-name

Recommended Models

Model	Size	Best For	VRAM Needed
Llama 3.2 3B	2GB	Quick tasks, chat	4GB
Llama 3.2 8B	4.7GB	General purpose	8GB
CodeLlama 13B	7.4GB	Code generation	16GB
Mistral 7B	4.1GB	Balanced performance	8GB
Phi-3 Mini	2.2GB	Edge devices	4GB
DeepSeek Coder V2	8.9GB	Code tasks	16GB

Hybrid Setup

Use local models for simple tasks and cloud providers for complex ones:

providers:
  default: ollama
  ollama:
    baseUrl: http://localhost:11434
    model: llama3.2
  anthropic:
    apiKey: ${ANTHROPIC_API_KEY}
    model: claude-sonnet-4-6

Switch providers per conversation:

profclaw chat --provider anthropic
profclaw chat --provider ollama

Docker with Ollama

Run both profClaw and Ollama in Docker:

services:
  profclaw:
    image: profclaw/profclaw:latest
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - OLLAMA_MODEL=llama3.2
    depends_on:
      - ollama

  ollama:
    image: ollama/ollama:latest
    volumes:
      - ollama-models:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]  # GPU passthrough

volumes:
  ollama-models:

Performance Tips

GPU Acceleration

Ollama automatically uses GPU if available. Check with ollama list - GPU-accelerated models show higher tokens/sec.

Context Length

Local models have smaller context windows than cloud models. Set POOL_TIMEOUT_MS higher for larger contexts.

Quantization

Use quantized models (Q4_K_M, Q5_K_M) for better speed with minimal quality loss:

ollama pull llama3.2:q4_k_m

​Overview

​Ollama Setup

​LM Studio Setup

​Recommended Models

​Hybrid Setup

​Docker with Ollama

​Performance Tips