Documentation
MindBalancer is a high-performance load balancer and reverse proxy for AI/LLM APIs. Think ProxySQL, but for AI.
Installation
Requirements
- Go 1.20 or later
- SQLite (included)
- Any Linux, macOS, or Windows system
From Source
# Clone the repository
git clone https://github.com/mindbalancer/mindbalancer-labs.git
cd mindbalancer-labs
# Build
make build
# Binaries will be in ./bin/
ls -la bin/
# mindbalancer mindsql
Using Go Install
go install github.com/mindbalancer/mindbalancer-labs/cmd/mindbalancer@latest
go install github.com/mindbalancer/mindbalancer-labs/cmd/mindsql@latest
Quick Start
Get MindBalancer running in under 5 minutes.
1. Create Configuration
# mindbalancer.cnf
[mindbalancer]
proxy_bind_address = 0.0.0.0
proxy_port = 6034
admin_bind_address = 127.0.0.1
admin_port = 6032
data_dir = /var/lib/mindbalancer
# Optional: 32-char key for API key encryption
api_key_encryption_key = your-32-character-encryption-key
2. Start MindBalancer
./bin/mindbalancer -config mindbalancer.cnf
You should see:
Starting proxy server on 0.0.0.0:6034
Starting admin MySQL server on 127.0.0.1:6032
Starting admin HTTP server on 127.0.0.1:6033
3. Add Your First Server
# Connect to admin interface
./bin/mindsql
# Add an OpenAI server
mindsql> INSERT INTO ai_servers (name, provider_type, endpoint, api_key_encrypted, weight, status)
VALUES ('openai-primary', 'openai', 'https://api.openai.com', 'sk-your-key', 100, 'ONLINE');
# Verify
mindsql> SELECT * FROM ai_servers;
4. Send Your First Request
curl http://localhost:6034/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Configuration
MindBalancer uses an INI-style configuration file.
| Setting | Default | Description |
|---|---|---|
proxy_port |
6034 | Port for OpenAI-compatible API |
admin_port |
6032 | Port for mindsql admin interface |
data_dir |
/var/lib/mindbalancer | Directory for SQLite database |
health_check_interval_ms |
5000 | Health check frequency (ms) |
circuit_breaker_threshold |
5 | Failures before circuit opens |
max_retries |
3 | Max retry attempts |
cache_enabled |
true | Enable response caching |
cache_ttl_ms |
300000 | Cache entry TTL (5 min) |
rate_limit_enabled |
true | Enable per-user rate limiting |
Architecture
MindBalancer sits between your application and AI providers, handling:
- Load Balancing — Distribute requests across multiple providers
- Health Checks — Continuously monitor provider health
- Circuit Breaking — Prevent cascade failures
- Response Caching — Cache deterministic responses
- Request Routing — Route by model, pattern, or user
┌─────────────┐ ┌──────────────────┐ ┌─────────────┐
│ Application │────▶│ MindBalancer │────▶│ OpenAI │
│ (OpenAI │ │ │ └─────────────┘
│ SDK) │ │ ┌────────────┐ │ ┌─────────────┐
└─────────────┘ │ │ Balancer │ │────▶│ Anthropic │
│ │ Router │ │ └─────────────┘
│ │ Cache │ │ ┌─────────────┐
│ │ Metrics │ │────▶│ Ollama │
│ └────────────┘ │ └─────────────┘
└──────────────────┘
Load Balancing
Strategies
MindBalancer supports multiple load balancing strategies:
- Weighted Round-Robin — Distribute based on server weights
- Least Connections — Route to server with fewest active requests
- Latency-based — Prefer servers with lowest latency
Hostgroups
Organize servers into hostgroups for different workloads:
-- Fast models for chat
INSERT INTO ai_servers (name, hostgroup, ...) VALUES ('groq-1', 1, ...);
-- Powerful models for complex tasks
INSERT INTO ai_servers (name, hostgroup, ...) VALUES ('openai-1', 2, ...);
-- Route based on model
INSERT INTO routing_rules (match_model, destination_hostgroup)
VALUES ('llama*', 1), ('gpt-4*', 2);
Failover
Health Checks
MindBalancer continuously monitors server health with configurable checks:
mindsql> SHOW HEALTH STATUS;
+---------------+---------+---------+---------------------+
| server | healthy | latency | last_check |
+---------------+---------+---------+---------------------+
| openai-main | Yes | 245ms | 2025-01-25 10:30:05 |
| anthropic-1 | Yes | 312ms | 2025-01-25 10:30:05 |
| ollama-local | No | - | 2025-01-25 10:30:04 |
+---------------+---------+---------+---------------------+
Circuit Breaker
After consecutive failures, the circuit breaker opens to prevent cascading failures:
- Closed — Normal operation, requests flow through
- Open — Too many failures, requests fail fast
- Half-Open — Testing if service recovered
Retry with Backoff
Failed requests automatically retry with exponential backoff:
[mindbalancer]
max_retries = 3
retry_initial_delay_ms = 100
retry_max_delay_ms = 5000
retry_multiplier = 2.0
Caching
MindBalancer caches deterministic responses to reduce costs and latency.
How It Works
- Only caches when
temperature=0(deterministic output) - Cache key = hash(model + messages + temperature + max_tokens)
- Response includes
X-Cache: HITorX-Cache: MISSheader
Managing Cache
-- Check cache status
mindsql> SHOW CACHE STATUS;
+------------------+------------------+
| Variable | Value |
+------------------+------------------+
| status | enabled |
| hits | 1247 |
| misses | 523 |
| hit_rate | 0.70 |
| evictions | 12 |
| size_bytes | 2458624 |
| item_count | 847 |
+------------------+------------------+
-- Enable/disable caching
mindsql> CACHE ENABLE;
mindsql> CACHE DISABLE;
-- Clear cache
mindsql> CACHE CLEAR;
HTTP API
# Get cache status
curl http://localhost:6033/api/cache
# Enable cache
curl -X PUT http://localhost:6033/api/cache -d '{"enabled": true}'
# Clear cache
curl -X POST http://localhost:6033/api/cache/clear
API Reference
MindBalancer exposes an OpenAI-compatible API.
Chat Completions
POST /v1/chat/completions
{
"model": "gpt-4",
"messages": [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"}
],
"temperature": 0.7,
"max_tokens": 1000,
"stream": false
}
Models List
GET /v1/models
Response Headers
| Header | Description |
|---|---|
X-Request-ID |
Unique request identifier |
X-Cache |
HIT or MISS (caching status) |
X-Retry-Count |
Number of retries (if any) |
X-RateLimit-Remaining |
Remaining requests in window |
X-RateLimit-Reset |
Unix timestamp when limit resets |
mindsql CLI
mindsql is a MySQL-compatible CLI for managing MindBalancer.
Connection
# Default connection
./bin/mindsql
# Custom host/port
./bin/mindsql -h 192.168.1.100 -P 6032
# Execute single command
./bin/mindsql -e "SELECT * FROM ai_servers"
Commands
-- Server Management
SELECT * FROM ai_servers;
INSERT INTO ai_servers (name, provider_type, endpoint, api_key_encrypted, weight, status)
VALUES ('name', 'openai', 'https://...', 'sk-...', 100, 'ONLINE');
DELETE FROM ai_servers WHERE name = 'server-name';
-- Routing Rules
SELECT * FROM routing_rules;
-- Users & Rate Limits
SELECT * FROM ai_users;
-- Monitoring
SHOW HEALTH STATUS;
SHOW API KEYS;
SHOW STATS;
SHOW VARIABLES;
-- Cache Management
SHOW CACHE STATUS;
CACHE ENABLE;
CACHE DISABLE;
CACHE CLEAR;
-- Configuration
SET max_retries = 5;
Providers
MindBalancer supports multiple AI providers through a unified interface.
| Provider | Type | Endpoint |
|---|---|---|
| OpenAI | openai |
https://api.openai.com |
| Anthropic | anthropic |
https://api.anthropic.com |
| Azure OpenAI | azure |
https://YOUR.openai.azure.com |
| Ollama | ollama |
http://localhost:11434 |
| Groq | groq |
https://api.groq.com |
| Google AI | google |
https://generativelanguage.googleapis.com |
Monitoring
Prometheus Metrics
MindBalancer exposes Prometheus-compatible metrics at :9090/metrics.
# Request metrics
mindbalancer_requests_total{server, model, status}
mindbalancer_request_duration_seconds{server, model}
mindbalancer_tokens_total{server, model, type}
# Cost tracking
mindbalancer_cost_usd_total{server, model, provider_type}
# Cache metrics
mindbalancer_cache_hits_total{model}
mindbalancer_cache_misses_total{model}
# Health metrics
mindbalancer_server_health{server}
Web Dashboard
Access the built-in dashboard at http://localhost:6033/ for real-time monitoring.
Security
API Key Encryption
API keys are encrypted at rest using AES-256-GCM. Set a 32-character encryption key:
[mindbalancer]
api_key_encryption_key = your-32-character-encryption-key
Rate Limiting
Configure per-user rate limits:
[mindbalancer]
rate_limit_enabled = true
default_requests_per_minute = 60
default_tokens_per_minute = 100000
Best Practices
- Run admin interface on localhost only (
admin_bind_address = 127.0.0.1) - Use TLS in production (configure
tls_cert_fileandtls_key_file) - Rotate the encryption key periodically
- Monitor rate limit headers for abuse detection