Training · QLoRA fine-tuning pipeline

A training agent that ships a private code LLM for the cost of a lunch.

Train Qwen 2.5 Coder 7B on your codebase in one weekend for about $6 in compute. Quantize to GGUF, deploy locally via Ollama, and get dramatically better autocomplete on your internal SDK — with zero external API calls at inference.

~$6
Compute cost per training run
45–80ms
Local inference latency per token
0
External API calls at inference
The Challenge

Public LLMs don't know your SDK. And every autocomplete request leaks code to a cloud you don't own.

An engineer triggers autocomplete on an internal function. The model suggests a signature from a generic open-source library — which doesn't match the internal SDK, compiles cleanly, and quietly introduces a bug that shows up two weeks later. The LLM bill for the week is $500+, and a snippet of your proprietary code just trained somebody else's future model.

Both failures are structural. Public models train on public repos — your internal patterns are unknown. Cloud APIs charge per token and round-trip at 500ms–2s, which makes the autocomplete feel worse than a local tool. Compliance blocks the whole thing in regulated industries, and the workaround is usually "engineers write without autocomplete," which is a productivity tax nobody wanted.

The root cause isn't the quality of the base models. It's that there's no owned pipeline between "your codebase" and "a model that understands it". Fine-tuning is orchestration across data collection, synthetic Q&A, GPU training, quantization, and local deployment — each with failure modes a script can't recover from.

This agent owns the whole pipeline: idempotent data collection, 3–5k synthetic Q&A pairs grounded in real code, QLoRA training on a rented A100, GGUF quantization, and a local Ollama endpoint — reproducible per quarter as the codebase grows.

How the agent handles it

Five isolated stages. Reproducible per quarter. Everything ends in an Ollama endpoint.

SOURCECodebase~/Desktop/Accelevents/ SOURCEKB SnapshotsDocumentationArchitecture ~/ACCELEVENTS-TRAINING/Data CollectionExtract files & docsSynthetic Q&A3,000–5,000 pairs(JSONL)QLoRA TrainRunPod A1002–4 hours EXPORTGGUF FormatQ5_K_M~5GB DEPLOYLocal Ollamaaccelevents:latest OUTPUT ARTIFACTaccelevents:latest — fine-tuned Qwen 2.5 Coder 7B running locally via OllamaAutocompletes internal SDK · knows API docs · higher accuracy vs base · zero external API calls · one-time cost: ~$6
1

Training artifacts live outside your source tree.

Every intermediate file — manifests, JSONL pairs, adapter weights, GGUF exports — lands in an isolated ~/accelevents-training/. Source code stays read-only. Git hooks block accidental commits of model weights to the main repo. Re-runs are cheap and rollback is easy.

2

Synthetic Q&A is grounded in your real code, not generic prompts.

The pipeline extracts actual functions, docstrings, type hints, and API endpoints — then generates 3–5k pairs of (instruction, input, output) where each task is one your team actually performs. "Complete this function," "explain this endpoint" — not abstract LLM trivia.

3

QLoRA on a rented A100 — not a full fine-tune.

Adapter weights only. 10× cheaper than full training at comparable quality for domain specialization. A 2–4 hour run on a $1.50/hr A100 lands at about $6 total. Quarterly re-trains stay predictable instead of "let's raise a budget for this."

4

The output is a local Ollama endpoint. Zero per-token cost.

Merge adapter with base, quantize to GGUF Q5_K_M (~5GB, no measurable accuracy loss on code tasks), and register with Ollama. The IDE plugin now talks to localhost:11434. Your code never leaves your infrastructure, and the inference bill is rent on a laptop.

What you get

Three things change once the model is local.

~$6per run

Total compute cost per training cycle

2–4 hours of A100 plus quantization. Amortized quarterly vs the cloud LLM bill this replaces.

45–80ms/token

Local inference latency

CPU-only Ollama, no GPU required. IDE autocomplete feels instant — 500ms–2s cloud round trips are gone.

0external calls

Code leaving your infrastructure at inference

The model runs entirely locally. No per-token costs, no data leakage to cloud providers, no licensing exposure.

Numbers observed in Brilworks' internal reference deployment training Qwen 2.5 Coder 7B on the Accelevents codebase. Actual figures on your stack will depend on codebase size, GPU rental rates, and quantization tolerance.

Is this right for you?

Honest fit criteria. We'd rather say no than oversell.

Strong fit if

  • You have 150K+ lines of proprietary code with domain-specific patterns
  • Your team uses LLM-based code tools daily and is spending $500+ per month on cloud LLM APIs
  • You can't send code snippets to cloud APIs because of compliance or IP concerns
  • You have engineering capacity to integrate a local LLM into your IDE workflow

Not a fit if

  • Your codebase is under 50K lines — not enough signal for a meaningful fine-tune
  • You're okay with public models seeing your code patterns
  • You need real-time model updates, not quarterly retraining cadence
  • Your team has no ML or infra experience to debug training failures

Book a 30-minute scoping call.

We'll walk through your codebase, your current LLM bill, and your compliance constraints — then tell you honestly whether a private local model is worth the weekend.