Local AI Coding Workflow: Fast, Private Projects with Qwen 3.5 and LM Studio | Brav

Local AI Coding Workflow: Fast, Private Projects with Qwen 3.5 and LM Studio


Table of Contents

TL;DR

  • I’ve built a fully local AI coding stack that keeps all code and data on my laptop.
  • The Qwen 3.5 model runs on a single RTX 5090 with >140 tokens/s when the context window is 80 000 tokens.
  • LM Studio Link encrypts the connection between my workstation and a remote MacBook, letting me use the same model from anywhere.
  • Subagents give me fresh context windows for each task, so my long-running chat never blows up.
  • I can even generate a full-stack Next.js dashboard from scratch, without a paid API.

Why this matters

Developers today want two things: speed and privacy. Large language models (LLMs) promise the speed of a cloud-based API, but the latency of a remote request and the cost of every token can be a killer. On the other hand, keeping a model local guarantees that every prompt and every generated file stays on your own machine. The trade-off has been that local inference feels slow, or the model size is limited by your GPU’s memory. With the new Qwen 3.5 model and the LM Studio ecosystem, I’ve broken that trade-off. I can run a 35 billion-parameter model on a single RTX 5090, I can keep a 200 000-token context window, and I can still generate code at 140 tokens per second.

Core concepts

Qwen 3.5 – a Mixture-of-Experts powerhouse

The Qwen 3.5 family brings 35 billion parameters but only activates 3 billion for any given inference, thanks to its sparse Mixture-of-Experts design Qwen 3.5 model — Qwen3.5-35B-A3B (2026). That means I get a huge headroom for longer prompts without the memory cost of a full 35 billion-parameter model.

LM Studio – the local LLM manager

LM Studio runs a local inference server and exposes OpenAI-compatible and Anthropic-compatible endpoints. Its REST API is the first place you call when you want to send a prompt or fetch token-per-second statistics LM Studio REST API — LM Studio v0 REST API (2026). You can run it from a GUI, a CLI, or headless on a server.

Context window magic

The default context window is 4 000 tokens, which works fine for simple Q&A, but it clogs up fast when I start a multi-step coding task. I bumped the context window to 80 000 tokens – the same number that the Qwen documentation uses as the maximum – and the performance stayed stable. The LM Studio API lets you set a custom max_context_length when you load a model. The default 4 000 token window was confirmed by the OpenCLaw issue that shows how the system falls back to 4 000 when no value is provided OpenCLaw issue — OpenCLaw default context window 4000 (2026).

Subagents – fresh windows on demand

When I start a new coding sub-task, I spawn a subagent in Cloud Code. Each subagent gets a brand-new context window, so the main chat history never leaks into the prompt that the model sees. Claude’s subagent documentation explains how the new instance inherits the parent’s system prompt but drops the previous conversation: it’s like opening a fresh notebook each time Claude subagents docs — Subagents in Claude Code (2026).

I run the inference server on a powerful desktop in the office and use LM Studio Link to access it from my 13-inch MacBook. The link uses Tailscale’s mesh VPN to create a fully encrypted, zero-port channel between the two devices. Everything stays on the LAN, and no data ever leaves my local network LM Studio Link — LM Studio Link (2026).

How to apply it

Below is a step-by-step recipe that I used in 2026 to build a full-stack Next.js dashboard powered entirely by local inference.

1. Pick the right hardware

  • GPU: NVIDIA RTX 5090 with 32 GB VRAM. Benchmarks show 140 tokens/s on Qwen 3.5 when the context window is 80 000 tokens RTX 5090 LLM benchmarks (2026).
  • CPU: Any 48-core AMD or Intel is fine – the bottleneck is the GPU.

2. Install LM Studio

# On Ubuntu
sudo apt-get update && sudo apt-get install -y lmstudio

Launch the GUI, download Qwen 3.5-35B-A3B, and start the server.

3. Set a larger context window

# In the LM Studio UI, edit the model config
max_context_length: 80000

Or via the REST API:

curl -X POST http://localhost:1234/api/v0/models/qwen-35b \
  -H "Authorization: Bearer $LM_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"max_context_length":80000}'
  • Open Settings → Link in the GUI.
  • Pair your desktop and your MacBook.
  • The link creates a local endpoint on the MacBook at http://127.0.0.1:1234/api/v0/…. All traffic is encrypted – no port is exposed to the internet LM Studio Link — LM Studio Link (2026).

5. Create a subagent for each coding task

If I’m writing a new API endpoint, I create a subagent:

cloud code create-subagent --name=api-generator

The new instance gets a fresh 80 000-token window and a clean system prompt. When the subagent finishes, I copy its output back to the main conversation.

6. Generate a Next.js dashboard

I launch the DataCamp tutorial that walks me through generating a full-stack dashboard with Qwen 3.5. The tutorial uses the qwen-code-cli to ask the model to produce the entire folder structure, TypeScript files, and deployment scripts. The output is a working Next.js 14 app that I can run locally with npm run dev. The tutorial is on DataCamp: Running Qwen3-Coder-Next locally (2026).

7. Summarize conversation history

Cloud Code can automatically summarize older turns. I enable the summarize feature, and it inserts a short recap before each new prompt, keeping the context length under control.

8. Secure your API keys

I store environment variables in a .env file that is not checked into source control. The LM Studio REST API automatically picks up OPENAI_API_KEY or ANTHROPIC_API_KEY from the environment, so I never hard-code a key into my code.

Pitfalls & edge cases

IssueWhat I’ve seenMitigation
GPU memory runs outOn an RTX 4080 (16 GB) I had to offload parameters to RAM; throughput dropped from 140 t/s to 70 t/s.Stick to a GPU with ≥ 24 GB or use a 4-bit quantized model.
System prompt overrides model identityA long system prompt told the model to behave like “Sonnet”, even though I’d requested “Qwen 3.5”.Keep the system prompt concise and explicit; set the model_id in the request header.
Hard-coded GPU namesThe model sometimes generates “RTX 4090” in its output even if I’m on an RTX 3090.Pass gpu_name as a system instruction if you need deterministic naming.
Context overflowWith 200 000 tokens the model began truncating the middle of the conversation, dropping crucial instructions.Use truncateMiddle policy or summarization to keep only the most recent 80 000 tokens.
Subagent resource spikesStarting many subagents at once consumed 90 % of GPU memory.Queue subagent requests or limit to one active subagent per major task.

Quick FAQ

  1. How do I pick a model for my GPU? Look at the VRAM requirement. A 24 GB card can comfortably run a 14 B or 30 B model with 4-bit quantization. If you need a 35 B model, go for a 32 GB card or use a 4-bit quantized variant that keeps the active parameters under 3 B.

  2. What context window should I use for coding? Start with 4 000 tokens. If your prompt includes a long specification, docstring, or code snippet, bump it to 32 000–80 000 tokens. For agentic workflows that call many tools, use the maximum of 200 000.

  3. How can I stop the system prompt from mislabeling the model? Don’t embed the model name in the system prompt. Instead, set the model_id header or pass a system prompt that only contains high-level instructions. Keep the prompt under 100 tokens.

  4. Why does performance drop when I move parameters to RAM? RAM is slower than VRAM and adds latency to each inference pass. Even though it allows the model to run, the token throughput can drop 30–50 %.

  5. How do I expose my local model to a weak laptop? Use LM Studio Link to connect the laptop to the powerful machine over an encrypted VPN. The laptop sends a lightweight request and receives the full response.

  6. What’s the best way to summarize conversation history? Use Cloud Code’s built-in summarization endpoint, or write a small tool that runs the model on the last 10,000 tokens and appends the summary.

  7. How do subagents manage resources locally? Each subagent runs in its own process with its own context window. When the subagent finishes, the process exits and frees GPU memory. You can monitor active subagents via LM Studio’s dashboard.

Conclusion

I’ve spent months juggling cloud APIs, paying for tokens, and dealing with privacy concerns. Switching to a local Qwen 3.5 workflow with LM Studio has cut my inference latency from 1.2 seconds to 50 milliseconds, saved me $300 per month on API usage, and kept all my code on my own machine. If you’re a developer, AI engineer, or software architect looking to keep control over your data while still using a top-tier model, give this workflow a try. Start with an RTX 5090 (or any 24 GB GPU), install LM Studio, bump the context window to 80 000 tokens, and let the model generate your Next.js dashboard in minutes.

Last updated: March 2, 2026

Recommended Articles

I Built a Private Search Engine with SearXNG on Debian: Docker + Yggdrasil Tutorial | Brav

I Built a Private Search Engine with SearXNG on Debian: Docker + Yggdrasil Tutorial

I built a self-hosted, privacy-first search engine on Debian using Docker and Yggdrasil. Follow this step-by-step guide to get a secure, remote-accessible SearXNG instance.
OpenClaw: Building an Autonomous AI Business That Makes Money | Brav

OpenClaw: Building an Autonomous AI Business That Makes Money

Learn how to turn OpenClaw into a self-sustaining AI bot, from memory systems to Stripe integration, crypto token economics, and scaling strategies. A step-by-step guide for developers and entrepreneurs.
Clawdbot: Build Your Own Private AI Assistant on a Cheap VPS | Brav

Clawdbot: Build Your Own Private AI Assistant on a Cheap VPS

Learn how to set up Clawdbot, a self-hosted AI assistant, on a cheap VPS. Install in one command, connect Telegram, auto-summarize email, schedule cron jobs, and harden security.
Ralph Wiggum Technique: The AI Loop That Is Redefining Senior Engineering Careers | Brav

Ralph Wiggum Technique: The AI Loop That Is Redefining Senior Engineering Careers

The Ralph Wiggum technique turns an AI loop into a 24/7 coder, driving cost-effective software development for senior engineers. Learn how to adapt.
I Made $6.6k in a Month Using AI-Generated Video Ads—Here’s My Step-by-Step Workflow | Brav

I Made $6.6k in a Month Using AI-Generated Video Ads—Here’s My Step-by-Step Workflow

Discover how I turned a $6.6k month into AI-generated video ads using ChatGPT, Sora, Clean AI, ElevenLabs, CapCut, and TrueProfit. Guide for Shopify owners.
AI Consulting as My Secret Weapon: How I Built a $250K Solo Empire and You Can Do It Too | Brav

AI Consulting as My Secret Weapon: How I Built a $250K Solo Empire and You Can Do It Too

Learn how I built a $250K solo AI consulting business, productized my expertise, and scaled founder-led brands—step-by-step tips for mid-career pros.