How can I add authentication to the server?

Use Vapor’s UserAuthMiddleware and a bearer token or JWT. Store the secret in Keychain or an environment variable.

What are the performance limits on an M2 MacBook Air?

Roughly 10 concurrent requests and 20 tokens per second, as measured by the load-tester repo.

Is it legal to reverse engineer Apple Intelligence weights?

Apple’s Standard EULA prohibits reverse engineering of proprietary components. Violating it can lead to legal action.

Will Apple open up the Foundation Models framework to third-party apps in the future?

Apple has hinted at broader access, but no official roadmap exists yet.

Can I monetize this service, and if so, how?

You could charge for API usage, but you must ensure compliance with Apple’s licensing and consider the uncertainty around Apple’s stance on commercial APIs.

Can I use tool calling with this server?

The current repo does not implement tool calling; you’ll need to add it yourself.

How do I handle structured outputs?

Once the Foundation Models framework exposes guided generation, you can wrap the output in a JSON schema. Until then, parse the plain text.

Expose Apple Intelligence via a local Swift API server. Learn how to set up, test, and scale this OpenRouter-compatible endpoint on an M2 MacBook Air.

Apple Intelligence API Server: From Swift to JavaScript

Published by Brav

Table of Contents

TL;DR:

Apple Intelligence runs only on Apple Silicon via Swift’s Foundation Models.
A lightweight Swift API server can expose it as an OpenRouter-compatible endpoint.
I’ll walk through building, testing, and scaling this server on an M2 MacBook Air.
The server handles ~10 concurrent requests and 20 tokens per second out of the box.
You can add authentication, rate limiting, structured outputs, and tool calling to turn it into a production-ready service.

Why this matters

I remember the first time I tried to tap into Apple Intelligence from my React app. The error stack kept telling me that there was no public API. That was a pain because I had a whole team of Python, JavaScript, and Swift developers who wanted the same model. Apple Intelligence is locked to Apple Silicon and only accessible via the Foundation Models framework in Swift, which meant I had to run a server on a Mac and then hit it from the web. The lack of cross-language access, no authentication, and limited concurrency made it impossible to ship a real product.

Core concepts

Apple Intelligence is a generative AI system that runs entirely on device, offering speed and privacy Apple — Apple Intelligence Release (2024). It sits on top of Apple’s Foundation Models framework, announced at WWDC 2025, and is only reachable via Swift’s FMModel APIs Apple — Foundation Models (2025). The framework gives you streaming responses, guided generation, and, in future releases, tool calling.

The idea is simple: spin up a local HTTP server on a macOS machine, translate incoming JSON requests into FMModel calls, and stream back the answer. The API is deliberately shaped to mimic OpenRouter and OpenAI so you can drop in existing SDKs with little change.

API	Local vs Cloud	Privacy	Rate Limiting	Notes
Apple Intelligence API Server	On-device	High (data never leaves the Mac)	None by default	Built with Swift + Vapor
OpenAI API	Cloud	Medium (data sent to OpenAI servers)	Standard key-based limits	Cloud-only
OpenRouter API	Cloud	Medium	Key-based limits	Open-source community

How to apply it

1. Prep your machine

Make sure you’re running macOS 15.1 or later on an M2 (or newer) MacBook Air. Enable Apple Intelligence in Settings → Apple Intelligence & Siri. You’ll need an Apple ID signed into Xcode and a recent Xcode 15 beta.

2. Clone the repo

git clone https://github.com/gety-ai/apple-on-device-openai.git
cd apple-on-device-openai

The repo ships a SwiftUI app that starts a Vapor server at 127.0.0.1:11535. The code is tiny – just a couple dozen lines – and you can pull it into any Xcode project.

3. Build and run

open AppleOnDeviceOpenAI.xcodeproj

Click “Run” and wait for the terminal to say “Server started at 127.0.0.1:11535”. You now have an OpenAI-compatible /v1/chat/completions endpoint that talks directly to FMModel.

4. Test with curl

curl -X POST http://127.0.0.1:11535/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"apple-on-device","messages":[{"role":"user","content":"Hello"}],"stream":true}'

You’ll see streaming tokens in real time – just like the OpenAI API. The server also supports a /health endpoint for uptime checks.

5. Hook it up to your stack

In Python or Node, you can simply swap the OpenAI base URL:

import openai
openai.api_base = "http://localhost:11535/v1"
openai.api_key = "apple"
resp = openai.ChatCompletion.create(
    model="apple-on-device",
    messages=[{"role":"user","content":"Tell me a joke"}],
)

Because the API shape matches OpenAI, all existing SDKs (Python, JavaScript, Swift) will work out of the box.

6. Add auth & rate limiting

The starter repo has no auth. For production, spin up a simple middleware in Vapor:

router.grouped(UserAuthMiddleware()).post("v1", "chat", "completions") { req in ... }

Use a bearer token or JWT; store your secret in Keychain or a .env file. For rate limiting, add a token-bucket or leaky-bucket algorithm. Vapor ships with a RateLimit middleware you can plug in.

7. Measure performance

The load-tester repo Gouwsxander — LLM API Load Testing (2025) shows that on an M2 MacBook Air you get about 10 concurrent requests and ~20 tokens per second. This matches the numbers in the project docs and confirms the server is already battle-tested.

8. Scale out

If you need to serve many clients, you can run multiple instances behind an NGINX reverse proxy with sticky sessions. For true horizontal scaling, consider a Docker image built on macOS-specific runners or use a macOS-based cloud instance (e.g., AWS Mac instance).

Pitfalls & edge cases

No built-in auth – Exposing the server publicly without authentication is a security risk.
Concurrency ceiling – The FMModel API can handle only ~10 concurrent requests on an M2. Over-loading will stall or kill threads.
Tool calling is WIP – Structured outputs and tool calling are not yet implemented in the starter; you’ll need to roll your own if you need that.
Reverse engineering risk – Apple’s Standard EULA explicitly forbids reverse engineering of the model weights, which may have legal implications if you try to extract them.
Monetization uncertainty – Apple has said the model can be free for users and developers, but the business model for hosting a public API is unclear.

Quick FAQ

How can I add authentication to the server?
Use Vapor’s UserAuthMiddleware and a bearer token or JWT. Store the secret in Keychain or an environment variable.
What are the performance limits on an M2 MacBook Air?
Roughly 10 concurrent requests and 20 tokens per second, as measured by the load-tester repo.
Is it legal to reverse engineer Apple Intelligence weights?
Apple’s Standard EULA prohibits reverse engineering of proprietary components. Violating it can lead to legal action.
Will Apple open up the Foundation Models framework to third-party apps in the future?
Apple has hinted at broader access, but no official roadmap exists yet.
Can I monetize this service, and if so, how?
You could charge for API usage, but you must ensure compliance with Apple’s licensing and consider the uncertainty around Apple’s stance on commercial APIs.
Can I use tool calling with this server?
The current repo does not implement tool calling; you’ll need to add it yourself.
How do I handle structured outputs?
Once the Foundation Models framework exposes guided generation, you can wrap the output in a JSON schema. Until then, parse the plain text.

Conclusion

I’ve spent dozens of hours wrestling with Apple’s on-device AI, but this approach finally gave me a stable, cross-language API that I can deploy on my own MacBook Air. If you’re a Swift developer looking to expose Apple Intelligence, or a Python/JavaScript dev who wants local inference, this repo is a solid starting point. Clone it, run the server, add auth and rate limiting, and you’re ready to serve a production workload – or at least a proof-of-concept. Keep an eye on Apple’s future announcements; they may eventually release a public API that would make this whole exercise unnecessary. Until then, the local server remains the best way to bring Apple Intelligence into any stack.

References

Apple — Apple Intelligence Release (2024) (https://www.apple.com/newsroom/2024/10/apple-intelligence-is-available-today-on-iphone-ipad-and-mac/)
Apple — Foundation Models (2025) (https://developer.apple.com/videos/play/wwdc2025/286/)
Gety AI — Apple On-Device OpenAI API (2025) (https://github.com/gety-ai/apple-on-device-openai)
Gouwsxander — LLM API Load Testing (2025) (https://github.com/gouwsxander/stress-test)
Apple — Standard EULA (2024) (https://www.apple.com/legal/internet-services/itunes/dev/stdeula/)

Last updated: December 25, 2025