Local AI Server

Run GGUF language models locally with llama.cpp, Vulkan GPU acceleration, and llama-swap. The server exposes an OpenAI-compatible API and discovers models placed in ~/ai/models.

What it provides

OpenAI-compatible chat and completion endpoints
Vulkan acceleration on supported NVIDIA, AMD, and Intel GPUs
Automatic discovery of .gguf model files
On-demand model loading and switching through llama-swap
A systemd user service
Update, start, stop, and configuration helper scripts

OpenAI-compatible client
          |
          v
 llama-swap (localhost)
          |
          v
 llama.cpp + Vulkan
          |
          v
    GGUF model files

Requirements

Ubuntu or Debian on an x86-64 machine
A Vulkan-capable GPU and working Vulkan driver
sudo access during installation
Enough RAM and VRAM for the model and quantization you choose

The installer uses the known compatible releases llama.cpp b9672 and llama-swap v226. The separate update script checks for newer releases.

Install

git clone https://github.com/hossbit/localai.git
cd localai
chmod +x ./*.sh
./install-local-ai.sh

The installer:

Installs required system packages.
Downloads the pinned llama.cpp b9672 and llama-swap v226 releases.
Creates ~/ai/bin, ~/ai/models, and the helper scripts.
Selects an available port, beginning at 11435.
Creates a systemd user service.

The installer does not start the server automatically. Add at least one model, then start it with:

systemctl --user start localai

To start it automatically when you log in:

systemctl --user enable --now localai

Add a model

Place one or more .gguf files in:

~/ai/models

For example, with the Hugging Face CLI:

python3 -m pip install --user huggingface_hub
hf auth login

hf download bartowski/Qwen2.5-Coder-7B-Instruct-GGUF \
  Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf \
  --local-dir ~/ai/models

Some model repositories require a Hugging Face account and read token. See Hugging Face access tokens.

The model ID exposed by the API is the filename without .gguf. For example:

Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf

becomes:

Qwen2.5-Coder-7B-Instruct-Q4_K_M

Choose a quantization

Quantization	Relative quality	Relative memory use
Q2_K	Lowest	Smallest
Q3_K_M	Good	Low
Q4_K_M	Recommended balance	Medium
Q5_K_M	Better	High
Q6_K	Very good	Higher
Q8_0	Near FP16	Highest

Q4_K_M is a useful starting point for GPUs with limited VRAM. Actual memory use also depends on model size, context length, and GPU-offloaded layers.

Use the server

Read the selected port:

PORT=$(cat ~/ai/port)

List available models:

curl "http://127.0.0.1:${PORT}/v1/models"

Send a chat request:

MODEL="Qwen2.5-Coder-7B-Instruct-Q4_K_M"

curl "http://127.0.0.1:${PORT}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"${MODEL}\",
    \"messages\": [
      {\"role\": \"user\", \"content\": \"What is Linux?\"}
    ]
  }"

Python with the OpenAI SDK:

from pathlib import Path
from openai import OpenAI

port = Path.home().joinpath("ai/port").read_text().strip()
client = OpenAI(base_url=f"http://127.0.0.1:{port}/v1", api_key="local")

response = client.chat.completions.create(
    model="Qwen2.5-Coder-7B-Instruct-Q4_K_M",
    messages=[{"role": "user", "content": "Hello!"}],
)

print(response.choices[0].message.content)

The local server does not validate api_key, but OpenAI client libraries usually require a non-empty value.

Service and helper commands

# Start, stop, restart, and inspect the systemd service
systemctl --user start localai
systemctl --user stop localai
systemctl --user restart localai
systemctl --user status localai

# Follow service output
journalctl --user -u localai -f

# Run the helpers directly
~/ai/start.sh
~/ai/stop.sh
~/ai/rebuild-config.sh
~/ai/update-local-ai.sh
~/ai/uninstall-local-ai.sh

Direct-process logs are written to:

~/ai/logs/llama-swap.log

Configuration

rebuild-config.sh creates ~/ai/config.yaml from every .gguf file in ~/ai/models. It runs automatically whenever the server starts.

The defaults are:

Context size: 32768
GPU layers: 10
KV cache: q8_0
Idle model timeout: 900 seconds

Override context size or GPU layers for one start:

CTX_SIZE=8192 N_GPU_LAYERS=20 ~/ai/start.sh

If you use systemd and want persistent overrides, add them with:

systemctl --user edit localai

Then enter:

[Service]
Environment=CTX_SIZE=8192
Environment=N_GPU_LAYERS=20

Apply the change:

systemctl --user daemon-reload
systemctl --user restart localai

Update

From the cloned repository:

./update-local-ai.sh

Or use the installed copy:

~/ai/update-local-ai.sh

The updater checks GitHub for the latest compatible releases, refreshes the installed helper scripts when run from the repository, updates outdated components, and preserves models and the configured port. By default it starts the server after an update, using the systemd user service when it is installed; use --no-start to leave it stopped.

Uninstall

Remove the user service and installed helper files:

~/ai/uninstall-local-ai.sh

By default the uninstaller keeps ~/ai/models. To remove downloaded models too:

~/ai/uninstall-local-ai.sh --remove-models

To also remove the shared llama-swap binary installed in /usr/local/bin:

~/ai/uninstall-local-ai.sh --remove-llama-swap

Troubleshooting

Check the configured port and models:

cat ~/ai/port
ls -lh ~/ai/models
curl "http://127.0.0.1:$(cat ~/ai/port)/v1/models"

Check GPU detection:

~/ai/bin/llama-server --list-devices

Check logs:

tail -n 100 ~/ai/logs/llama-swap.log
journalctl --user -u localai -n 100 --no-pager

If a Hugging Face download returns 401 Unauthorized:

hf auth logout
hf auth login
hf auth whoami

Security

The helper scripts bind llama-swap to 127.0.0.1, so the API is available only on the local machine by default. Do not expose it to a network without adding authentication, TLS, and appropriate firewall rules.

Credits

This project is built on top of:

Special thanks to the maintainers and contributors of these projects.

LocalAI focuses on simplifying installation, configuration, model management, and service deployment for local LLM environments.

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local AI Server

What it provides

Requirements

Install

Add a model

Choose a quantization

Use the server

Service and helper commands

Configuration

Update

Uninstall

Troubleshooting

Security

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.md		README.md
install-local-ai.sh		install-local-ai.sh
rebuild-config.sh		rebuild-config.sh
start.sh		start.sh
stop.sh		stop.sh
uninstall-local-ai.sh		uninstall-local-ai.sh
update-local-ai.sh		update-local-ai.sh

Folders and files

Latest commit

History

Repository files navigation

Local AI Server

What it provides

Requirements

Install

Add a model

Choose a quantization

Use the server

Service and helper commands

Configuration

Update

Uninstall

Troubleshooting

Security

Credits

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages