Troubleshooting¶
All troubleshooting starts by SSHing into the GCE instance:
Container-Optimized OS (COS)
The GCE instance runs COS, which has a read-only root filesystem. You cannot install packages with apt, write to /etc, or modify system files. All mutable state lives on the persistent disk at /mnt/disks/data/. Docker is the only way to run software. Keep this in mind for all troubleshooting steps below.
General diagnostics¶
Run these first to understand the current state:
# Container running?
docker ps -a
# All three processes healthy?
docker exec obsidian-palace supervisorctl status
# Expected:
# mcp-server RUNNING pid 42, uptime 1:23:45
# nginx RUNNING pid 12, uptime 1:23:45
# obsidian-sync RUNNING pid 37, uptime 1:23:45
# Recent container logs (all processes interleaved)
docker logs obsidian-palace --tail 200
# Per-process logs via supervisord
docker exec obsidian-palace supervisorctl tail -f mcp-server
docker exec obsidian-palace supervisorctl tail -f obsidian-sync
docker exec obsidian-palace supervisorctl tail -f nginx
# Persistent disk mounted?
mount | grep /mnt/disks/data
ls /mnt/disks/data/
# Expected: vault/ chromadb/ obsidian-config/ letsencrypt/ certbot-webroot/ docker-config/
# Health check from inside the instance (bypasses nginx/SSL)
curl http://localhost:8080/health
supervisorctl socket issues
If supervisorctl status returns a socket connection error, the supervisord unix socket may not be configured. You can still check process status with docker exec obsidian-palace ps aux and view logs with docker logs.
SSL certificate not issued¶
If curl https://YOUR_DOMAIN/health fails with a certificate error, certbot may not have run successfully.
# Check if certs exist
ls /mnt/disks/data/letsencrypt/live/
# Re-run certbot manually if needed (stop the container first to free port 80)
docker stop obsidian-palace
docker run --rm \
-v /mnt/disks/data/letsencrypt:/etc/letsencrypt \
-v /mnt/disks/data/certbot-webroot:/var/www/certbot \
-p 80:80 \
certbot/certbot certonly \
--standalone \
--non-interactive \
--agree-tos \
--email your-email@gmail.com \
-d YOUR_DOMAIN
docker start obsidian-palace
Common causes:
- DNS not propagated yet -- verify with
dig YOUR_DOMAINthat it resolves to your static IP - Port 80 blocked -- the firewall rule must allow HTTP for certbot's challenge
- Certbot rate limits -- Let's Encrypt allows ~5 certificates per domain per week
Obsidian Sync: ob CLI troubleshooting¶
The ob CLI (obsidian-headless) is the Node.js sidecar that syncs your vault. Most sync issues trace back to auth tokens or sync configuration.
Verify auth token¶
# Check the symlink and auth token on the persistent disk
docker exec obsidian-palace ls -la /root/.obsidian-headless/
# Should be a symlink to /data/obsidian-config/headless/
docker exec obsidian-palace cat /data/obsidian-config/headless/auth_token
# Should be a non-empty token string
If the auth token is missing or the symlink is broken, re-authenticate:
The entrypoint script symlinks /root/.obsidian-headless/ -> /data/obsidian-config/headless/ on startup, so the token persists on the data disk across container restarts.
Verify sync configuration¶
ob sync-setup writes config to ~/.config/obsidian-headless/sync/<vault-id>/config.json, which the entrypoint symlinks to /data/obsidian-config/config/. If this config is missing, sync won't start even if the auth token is valid (sync-guard.sh Gate 2 will block it).
# Check if sync config exists on the persistent disk
docker exec obsidian-palace find /data/obsidian-config/config/sync -name 'config.json' -type f
# Should show: /data/obsidian-config/config/sync/<vault-id>/config.json
# Verify the symlink
docker exec obsidian-palace ls -la /root/.config/obsidian-headless
# Should be a symlink to /data/obsidian-config/config/
# List available vaults (to find or verify your vault ID)
docker exec obsidian-palace ob list-vaults
# Output: vault ID (32-char hex), name, region
If sync config is missing, re-run setup:
docker exec -it obsidian-palace ob sync-setup \
--vault YOUR_VAULT_ID \
--path /data/vault \
--device-name obsidian-palace
Vault ID
The vault ID is a 32-character hex string (e.g., a4a2ccb7cd82d034751c55ad5e38c4a3), NOT the vault name. Use ob list-vaults to find it.
Run sync manually¶
If ob sync --continuous is failing in supervisord, run it manually to see the error output in real time:
# Stop the supervised sync process
docker exec obsidian-palace supervisorctl stop obsidian-sync
# Run sync manually in the foreground
docker exec -it obsidian-palace ob sync --continuous --path /data/vault
# Once diagnosed, restart the supervised process
docker exec obsidian-palace supervisorctl start obsidian-sync
Common sync errors:
| Symptom | Cause | Fix |
|---|---|---|
auth_token not found |
Token missing or symlink broken | Re-run ob login |
No sync config found / Gate 2 blocked |
Sync config missing | Re-run ob sync-setup |
Vault has N files, expected at least M / Gate 3 blocked |
Vault below percentage of last known good count | Check vault content; delete /data/state/last_vault_count to reset, or lower OBSIDIAN_PALACE_MIN_VAULT_PERCENT (default: 80) |
network error / ECONNREFUSED |
Obsidian Sync servers unreachable | Check instance egress; try again later |
| Sync starts but no files appear | Wrong vault ID or empty vault | Verify vault ID with ob list-vaults |
FATAL on startup, immediate exit |
Auth token expired | Re-run ob login (tokens expire after extended periods) |
Sync config lost after container rebuild¶
Both the auth token and sync config are persisted on the data disk and symlinked into the container filesystem by entrypoint.sh on every boot:
| Container path | Symlink target (persistent disk) |
|---|---|
~/.obsidian-headless/ |
/data/obsidian-config/headless/ |
~/.config/obsidian-headless/ |
/data/obsidian-config/config/ |
Because the symlinks are recreated on every container start, rebuilding and redeploying the image does not lose either credential. sync-guard.sh verifies both exist before allowing ob sync to start.
If sync still breaks after a deploy:
- Check that the auth token exists:
ls -la /data/obsidian-config/headless/auth_token - Check that the sync config exists:
find /data/obsidian-config/config/sync/ -name config.json - If the auth token is missing, re-run
ob logininside the container - If the sync config is missing, re-run
ob sync-setupinside the container - Restart the container:
docker restart obsidian-palace
Server starts but doesn't respond to requests¶
Symptom: Container is running, docker ps shows it as healthy, but curl https://YOUR_DOMAIN/health hangs or times out.
Likely cause: Vault indexing blocking the server startup. This was a bug in early versions where index_vault() ran during the FastAPI lifespan startup, blocking uvicorn from serving requests. On e2-small with a 600MB vault, indexing can take several minutes.
Current behavior: Indexing now runs as a background asyncio.create_task(). The server starts immediately and search returns empty results until indexing completes. If you're seeing this issue, make sure you're on the latest image.
Diagnosis:
# Check if uvicorn is even listening
docker exec obsidian-palace curl -s http://localhost:8080/health
# Check MCP server logs for indexing progress
docker exec obsidian-palace supervisorctl tail -f mcp-server
# Look for: "Vault indexing started in background"
# Then later: "Background vault indexing complete: N files, M drawers"
# Check memory usage (indexing + ChromaDB can spike)
docker exec obsidian-palace cat /proc/meminfo | head -5
Slow cold starts (ONNX model download)¶
Symptom: First startup after a fresh container deploy takes 2-5 extra minutes. Subsequent restarts are fast.
Cause: MemPalace/ChromaDB uses an ONNX embedding model (~79MB) that is downloaded from the internet on first use. This happens inside index_vault() during the background indexing task.
Current behavior: The model is cached on the persistent disk at /data/chroma-cache/onnx_models/. The container's entrypoint.sh symlinks /root/.cache/chroma → /data/chroma-cache/ so the model survives container rebuilds and redeploys. The first boot after creating a new persistent disk still downloads the model, but all subsequent deploys reuse the cached copy.
# Verify the model is cached
docker exec obsidian-palace ls -la /data/chroma-cache/onnx_models/
# Should show: all-MiniLM-L6-v2/
If the model re-downloads on every deploy
Verify the symlink exists: docker exec obsidian-palace ls -la /root/.cache/chroma. It should point to /data/chroma-cache. If it's missing, the entrypoint may have failed — check docker logs obsidian-palace | head -30.
MCP client connection issues¶
Wrong transport endpoint¶
Different MCP clients use different transports:
| Client | Transport | Endpoint |
|---|---|---|
| Claude Desktop | SSE | https://YOUR_DOMAIN/sse |
| Claude Code | SSE | https://YOUR_DOMAIN/sse |
| Claude iOS / claude.ai | SSE | https://YOUR_DOMAIN/sse |
| OpenCode | Streamable HTTP | https://YOUR_DOMAIN/mcp |
Common mistake: Configuring OpenCode with the SSE endpoint (/sse). OpenCode uses Streamable HTTP (POST to a single endpoint), so it sends a POST to /sse which returns 405 Method Not Allowed. Fix: point OpenCode at /mcp.
{
"mcp": {
"obsidian-palace": {
"type": "remote",
"url": "https://YOUR_DOMAIN/mcp"
}
}
}
OAuth discovery chain¶
MCP clients follow a specific OAuth discovery sequence. If any step fails, the client won't connect. You can verify each step manually:
# 1. Unauthenticated request should return 401 with WWW-Authenticate header
curl -sI https://YOUR_DOMAIN/sse
# Look for: WWW-Authenticate: Bearer resource_metadata="..."
# 2. Protected resource metadata
curl -s https://YOUR_DOMAIN/.well-known/oauth-protected-resource | python3 -m json.tool
# 3. Authorization server metadata
curl -s https://YOUR_DOMAIN/.well-known/oauth-authorization-server | python3 -m json.tool
# 4. Dynamic client registration (should accept POST)
curl -s -X POST https://YOUR_DOMAIN/register \
-H "Content-Type: application/json" \
-d '{"redirect_uris": ["http://localhost:9999/callback"], "client_name": "test"}'
If any of these return errors, check the MCP server logs.
OAuth discovery returns localhost¶
Symptom: MCP clients fail to connect. curl https://YOUR_DOMAIN/.well-known/oauth-protected-resource returns URLs pointing to https://localhost:8080/ instead of your public domain.
Cause: The OBSIDIAN_PALACE_SERVER_URL environment variable is not set or not being passed to the container. The server defaults to https://localhost:8080.
Fix: Ensure the docker run command includes -e OBSIDIAN_PALACE_SERVER_URL="https://YOUR_DOMAIN". In the Terraform-managed startup script, this is set automatically from var.domain. Verify the domain variable is set in your TFC workspace.
# Check what the container sees
docker exec obsidian-palace env | grep SERVER_URL
# Expected: OBSIDIAN_PALACE_SERVER_URL=https://YOUR_DOMAIN
OAuth token issues¶
Symptom: Auth flow completes (browser redirects back) but the MCP client gets 401 on subsequent requests.
Known bug (fixed): In Python, if token.expires_at and ... skips validation when expires_at=0 because 0 is falsy. The fix is if token.expires_at is not None and .... If you're seeing this, make sure you're on the latest image.
Diagnosis:
# Check server logs during the auth flow
docker exec obsidian-palace supervisorctl tail -500 mcp-server | grep -i "token\|auth\|oauth"
Other OAuth issues:
- Redirect URI mismatch: The Google OAuth client's redirect URI must be exactly
https://YOUR_DOMAIN/oauth2/callback. No trailing slash, no port number. - Test user not added: While the OAuth consent screen is in "Testing" status, only explicitly added test users can authenticate. Go to Google Cloud Console > APIs & Services > OAuth consent screen > Test users.
- Wrong client type: The OAuth client must be "Web application," not "Desktop." Desktop clients don't support redirect URIs.
Container not starting¶
# Check container status and exit code
docker ps -a
# Look at the STATUS column — if it says "Exited (1)", check logs
# Container logs (includes entrypoint + supervisord output)
docker logs obsidian-palace --tail 200
# Startup script logs (COS metadata script runner)
sudo journalctl -u google-startup-scripts.service --no-pager | tail -100
Common causes:
| Symptom | Cause | Fix |
|---|---|---|
| Container never starts | Image pull failed (Artifact Registry auth) | Check docker pull manually; verify service account has artifactregistry.reader |
| Exits immediately | Missing environment variables | Check startup script injects secrets from Secret Manager |
| Exits after a few seconds | supervisord config error | Check docker logs for supervisord parse errors |
| Running but no ports exposed | Docker run missing -p flags |
Check the startup script's docker run command in instance metadata |
Re-run the startup script¶
The COS startup script handles pulling the image, injecting secrets, and starting the container. If something went wrong during boot, you can re-run it:
This is safe to re-run -- it's idempotent. It will stop the existing container (if any), pull the latest image, and start a new container.
Persistent disk issues¶
Verify disk is mounted¶
mount | grep /mnt/disks/data
# Should show: /dev/sdb on /mnt/disks/data type ext4 (rw,relatime)
ls /mnt/disks/data/
# Expected directories: vault/ chromadb/ chroma-cache/ obsidian-config/ letsencrypt/ certbot-webroot/ docker-config/ state/
If the disk isn't mounted, the startup script should handle it. Re-run:
Check data integrity after instance reset¶
After a gcloud compute instances reset, the persistent disk is reattached but the mount may need to be re-established. Verify:
# Check disk is attached
lsblk
# Should show sdb (or similar) with the correct size
# Check mount
mount | grep /mnt/disks/data
# If not mounted, mount it manually
sudo mkdir -p /mnt/disks/data
sudo mount /dev/sdb /mnt/disks/data
# Then restart the container
sudo google_metadata_script_runner startup
ChromaDB out of memory¶
The e2-small instance has 2 GB RAM. With a 600MB vault, the Python server + ChromaDB + ONNX embeddings + Node.js sync process typically uses 1.2-1.6 GB. If your vault is significantly larger (>1 GB of markdown), you may hit OOM.
Diagnosis:
# Check memory usage
free -h
docker stats obsidian-palace --no-stream
# Check if OOM killer fired
dmesg | grep -i "out of memory\|oom"
Fix: Upgrade to e2-medium (4 GB RAM). Set the machine_type Terraform variable:
Alternatively, disable MemPalace indexing to save memory (you lose semantic search):