Troubleshooting¶

All troubleshooting starts by SSHing into the GCE instance:

gcloud compute ssh obsidian-palace --zone=us-central1-a --project=YOUR_PROJECT_ID

Container-Optimized OS (COS)

The GCE instance runs COS, which has a read-only root filesystem. You cannot install packages with apt, write to /etc, or modify system files. All mutable state lives on the persistent disk at /mnt/disks/data/. Docker is the only way to run software. Keep this in mind for all troubleshooting steps below.

General diagnostics¶

Run these first to understand the current state:

# Container running?
docker ps -a

# All three processes healthy?
docker exec obsidian-palace supervisorctl status
# Expected:
#   mcp-server       RUNNING   pid 42, uptime 1:23:45
#   nginx            RUNNING   pid 12, uptime 1:23:45
#   obsidian-sync    RUNNING   pid 37, uptime 1:23:45

# Recent container logs (all processes interleaved)
docker logs obsidian-palace --tail 200

# Per-process logs via supervisord
docker exec obsidian-palace supervisorctl tail -f mcp-server
docker exec obsidian-palace supervisorctl tail -f obsidian-sync
docker exec obsidian-palace supervisorctl tail -f nginx

# Persistent disk mounted?
mount | grep /mnt/disks/data
ls /mnt/disks/data/
# Expected: vault/ chromadb/ obsidian-config/ letsencrypt/ certbot-webroot/ docker-config/

# Health check from inside the instance (bypasses nginx/SSL)
curl http://localhost:8080/health

supervisorctl socket issues

If supervisorctl status returns a socket connection error, the supervisord unix socket may not be configured. You can still check process status with docker exec obsidian-palace ps aux and view logs with docker logs.

SSL certificate not issued¶

If curl https://YOUR_DOMAIN/health fails with a certificate error, certbot may not have run successfully.

# Check if certs exist
ls /mnt/disks/data/letsencrypt/live/

# Re-run certbot manually if needed (stop the container first to free port 80)
docker stop obsidian-palace
docker run --rm \
  -v /mnt/disks/data/letsencrypt:/etc/letsencrypt \
  -v /mnt/disks/data/certbot-webroot:/var/www/certbot \
  -p 80:80 \
  certbot/certbot certonly \
    --standalone \
    --non-interactive \
    --agree-tos \
    --email your-email@gmail.com \
    -d YOUR_DOMAIN
docker start obsidian-palace

Common causes:

DNS not propagated yet -- verify with dig YOUR_DOMAIN that it resolves to your static IP
Port 80 blocked -- the firewall rule must allow HTTP for certbot's challenge
Certbot rate limits -- Let's Encrypt allows ~5 certificates per domain per week

Obsidian Sync: `ob` CLI troubleshooting¶

The ob CLI (obsidian-headless) is the Node.js sidecar that syncs your vault. Most sync issues trace back to auth tokens or sync configuration.

Verify auth token¶

# Check the symlink and auth token on the persistent disk
docker exec obsidian-palace ls -la /root/.obsidian-headless/
# Should be a symlink to /data/obsidian-config/headless/

docker exec obsidian-palace cat /data/obsidian-config/headless/auth_token
# Should be a non-empty token string

If the auth token is missing or the symlink is broken, re-authenticate:

docker exec -it obsidian-palace ob login
# Enter: email, password, MFA code (interactive)

The entrypoint script symlinks /root/.obsidian-headless/ -> /data/obsidian-config/headless/ on startup, so the token persists on the data disk across container restarts.

Verify sync configuration¶

ob sync-setup writes config to ~/.config/obsidian-headless/sync/<vault-id>/config.json, which the entrypoint symlinks to /data/obsidian-config/config/. If this config is missing, sync won't start even if the auth token is valid (sync-guard.sh Gate 2 will block it).

# Check if sync config exists on the persistent disk
docker exec obsidian-palace find /data/obsidian-config/config/sync -name 'config.json' -type f
# Should show: /data/obsidian-config/config/sync/<vault-id>/config.json

# Verify the symlink
docker exec obsidian-palace ls -la /root/.config/obsidian-headless
# Should be a symlink to /data/obsidian-config/config/

# List available vaults (to find or verify your vault ID)
docker exec obsidian-palace ob list-vaults
# Output: vault ID (32-char hex), name, region

If sync config is missing, re-run setup:

docker exec -it obsidian-palace ob sync-setup \
  --vault YOUR_VAULT_ID \
  --path /data/vault \
  --device-name obsidian-palace

Vault ID

The vault ID is a 32-character hex string (e.g., a4a2ccb7cd82d034751c55ad5e38c4a3), NOT the vault name. Use ob list-vaults to find it.

Run sync manually¶

If ob sync --continuous is failing in supervisord, run it manually to see the error output in real time:

# Stop the supervised sync process
docker exec obsidian-palace supervisorctl stop obsidian-sync

# Run sync manually in the foreground
docker exec -it obsidian-palace ob sync --continuous --path /data/vault

# Once diagnosed, restart the supervised process
docker exec obsidian-palace supervisorctl start obsidian-sync

Common sync errors:

Symptom	Cause	Fix
`auth_token not found`	Token missing or symlink broken	Re-run `ob login`
`No sync config found` / Gate 2 blocked	Sync config missing	Re-run `ob sync-setup`
`Vault has N files, expected at least M` / Gate 3 blocked	Vault below percentage of last known good count	Check vault content; delete `/data/state/last_vault_count` to reset, or lower `OBSIDIAN_PALACE_MIN_VAULT_PERCENT` (default: 80)
`network error` / `ECONNREFUSED`	Obsidian Sync servers unreachable	Check instance egress; try again later
Sync starts but no files appear	Wrong vault ID or empty vault	Verify vault ID with `ob list-vaults`
`FATAL` on startup, immediate exit	Auth token expired	Re-run `ob login` (tokens expire after extended periods)

Sync config lost after container rebuild¶

Both the auth token and sync config are persisted on the data disk and symlinked into the container filesystem by entrypoint.sh on every boot:

Container path	Symlink target (persistent disk)
`~/.obsidian-headless/`	`/data/obsidian-config/headless/`
`~/.config/obsidian-headless/`	`/data/obsidian-config/config/`

Because the symlinks are recreated on every container start, rebuilding and redeploying the image does not lose either credential. sync-guard.sh verifies both exist before allowing ob sync to start.

If sync still breaks after a deploy:

Check that the auth token exists: ls -la /data/obsidian-config/headless/auth_token
Check that the sync config exists: find /data/obsidian-config/config/sync/ -name config.json
If the auth token is missing, re-run ob login inside the container
If the sync config is missing, re-run ob sync-setup inside the container
Restart the container: docker restart obsidian-palace

Server starts but doesn't respond to requests¶

Symptom: Container is running, docker ps shows it as healthy, but curl https://YOUR_DOMAIN/health hangs or times out.

Likely cause: Vault indexing blocking the server startup. This was a bug in early versions where index_vault() ran during the FastAPI lifespan startup, blocking uvicorn from serving requests. On e2-small with a 600MB vault, indexing can take several minutes.

Current behavior: Indexing now runs as a background asyncio.create_task(). The server starts immediately and search returns empty results until indexing completes. If you're seeing this issue, make sure you're on the latest image.

Diagnosis:

# Check if uvicorn is even listening
docker exec obsidian-palace curl -s http://localhost:8080/health

# Check MCP server logs for indexing progress
docker exec obsidian-palace supervisorctl tail -f mcp-server
# Look for: "Vault indexing started in background"
# Then later: "Background vault indexing complete: N files, M drawers"

# Check memory usage (indexing + ChromaDB can spike)
docker exec obsidian-palace cat /proc/meminfo | head -5

Slow cold starts (ONNX model download)¶

Symptom: First startup after a fresh container deploy takes 2-5 extra minutes. Subsequent restarts are fast.

Cause: MemPalace/ChromaDB uses an ONNX embedding model (~79MB) that is downloaded from the internet on first use. This happens inside index_vault() during the background indexing task.

Current behavior: The model is cached on the persistent disk at /data/chroma-cache/onnx_models/. The container's entrypoint.sh symlinks /root/.cache/chroma → /data/chroma-cache/ so the model survives container rebuilds and redeploys. The first boot after creating a new persistent disk still downloads the model, but all subsequent deploys reuse the cached copy.

# Verify the model is cached
docker exec obsidian-palace ls -la /data/chroma-cache/onnx_models/
# Should show: all-MiniLM-L6-v2/

If the model re-downloads on every deploy

Verify the symlink exists: docker exec obsidian-palace ls -la /root/.cache/chroma. It should point to /data/chroma-cache. If it's missing, the entrypoint may have failed — check docker logs obsidian-palace | head -30.

MCP client connection issues¶

Wrong transport endpoint¶

Different MCP clients use different transports:

Client	Transport	Endpoint
Claude Desktop	SSE	`https://YOUR_DOMAIN/sse`
Claude Code	SSE	`https://YOUR_DOMAIN/sse`
Claude iOS / claude.ai	SSE	`https://YOUR_DOMAIN/sse`
OpenCode	Streamable HTTP	`https://YOUR_DOMAIN/mcp`

Common mistake: Configuring OpenCode with the SSE endpoint (/sse). OpenCode uses Streamable HTTP (POST to a single endpoint), so it sends a POST to /sse which returns 405 Method Not Allowed. Fix: point OpenCode at /mcp.

~/.config/opencode/opencode.json

{
  "mcp": {
    "obsidian-palace": {
      "type": "remote",
      "url": "https://YOUR_DOMAIN/mcp"
    }
  }
}

OAuth discovery chain¶

MCP clients follow a specific OAuth discovery sequence. If any step fails, the client won't connect. You can verify each step manually:

# 1. Unauthenticated request should return 401 with WWW-Authenticate header
curl -sI https://YOUR_DOMAIN/sse
# Look for: WWW-Authenticate: Bearer resource_metadata="..."

# 2. Protected resource metadata
curl -s https://YOUR_DOMAIN/.well-known/oauth-protected-resource | python3 -m json.tool

# 3. Authorization server metadata
curl -s https://YOUR_DOMAIN/.well-known/oauth-authorization-server | python3 -m json.tool

# 4. Dynamic client registration (should accept POST)
curl -s -X POST https://YOUR_DOMAIN/register \
  -H "Content-Type: application/json" \
  -d '{"redirect_uris": ["http://localhost:9999/callback"], "client_name": "test"}'

If any of these return errors, check the MCP server logs.

OAuth discovery returns `localhost`¶

Symptom: MCP clients fail to connect. curl https://YOUR_DOMAIN/.well-known/oauth-protected-resource returns URLs pointing to https://localhost:8080/ instead of your public domain.

Cause: The OBSIDIAN_PALACE_SERVER_URL environment variable is not set or not being passed to the container. The server defaults to https://localhost:8080.

Fix: Ensure the docker run command includes -e OBSIDIAN_PALACE_SERVER_URL="https://YOUR_DOMAIN". In the Terraform-managed startup script, this is set automatically from var.domain. Verify the domain variable is set in your TFC workspace.

# Check what the container sees
docker exec obsidian-palace env | grep SERVER_URL
# Expected: OBSIDIAN_PALACE_SERVER_URL=https://YOUR_DOMAIN

OAuth token issues¶

Symptom: Auth flow completes (browser redirects back) but the MCP client gets 401 on subsequent requests.

Known bug (fixed): In Python, if token.expires_at and ... skips validation when expires_at=0 because 0 is falsy. The fix is if token.expires_at is not None and .... If you're seeing this, make sure you're on the latest image.

Diagnosis:

# Check server logs during the auth flow
docker exec obsidian-palace supervisorctl tail -500 mcp-server | grep -i "token\|auth\|oauth"

Other OAuth issues:

Redirect URI mismatch: The Google OAuth client's redirect URI must be exactly https://YOUR_DOMAIN/oauth2/callback. No trailing slash, no port number.
Test user not added: While the OAuth consent screen is in "Testing" status, only explicitly added test users can authenticate. Go to Google Cloud Console > APIs & Services > OAuth consent screen > Test users.
Wrong client type: The OAuth client must be "Web application," not "Desktop." Desktop clients don't support redirect URIs.

Container not starting¶

# Check container status and exit code
docker ps -a
# Look at the STATUS column — if it says "Exited (1)", check logs

# Container logs (includes entrypoint + supervisord output)
docker logs obsidian-palace --tail 200

# Startup script logs (COS metadata script runner)
sudo journalctl -u google-startup-scripts.service --no-pager | tail -100

Common causes:

Symptom	Cause	Fix
Container never starts	Image pull failed (Artifact Registry auth)	Check `docker pull` manually; verify service account has `artifactregistry.reader`
Exits immediately	Missing environment variables	Check startup script injects secrets from Secret Manager
Exits after a few seconds	supervisord config error	Check `docker logs` for supervisord parse errors
Running but no ports exposed	Docker run missing `-p` flags	Check the startup script's `docker run` command in instance metadata

Re-run the startup script¶

The COS startup script handles pulling the image, injecting secrets, and starting the container. If something went wrong during boot, you can re-run it:

sudo google_metadata_script_runner startup

This is safe to re-run -- it's idempotent. It will stop the existing container (if any), pull the latest image, and start a new container.

Persistent disk issues¶

Verify disk is mounted¶

mount | grep /mnt/disks/data
# Should show: /dev/sdb on /mnt/disks/data type ext4 (rw,relatime)

ls /mnt/disks/data/
# Expected directories: vault/ chromadb/ chroma-cache/ obsidian-config/ letsencrypt/ certbot-webroot/ docker-config/ state/

If the disk isn't mounted, the startup script should handle it. Re-run:

sudo google_metadata_script_runner startup

Check data integrity after instance reset¶

After a gcloud compute instances reset, the persistent disk is reattached but the mount may need to be re-established. Verify:

# Check disk is attached
lsblk
# Should show sdb (or similar) with the correct size

# Check mount
mount | grep /mnt/disks/data

# If not mounted, mount it manually
sudo mkdir -p /mnt/disks/data
sudo mount /dev/sdb /mnt/disks/data

# Then restart the container
sudo google_metadata_script_runner startup

ChromaDB out of memory¶

The e2-small instance has 2 GB RAM. With a 600MB vault, the Python server + ChromaDB + ONNX embeddings + Node.js sync process typically uses 1.2-1.6 GB. If your vault is significantly larger (>1 GB of markdown), you may hit OOM.

Diagnosis:

# Check memory usage
free -h
docker stats obsidian-palace --no-stream

# Check if OOM killer fired
dmesg | grep -i "out of memory\|oom"

Fix: Upgrade to e2-medium (4 GB RAM). Set the machine_type Terraform variable:

variable "machine_type" {
  default = "e2-medium"  # 4 GB RAM, ~$26/mo
}

Alternatively, disable MemPalace indexing to save memory (you lose semantic search):

# Set in Terraform or directly as an env var
OBSIDIAN_PALACE_MEMPALACE_ENABLED=false

Troubleshooting¶

General diagnostics¶

SSL certificate not issued¶

Obsidian Sync: ob CLI troubleshooting¶

Verify auth token¶

Verify sync configuration¶

Run sync manually¶

Sync config lost after container rebuild¶

Server starts but doesn't respond to requests¶

Slow cold starts (ONNX model download)¶

MCP client connection issues¶

Wrong transport endpoint¶

OAuth discovery chain¶

OAuth discovery returns localhost¶

OAuth token issues¶

Container not starting¶

Re-run the startup script¶

Persistent disk issues¶

Verify disk is mounted¶

Check data integrity after instance reset¶

ChromaDB out of memory¶

Obsidian Sync: `ob` CLI troubleshooting¶

OAuth discovery returns `localhost`¶