Skip to main content

Manual Recovery Guide

Overview

Claude-mem’s manual recovery system helps you recover observations that get stuck in the processing queue after worker crashes, system restarts, or unexpected shutdowns. Key Change in v5.x: Automatic recovery on worker startup is now disabled. This gives you explicit control over when reprocessing happens, preventing unexpected duplicate observations.

When Do You Need Manual Recovery?

You should trigger manual recovery when:
  • Worker crashed or restarted - Observations were queued but worker stopped before processing
  • No new summaries appearing - Observations are being saved but not processed into summaries
  • Stuck messages detected - Messages showing as “processing” for >5 minutes
  • System crashes - Unexpected shutdowns left messages in incomplete states

Quick Start

The interactive CLI tool is the safest and easiest way to recover stuck observations:
# Check status and prompt for recovery
bun scripts/check-pending-queue.ts
This will:
  1. Check worker health
  2. Show queue summary (pending, processing, failed, stuck counts)
  3. Display sessions with pending work
  4. Prompt you to confirm recovery
  5. Show recently processed messages for feedback

Auto-Process Without Prompts

For scripting or when you’re confident recovery is needed:
# Auto-process without prompting
bun scripts/check-pending-queue.ts --process

# Limit to 5 sessions
bun scripts/check-pending-queue.ts --process --limit 5

Understanding Queue States

Messages progress through these lifecycle states:
  1. pending → Queued, waiting to process
  2. processing → Currently being processed by SDK agent
  3. processed → Completed successfully
  4. failed → Failed after 3 retry attempts

Stuck Detection

Messages in processing state for >5 minutes are considered stuck:
  • They’re automatically reset to pending on worker startup
  • They’re NOT automatically reprocessed (requires manual trigger)
  • They appear in the stuckCount field when checking queue status

Recovery Methods

Method 1: Interactive CLI Tool

Best for: Regular users, interactive sessions, when you want visibility into what’s happening
bun scripts/check-pending-queue.ts
Example Output:
Checking worker health...
Worker is healthy ✓

Queue Summary:
  Pending: 12 messages
  Processing: 2 messages (1 stuck)
  Failed: 0 messages
  Recently Processed: 5 messages in last 30 minutes

Sessions with pending work: 3
  Session 44: 5 pending, 1 processing (age: 2m)
  Session 45: 4 pending, 1 processing (age: 7m - STUCK)
  Session 46: 2 pending

Would you like to process these pending queues? (y/n)
Features:
  • ✅ Pre-flight health check (verifies worker is running)
  • ✅ Detailed queue breakdown by session
  • ✅ Age tracking for stuck detection
  • ✅ Confirmation prompt (prevents accidental reprocessing)
  • ✅ Non-interactive mode with --process flag
  • ✅ Session limit control with --limit N

Method 2: HTTP API

Best for: Automation, scripting, integration with monitoring systems

Check Queue Status

curl http://localhost:37777/api/pending-queue
Response:
{
  "queue": {
    "messages": [
      {
        "id": 123,
        "session_db_id": 45,
        "claude_session_id": "abc123",
        "message_type": "observation",
        "status": "pending",
        "retry_count": 0,
        "created_at_epoch": 1730886600000
      }
    ],
    "totalPending": 12,
    "totalProcessing": 2,
    "totalFailed": 0,
    "stuckCount": 1
  },
  "recentlyProcessed": [...],
  "sessionsWithPendingWork": [44, 45, 46]
}
Key Fields:
  • totalPending - Messages waiting to process
  • totalProcessing - Messages currently processing
  • stuckCount - Processing messages >5 minutes old
  • sessionsWithPendingWork - Session IDs needing recovery

Trigger Recovery

curl -X POST http://localhost:37777/api/pending-queue/process \
  -H "Content-Type: application/json" \
  -d '{"sessionLimit": 10}'
Response:
{
  "success": true,
  "totalPendingSessions": 15,
  "sessionsStarted": 10,
  "sessionsSkipped": 2,
  "startedSessionIds": [44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
}
Response Fields:
  • totalPendingSessions - Total sessions with pending messages in database
  • sessionsStarted - Sessions we started processing this request
  • sessionsSkipped - Sessions already processing (prevents duplicate agents)
  • startedSessionIds - Database IDs of sessions we started

Best Practices

1. Always Check Before Recovery

# Check queue status first
curl http://localhost:37777/api/pending-queue

# Or use CLI tool which checks automatically
bun scripts/check-pending-queue.ts

2. Start with Low Session Limits

# Process only 5 sessions at a time
bun scripts/check-pending-queue.ts --process --limit 5
This prevents overwhelming the worker with too many concurrent SDK agents.

3. Monitor During Recovery

Watch worker logs while recovery runs:
npm run worker:logs
Look for:
  • SDK agent starts: Starting SDK agent for session...
  • Processing completions: Processed observation...
  • Errors: ERROR or Failed to process...

4. Verify Recovery Success

Check recently processed messages:
curl http://localhost:37777/api/pending-queue | jq '.recentlyProcessed'
Or use the CLI tool which shows this automatically.

5. Handle Failed Messages

Messages that fail 3 times are marked failed and won’t auto-retry:
# View failed messages
sqlite3 ~/.claude-mem/claude-mem.db "
  SELECT id, session_db_id, message_type, retry_count
  FROM pending_messages
  WHERE status = 'failed'
  ORDER BY completed_at_epoch DESC;
"
You can manually reset them if needed:
sqlite3 ~/.claude-mem/claude-mem.db "
  UPDATE pending_messages
  SET status = 'pending', retry_count = 0
  WHERE status = 'failed';
"

Troubleshooting

Recovery Not Working

Symptom: Triggered recovery but messages still pending Solutions:
  1. Verify worker health:
    curl http://localhost:37777/health
    
  2. Check worker logs for errors:
    npm run worker:logs | grep -i error
    
  3. Restart worker:
    claude-mem restart
    
  4. Check database integrity:
    sqlite3 ~/.claude-mem/claude-mem.db "PRAGMA integrity_check;"
    

Messages Stuck Forever

Symptom: Messages show as “processing” for hours Solution: Force reset stuck messages
# Reset all stuck messages to pending
sqlite3 ~/.claude-mem/claude-mem.db "
  UPDATE pending_messages
  SET status = 'pending', started_processing_at_epoch = NULL
  WHERE status = 'processing';
"

# Then trigger recovery
bun scripts/check-pending-queue.ts --process

Worker Crashes During Recovery

Symptom: Worker stops while processing recovered messages Solutions:
  1. Check available memory:
    npm run worker:status
    
  2. Reduce session limit:
    bun scripts/check-pending-queue.ts --process --limit 3
    
  3. Check for SDK errors in logs:
    npm run worker:logs | grep -i "sdk"
    
  4. Increase worker memory (if using custom runner):
    export NODE_OPTIONS="--max-old-space-size=4096"
    claude-mem restart
    

Advanced Usage

Direct Database Inspection

View all pending messages:
sqlite3 ~/.claude-mem/claude-mem.db "
  SELECT
    id,
    session_db_id,
    message_type,
    status,
    retry_count,
    datetime(created_at_epoch/1000, 'unixepoch') as created_at,
    datetime(started_processing_at_epoch/1000, 'unixepoch') as started_at,
    CAST((strftime('%s', 'now') * 1000 - started_processing_at_epoch) / 60000 AS INTEGER) as age_minutes
  FROM pending_messages
  WHERE status IN ('pending', 'processing')
  ORDER BY created_at_epoch;
"

Count Messages by Status

sqlite3 ~/.claude-mem/claude-mem.db "
  SELECT status, COUNT(*) as count
  FROM pending_messages
  GROUP BY status;
"

Find Sessions with Pending Work

sqlite3 ~/.claude-mem/claude-mem.db "
  SELECT
    session_db_id,
    COUNT(*) as pending_count,
    GROUP_CONCAT(message_type) as message_types
  FROM pending_messages
  WHERE status IN ('pending', 'processing')
  GROUP BY session_db_id;
"

View Recent Failures

sqlite3 ~/.claude-mem/claude-mem.db "
  SELECT
    id,
    session_db_id,
    message_type,
    retry_count,
    datetime(completed_at_epoch/1000, 'unixepoch') as failed_at
  FROM pending_messages
  WHERE status = 'failed'
  ORDER BY completed_at_epoch DESC
  LIMIT 10;
"

Integration Examples

Cron Job for Automatic Recovery

#!/bin/bash
# Run every hour to process stuck queues

# Check if worker is healthy
if curl -f http://localhost:37777/health > /dev/null 2>&1; then
  # Auto-process up to 5 sessions
  bun scripts/check-pending-queue.ts --process --limit 5
else
  echo "Worker not healthy, skipping recovery"
  exit 1
fi

Monitoring Script

#!/bin/bash
# Alert if stuck count exceeds threshold

STUCK_COUNT=$(curl -s http://localhost:37777/api/pending-queue | jq '.queue.stuckCount')

if [ "$STUCK_COUNT" -gt 5 ]; then
  echo "WARNING: $STUCK_COUNT stuck messages detected"
  # Send alert (email, Slack, etc.)
fi

Pre-Shutdown Recovery

#!/bin/bash
# Process pending queues before system shutdown

echo "Processing pending queues before shutdown..."
bun scripts/check-pending-queue.ts --process --limit 20

echo "Waiting for processing to complete..."
sleep 10

echo "Stopping worker..."
claude-mem stop

Migration Note

If you’re upgrading from v4.x to v5.x: v4.x Behavior (Automatic Recovery):
  • Worker automatically recovered stuck messages on startup
  • No user control over reprocessing timing
v5.x Behavior (Manual Recovery):
  • Stuck messages detected but NOT automatically reprocessed
  • User must explicitly trigger recovery via CLI or API
  • Prevents unexpected duplicate observations
  • Provides explicit control over when processing happens
Migration Steps:
  1. Upgrade to v5.x
  2. Check for stuck messages: bun scripts/check-pending-queue.ts
  3. Process if needed: bun scripts/check-pending-queue.ts --process
  4. Add recovery to your workflow (cron job, pre-shutdown script, etc.)

See Also