Health Check System

Comprehensive health monitoring for Promenade Platform dependencies - PostgreSQL, Redis, and Event Bus.

Overview

The health check system provides 3-level monitoring (healthy, degraded, unhealthy) with:

4 HTTP endpoints for monitoring
5-second timeout for all checks
Graceful degradation for optional dependencies
Proper HTTP status codes (200 for healthy/degraded, 503 for unhealthy)

Architecture

Components

health.Checker (internal/infrastructure/health/health.go)
- Core health check logic
- Checks: Database, Redis (optional), Event Bus
- 5-second timeout for all checks combined
health.Handler (internal/infrastructure/health/handler.go)
- HTTP endpoints with Gin
- 4 routes: /health, /health/db, /health/redis, /health/bus
Integration (cmd/api/main.go)
- Dependency injection (db, redisClient, eventBus)
- Replaces old simple health endpoint

API Endpoints

1. Overall Health Check

GET /health

Returns overall system health with all dependency checks.

Response (200 OK - Healthy):

json

{
  "status": "healthy",
  "checks": {
    "database": {
      "name": "PostgreSQL",
      "status": "healthy",
      "message": "database connection ok",
      "duration_ms": 1234567,
      "timestamp": "2025-12-29T18:00:00Z"
    },
    "redis": {
      "name": "Redis",
      "status": "healthy",
      "message": "redis connection ok",
      "duration_ms": 567890,
      "timestamp": "2025-12-29T18:00:00Z"
    },
    "event_bus": {
      "name": "Event Bus",
      "status": "healthy",
      "message": "event bus operational",
      "duration_ms": 123456,
      "timestamp": "2025-12-29T18:00:00Z"
    }
  },
  "timestamp": "2025-12-29T18:00:00Z",
  "version": "1.0.0"
}

Response (200 OK - Degraded):

json

{
  "status": "degraded",
  "checks": {
    "database": {
      "status": "degraded",
      "message": "database ping ok but query failed"
    },
    "redis": { "status": "healthy" },
    "event_bus": { "status": "healthy" }
  }
}

Response (503 Service Unavailable - Unhealthy):

json

{
  "status": "unhealthy",
  "checks": {
    "database": {
      "status": "unhealthy",
      "message": "database ping failed: connection refused"
    }
  }
}

2. Database Health Check

GET /health/db

Returns PostgreSQL database health only.

Response (200 OK):

json

{
  "name": "PostgreSQL",
  "status": "healthy",
  "message": "database connection ok",
  "duration_ms": 1234567,
  "timestamp": "2025-12-29T18:00:00Z"
}

Response (503 Service Unavailable):

json

{
  "name": "PostgreSQL",
  "status": "unhealthy",
  "message": "database ping failed: connection refused"
}

3. Redis Health Check

GET /health/redis

Returns Redis health (if configured).

Response (200 OK - Configured):

json

{
  "name": "Redis",
  "status": "healthy",
  "message": "redis connection ok"
}

Response (200 OK - Not Configured):

json

{
  "name": "Redis",
  "status": "healthy",
  "message": "redis not configured (optional)"
}

4. Event Bus Health Check

GET /health/bus

Returns Event Bus health.

Response (200 OK):

json

{
  "name": "Event Bus",
  "status": "healthy",
  "message": "event bus operational"
}

Status Levels

Healthy

All dependencies are operational.

HTTP Status: 200 OK
Criteria: All checks pass
Action: No action needed

Degraded

System is operational but with issues.

HTTP Status: 200 OK
Criteria: At least one check is degraded (e.g., database ping works but query fails)
Action: Investigate warnings, monitor closely

Unhealthy

Critical dependency is down.

HTTP Status: 503 Service Unavailable
Criteria: At least one check failed completely
Action: Immediate investigation required

Usage Examples

cURL

bash

# Check overall health
curl http://localhost:8081/health

# Check database only
curl http://localhost:8081/health/db

# Check Redis only
curl http://localhost:8081/health/redis

# Check Event Bus only
curl http://localhost:8081/health/bus

Kubernetes Liveness Probe

yaml

livenessProbe:
  httpGet:
    path: /health/db
    port: 8081
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Kubernetes Readiness Probe

yaml

readinessProbe:
  httpGet:
    path: /health
    port: 8081
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 5
  failureThreshold: 2

Prometheus Monitoring

yaml

# prometheus.yml
scrape_configs:
  - job_name: 'promenade-health'
    metrics_path: /health
    static_configs:
      - targets: ['promenade:8081']

Implementation Details

Database Check

Ping: db.PingContext(ctx) - Basic connectivity
Query: SELECT 1 - Database is writable

Status:

Healthy: Both pass
Degraded: Ping passes, query fails
Unhealthy: Ping fails

Redis Check

Optional: Returns "healthy" if not configured
Ping: redis.Ping(ctx) - Connectivity check

Status:

Healthy: Ping passes or not configured
Unhealthy: Ping fails

Event Bus Check

Health method: eventBus.Health(ctx) - Internal health check

Status:

Healthy: Health check passes
Unhealthy: Health check fails

Timeout Behavior

All checks have a 5-second combined timeout:

ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
defer cancel()

Behavior:

If any check takes longer than 5 seconds, it returns unhealthy
Prevents hanging requests
Fast-fail for slow dependencies

Testing

Unit Tests

Location: internal/infrastructure/health/health_test.go (11 tests)

bash

go test ./internal/infrastructure/health -v

Tests:

TestChecker_CheckDatabase_Healthy
TestChecker_CheckDatabase_Unhealthy_PingFailed
TestChecker_CheckDatabase_Degraded_QueryFailed
TestChecker_CheckRedis_Healthy
TestChecker_CheckRedis_Unhealthy
TestChecker_CheckRedis_NotConfigured
TestChecker_CheckEventBus_Healthy
TestChecker_CheckAll_AllHealthy
TestChecker_CheckAll_DatabaseUnhealthy
TestChecker_CheckAll_WithTimeout (verifies 5s timeout)

Handler Tests

Location: internal/infrastructure/health/handler_test.go (10 tests)

Tests:

TestHandler_CheckAll_Healthy
TestHandler_CheckAll_Unhealthy
TestHandler_CheckAll_Degraded
TestHandler_CheckDatabase_Healthy
TestHandler_CheckDatabase_Unhealthy
TestHandler_CheckRedis
TestHandler_CheckEventBus
TestHandler_RegisterRoutes

Mock Dependencies

// Database mock
db, mock, _ := sqlmock.New(sqlmock.MonitorPingsOption(true))
sqlxDB := sqlx.NewDb(db, "sqlmock")

// Redis mock
redisClient, redisMock := redismock.NewClientMock()

// Event Bus mock (real memory bus)
eventBus := memory.NewMemoryBus(bus.Config{
    WorkerPoolSize: 1,
    BufferSize:     10,
})

Monitoring Integration

Grafana Dashboard

Create dashboard with panels for:

Overall status (gauge: healthy/degraded/unhealthy)
Database response time (graph)
Redis response time (graph)
Event Bus status (status history)

Alerting

Configure alerts for:

Critical: Status = unhealthy for > 1 minute
Warning: Status = degraded for > 5 minutes

Logging

Health checks are logged at DEBUG level:

[DEBUG] Health check: status=healthy db=1ms redis=2ms bus=1ms
[WARN] Health check: status=degraded db=degraded
[ERROR] Health check: status=unhealthy db=unhealthy

Configuration

Main Application

Location: cmd/api/main.go

// Initialize health checker
healthChecker := health.NewChecker(db, redisClient, eventBus, cfg.App.Version)
healthHandler := health.NewHandler(healthChecker)

// Register routes
healthHandler.RegisterRoutes(r)

Dependencies

PostgreSQL: Required (sqlx.DB)
Redis: Optional (nil if not configured)
Event Bus: Required (bus.IBus)
Version: App version from config

Graceful Degradation

Redis Not Configured

If Redis is not configured (nil), health check returns:

json

{
  "name": "Redis",
  "status": "healthy",
  "message": "redis not configured (optional)"
}

Behavior:

System continues to operate
Overall status not affected
Token revocation disabled (graceful fallback)

Best Practices

DO

Monitor /health endpoint - Set up alerts for unhealthy status
Use readiness probes - Prevent traffic to unhealthy instances
Check individual endpoints - Debug specific dependency issues
Set timeouts - Prevent hanging health checks
Log health changes - Track status transitions

DON'T

DON'T poll too frequently - Adds load, use 5-10 second intervals
DON'T expose to public - Health endpoints should be internal
DON'T treat degraded as unhealthy - System still operational
DON'T skip Redis check - Optional but important for full picture

Troubleshooting

Database Unhealthy

json

{
  "status": "unhealthy",
  "message": "database ping failed: connection refused"
}

Possible causes:

Database down
Connection pool exhausted
Network issues
Wrong credentials

Actions:

Check PostgreSQL is running: docker ps | grep postgres
Verify connection string in config
Check database logs
Test connection: psql -h localhost -U system -d promenade_dev

Database Degraded

json

{
  "status": "degraded",
  "message": "database ping ok but query failed"
}

Possible causes:

Read-only mode
Disk full
Permissions issue

Actions:

Check database mode: SHOW transaction_read_only;
Check disk space: df -h
Verify user permissions

Redis Unhealthy

json

{
  "status": "unhealthy",
  "message": "redis ping failed: connection refused"
}

Possible causes:

Redis down
Network issues
Wrong address/port

Actions:

Check Redis is running: docker ps | grep redis
Verify Redis address in config
Test connection: redis-cli -h localhost -p 6379 ping

Main README - Project overview
Configuration Guide - App configuration

Last Updated: December 29, 2025
Status: Production-ready
Test Coverage: 21 tests, 100% passing
Maintainer: Promenade Team

Health Check System ​

Overview ​

Architecture ​

Components ​

API Endpoints ​

1. Overall Health Check ​

2. Database Health Check ​

3. Redis Health Check ​

4. Event Bus Health Check ​

Status Levels ​

Healthy ​

Degraded ​

Unhealthy ​

Usage Examples ​

cURL ​

Kubernetes Liveness Probe ​

Kubernetes Readiness Probe ​

Prometheus Monitoring ​

Implementation Details ​

Database Check ​

Redis Check ​

Event Bus Check ​

Timeout Behavior ​

Testing ​

Unit Tests ​

Handler Tests ​

Mock Dependencies ​

Monitoring Integration ​

Grafana Dashboard ​

Alerting ​

Logging ​

Configuration ​

Main Application ​

Dependencies ​

Graceful Degradation ​

Redis Not Configured ​

Best Practices ​

DO ​

DON'T ​

Troubleshooting ​

Database Unhealthy ​

Database Degraded ​

Redis Unhealthy ​

Related Documentation ​

Health Check System

Overview

Architecture

Components

API Endpoints

1. Overall Health Check

2. Database Health Check

3. Redis Health Check

4. Event Bus Health Check

Status Levels

Healthy

Degraded

Unhealthy

Usage Examples

cURL

Kubernetes Liveness Probe

Kubernetes Readiness Probe

Prometheus Monitoring

Implementation Details

Database Check

Redis Check

Event Bus Check

Timeout Behavior

Testing

Unit Tests

Handler Tests

Mock Dependencies

Monitoring Integration

Grafana Dashboard

Alerting

Logging

Configuration

Main Application

Dependencies

Graceful Degradation

Redis Not Configured

Best Practices

DO

DON'T

Troubleshooting

Database Unhealthy

Database Degraded

Redis Unhealthy

Related Documentation