Saltar a contenido

ADR-0008 · Observabilidad: Powertools + X-Ray + structured logging + alarms

  • Status: Accepted
  • Date: 2026-04-30
  • Deciders: Cristian Fernández (Zerviz Group)
  • Related: ADR-0001, ADR-0002, docs/discovery/00-phase1-prerequisites.md R-09 + capability gap #13/#14.

Context

Legacy tiene cero observabilidad estructurada (R-09). Decisión asentada: Powertools + X-Ray desde el primer Lambda. Este ADR formaliza el conjunto.

Decision

Stack

  • AWS Lambda Powertools for TypeScript v2.x como dependencia obligatoria de services/* y packages/*. Importado como Layer compartida ze-{env}-powertools-layer o via container image.
  • AWS X-Ray active tracing ON en todos los Lambdas y APIGW.
  • CloudWatch Logs con retention obligatoria por env: dev=14d, qa=30d, prod=90d.
  • CloudWatch Logs Insights + saved queries por servicio.
  • CloudWatch Metrics custom + métricas Powertools (EMF embebido en logs).
  • CloudWatch Alarms con destinos: PagerDuty (prod), Slack #ze-alerts (qa+dev).
  • CloudWatch RUM opcional para apps/builder-spa y apps/widget-spa.

Logger contract (packages/observability)

import { Logger } from '@aws-lambda-powertools/logger';

export const logger = new Logger({
  serviceName: process.env.SERVICE_NAME,    // ej. 'flow-engine'
  logLevel: process.env.LOG_LEVEL || 'INFO',
  persistentLogAttributes: {
    env: process.env.ENV,                    // dev|qa|prod
    region: process.env.AWS_REGION,
  },
});

// Middleware tenant-aware:
logger.addPersistentLogAttributes({
  tenant_id: ctx.tenantId,
  correlation_id: ctx.correlationId,
});
  • Campos JSON obligatorios en cada log: timestamp, level, message, service, tenant_id, correlation_id, request_id (Lambda), xray_trace_id.
  • Prohibido console.log directo; lint rule custom no-console salvo whitelist explícita.
  • Sanitización: Powertools LogFormatter redacta keys password, token, secret, authorization, cookie, apikey (case-insensitive).

Tracer

import { Tracer } from '@aws-lambda-powertools/tracer';
export const tracer = new Tracer({ serviceName: process.env.SERVICE_NAME });

// En cada handler:
tracer.captureLambdaHandler(handler);
// En clients de strategy:
tracer.captureAWSv3Client(s3Client);
  • Annotations obligatorias: tenant_id, service, connector_id (cuando aplica).
  • Subsegments por integración externa (Meta, 360, Five9, OpenAI).
  • Sampling: 100% en dev/qa, 10% en prod (custom rule en X-Ray).

Metrics (EMF)

import { Metrics, MetricUnit } from '@aws-lambda-powertools/metrics';
export const metrics = new Metrics({
  namespace: 'ZEngine',
  serviceName: process.env.SERVICE_NAME,
  defaultDimensions: { Env: process.env.ENV },
});

metrics.addDimension('tenant_id', ctx.tenantId);
metrics.addMetric('MessageReceived', MetricUnit.Count, 1);
  • Métricas estándar por servicio (auto-emitidas):
  • LambdaInvocations, LambdaErrors, LambdaDuration (CloudWatch native).
  • BusinessEventsPublished por DetailType.
  • ConnectorLatencyMs (p50/p95/p99) por strategy.
  • BreakerOpenCount, BreakerHalfOpenTrials por connector.
  • DLQDepth por cola.

Alarms baseline (Terraform module infra/modules/observability)

Alarm Threshold Severity Destino
<service>-error-rate Errors / Invocations > 1% 5 min High PagerDuty (prod), Slack
<service>-p99-latency Duration p99 > 3s 10 min Medium Slack
dlq-depth ApproxNumberMessagesVisible > 0 5 min High PagerDuty (prod)
breaker-open BreakerOpenCount > 0 1 min High PagerDuty (prod)
cognito-throttle ThrottledRequests > 10/min Medium Slack
rds-cpu CPUUtilization > 80% 15 min Medium Slack
rds-connections > 80% max_connections 5 min High PagerDuty
kms-key-usage-spike > 10× baseline Medium Slack

Dashboards

  • ze-{env}-platform: visión global (RPS, latencias, errores, DLQ).
  • ze-{env}-tenant-<id> (auto-generado para tier físico): visión por tenant.
  • ze-{env}-connectors: salud de strategies + breakers.

Audit log transversal

  • EventBridge rule captura todos los eventos a Firehose → S3 ze-{env}-data/audit/yyyy/mm/dd/.
  • Retención 7 años (compliance forward-looking ISO + 19.628).
  • Object Lock compliance mode obligatorio.

Consequences

Positive

  • Multi-tenant logs filtrables por tenant_id.
  • X-Ray service map elimina mistery debugging.
  • Alarms baseline ⇒ on-call útil desde día 1.

Negative

  • Powertools agrega ~5–10 ms cold-start. Aceptable.
  • 100% trace sampling en dev/qa puede pegarle al budget; mitigado con sampling rule.

Alternatives considered

  • Datadog / New Relic: rechazado, costo + lock-in.
  • OpenTelemetry-only: evaluado, adoptar cuando AWS distribuya OTel para Lambda con perf paritario; mientras tanto Powertools.
  • Logs sin estructura: rechazado (es el problema actual).

Tenant-isolation impact

  • tenant_id en cada log/trace/metric ⇒ filtros y dashboards por tenant.
  • Per-tenant retention configurable (lifecycle por prefijo).

Blast radius

  • Powertools layer caída ⇒ Lambdas fallan import. Mitigación: layer pinned por env, no auto-update.

Cost note

  • CloudWatch Logs: ~ USD 0.50/GB ingest. Estimado dev USD 5/mes; prod USD 100–500/mes.
  • X-Ray: USD 5/millón traces. Sampling 10% prod ⇒ < USD 50/mes.
  • CloudWatch Alarms: USD 0.10/alarma. ~30 alarmas baseline.

ISO 27001 controls touched

  • A.12.4.1 (event logging): structured logs.
  • A.12.4.2 (protection of log information): KMS + Object Lock.
  • A.12.4.3 (administrator and operator logs): audit Firehose stream.
  • A.16.1.4 (assessment of and decision on information security events): alarms + on-call.

Sources

  • docs/discovery/00-phase1-prerequisites.md R-09 (Powertools+X-Ray check).
  • AWS docs: Lambda Powertools v2, X-Ray, EMF.