ADR-0008 · Observabilidad: Powertools + X-Ray + structured logging + alarms¶
- Status: Accepted
- Date: 2026-04-30
- Deciders: Cristian Fernández (Zerviz Group)
- Related: ADR-0001, ADR-0002,
docs/discovery/00-phase1-prerequisites.mdR-09 + capability gap #13/#14.
Context¶
Legacy tiene cero observabilidad estructurada (R-09). Decisión asentada: Powertools + X-Ray desde el primer Lambda. Este ADR formaliza el conjunto.
Decision¶
Stack¶
- AWS Lambda Powertools for TypeScript v2.x como dependencia obligatoria de
services/*ypackages/*. Importado como Layer compartidaze-{env}-powertools-layero via container image. - AWS X-Ray active tracing ON en todos los Lambdas y APIGW.
- CloudWatch Logs con retention obligatoria por env:
dev=14d,qa=30d,prod=90d. - CloudWatch Logs Insights + saved queries por servicio.
- CloudWatch Metrics custom + métricas Powertools (EMF embebido en logs).
- CloudWatch Alarms con destinos: PagerDuty (prod), Slack
#ze-alerts(qa+dev). - CloudWatch RUM opcional para
apps/builder-spayapps/widget-spa.
Logger contract (packages/observability)¶
import { Logger } from '@aws-lambda-powertools/logger';
export const logger = new Logger({
serviceName: process.env.SERVICE_NAME, // ej. 'flow-engine'
logLevel: process.env.LOG_LEVEL || 'INFO',
persistentLogAttributes: {
env: process.env.ENV, // dev|qa|prod
region: process.env.AWS_REGION,
},
});
// Middleware tenant-aware:
logger.addPersistentLogAttributes({
tenant_id: ctx.tenantId,
correlation_id: ctx.correlationId,
});
- Campos JSON obligatorios en cada log:
timestamp,level,message,service,tenant_id,correlation_id,request_id(Lambda),xray_trace_id. - Prohibido
console.logdirecto; lint rule customno-consolesalvo whitelist explícita. - Sanitización: Powertools
LogFormatterredacta keyspassword,token,secret,authorization,cookie,apikey(case-insensitive).
Tracer¶
import { Tracer } from '@aws-lambda-powertools/tracer';
export const tracer = new Tracer({ serviceName: process.env.SERVICE_NAME });
// En cada handler:
tracer.captureLambdaHandler(handler);
// En clients de strategy:
tracer.captureAWSv3Client(s3Client);
- Annotations obligatorias:
tenant_id,service,connector_id(cuando aplica). - Subsegments por integración externa (Meta, 360, Five9, OpenAI).
- Sampling: 100% en dev/qa, 10% en prod (custom rule en X-Ray).
Metrics (EMF)¶
import { Metrics, MetricUnit } from '@aws-lambda-powertools/metrics';
export const metrics = new Metrics({
namespace: 'ZEngine',
serviceName: process.env.SERVICE_NAME,
defaultDimensions: { Env: process.env.ENV },
});
metrics.addDimension('tenant_id', ctx.tenantId);
metrics.addMetric('MessageReceived', MetricUnit.Count, 1);
- Métricas estándar por servicio (auto-emitidas):
LambdaInvocations,LambdaErrors,LambdaDuration(CloudWatch native).BusinessEventsPublishedporDetailType.ConnectorLatencyMs(p50/p95/p99) por strategy.BreakerOpenCount,BreakerHalfOpenTrialspor connector.DLQDepthpor cola.
Alarms baseline (Terraform module infra/modules/observability)¶
| Alarm | Threshold | Severity | Destino |
|---|---|---|---|
<service>-error-rate |
Errors / Invocations > 1% 5 min |
High | PagerDuty (prod), Slack |
<service>-p99-latency |
Duration p99 > 3s 10 min |
Medium | Slack |
dlq-depth |
ApproxNumberMessagesVisible > 0 5 min |
High | PagerDuty (prod) |
breaker-open |
BreakerOpenCount > 0 1 min |
High | PagerDuty (prod) |
cognito-throttle |
ThrottledRequests > 10/min |
Medium | Slack |
rds-cpu |
CPUUtilization > 80% 15 min |
Medium | Slack |
rds-connections |
> 80% max_connections 5 min |
High | PagerDuty |
kms-key-usage-spike |
> 10× baseline |
Medium | Slack |
Dashboards¶
ze-{env}-platform: visión global (RPS, latencias, errores, DLQ).ze-{env}-tenant-<id>(auto-generado para tier físico): visión por tenant.ze-{env}-connectors: salud de strategies + breakers.
Audit log transversal¶
- EventBridge rule captura todos los eventos a Firehose → S3
ze-{env}-data/audit/yyyy/mm/dd/. - Retención 7 años (compliance forward-looking ISO + 19.628).
- Object Lock compliance mode obligatorio.
Consequences¶
Positive¶
- Multi-tenant logs filtrables por
tenant_id. - X-Ray service map elimina mistery debugging.
- Alarms baseline ⇒ on-call útil desde día 1.
Negative¶
- Powertools agrega ~5–10 ms cold-start. Aceptable.
- 100% trace sampling en dev/qa puede pegarle al budget; mitigado con sampling rule.
Alternatives considered¶
- Datadog / New Relic: rechazado, costo + lock-in.
- OpenTelemetry-only: evaluado, adoptar cuando AWS distribuya OTel para Lambda con perf paritario; mientras tanto Powertools.
- Logs sin estructura: rechazado (es el problema actual).
Tenant-isolation impact¶
tenant_iden cada log/trace/metric ⇒ filtros y dashboards por tenant.- Per-tenant retention configurable (lifecycle por prefijo).
Blast radius¶
- Powertools layer caída ⇒ Lambdas fallan import. Mitigación: layer pinned por env, no auto-update.
Cost note¶
- CloudWatch Logs: ~ USD 0.50/GB ingest. Estimado dev USD 5/mes; prod USD 100–500/mes.
- X-Ray: USD 5/millón traces. Sampling 10% prod ⇒ < USD 50/mes.
- CloudWatch Alarms: USD 0.10/alarma. ~30 alarmas baseline.
ISO 27001 controls touched¶
- A.12.4.1 (event logging): structured logs.
- A.12.4.2 (protection of log information): KMS + Object Lock.
- A.12.4.3 (administrator and operator logs): audit Firehose stream.
- A.16.1.4 (assessment of and decision on information security events): alarms + on-call.
Sources¶
docs/discovery/00-phase1-prerequisites.mdR-09 (Powertools+X-Ray check).- AWS docs: Lambda Powertools v2, X-Ray, EMF.