Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Debugging Production Issues: The Node.js Playbook

For .NET engineers who know: Attaching the Visual Studio debugger to a running process, capturing memory dumps with ProcDump or dotMemory, reading Application Insights live metrics, and using structured logs in Azure Monitor You’ll learn: How production debugging works in a Node.js stack — a fundamentally different model from the attach-and-step approach you’re used to Time: 15-20 min read


The .NET Way (What You Already Know)

In .NET production debugging, you have several overlapping tools. When something goes wrong, you can attach the Visual Studio debugger to the running IIS or Kestrel process, set breakpoints, and inspect live state — often without restarting. For deeper issues, you capture a memory dump with ProcDump, pull it into WinDbg or Visual Studio, and examine heap objects, thread stacks, and GC roots. Application Insights gives you distributed traces, dependency call timing, live metrics, and a query language (KQL) for slicing logs by request, exception type, or custom dimension.

The core assumption in .NET production debugging is that you can reconstruct program state after the fact. A memory dump is a snapshot of the entire heap. A call stack tells you exactly which thread was doing what. The CLR’s type system survives into the dump — you can ask WinDbg “show me every Order object on the heap” and it works.

That assumption does not transfer to Node.js.


The Node.js Way

Why the Debugging Model Is Different

Node.js is a single-threaded runtime executing on the V8 engine. There is no concept of “attaching to the process and pausing execution” in production — doing so would block all request handling for every user. There are no native memory dump utilities comparable to ProcDump. There is no built-in equivalent to Application Insights that ships with the runtime.

What Node.js production debugging gives you instead:

  1. Structured logs — your primary source of truth for what happened
  2. Error tracking with stack traces — Sentry captures exceptions with source-mapped stacks
  3. Heap snapshots — V8’s heap profiler, accessed via --inspect or programmatic APIs
  4. CPU profiles — V8’s sampling profiler for identifying hot code paths
  5. Process metrics — memory usage, event loop lag, handle counts

The mental shift: in .NET, you debug by reconstructing state. In Node.js, you debug by reading what the code logged before it broke. This is closer to how you approach distributed systems debugging — trace IDs, log correlation, structured events — than traditional debugger usage.

Sentry: Your Application Insights Equivalent

Sentry is the primary error tracking tool. Install it early; retrofitting it later is painful.

pnpm add @sentry/node @sentry/profiling-node
// src/instrument.ts — load this FIRST, before any other imports
import * as Sentry from '@sentry/node';
import { nodeProfilingIntegration } from '@sentry/profiling-node';

Sentry.init({
  dsn: process.env.SENTRY_DSN,
  environment: process.env.NODE_ENV,
  release: process.env.RENDER_GIT_COMMIT, // Render sets this automatically
  integrations: [
    nodeProfilingIntegration(),
  ],
  tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
  profilesSampleRate: 1.0,
});
// src/main.ts — instrument.ts must be the very first import
import './instrument';
import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module';

async function bootstrap() {
  const app = await NestFactory.create(AppModule);
  await app.listen(process.env.PORT ?? 3000);
}
bootstrap();

In NestJS, add the Sentry exception filter to capture unhandled errors:

// src/filters/sentry-exception.filter.ts
import { Catch, ArgumentsHost, HttpException } from '@nestjs/common';
import { BaseExceptionFilter } from '@nestjs/core';
import * as Sentry from '@sentry/node';

@Catch()
export class SentryExceptionFilter extends BaseExceptionFilter {
  catch(exception: unknown, host: ArgumentsHost) {
    // Only capture unexpected errors; don't report 4xx HTTP exceptions
    if (!(exception instanceof HttpException)) {
      Sentry.captureException(exception);
    }
    super.catch(exception, host);
  }
}
// src/main.ts
const { httpAdapter } = app.get(HttpAdapterHost);
app.useGlobalFilters(new SentryExceptionFilter(httpAdapter));

Sentry Error Triage Workflow

When an alert fires or a user reports an error:

  1. Open the Sentry issue. The issue page groups all occurrences of the same error by stack trace fingerprint — equivalent to Application Insights’ “exceptions by problem ID.”

  2. Read the breadcrumbs. Sentry automatically records HTTP requests, console logs, and database queries in the 60 seconds before the error. This is your timeline of what the process was doing.

  3. Check the source-mapped stack trace. If source maps are configured correctly, you see TypeScript line numbers, not compiled JavaScript. If you see minified code (at e (index.js:1:4823)), source maps are not uploaded — fix this before debugging further.

  4. Check the user context and tags. Sentry shows which user triggered the error, what request they made, and what environment variables your code attached. Set meaningful context early:

// In your auth middleware or guard
Sentry.setUser({ id: user.id, email: user.email });
Sentry.setTag('tenant', user.tenantId);
  1. Check for similar events. The “Events” tab shows every occurrence. Look for patterns: does it happen only for specific users, specific inputs, or specific times of day?

  2. Mark it as “In Progress” and assign it before you start fixing it.

Source Maps in Production

Without source maps, Sentry shows minified stack traces that are unreadable. There are two approaches.

Option A: Upload source maps during CI/CD (recommended)

# Install the Sentry CLI
pnpm add -D @sentry/cli

# In your CI/CD pipeline, after building:
npx sentry-cli releases new "$RENDER_GIT_COMMIT"
npx sentry-cli releases files "$RENDER_GIT_COMMIT" upload-sourcemaps ./dist
npx sentry-cli releases finalize "$RENDER_GIT_COMMIT"

Add this to your GitHub Actions workflow, not your production startup. Source maps contain your full source code — never expose them to the public.

Option B: Sentry Vite/webpack plugin (build-time upload)

// vite.config.ts (for Vite-based frontends)
import { sentryVitePlugin } from '@sentry/vite-plugin';

export default defineConfig({
  plugins: [
    sentryVitePlugin({
      authToken: process.env.SENTRY_AUTH_TOKEN,
      org: 'your-org',
      project: 'your-project',
    }),
  ],
  build: {
    sourcemap: true, // required
  },
});

For NestJS (built with tsc or webpack), configure source maps in tsconfig.json:

{
  "compilerOptions": {
    "sourceMap": true,
    "inlineSources": true
  }
}

Reading Node.js Stack Traces

A Node.js stack trace reads bottom-to-top, same as CLR stack traces. The error type and message appear first, then frames from innermost to outermost:

TypeError: Cannot read properties of undefined (reading 'id')
    at OrderService.getOrderTotal (order.service.ts:47:24)
    at OrdersController.getTotal (orders.controller.ts:23:38)
    at Layer.handle [as handle_request] (express/lib/router/layer.js:95:5)
    at next (express/lib/router/route.js:137:13)
    ...

Frame format: at [function] ([file]:[line]:[column]). The frames without recognizable file names (express/lib/router/layer.js) are framework internals — equivalent to ASP.NET Core’s middleware pipeline frames that you learn to skip past.

The key frames are the first ones with your application code. In the example above: order.service.ts:47 is where the error originated. orders.controller.ts:23 is what called it. Everything below those two frames is the framework plumbing.

Async stack traces are the Node.js-specific challenge. In synchronous .NET code, the call stack is linear. In async Node.js code, awaited calls create new microtask queue entries, which can lose stack context. Node.js 12+ preserves async stack traces via --async-context, but the output can be verbose:

at async OrderService.getOrderTotal (order.service.ts:47:24)
at async OrdersController.getTotal (orders.controller.ts:23:38)

If you see a stack trace that terminates at the async boundary and doesn’t show what called the async function, the caller did not await properly — a classic Node.js footgun.

Structured Log Analysis

Use a structured logger from the start. pino is the standard in NestJS projects.

pnpm add pino pino-pretty nestjs-pino
// src/app.module.ts
import { LoggerModule } from 'nestjs-pino';

@Module({
  imports: [
    LoggerModule.forRoot({
      pinoHttp: {
        level: process.env.NODE_ENV === 'production' ? 'info' : 'debug',
        transport: process.env.NODE_ENV !== 'production'
          ? { target: 'pino-pretty' }
          : undefined, // In production, output raw JSON for log aggregators
        serializers: {
          req: (req) => ({ method: req.method, url: req.url, id: req.id }),
          res: (res) => ({ statusCode: res.statusCode }),
        },
      },
    }),
  ],
})
export class AppModule {}

Production logs output as JSON, one object per line. Each log entry has at minimum:

  • level — numeric level (10=trace, 20=debug, 30=info, 40=warn, 50=error, 60=fatal)
  • time — Unix timestamp in milliseconds
  • msg — the log message
  • pid — process ID
  • Any custom fields you added

When triaging an incident, correlate logs using request IDs:

// Log meaningful context alongside the message
this.logger.log({
  msg: 'Order created',
  orderId: order.id,
  userId: user.id,
  durationMs: Date.now() - startTime,
});

Render Log Viewer

Render’s log viewer (your hosting platform) provides a web interface for live and historical logs. Key features:

  • Live tail — equivalent to kubectl logs -f or Azure’s Log Stream
  • Filter by text — basic grep-style filtering; for complex queries, export to your log aggregator
  • Log retention — Render retains logs for 7 days on free plans, longer on paid plans

For production incidents, use the Render log viewer to:

  1. Confirm the error is occurring (not a Sentry reporting issue)
  2. Correlate timestamps with deployment events
  3. Check for patterns before the error: unusual traffic volume, repeated retry storms

For serious analysis beyond basic text search, pipe logs to a proper aggregator. Render supports log drains (forwarding to Datadog, Papertrail, etc.). For simpler setups, the Render CLI can tail and export logs:

render logs --service YOUR_SERVICE_ID --tail

Memory Leak Detection

The symptoms: Node.js process memory grows steadily over hours or days. The application becomes slower, then starts throwing out-of-memory errors, or the hosting platform restarts it.

The cause: In Node.js, the most common memory leaks are:

  1. Closures holding references to large objects longer than expected
  2. Event listeners added but never removed
  3. Caches with no eviction policy
  4. Database connection pools not being released
  5. Global state accumulating data over time

Detection step 1: Monitor memory trend

Node.js exposes process.memoryUsage():

// Scheduled health check or metrics endpoint
const usage = process.memoryUsage();
this.logger.log({
  msg: 'Memory usage',
  heapUsed: Math.round(usage.heapUsed / 1024 / 1024) + 'MB',
  heapTotal: Math.round(usage.heapTotal / 1024 / 1024) + 'MB',
  rss: Math.round(usage.rss / 1024 / 1024) + 'MB',
  external: Math.round(usage.external / 1024 / 1024) + 'MB',
});

If heapUsed grows linearly over time without plateauing, you have a leak.

Detection step 2: Heap snapshot in staging

Never run --inspect in production (it opens a debugging port, is a security risk, and has performance overhead). Use staging:

# Start Node.js with the inspector open
node --inspect dist/main.js

# Or with NestJS:
pnpm start:debug

Open Chrome DevTools (chrome://inspect), connect to the Node.js process, go to the Memory tab, and take heap snapshots. Compare two snapshots taken minutes apart. The “Comparison” view shows which object types grew between snapshots — the same workflow as dotMemory’s “Object Sets” comparison.

Detection step 3: Programmatic heap snapshot

import { writeHeapSnapshot } from 'v8';
import { existsSync, mkdirSync } from 'fs';

// Trigger via HTTP endpoint (only in staging, never in production)
@Get('debug/heap-snapshot')
async takeHeapSnapshot() {
  if (process.env.NODE_ENV === 'production') {
    throw new ForbiddenException();
  }
  const dir = '/tmp/heapdumps';
  if (!existsSync(dir)) mkdirSync(dir);
  const filename = writeHeapSnapshot(dir);
  return { filename };
}

Open the .heapsnapshot file in Chrome DevTools Memory tab.

CPU Profiling

If the application is slow without obvious memory growth, CPU profiling identifies hot code paths. The approach is similar to dotTrace’s sampling profiler.

import { Session } from 'inspector';
import { writeFileSync } from 'fs';

// Staging-only endpoint
@Get('debug/cpu-profile')
async takeCpuProfile(@Query('durationMs') durationMs = '5000') {
  if (process.env.NODE_ENV === 'production') {
    throw new ForbiddenException();
  }
  const session = new Session();
  session.connect();

  await new Promise<void>((resolve) => {
    session.post('Profiler.enable', () => {
      session.post('Profiler.start', () => {
        setTimeout(() => {
          session.post('Profiler.stop', (err, { profile }) => {
            writeFileSync('/tmp/cpu-profile.cpuprofile', JSON.stringify(profile));
            resolve();
          });
        }, parseInt(durationMs));
      });
    });
  });

  return { filename: '/tmp/cpu-profile.cpuprofile' };
}

Load the .cpuprofile file in Chrome DevTools Performance tab. The flame chart shows which functions consumed the most CPU time. Functions wider in the flame chart are hotter.

Common Node.js Production Issues

Event loop blocking

This is the Node.js-specific issue with no direct .NET equivalent. If your code runs a synchronous operation that takes more than a few milliseconds, all other requests queue behind it. Common culprits:

  • JSON.parse() on very large payloads (megabytes)
  • Synchronous crypto operations
  • fs.readFileSync() in request handlers
  • Complex regular expressions on large strings (ReDoS)
  • Tight loops over large arrays

Detect event loop lag:

import { monitorEventLoopDelay } from 'perf_hooks';

const histogram = monitorEventLoopDelay({ resolution: 10 });
histogram.enable();

setInterval(() => {
  const lagMs = histogram.mean / 1e6; // Convert nanoseconds to milliseconds
  if (lagMs > 100) {
    this.logger.warn({ msg: 'Event loop lag detected', lagMs });
  }
  histogram.reset();
}, 10_000);

Unhandled promise rejections

In Node.js, an unhandled promise rejection that is not caught will crash the process in Node.js 15+ or silently succeed in older versions. Both behaviors are bad.

// In main.ts — catch what everything else misses
process.on('unhandledRejection', (reason, promise) => {
  logger.error({ msg: 'Unhandled promise rejection', reason });
  Sentry.captureException(reason);
  // Don't crash immediately — give Sentry time to flush
  setTimeout(() => process.exit(1), 1000);
});

process.on('uncaughtException', (error) => {
  logger.fatal({ msg: 'Uncaught exception', error });
  Sentry.captureException(error);
  setTimeout(() => process.exit(1), 1000);
});

Memory leaks from closures

// Leak: each request registers an event listener that holds a closure
// referencing the response object — it is never removed
app.get('/events', (req, res) => {
  const handler = (data) => res.write(data); // closure over res
  emitter.on('data', handler);
  // Bug: emitter.off('data', handler) is never called when req closes
});

// Fix: clean up listeners when the connection closes
app.get('/events', (req, res) => {
  const handler = (data) => res.write(data);
  emitter.on('data', handler);
  req.on('close', () => emitter.off('data', handler)); // cleanup
});

Connection pool exhaustion

Prisma, pg, and most database clients maintain a connection pool. If queries hold connections longer than the timeout, new queries queue indefinitely.

// Check pool status in health endpoint
@Get('health')
async health() {
  const result = await this.prisma.$queryRaw`SELECT 1 as alive`;
  return { status: 'ok', db: result };
}

If the health check times out but the process is otherwise running, suspect pool exhaustion. Check for long-running transactions or missing await on Prisma calls (a query without await starts but is never awaited, holding a connection until GC).


Key Differences

.NET ApproachNode.js Approach
Attach debugger to live processRead logs and Sentry; use staging for interactive debug
Memory dump + WinDbg/dotMemoryHeap snapshot via V8 inspector in staging
Thread-per-request modelSingle-threaded; event loop lag is the unique failure mode
Application Insights built-inSentry + structured logs (pino) assembled manually
CLR type information in dumpsHeap snapshots show object shapes but not TypeScript types
Thread stacks show concurrent workSingle thread; “concurrent” issues are async sequencing bugs
ConfigureAwait(false) mattersNo thread switching; all code runs on the event loop thread
Background threads for CPU workCPU work blocks the event loop; offload to worker threads or external processes

Gotchas for .NET Engineers

Gotcha 1: Source maps must be explicitly configured or Sentry is useless. In .NET, Application Insights records stack traces against your PDB symbols automatically. In Node.js, Sentry receives minified JavaScript stack traces. Until you configure source map upload in your CI/CD pipeline, every Sentry error will show at e (index.js:1:4823). Set this up before your first deployment, not after the first production incident.

Gotcha 2: --inspect in production is a security hole. The Node.js --inspect flag opens a WebSocket debugging port. If that port is exposed — even accidentally — an attacker can execute arbitrary code in your process. Never pass --inspect to production Node.js. Use it only locally or in isolated staging environments with network-level access controls. If you see --inspect in a production Dockerfile or a hosting platform’s start command, treat it as a critical security finding.

Gotcha 3: Synchronous operations in async code block every user. A .NET developer’s instinct is that synchronous work only blocks the current thread. In Node.js, synchronous work blocks the event loop, which is the only thread handling all requests. JSON.parse() on a 10MB payload, fs.readFileSync() in a route handler, or a CPU-intensive loop in a service method will stall every in-flight request until it completes. The fix is to either move the work to a Worker thread (Node.js built-in) or break it into smaller async chunks.

Gotcha 4: Unhandled promise rejections are silent data loss in older code. In pre-Node.js 15 codebases, a rejected promise without a .catch() handler silently discards the error. If you see code like someAsyncFunction() without await or .catch(), and that code can reject, you have silent failures. Always await async calls in request handlers. Always add process.on('unhandledRejection') as a safety net.

Gotcha 5: Memory profiling requires staging access, not production. In .NET, you can often take a memory dump from a production server (pausing one process on a multi-instance deployment). In Node.js on Render, you do not have direct process access. Plan your debugging workflow before you need it: have a staging environment with --inspect accessible, have a process for reproducing production traffic in staging, and have heap snapshot endpoints (protected) deployed in staging at all times.


Hands-On Exercise

Set up the full observability stack for a NestJS application and trigger a real error through each layer.

Part 1: Sentry integration

  1. Create a free Sentry account at sentry.io, create a Node.js project, and copy the DSN.
  2. Install @sentry/node and initialize it as the first import in main.ts.
  3. Add the SentryExceptionFilter as shown above.
  4. Throw a deliberate error in a route handler: throw new Error('Test Sentry integration').
  5. Call the endpoint. Verify the error appears in Sentry with a readable stack trace.
  6. If the stack trace shows minified code, configure sourceMap: true in tsconfig.json and redeploy.

Part 2: Structured logging

  1. Install pino and nestjs-pino.
  2. Configure LoggerModule with JSON output for production and pretty output for development.
  3. Replace all console.log calls in one service with this.logger.log({ msg: '...', ...context }).
  4. Run the application and verify log output is structured JSON.

Part 3: Memory monitoring

  1. Add a /health endpoint that returns process.memoryUsage() formatted as MB.
  2. Add a memory warning log that fires when heapUsed exceeds 80% of heapTotal.
  3. Write a test that calls the endpoint repeatedly in a loop and watches for memory growth.

Part 4: Event loop monitoring

  1. Add monitorEventLoopDelay from perf_hooks as shown above.
  2. Write a route that performs a CPU-intensive synchronous operation (e.g., summing 10 million numbers in a loop).
  3. Call that route while also calling a fast route in parallel. Observe that the fast route is delayed.
  4. Log the event loop lag before and after the slow call.

Quick Reference

Observability Stack

NeedTool.NET Equivalent
Error trackingSentryApplication Insights Exceptions
Structured logspino + nestjs-pinoSerilog + Application Insights
Log viewer (hosted)Render log viewerAzure Log Stream
Heap profilingV8 inspector (Chrome DevTools)dotMemory / WinDbg
CPU profilingV8 profiler (.cpuprofile)dotTrace
Event loop lagperf_hooks.monitorEventLoopDelayNo direct equivalent
Process metricsprocess.memoryUsage()Azure Monitor / Kusto

Incident Response Runbook

1. CHECK SENTRY
   - Is there a new error group with rising occurrences?
   - Is the stack trace readable (source maps working)?
   - What are the breadcrumbs showing?

2. CHECK RENDER LOGS
   - Is the process restarting? (look for process exit logs)
   - Is there a burst of errors at a specific timestamp?
   - Does the error correlate with a deployment?

3. CHECK MEMORY
   GET /health -> heapUsed
   If growing: take heap snapshot in staging, compare object counts

4. CHECK EVENT LOOP
   - Is response latency elevated across ALL endpoints? (event loop blocking)
   - Is it only specific endpoints? (slow query, external API)

5. ROLLBACK IF NEEDED
   Render dashboard -> Deploys -> Rollback to previous deploy
   This takes ~60 seconds; do it immediately if a deploy caused the issue

6. DOCUMENT
   - What was the error?
   - What was the root cause?
   - What was the fix?
   - Add to runbook if it could recur

Common Production Issues and Fixes

SymptomLikely CauseInvestigation
All endpoints slow simultaneouslyEvent loop blockingCheck event loop lag metric; look for sync operations in hot paths
Memory grows without boundClosure leak / listener leakHeap snapshot comparison in staging
Process crashes with OOMMemory leak reached limitHeap snapshot before crash; check Render restart logs
UnhandledPromiseRejection in logsMissing await somewhereSearch codebase for async calls without await; add global handler
500 errors on specific endpointUncaught exception in handlerCheck Sentry for that endpoint; verify error handling
DB queries timing outPool exhaustionCheck for missing await on Prisma; look for long transactions
Sentry shows minified tracesSource maps not uploadedConfigure source map upload in CI; verify sourceMap: true in tsconfig

Key Commands

# Run with inspector open (staging only)
node --inspect dist/main.js

# Take a heap snapshot to a file programmatically
node -e "const v8 = require('v8'); v8.writeHeapSnapshot();"

# Check Render logs via CLI
render logs --service SERVICE_ID --tail

# Check memory usage of running process
kill -USR2 <PID>  # triggers V8 heap dump if configured

Sentry Configuration Checklist

  • SENTRY_DSN set in environment variables
  • instrument.ts imported before all other imports in main.ts
  • SentryExceptionFilter registered globally
  • Source maps configured (sourceMap: true, inlineSources: true in tsconfig)
  • Source maps uploaded in CI/CD pipeline (not at runtime)
  • release set to git commit SHA for version tracking
  • environment set to production / staging / development
  • tracesSampleRate reduced in production (0.05–0.1 for high-traffic services)

Further Reading