NLP Backends with a TypeScript Frontend

For .NET engineers who know: ASP.NET Core Minimal APIs, async/await, dependency injection, and strongly typed API contracts You’ll learn: When Python is the pragmatic backend choice for AI/ML workloads, how to build a type-safe bridge between FastAPI and a TypeScript frontend, and how to stream LLM responses in real time Time: 25-30 min read

The .NET Way (What You Already Know)

When you build a standard backend in .NET, the full stack is coherent: C# types define the domain model, EF Core maps them to the database, the ASP.NET Core pipeline handles auth and middleware, and Swashbuckle generates an OpenAPI spec. The compiler enforces the contract between every layer.

For standard CRUD and business logic, this is an excellent setup. But when your product requires machine learning inference, LLM orchestration, or NLP pipelines, you run into a wall that no amount of C# skill resolves: the AI/ML ecosystem is in Python, and it is not moving.

PyTorch, TensorFlow, Hugging Face Transformers, LangChain, LlamaIndex, scikit-learn, spaCy, NumPy, pandas, and the entire vector search ecosystem (Pinecone, Weaviate, pgvector clients) have their canonical implementations in Python. The .NET equivalents are either thin wrappers, significantly behind the Python versions, or simply absent.

When your product needs ML inference, you do not write a ONNX wrapper in C# to avoid learning Python. You pick up FastAPI, which is — as you will see — closer to ASP.NET Core Minimal APIs than it is to anything alien, and you build a type-safe bridge between it and your TypeScript frontend.

The Architecture

graph TD
    Browser["Browser"]

    subgraph FE["Next.js / Nuxt (Vercel / Render)"]
        SC["Server Components\n(static data, SEO content)"]
        CC["Client Components\n(streaming chat, interactive UI)"]
    end

    subgraph PY["FastAPI (Python)\nAI / ML / NLP endpoints"]
        HF["Hugging Face Models"]
        LC["LangChain / RAG"]
        VS["Vector Search"]
        NLP["spaCy / NLTK"]
    end

    subgraph TS["NestJS or ASP.NET Core\nStandard CRUD endpoints"]
        EF["EF Core / Prisma"]
        BL["Business Logic\nAuth / Billing"]
    end

    subgraph DATA["Data Layer"]
        PG["PostgreSQL + pgvector"]
        PIN["Pinecone / Weaviate"]
        HFH["Hugging Face Hub"]
        OAI["OpenAI / Anthropic APIs"]
    end

    Browser -->|HTTPS| FE
    SC -->|"Generated TS types\nZod validation"| PY
    CC -->|"SSE / fetch streaming"| PY
    SC --> TS
    CC --> TS
    PY --> PG
    PY --> PIN
    PY --> HFH
    PY --> OAI
    TS --> PG

The frontend speaks to two backends:

FastAPI handles everything AI-related: inference, embeddings, vector search, LLM orchestration, streaming responses.
NestJS or ASP.NET Core handles standard CRUD: users, billing, settings, content management — anything that fits a relational model and does not need ML.

Both backends expose OpenAPI specifications. Your TypeScript frontend generates types from both and talks to each directly.

Why Python? The Honest Technical Case

A .NET engineer deserves a straightforward answer, not marketing.

Python IS the right choice for:

ML model inference: PyTorch and TensorFlow are written in C++, with Python bindings as the primary interface. Hugging Face’s transformers library has thousands of pretrained models with three-line inference code. The ONNX Runtime has a .NET SDK, but the Hugging Face model hub is Python-native — the gap in available models is enormous.
LLM orchestration: LangChain, LlamaIndex, and DSPy are Python-first. They have Node.js ports, but those ports lag behind the Python versions by months and lack many features. If you are building RAG pipelines, AI agents, or multi-model chains, you want the Python versions.
Vector search and embeddings: Generating embeddings, indexing them in pgvector or Pinecone, and performing semantic search is a first-class operation in Python. Every vector database has a mature Python client. The .NET clients exist but are often community-maintained.
Data science APIs: If your product surfaces ML-derived analytics — clustering, anomaly detection, recommendation scores — Python’s scientific stack (NumPy, pandas, scikit-learn) is the right tool. Implementing these algorithms in C# is possible but there is no ecosystem equivalent.
NLP pipelines: spaCy for entity recognition, NLTK for text preprocessing, Sentence Transformers for semantic similarity — these have no serious .NET equivalents.

Python is NOT the right choice for:

Standard CRUD: FastAPI can do CRUD just as well as ASP.NET Core, but there is no reason to prefer it for record-level database operations. Use your existing .NET API or NestJS.
High-concurrency real-time systems: Python’s GIL (Global Interpreter Lock) is a real architectural constraint for CPU-bound concurrency. More on this below.
Complex business logic with deep domain models: Python’s type system is opt-in and structural. For complex domains with invariants, C#’s compiler-enforced type system catches more bugs. Python works, but you trade away compile-time guarantees.
Teams with no Python experience: FastAPI is approachable, but if your team has zero Python exposure and your use case does not specifically require the ML ecosystem, use NestJS. Learning Python and the ML ecosystem simultaneously while shipping product is challenging.

FastAPI vs. ASP.NET Core: The Mental Model Bridge

FastAPI is the closest thing Python has to ASP.NET Core Minimal APIs. If you can read Minimal API code, you can read FastAPI code within hours.

ASP.NET Core Minimal API:

// Program.cs
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddScoped<IProductService, ProductService>();
builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen();

var app = builder.Build();

app.MapGet("/api/products/{id}", async (
    int id,
    IProductService service,
    CancellationToken ct) =>
{
    var product = await service.GetAsync(id, ct);
    return product is not null ? Results.Ok(product) : Results.NotFound();
});

app.Run();

FastAPI equivalent:

# main.py
from fastapi import FastAPI, Depends, HTTPException
from contextlib import asynccontextmanager
from services.product_service import ProductService

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup (equivalent to builder.Services.AddScoped, etc.)
    yield
    # Shutdown

app = FastAPI(lifespan=lifespan)

def get_product_service() -> ProductService:
    return ProductService()  # DI is manual or via a library like dependency-injector

@app.get("/api/products/{product_id}", response_model=ProductDto)
async def get_product(
    product_id: int,
    service: ProductService = Depends(get_product_service)
):
    product = await service.get(product_id)
    if not product:
        raise HTTPException(status_code=404, detail="Product not found")
    return product

The structure is nearly identical. Route registration with path parameters, dependency injection, async handlers, automatic OpenAPI generation. The differences are syntax, not architecture.

Pydantic vs. C# DTOs:

Pydantic models are the Python equivalent of C# record types with Data Annotations — they define the shape of data and validate it at instantiation:

# Pydantic — Python
from pydantic import BaseModel, Field, field_validator
from datetime import datetime
from enum import Enum

class ProductStatus(str, Enum):
    active = "active"
    discontinued = "discontinued"
    out_of_stock = "out_of_stock"

class ProductDto(BaseModel):
    id: int
    name: str = Field(min_length=1, max_length=200)
    price: float = Field(gt=0)
    stock_count: int = Field(ge=0)
    status: ProductStatus
    created_at: datetime

    @field_validator("name")
    @classmethod
    def name_must_not_be_empty_after_strip(cls, v: str) -> str:
        stripped = v.strip()
        if not stripped:
            raise ValueError("Name cannot be only whitespace")
        return stripped

// Equivalent C# DTO
public enum ProductStatus { Active, Discontinued, OutOfStock }

public record ProductDto
{
    public int Id { get; init; }

    [Required, StringLength(200, MinimumLength = 1)]
    public string Name { get; init; } = string.Empty;

    [Range(double.Epsilon, double.MaxValue, ErrorMessage = "Price must be positive")]
    public decimal Price { get; init; }

    [Range(0, int.MaxValue)]
    public int StockCount { get; init; }

    public ProductStatus Status { get; init; }
    public DateTime CreatedAt { get; init; }
}

Pydantic validates at assignment time, similar to how ASP.NET Core validates model binding before your controller action runs. FastAPI feeds incoming request bodies through the Pydantic model automatically, returning a structured 422 Unprocessable Entity error if validation fails — analogous to ASP.NET Core’s automatic 400 Bad Request with [ApiController].

Python async/await — similar concept, different threading model:

# Python async/await looks familiar
async def get_embedding(text: str) -> list[float]:
    response = await client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

But the underlying model is different. C#’s async/await runs on a thread pool — await suspends the current method and frees a thread to do other work, and when the awaited task completes, execution resumes on a thread pool thread. Python’s asyncio event loop is single-threaded: there is one loop per process, and await yields control back to that loop, which picks up the next ready coroutine. There are no multiple threads involved in standard Python async code.

This is fine for I/O-bound work (HTTP calls, database queries, LLM API calls). It is a problem for CPU-bound work — running a heavy ML model inference on the async event loop blocks the entire event loop until inference completes.

The solution is to run CPU-bound work in a thread pool executor or a process pool:

import asyncio
from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor(max_workers=4)

async def run_model_inference(text: str) -> str:
    loop = asyncio.get_event_loop()
    # Run the CPU-bound inference in a thread pool
    # This yields control back to the event loop while inference runs
    result = await loop.run_in_executor(
        executor,
        lambda: model_pipeline(text)  # synchronous, CPU-heavy call
    )
    return result

The GIL — Python’s biggest web serving limitation:

The Global Interpreter Lock (GIL) prevents multiple Python threads from executing Python bytecode simultaneously. In practice:

For I/O-bound async work: the GIL is released during I/O waits, so it rarely matters.
For CPU-bound work in threads: the GIL means your threads do not actually run in parallel on multiple cores.
The solution for CPU-intensive workloads is multiprocessing (separate Python processes, each with their own GIL) or using libraries like NumPy and PyTorch that release the GIL during their C-level operations (which they do).

For a FastAPI service handling LLM API calls (which are network I/O), the GIL is essentially irrelevant. For a service doing real-time model inference in pure Python, you need to think about multiprocessing or model servers like Triton Inference Server.

The practical guidance: FastAPI with async I/O and background tasks handles typical AI API workloads (calling OpenAI, running Hugging Face inference, semantic search) without GIL issues. If you need to saturate multiple CPU cores with Python bytecode, that is a specialized workload that requires a different architecture.

Type Safety Across the Python Boundary

Step 1: FastAPI + Pydantic Generates OpenAPI Automatically

FastAPI generates an OpenAPI spec from your Pydantic models and route definitions automatically — no additional setup required:

# main.py
from fastapi import FastAPI
from pydantic import BaseModel, Field
from typing import Optional
from enum import Enum

app = FastAPI(
    title="AI API",
    version="1.0.0",
    description="ML inference and NLP endpoints"
)

class SentimentResult(BaseModel):
    text: str
    sentiment: str  # "positive" | "negative" | "neutral"
    confidence: float = Field(ge=0.0, le=1.0)
    processing_time_ms: float

class ClassifyRequest(BaseModel):
    text: str = Field(min_length=1, max_length=10_000)
    model_id: Optional[str] = None  # Override default model

@app.post("/api/classify/sentiment", response_model=SentimentResult)
async def classify_sentiment(request: ClassifyRequest) -> SentimentResult:
    ...

The spec is available at http://localhost:8000/openapi.json. Feed this to openapi-typescript exactly as you would the ASP.NET Core Swashbuckle spec:

npx openapi-typescript http://localhost:8000/openapi.json \
  --output src/lib/ai-api-types.gen.ts

Your Next.js project can consume types from two generated files simultaneously:

// src/lib/api-types.gen.ts     — from ASP.NET Core / NestJS
// src/lib/ai-api-types.gen.ts  — from FastAPI

Step 2: Pydantic ↔ Zod Translation Guide

The schemas you write on the Python side have direct equivalents in Zod on the TypeScript side. Maintaining both in sync is the key discipline.

# Python / Pydantic
from pydantic import BaseModel, Field
from typing import Optional, List
from datetime import datetime
from enum import Enum

class Sentiment(str, Enum):
    positive = "positive"
    negative = "negative"
    neutral = "neutral"

class Entity(BaseModel):
    text: str
    label: str
    start: int
    end: int
    score: float = Field(ge=0.0, le=1.0)

class AnalysisResult(BaseModel):
    id: str
    original_text: str
    sentiment: Sentiment
    entities: List[Entity]
    summary: Optional[str] = None
    processed_at: datetime
    token_count: int = Field(ge=0)

// TypeScript / Zod — mirrors the Pydantic schema
import { z } from "zod";

// str Enum with (str, Enum) -> z.enum()
const SentimentSchema = z.enum(["positive", "negative", "neutral"]);

// Nested model -> nested z.object()
const EntitySchema = z.object({
  text: z.string(),
  label: z.string(),
  start: z.number().int().nonnegative(),
  end: z.number().int().nonnegative(),
  score: z.number().min(0).max(1),
});

// Optional[str] = None -> .nullable().optional() or .nullish()
// datetime -> z.string().datetime() with transform
const AnalysisResultSchema = z.object({
  id: z.string(),
  original_text: z.string(),
  sentiment: SentimentSchema,
  entities: z.array(EntitySchema),
  summary: z.string().nullable().optional(),   // Optional[str] = None
  processed_at: z.string().datetime().transform((v) => new Date(v)),
  token_count: z.number().int().nonnegative(),
});

export type AnalysisResult = z.infer<typeof AnalysisResultSchema>;

Pydantic to Zod field mapping:

Pydantic	Zod equivalent
`str`	`z.string()`
`int`	`z.number().int()`
`float`	`z.number()`
`bool`	`z.boolean()`
`datetime`	`z.string().datetime().transform(v => new Date(v))`
`Optional[T]`	`z.T().nullable().optional()`
`List[T]`	`z.array(z.T())`
`Dict[str, T]`	`z.record(z.T())`
`str Enum`	`z.enum([...values])`
`Field(ge=0, le=1)`	`.min(0).max(1)`
`Field(min_length=1)`	`.min(1)`
`Literal["a", "b"]`	`z.literal("a").or(z.literal("b"))`

Contract testing with schemathesis:

Schemathesis is a Python library that fuzzes your FastAPI endpoints against their own OpenAPI spec — it generates random valid and invalid inputs and verifies the responses match the declared schema:

pip install schemathesis
schemathesis run http://localhost:8000/openapi.json --checks all

Add to your Python CI:

# .github/workflows/python-api.yml
- name: Run schemathesis contract tests
  run: |
    schemathesis run http://localhost:8000/openapi.json \
      --checks all \
      --stateful=links \
      --max-examples=50

Step 3: Building a FastAPI ML Endpoint

Here is a complete FastAPI endpoint running a Hugging Face sentiment analysis model. This is the kind of code that has no clean equivalent in .NET:

# api/routes/classify.py
from fastapi import APIRouter, HTTPException, BackgroundTasks
from pydantic import BaseModel, Field
from typing import Optional
import asyncio
import time
import logging
from concurrent.futures import ThreadPoolExecutor

from transformers import pipeline
from functools import lru_cache

logger = logging.getLogger(__name__)
router = APIRouter(prefix="/api/classify", tags=["classification"])

# Thread pool for CPU-bound inference
_executor = ThreadPoolExecutor(max_workers=2)

# Models are heavy — cache them at module level
@lru_cache(maxsize=3)
def get_sentiment_pipeline(model_id: str):
    logger.info(f"Loading model: {model_id}")
    return pipeline(
        "sentiment-analysis",
        model=model_id,
        device=-1,  # -1 = CPU, 0 = first GPU
        truncation=True,
        max_length=512
    )

class SentimentRequest(BaseModel):
    text: str = Field(min_length=1, max_length=10_000)
    model_id: str = "distilbert-base-uncased-finetuned-sst-2-english"

class SentimentResponse(BaseModel):
    text: str
    label: str
    score: float = Field(ge=0.0, le=1.0)
    model_id: str
    processing_time_ms: float

@router.post("/sentiment", response_model=SentimentResponse)
async def classify_sentiment(request: SentimentRequest) -> SentimentResponse:
    start = time.perf_counter()

    pipe = get_sentiment_pipeline(request.model_id)

    # Run CPU-bound inference off the event loop
    loop = asyncio.get_event_loop()
    try:
        result = await loop.run_in_executor(
            _executor,
            lambda: pipe(request.text)[0]
        )
    except Exception as e:
        logger.error(f"Inference failed for model {request.model_id}: {e}")
        raise HTTPException(
            status_code=503,
            detail=f"Model inference failed: {str(e)}"
        )

    processing_ms = (time.perf_counter() - start) * 1000

    return SentimentResponse(
        text=request.text,
        label=result["label"].lower(),  # "POSITIVE" -> "positive"
        score=result["score"],
        model_id=request.model_id,
        processing_time_ms=processing_ms
    )

# main.py
from fastapi import FastAPI
from contextlib import asynccontextmanager
from api.routes import classify
import uvicorn

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Warm up the default model on startup
    # (avoids cold start latency on first request)
    from api.routes.classify import get_sentiment_pipeline
    get_sentiment_pipeline("distilbert-base-uncased-finetuned-sst-2-english")
    yield
    # Cleanup on shutdown if needed

app = FastAPI(
    title="AI API",
    version="1.0.0",
    lifespan=lifespan
)

app.include_router(classify.router)

# CORS
from fastapi.middleware.cors import CORSMiddleware

app.add_middleware(
    CORSMiddleware,
    allow_origins=["http://localhost:3000", "https://yourapp.vercel.app"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

Run locally:

pip install fastapi uvicorn[standard] transformers torch
uvicorn main:app --reload --port 8000

The OpenAPI spec is now at http://localhost:8000/openapi.json.

Step 4: Streaming LLM Responses with Server-Sent Events

This is the most important section for any product involving LLMs. Streaming is not optional for LLM UX — users will not wait 10 seconds staring at a spinner while a response generates. The pattern is Server-Sent Events (SSE): the server sends a stream of chunks, and the frontend renders each chunk as it arrives.

The FastAPI streaming endpoint:

# api/routes/chat.py
from fastapi import APIRouter, Request
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
from typing import AsyncGenerator
from openai import AsyncOpenAI
import json

router = APIRouter(prefix="/api/chat", tags=["chat"])
client = AsyncOpenAI()  # Reads OPENAI_API_KEY from env

class ChatMessage(BaseModel):
    role: str   # "user" | "assistant" | "system"
    content: str

class ChatRequest(BaseModel):
    messages: list[ChatMessage] = Field(min_length=1)
    model: str = "gpt-4o-mini"
    max_tokens: int = Field(default=1024, ge=1, le=4096)

async def generate_stream(request: ChatRequest) -> AsyncGenerator[str, None]:
    """
    Yields Server-Sent Events formatted strings.
    SSE format: 'data: <json>\n\n'
    """
    try:
        async with client.beta.chat.completions.stream(
            model=request.model,
            messages=[m.model_dump() for m in request.messages],
            max_tokens=request.max_tokens,
        ) as stream:
            async for event in stream:
                if event.type == "content.delta":
                    # Each chunk is a small piece of the response text
                    chunk_data = json.dumps({
                        "type": "delta",
                        "content": event.delta
                    })
                    yield f"data: {chunk_data}\n\n"

                elif event.type == "content.done":
                    # Signal completion with usage information
                    final_data = json.dumps({
                        "type": "done",
                        "usage": {
                            "prompt_tokens": event.parsed_completion.usage.prompt_tokens
                                if event.parsed_completion.usage else None,
                            "completion_tokens": event.parsed_completion.usage.completion_tokens
                                if event.parsed_completion.usage else None,
                        }
                    })
                    yield f"data: {final_data}\n\n"

    except Exception as e:
        error_data = json.dumps({"type": "error", "message": str(e)})
        yield f"data: {error_data}\n\n"

@router.post("/stream")
async def chat_stream(request: ChatRequest):
    return StreamingResponse(
        generate_stream(request),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",  # Disable Nginx buffering
            "Connection": "keep-alive",
        }
    )

The Next.js streaming chat UI — complete implementation:

// src/components/ChatInterface.tsx
"use client";

import { useState, useRef, useCallback } from "react";

interface Message {
  role: "user" | "assistant";
  content: string;
}

interface StreamEvent {
  type: "delta" | "done" | "error";
  content?: string;
  message?: string;
  usage?: {
    prompt_tokens: number | null;
    completion_tokens: number | null;
  };
}

export function ChatInterface() {
  const [messages, setMessages] = useState<Message[]>([]);
  const [input, setInput] = useState("");
  const [isStreaming, setIsStreaming] = useState(false);
  const [error, setError] = useState<string | null>(null);
  const abortControllerRef = useRef<AbortController | null>(null);

  const sendMessage = useCallback(async () => {
    if (!input.trim() || isStreaming) return;

    const userMessage: Message = { role: "user", content: input.trim() };
    const updatedMessages = [...messages, userMessage];

    setMessages(updatedMessages);
    setInput("");
    setIsStreaming(true);
    setError(null);

    // Add empty assistant message that will be filled as chunks arrive
    setMessages((prev) => [...prev, { role: "assistant", content: "" }]);

    abortControllerRef.current = new AbortController();

    try {
      const response = await fetch(
        `${process.env.NEXT_PUBLIC_AI_API_URL}/api/chat/stream`,
        {
          method: "POST",
          headers: { "Content-Type": "application/json" },
          body: JSON.stringify({
            messages: updatedMessages,
            model: "gpt-4o-mini",
          }),
          signal: abortControllerRef.current.signal,
        }
      );

      if (!response.ok) {
        throw new Error(`HTTP ${response.status}: ${await response.text()}`);
      }

      // ReadableStream for SSE processing
      const reader = response.body?.getReader();
      if (!reader) throw new Error("No response body");

      const decoder = new TextDecoder();
      let buffer = "";

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split("\n");

        // Keep the last potentially incomplete line in the buffer
        buffer = lines.pop() ?? "";

        for (const line of lines) {
          if (!line.startsWith("data: ")) continue;

          const jsonStr = line.slice(6).trim();
          if (!jsonStr) continue;

          try {
            const event = JSON.parse(jsonStr) as StreamEvent;

            if (event.type === "delta" && event.content) {
              // Append chunk to the last (assistant) message
              setMessages((prev) => {
                const updated = [...prev];
                const last = updated[updated.length - 1];
                if (last.role === "assistant") {
                  updated[updated.length - 1] = {
                    ...last,
                    content: last.content + event.content,
                  };
                }
                return updated;
              });
            } else if (event.type === "error") {
              setError(event.message ?? "An error occurred");
            }
          } catch {
            // Malformed JSON chunk — skip
          }
        }
      }
    } catch (err) {
      if (err instanceof Error && err.name === "AbortError") {
        // User cancelled — that is fine
      } else {
        setError(err instanceof Error ? err.message : "Connection failed");
        // Remove the empty assistant message on error
        setMessages((prev) => prev.slice(0, -1));
      }
    } finally {
      setIsStreaming(false);
      abortControllerRef.current = null;
    }
  }, [input, messages, isStreaming]);

  const cancelStream = useCallback(() => {
    abortControllerRef.current?.abort();
  }, []);

  return (
    <div className="flex flex-col h-screen max-w-2xl mx-auto p-4">
      <div className="flex-1 overflow-y-auto space-y-4 mb-4">
        {messages.map((message, i) => (
          <div
            key={i}
            className={`p-3 rounded-lg ${
              message.role === "user"
                ? "bg-blue-100 ml-8"
                : "bg-gray-100 mr-8"
            }`}
          >
            <div className="text-xs text-gray-500 mb-1 font-medium">
              {message.role === "user" ? "You" : "Assistant"}
            </div>
            <div className="whitespace-pre-wrap">
              {message.content}
              {/* Blinking cursor on the last message while streaming */}
              {isStreaming && i === messages.length - 1 && (
                <span className="inline-block w-2 h-4 ml-0.5 bg-gray-700 animate-pulse" />
              )}
            </div>
          </div>
        ))}
        {error && (
          <div className="p-3 bg-red-50 border border-red-200 rounded-lg text-red-700 text-sm">
            {error}
          </div>
        )}
      </div>

      <div className="flex gap-2">
        <textarea
          value={input}
          onChange={(e) => setInput(e.target.value)}
          onKeyDown={(e) => {
            if (e.key === "Enter" && !e.shiftKey) {
              e.preventDefault();
              sendMessage();
            }
          }}
          placeholder="Type a message..."
          disabled={isStreaming}
          className="flex-1 border rounded-lg p-2 resize-none"
          rows={2}
        />
        {isStreaming ? (
          <button
            onClick={cancelStream}
            className="px-4 py-2 bg-red-500 text-white rounded-lg"
          >
            Stop
          </button>
        ) : (
          <button
            onClick={sendMessage}
            disabled={!input.trim()}
            className="px-4 py-2 bg-blue-500 text-white rounded-lg disabled:opacity-50"
          >
            Send
          </button>
        )}
      </div>
    </div>
  );
}

Step 5: A Complete RAG (Retrieval-Augmented Generation) Endpoint

To illustrate how the AI stack fits together, here is a FastAPI endpoint implementing a simple RAG pipeline — the pattern behind most AI-powered search and Q&A products:

# api/routes/rag.py
from fastapi import APIRouter, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field
from typing import AsyncGenerator
from openai import AsyncOpenAI
import asyncio
import json
import numpy as np

# pgvector client
from psycopg2.extras import execute_values
import psycopg2

router = APIRouter(prefix="/api/rag", tags=["rag"])
client = AsyncOpenAI()

class RAGRequest(BaseModel):
    question: str = Field(min_length=1, max_length=2000)
    collection: str = "documents"
    top_k: int = Field(default=5, ge=1, le=20)

class SourceDocument(BaseModel):
    id: str
    title: str
    excerpt: str
    score: float

async def get_embedding(text: str) -> list[float]:
    response = await client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def vector_search(
    embedding: list[float],
    collection: str,
    top_k: int
) -> list[SourceDocument]:
    # PostgreSQL with pgvector extension
    conn = psycopg2.connect(...)  # connection pool in production
    cursor = conn.cursor()
    cursor.execute(
        """
        SELECT id, title, content,
               1 - (embedding <=> %s::vector) as similarity
        FROM documents
        WHERE collection = %s
        ORDER BY embedding <=> %s::vector
        LIMIT %s
        """,
        (embedding, collection, embedding, top_k)
    )
    rows = cursor.fetchall()
    return [
        SourceDocument(
            id=str(row[0]),
            title=row[1],
            excerpt=row[2][:500],  # First 500 chars as excerpt
            score=float(row[3])
        )
        for row in rows
    ]

async def generate_rag_stream(
    question: str,
    sources: list[SourceDocument]
) -> AsyncGenerator[str, None]:
    context = "\n\n".join(
        f"[{s.title}]\n{s.excerpt}" for s in sources
    )

    # First, yield the source documents so the UI can render them
    # before the answer starts streaming
    sources_event = json.dumps({
        "type": "sources",
        "sources": [s.model_dump() for s in sources]
    })
    yield f"data: {sources_event}\n\n"

    system_prompt = (
        "You are a helpful assistant. Answer the user's question based "
        "only on the provided context. If the context does not contain "
        "enough information, say so clearly.\n\n"
        f"Context:\n{context}"
    )

    async with client.beta.chat.completions.stream(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question}
        ],
        max_tokens=1024,
    ) as stream:
        async for event in stream:
            if event.type == "content.delta":
                delta_event = json.dumps({
                    "type": "delta",
                    "content": event.delta
                })
                yield f"data: {delta_event}\n\n"
            elif event.type == "content.done":
                yield f"data: {json.dumps({'type': 'done'})}\n\n"

@router.post("/query")
async def rag_query(request: RAGRequest):
    # Get embedding for the question (async I/O — no GIL concern)
    embedding = await get_embedding(request.question)

    # Vector search (blocking DB call — run in executor)
    loop = asyncio.get_event_loop()
    sources = await loop.run_in_executor(
        None,
        lambda: vector_search(embedding, request.collection, request.top_k)
    )

    if not sources:
        raise HTTPException(
            status_code=404,
            detail="No relevant documents found in the collection"
        )

    return StreamingResponse(
        generate_rag_stream(request.question, sources),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",
        }
    )

Step 6: Deployment on Render

Render is a common choice for hosting both Next.js and FastAPI services with minimal infrastructure overhead.

Dockerfile for FastAPI:

# Dockerfile.api
FROM python:3.12-slim

WORKDIR /app

# Install system dependencies for ML libraries
RUN apt-get update && apt-get install -y \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Pre-download model weights at build time (avoids cold start)
RUN python -c "
from transformers import pipeline
pipeline('sentiment-analysis', 'distilbert-base-uncased-finetuned-sst-2-english')
"

COPY . .

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]

requirements.txt:

fastapi==0.115.0
uvicorn[standard]==0.30.0
pydantic==2.8.0
openai==1.50.0
transformers==4.44.0
torch==2.4.0
numpy==1.26.0
psycopg2-binary==2.9.9
python-dotenv==1.0.1

Note --workers 2 in the uvicorn command. Each worker is a separate Python process with its own GIL, allowing true parallelism for handling concurrent requests. For ML inference, be careful: each worker loads the model into memory. Two workers with a 2GB model requires 4GB RAM. Size your Render instance accordingly.

render.yaml:

services:
  - type: web
    name: ai-api
    env: python
    dockerfilePath: ./Dockerfile.api
    healthCheckPath: /health
    envVars:
      - key: OPENAI_API_KEY
        sync: false  # Set in Render dashboard, not committed
      - key: DATABASE_URL
        fromDatabase:
          name: main-db
          property: connectionString

  - type: web
    name: frontend
    env: node
    buildCommand: npm ci && npm run build
    startCommand: npm run start
    envVars:
      - key: NEXT_PUBLIC_AI_API_URL
        value: https://ai-api.onrender.com
      - key: NEXT_PUBLIC_API_URL
        value: https://your-dotnet-api.azurewebsites.net

Key Differences

Concern	ASP.NET Core	FastAPI (Python)
Type system	Nominal, compiler-enforced	Structural, runtime-validated via Pydantic
Concurrency	CLR thread pool, true parallelism	asyncio event loop (single-threaded) + thread pool for CPU
OpenAPI	Swashbuckle attribute-based	Automatic from Pydantic models and route decorators
DI container	`IServiceCollection`, lifetimes	`Depends()` — functional, no container
Middleware	`IMiddleware`, pipeline	Starlette middleware, decorators
Validation	Data Annotations, FluentValidation	Pydantic `Field()`, `@field_validator`
Error responses	`ProblemDetails` (RFC 7807)	`HTTPException` detail, `422 Unprocessable Entity`
Streaming	`IAsyncEnumerable<T>`, SignalR	`StreamingResponse` + `AsyncGenerator`
Background tasks	`IHostedService`, `BackgroundService`	`BackgroundTasks` (per-request), Celery for queues
Testing	xUnit, Moq, `WebApplicationFactory`	pytest, `pytest-asyncio`, `httpx.AsyncClient`

Gotchas for .NET Engineers

Gotcha 1: Python Indentation Is Structural, Not Stylistic — and Type Hints Are Optional

In C#, indentation is a style convention enforced by your linter. In Python, indentation is the block delimiter. Wrong indentation is a SyntaxError or, worse, a logic error that the interpreter accepts but does not do what you intended.

# This looks like an if/else, but the else is not attached to the if
if condition:
    do_something()
  else:             # IndentationError — else is indented differently from if
    do_other()

# This runs do_other() unconditionally — not a syntax error, but wrong
if condition:
    do_something()
do_other()          # Not indented — runs regardless of condition

More relevant: Python type hints are not enforced at runtime. Pydantic enforces its own models, but arbitrary function signatures with type hints can be called with wrong types and Python will not complain:

def process_items(items: list[int]) -> int:  # Hint says list[int] -> int
    return sum(items)

result = process_items(["a", "b", "c"])  # Python accepts this
# sum() fails at runtime with TypeError, not at parse time

Fix: Use mypy or pyright as a static type checker in CI. FastAPI uses pyright internally. Without a static checker, type hints in Python are documentation — valuable, but not enforced by the interpreter.

pip install mypy
mypy api/ --ignore-missing-imports --strict

Gotcha 2: Pydantic v1 vs. v2 — Two Incompatible APIs

Pydantic underwent a complete rewrite in version 2 (released 2023) that broke compatibility with v1. FastAPI 0.100+ supports Pydantic v2. Many tutorials, Stack Overflow answers, and GitHub repositories still show v1 syntax.

The most common breaking change:

# Pydantic v1
class MyModel(BaseModel):
    name: str

    class Config:
        allow_population_by_field_name = True

    @validator("name")
    def name_must_be_valid(cls, v):
        return v.strip()

instance = MyModel(name="test")
data = instance.dict()  # v1 method

# Pydantic v2 — different decorator, different method names
class MyModel(BaseModel):
    name: str

    model_config = ConfigDict(populate_by_name=True)  # replaces class Config

    @field_validator("name")  # replaces @validator
    @classmethod
    def name_must_be_valid(cls, v: str) -> str:
        return v.strip()

instance = MyModel(name="test")
data = instance.model_dump()  # replaces .dict()
json_str = instance.model_dump_json()  # replaces .json()

If you install FastAPI and Pydantic from scratch, you get v2. If you install into an existing Python project with a requirements.txt that pins pydantic<2, you get v1. Check pip show pydantic to confirm the version. Do not mix v1 and v2 syntax — the error messages are often confusing.

Gotcha 3: Python `datetime` Is Timezone-Naive by Default

In C#, DateTime without a Kind is ambiguous (local vs. UTC), and DateTimeOffset makes the offset explicit. Python has the same distinction: a datetime without tzinfo is naive (no timezone), and one with tzinfo is aware.

from datetime import datetime, timezone

# Naive — no timezone information
naive = datetime.now()          # Local time, no tzinfo
naive_utc = datetime.utcnow()   # UTC by convention, but still no tzinfo!

# Aware — explicit UTC
aware = datetime.now(timezone.utc)   # Correct way to get current UTC time

The trap: datetime.utcnow() returns the current UTC time as a naive datetime. If you store this in a database and later compare it to an aware datetime, you get a TypeError. FastAPI and Pydantic v2 handle this correctly when you use the datetime type in a Pydantic model — Pydantic v2 validates that datetime strings from JSON are timezone-aware and will reject naive datetimes from non-aware inputs.

Fix: Always use datetime.now(timezone.utc) for current timestamps. Configure Pydantic to require timezone-aware datetimes:

from pydantic import BaseModel, field_validator
from datetime import datetime, timezone

class EventModel(BaseModel):
    occurred_at: datetime

    @field_validator("occurred_at")
    @classmethod
    def must_be_timezone_aware(cls, v: datetime) -> datetime:
        if v.tzinfo is None:
            raise ValueError("occurred_at must be timezone-aware")
        return v.astimezone(timezone.utc)  # Normalize to UTC

On the TypeScript side, z.string().datetime({ offset: true }) rejects strings without an explicit offset — add this to your Zod schemas for all date fields from Python APIs.

Gotcha 4: Python’s Async/Await Is Not Drop-In Parallelism

The single-threaded nature of asyncio means that code that blocks the event loop blocks all requests, not just the current one:

# This blocks the ENTIRE event loop for all concurrent requests
@app.post("/api/classify")
async def classify(request: ClassifyRequest):
    result = model(request.text)   # Synchronous, CPU-intensive — BLOCKS event loop
    return {"result": result}

# This is correct — runs blocking code off the event loop
@app.post("/api/classify")
async def classify(request: ClassifyRequest):
    loop = asyncio.get_event_loop()
    result = await loop.run_in_executor(None, lambda: model(request.text))
    return {"result": result}

The tell is the function signature: if you call a synchronous (blocking) function directly from an async def without await run_in_executor, you are blocking the event loop. Always check whether library functions you call are async (safe to await) or synchronous (must go to executor).

Alternatively, use asyncio.to_thread() (Python 3.9+) which is cleaner syntax:

import asyncio

result = await asyncio.to_thread(model, request.text)

Gotcha 5: `None` Is Not `null` in JSON — the Exclude-None Pattern

In Python, None is the absence of a value. When Pydantic serializes a model with None fields, those fields appear in the JSON output as null by default. If your TypeScript schema uses z.string().optional() (expecting the field to be absent, not null), the Zod parse will fail.

class SearchResult(BaseModel):
    id: str
    title: str
    description: Optional[str] = None  # Optional field

# Default serialization includes the None field:
# { "id": "1", "title": "Test", "description": null }

# Exclude None fields — matches TypeScript Optional behavior:
result.model_dump(exclude_none=True)
# { "id": "1", "title": "Test" }

Pick a convention and apply it consistently. Excluding None fields makes the JSON smaller and matches TypeScript’s optional semantics. Including them as null makes the schema more explicit. The critical thing is that your Zod schema and your Pydantic serialization agree:

// If Python sends null:
description: z.string().nullable().optional()

// If Python excludes the field entirely:
description: z.string().optional()

Configure FastAPI to exclude None globally if that is your convention:

app = FastAPI()
app.router.default_response_class_kwargs = {"exclude_none": True}

Hands-On Exercise

Goal: Build a FastAPI sentiment analysis endpoint and consume it from Next.js with streaming output and full type safety.

Prerequisites:

Python 3.12+, pip
Next.js 14 app
An OpenAI API key (or use Hugging Face’s free inference API)

Step 1 — Set up FastAPI:

mkdir ai-api && cd ai-api
python -m venv venv && source venv/bin/activate
pip install fastapi uvicorn openai pydantic python-dotenv

Step 2 — Write the sentiment endpoint:

Create main.py with the sentiment analysis endpoint from the “Building a FastAPI ML Endpoint” section above. Run it:

uvicorn main:app --reload --port 8000

Verify the OpenAPI spec at http://localhost:8000/openapi.json.

Step 3 — Generate TypeScript types:

cd ../your-nextjs-app
npx openapi-typescript http://localhost:8000/openapi.json \
  -o src/lib/ai-api-types.gen.ts

Open the generated file. Identify the SentimentResponse type. Compare its shape to your Pydantic model.

Step 4 — Write the Zod schema:

Write a Zod schema that mirrors your SentimentResponse Pydantic model. Include:

label as a union type: z.enum(["positive", "negative", "neutral"])
score as z.number().min(0).max(1)
processing_time_ms as a positive number

Step 5 — Add the streaming chat endpoint:

Add the streaming chat endpoint from the “Streaming LLM Responses” section to your FastAPI app. Copy the ChatInterface component into your Next.js app.

Test the streaming: you should see tokens appear progressively in the UI as they stream from the LLM API.

Step 6 — Introduce a contract violation:

Change the label field in your Pydantic SentimentResponse to sentiment_label. Run the type generator again. Observe where TypeScript now errors. Fix the Zod schema to match, observe the runtime validation fail if you send the old response shape. This is the feedback loop you rely on in production.

Step 7 — Add schemathesis:

pip install schemathesis
schemathesis run http://localhost:8000/openapi.json --checks all

Observe what schemathesis tests automatically. Note that it will send null, empty strings, very long strings, and other edge cases to your endpoints. If any cause 500 errors, that is a bug in your Pydantic validation — fix it.

Quick Reference

FastAPI ↔ ASP.NET Core mapping
  @app.get("/path")        ->  app.MapGet("/path", handler)
  @app.post("/path")       ->  app.MapPost("/path", handler)
  BaseModel                ->  record / class with Data Annotations
  Field(ge=0, le=1)        ->  [Range(0, 1)]
  Optional[T] = None       ->  T? with nullable reference types
  Depends(get_service)     ->  constructor injection
  HTTPException(404, ...)  ->  return Results.NotFound(...)
  lifespan context manager ->  IHostedService startup/shutdown

Pydantic to Zod field mapping
  str                      ->  z.string()
  int                      ->  z.number().int()
  float                    ->  z.number()
  bool                     ->  z.boolean()
  datetime                 ->  z.string().datetime({ offset: true }).transform(v => new Date(v))
  Optional[T] = None       ->  z.T().nullable().optional()
  List[T]                  ->  z.array(z.T())
  Dict[str, T]             ->  z.record(z.T())
  str Enum                 ->  z.enum(["val1", "val2"])
  Field(ge=0)              ->  .min(0)
  Field(min_length=1)      ->  .min(1) (for strings: .min(1))

Run FastAPI locally
  uvicorn main:app --reload --port 8000

OpenAPI spec URL (FastAPI)
  http://localhost:8000/openapi.json

Generate TS types from FastAPI
  npx openapi-typescript http://localhost:8000/openapi.json -o src/lib/ai-api-types.gen.ts

Streaming response pattern (FastAPI)
  return StreamingResponse(generator(), media_type="text/event-stream")
  yield f"data: {json.dumps(payload)}\n\n"

SSE client (TypeScript)
  const reader = response.body.getReader()
  // Loop: reader.read() -> decode -> split on "\n" -> parse "data: " lines

CPU-bound work off event loop (Python)
  await asyncio.to_thread(sync_function, arg1, arg2)
  # or: await loop.run_in_executor(executor, lambda: sync_function(arg1))

Pydantic v2 key method names (not v1)
  .model_dump()            (was .dict())
  .model_dump_json()       (was .json())
  @field_validator         (was @validator)
  model_config = ConfigDict(...) (was class Config)

Exclude None from response
  model_instance.model_dump(exclude_none=True)

Contract testing
  pip install schemathesis
  schemathesis run http://localhost:8000/openapi.json --checks all

Type checking Python
  pip install mypy
  mypy api/ --ignore-missing-imports

uvicorn production (multiple workers)
  uvicorn main:app --host 0.0.0.0 --port 8000 --workers 2
  # Each worker = separate process + separate GIL
  # Memory: model_size_GB * num_workers

TypeScript for .NET Engineers