
systems-architecture · thin-stack · green-computing · api-security · ai-synthesis · software-engineering
The Return of the Thin Stack: Software in the Age of AI Synthesis
We traded machine efficiency for developer speed. Now that AI writes the code, it is time to buy back the bare metal.
Late on a Saturday night, I ran an experiment I hadn't been able to stop thinking about.
The question was simple. For thirty years, software engineering has operated on one clean trade-off: we spent machine resources to buy human developer speed.
We moved away from Assembly and C to Java, Python, and TypeScript. We wrapped our code in heavy runtimes, dynamic types, and nested package trees. We didn't care that a simple "Hello World" required a 30 Megabyte runtime engine; we cared that a human could write it in ten seconds.
We built a world of thick, bloated stacks because human cognitive capacity was the bottleneck.
But now, the builder has changed.
If an AI can refactor, type-check, and compile code instantly, why are we still paying the performance tax of dynamic, runtime-heavy languages?
I ran the experiment. Here is what I found, and I'd genuinely like to know if my thinking holds. I built the exact same "Hello World" output across five different environments:
The Bare-Metal Comparison
| Metric | Assembly (GAS) | C++ (g++) | Java (JVM) | Python 3 | TypeScript (Bun Run) | TypeScript (Bun Compile) |
|---|---|---|---|---|---|---|
| Source File Size | 284 B | 97 B | 119 B | 23 B | 30 B | 30 B |
| Time to Build | <1 ms | 530 ms | 590 ms | No build | No build | 440 ms |
| Compiled Binary Size | 8.8 KB | 16.3 KB | 417 Bytes (.class) | 23 B (+7.7 MB Interpreter) | 98.0 MB (Runtime) | 101.8 MB |
| Peak Memory (RAM) | 264 KB | 3.4 MB | 38.6 MB | 9.4 MB | 32.1 MB | 28.6 MB |
| Execution Time | <1 ms | <1 ms | 40 ms | 20 ms | 10 ms | 10 ms |
| Input Tokens (Prompt) | 850 | 850 | 510 | 850 | 850 | 850 |
| Output Tokens (Code) | 250 | 180 | 120 | 80 | 140 | 140 |
| AI Generation Time | 3.2s | 2.5s | 1.8s | 1.4s | 2.1s | 2.1s |
| Generation Cost (USD) | ~$0.0031 | ~$0.0028 | ~$0.0018 | ~$0.0026 | ~$0.0027 | ~$0.0027 |
The TypeScript (Bun Compile) standalone binary (the one you'd actually ship as a microservice) comes in at 101.8 MB and peaks at 28.6 MB of RAM. The raw Assembly equivalent: 8.8 Kilobytes compiled and just 264 Kilobytes at runtime. Almost 110 times lighter.
A note on scale: the differences here span three orders of magnitude. If you plotted these numbers on a chart rather than a table, you would need a logarithmic axis to make them readable.
Going Deeper: The File Parsing & Filter Tax
To see how these runtimes behave beyond a simple print statement, I ran a second experiment: reading a file of CSV record data representing vinyl listings (Artist, Title, Year, Genre) and performing case-insensitive keyword filtering based on standard user inputs.
Here is the resource footprint for the CSV parsing application:
| Metric | Assembly (GAS) | C (Native Compiled) | C++ (g++) | Java (JVM) | Python 3 | TypeScript (Bun Run) | TypeScript (Bun Compile) |
|---|---|---|---|---|---|---|---|
| Source File Size | 9.3 KB | 2.2 KB | 1.8 KB | 2.1 KB | 1.2 KB | 1.4 KB | 1.4 KB |
| Compiled Binary Size | 10.6 KB | 16.6 KB | 32.3 KB | 3.0 KB (.class) | 1.2 KB (+7.7 MB Interpreter) | 98.0 MB (Runtime) | 101.8 MB |
| Peak Memory (RAM) | 264 KB | 1.7 MB | 3.4 MB | 38.9 MB | 9.4 MB | 32.2 MB | 28.6 MB |
| Execution Time | <1 ms | <1 ms | <1 ms | 45 ms | 20 ms | 10 ms | 10 ms |
| Input Tokens (Prompt) | 920 | 920 | 850 | 510 | 850 | 850 | 850 |
| Output Tokens (Code) | 550 | 350 | 320 | 280 | 180 | 220 | 220 |
| AI Generation Time | 5.5s | 4.1s | 3.8s | 2.9s | 2.0s | 2.4s | 2.4s |
Look at the Peak Memory column. The Assembly binary parses the same file on just 264 Kilobytes of RAM. The C binary uses 1.7 Megabytes. Both are sub-millisecond. Both are done before the JVM has finished loading its garbage collector.
TypeScript via the standalone runner consumes 28.6 Megabytes of RAM for the same operation: a 16x multiplier over C, and over 100x over Assembly. The Java Virtual Machine weighs in at 38.9 Megabytes just to spin up for a 2-kilobyte script.
Why This Matters for the Future of Tech Stacks
Historically, writing Assembly or low-level C++ was too slow and risky. We chose TypeScript and Python because they were safer and faster for us to write.
But AI is more patient with strict typing, memory allocation, and pointer safety than most developers are. It will follow the register contracts. It can write structured, compiled, low-dependency code as readily as it writes a script. Whether it can do this safely is a different question, and one the security section below answers honestly.
If that bottleneck has shifted, what follows? I don't know with certainty, but here is what I think might be worth examining:
- The return of the Thin Stack: If the cost of generating compiled code falls, organisations may stop defaulting to heavy runtime frameworks for work that does not need them. AI could generate custom, zero-dependency, compiled binaries tailored precisely to the job.
- Languages as a design choice again: Programming languages might shift toward compiler safety and machine efficiency as selection criteria, rather than developer familiarity alone. Strong typing and ahead-of-time compilation become more viable when the AI is the one navigating the constraints.
- A different cloud calculation: If memory footprints shrink by 10–100x and execution times follow, that may translate to meaningful infrastructure savings at scale. How direct that translation is depends on how your platform prices RAM and idle compute, but it seems worth modelling.
These are hypotheses, not forecasts. The experiment I ran over a weekend does not settle them. It opens them up.
The Missing Equation: Token Cost vs. Carbon Lifecycle
When we discuss the carbon footprint of AI, we focus entirely on the energy consumed by the GPU during token generation. But we are completely ignoring the compounding lifecycle footprint of the bloated code those tokens produce.
- The One-Off Token Cost: You pay a small, one-time energy and token cost to have the AI write and debug a low-level, compiled Assembly or C program. Let's call it $0.05 and a few watt-hours of GPU power.
- The Infinite Execution Footprint: That software is then deployed. It is stored on disks, pushed through CI/CD pipelines, loaded into RAM, and run millions of times in production.
- Deploying a 100MB TypeScript/Node standalone container means every transfer, idle second in memory, and cold start consumes electricity, bandwidth, and CPU cycles globally.
- Deploying an 8KB Assembly binary reduces the operational footprint to virtually zero. It loads instantly, copies instantly, and runs on less than 300KB of RAM.
Going Further: The Bare-Metal HTTP API Server
Finally, I built a fully functional HTTP API server that loads the CSV records database into memory and serves filtered JSON query results on port 8080 (e.g. /api/records?q=Rock). I compared a hybrid Assembly/C++ binary (writing socket assembly but calling C++ for parsing) against clean C++ and TypeScript (via Bun).
Here is how they stack up:
| Metric | Assembly API Server | C++ API Server | TypeScript (Bun Run) | TypeScript (Bun Compile) |
|---|---|---|---|---|
| Source File Size | 14.8 KB | 4.8 KB | 1.4 KB | 1.4 KB |
| Compiled Binary Size | 21.8 KB | 23.6 KB | 98.0 MB (Runtime) | 101.8 MB |
| Peak Memory (RAM) | 1.37 MB | 1.66 MB | 44.1 MB | 40.4 MB |
| Response Time (RTT) | <1 ms | <1 ms | <2 ms | <2 ms |
The numbers are striking. The Assembly API server is 4,600 times smaller in binary size and consumes 30 times less RAM than its TypeScript counterpart.
The Security Tax: API Key vs. HMAC vs. HTTPS (TLS)
To understand how adding security logic affects a bare-metal executable vs. a runtime-heavy engine, I ran three iterations of security on the API server:
- API Key (Experiment 4a): Simple header string search (
X-API-Key: secure123). - HMAC Tokens (Experiment 4b): Cryptographic verification using OpenSSL
HMACfromlibcrypto.so.3with signature hex digests. - Transport Security (Experiment 4c): Application-level TLS/HTTPS handshakes using
libssl.soandlibcrypto.so.
Here is the resource overhead of security additions:
| Security level | Metric | Assembly API Server | C++ API Server | TypeScript (Bun Compiled) |
|---|---|---|---|---|
| Unsecured | Binary / RAM | 21.8 KB / 1.37 MB | 23.6 KB / 1.66 MB | 101.8 MB / 40.4 MB |
| API Key Auth | Binary / RAM | 22.2 KB / 1.37 MB | 23.6 KB / 1.66 MB | 101.8 MB / 40.3 MB |
| HMAC Verification | Binary / RAM | 26.0 KB / 2.67 MB | 27.2 KB / 3.02 MB | 101.8 MB / 41.3 MB |
| HTTPS (TLS) TLS | Binary / RAM | 28.3 KB / 6.58 MB | 33.7 KB / 6.73 MB | 101.8 MB / 42.1 MB |
By offloading the cryptographic stack to the system's dynamic libraries (libssl.so / libcrypto.so), the Assembly binary remains tiny (under 30 KB). However, dynamically loading SSL libraries brings a ~1.3 MB RAM tax for HMAC and a ~5.2 MB RAM tax for TLS handshake allocations.
Crucially, even with full transport-level encryption, the Assembly server consumes just 6.58 Megabytes of RAM, remaining 6.4 times more memory-efficient than TypeScript's baseline.
A candid finding from running a security review against this code: Experiment 5 was labelled "hardened," but a diff against Experiment 4 returns empty. The source files are byte-for-byte identical. The only addition is a pentest script, and it probes the one bound that was already correct. The AI generated a directory with a confidence-inspiring name, produced a passing test, and shipped zero hardening.
This is not a flaw in the experiment; it is the point. AI reduces the cost of writing low-level code. It does not reduce the cost of verifying it. The security review that surfaces a finding like this requires human judgment that no amount of code generation replaces. The thin stack argument stands, but with eyes open: the cost saved on the developer is re-spent on the auditor.
The Containerized Density and Rate-Limiting Gateway
To simulate a production-like microservice architecture, I introduced a multi-container Docker Compose stack featuring a gateway layer. In this experiment, an NGINX reverse-proxy acts as the gateway (port 8080), applying rate-limiting (up to 100 req/sec) and routing requests to the underlying Assembly (port 8081), C++ (port 8082), and TypeScript/Bun (port 8083) containers.
Here are the load-testing results (500 requests at 10 concurrent threads through NGINX):
| Environment | Docker Image Size | Baseline RAM | Peak RAM (under load) | Peak CPU (under load) | Avg Latency |
|---|---|---|---|---|---|
| Assembly API | 27.0 MB | 992 KB | 1.65 MB | 0.00% | 15.22 ms |
| C++ API | 27.0 MB | 988 KB | 1.41 MB | 0.00% | 16.20 ms |
| TypeScript (Bun) | 146.0 MB | 9.81 MB | 12.11 MB | 0.49% | 16.29 ms |
| Rust API | 14.6 MB | 592 KB | 1.36 MB | 0.00% | 13.95 ms |
Under parallel execution constraints behind NGINX, the Assembly, C++, and Rust services maintain sub-2 MB memory footprints, while Bun hovers around 12 MB peak RAM. This confirms that the lightweight memory footprint is preserved even when wrapped in separate minimal Docker containers.
The Relational SQL Database Overhead
To test how these microservices behave when integrated with a persistent state storage engine, I introduced a shared SQLite relational database (joining records, likes, and customers tables) queried concurrently by Assembly (port 8081), C++ (port 8082), and TypeScript (port 8083).
An interesting finding emerged regarding connection caching: in initial drafts, Assembly and C++ opened and closed the database file on every request, adding significant kernel file-lock overhead. Once refactored to cache a persistent connection handle (mirroring the Bun driver's behavior), response latencies aligned perfectly.
Here are the load-testing results (500 requests at 10 concurrent threads):
| Environment | Docker Image Size | Baseline RAM | Peak RAM (under load) | Peak CPU (under load) | Avg Latency |
|---|---|---|---|---|---|
| Assembly API | 29.5 MB | 620 KB | 1.93 MB | 0.00% | 16.48 ms |
| C++ API | 29.5 MB | 640 KB | 2.50 MB | 0.00% | 16.65 ms |
| TypeScript (Bun) | 146.0 MB | 8.46 MB | 11.05 MB | 0.43% | 16.62 ms |
| Rust API | 17.9 MB | 728 KB | 2.00 MB | 0.00% | 13.85 ms |
While query latency is virtually identical, the memory density of the compiled micro-binaries remains striking. Under peak concurrent query traffic, the Assembly and Rust SQL microservices run in less than 2 Megabytes of RAM, delivering a 5.5x memory reduction over TypeScript.

The Path Forward: AI-Synthesized Micro-Binaries
How do we make this practical without drowning in segmentation faults?
The answer lies in structural constraints. Instead of asking an AI to write low-level code from scratch, we provide it with an audited, vendor-agnostic Assembly library, a standardised set of register contracts for I/O, string operations, and TCP sockets. The AI simply synthesizes the business logic within these safe constraints.
This points to a new architecture: Bare-Metal Microservices.
Instead of writing APIs inside bloated containers, we build micro-services as tiny compiled binaries. By delegating complex networking concerns (like TLS decryption, rate-limiting, and DoS mitigation) to a high-performance edge proxy (like NGINX or Envoy), our services can remain extremely focused.
One question worth addressing directly: the binary still runs on a host OS. That OS carries its own footprint — a Linux kernel plus system processes typically uses 50–150 MB. But that cost is shared across every service on the host, not charged per service. The comparisons in this article are application-layer RAM, measured above that shared floor. The OS overhead is symmetric: a TypeScript container and an Assembly container both sit on the same kernel. The 38 MB delta between them is pure application-layer waste, and that is what compounds at scale. The logical endpoint of this architecture is a unikernel — where the binary IS the kernel, with no general-purpose OS underneath at all, bringing total per-service memory including everything down to 2–5 MB. That is a separate experiment.
This unlocks three meaningful architectural advantages:
- Unprecedented Density: Running a service at 20 KB in size and 1.3 MB RAM means you can host thousands of microservices on a single cheap server where you could previously only fit a handful of Node.js or JVM containers.
- Zero Cold-Start Latency: With no runtime engine to boot (no V8, no JVM), startup is instantaneous. Serverless "scale-to-zero" architectures suddenly have no performance penalty.
- Decoupled Security: The edge proxy acts as a hardened shield, meaning the microservices can focus purely on executing business logic at native, bare-metal speeds.
Maybe the paradigm is shifting. If the AI is writing the code, optimisation deserves to be a design question again, not something we defer because the human cost of addressing it was too high.
I ran this experiment over a weekend. I think the data points somewhere real, but I've probably made assumptions worth challenging, whether on the methodology, the security analysis, or the broader architectural claims.
The full benchmark suite, source code, deployment scripts, and security review findings are on GitHub: github.com/ShadowRustRuby/thin-stack-benchmarks. The hardware spec and methodology limitations are documented there too.
If my thinking is wrong, tell me. Raise an issue. That's the point.