What "Memory Compiler" Actually Means: From Bitcells to GDS Tiling

May 29, 2026 [compiler-design] #memory-compiler #sram #eda #layout-vs-schematic #tiling

Most people see the name "memory compiler" and have no idea what it actually does.


1. Compiler Representation

The classical representation is the T-diagram (or Tombstone diagram). It characterizes a compiler by three languages: the source language it reads (A), the target language it emits (B), and the implementation language it is written in (C):

     ┌───┬───┐
     │ A → B │
     └───┴───┘
         │ C

Read it as: "a program that translates A into B, written in C."

The textbook example is C → machine code. But the definition is broader than that:


2. The Raw Material: Inside a Bitcell

The fundamental building block of any SRAM (Static Random-Access Memory) is the bitcell — a tiny circuit that holds one bit. Hardware teams design and characterize these cells by hand; the memory compiler's job is to replicate them at scale.

The most common variant is the 6T bitcell (six transistors). Three signal lines connect it to the outside world: WL (Word Line — selects a row), BL (Bit Line), and BLB (Bit Line Bar, the complementary signal):

                  WL
  ══════════════════╦══════════════════════════════╦════════
                    ║                              ║
              ┌─────╫──────────────────────────────╫─────┐
              │   [ M5 ]                         [ M6 ]  │
              │     │                              │     │
         BL ──┼─────●───── Q               Q' ─────●─────┼── BLB
              │            │               │             │
              │            │   ┌───────┐   │             │
              │            ├─►─┤ INV_R ├───┤             │
              │            │   │(M2,M4)│   │             │
              │            │   └───────┘   │             │
              │            │               │             │
              │            │   ┌───────┐   │             │
              │            ├───┤ INV_L ├─◄─┤             │
              │            │   │(M1,M3)│   │             │
              │            │   └───────┘   │             │
              │                                          │
              │                bitcell (6T)              │
              └──────────────────────────────────────────┘

Transistor count:

Positive Feedback: How a Bit Gets Locked

Two inverters connected head-to-tail (cross-coupled):

  1. Assume Q = 1 (high, Vdd)
  2. INV_R receives 1, outputs 0 to Q'
  3. INV_L receives 0, outputs 1 back to Q
  4. Q stays 1 — a perfect closed loop

This is positive feedback + bistability. As long as Vdd is present, Q and Q' are locked at opposite voltages indefinitely. The stored bit is literally just the voltage sitting on those two nodes. SRAM is called "Static" because — unlike DRAM — it never needs to refresh.

Three Operations

Hold (Standby)

Read

  1. Precharge BL and BLB to Vdd
  2. Assert WL high; M5 and M6 turn on
  3. The side storing 0 pulls its bitline down by ~100 mV
  4. Sense amplifier amplifies the small differential → 1 bit out

Read Disturbance: the moment WL turns on, the high-voltage BLB can pull up Q' (the node storing 0) slightly through M6. If it rises past the switching threshold of INV_L, the cell flips — a Destructive Read. The fix is making the pull-down NMOS M2 stronger (wider) than the access transistor M6. This strength ratio is the Beta Ratio (β = W_pull-down / W_access), typically required to be > 1. This is why the six transistors in a standard bitcell are not all the same size.

Write

  1. Write driver forces BL and BLB to target values (one high, one low)
  2. Assert WL high
  3. Write driver strength overcomes the cell's pull-up → forces the cell to flip
  4. The sizing trade-off between these forces is called the pull-up ratio

Beyond 6T: Other Bitcell Types

6T is not the only option. Different use cases demand different trade-offs:

CellTransistorsKey propertyTypical use
6T6High density, standard read/writeL2/L3 caches
8T8Isolated read port, no read disturbL1 caches, register files
10T10Ultra-low voltage operationNear-threshold designs

The 8T cell adds a dedicated read path (two extra transistors) so the read operation never touches the storage nodes — eliminating read disturbance entirely at the cost of ~33% more area.


3. Why a Memory Compiler Exists

When a hardware team designs an SoC (System on Chip), they need SRAM — lots of it, at many different sizes. The fundamental unit they work with is a cell: the smallest verified building block of an SRAM array. But a chip might need an SRAM of depth 512 × width 32 in one place, and depth 4096 × width 64 in another. Redesigning a new cell from scratch for every configuration is not feasible.

The solution is a parameterized IP: instead of a fixed design, you ship a tool that accepts parameters and generates the correct implementation automatically. That tool is the memory compiler. Given depth, width, port count, and a few other knobs, it produces a complete, tapeout-ready SRAM macro.

A memory compiler generates a complete, self-contained SRAM macro — the bitcell array plus all the peripheral circuitry needed to operate it: row/column decoders, sense amplifiers, write drivers, and self-timed control. What it does not generate is the memory controller — the system-level logic that decides what to read and write, handles arbitration and BIST sequencing, and lives in the SoC RTL outside the macro.

  Leaf cell views              ┌──────────────────────────────┐     Generated View
  (from hardware team):        │        Memory Compiler       │
                               │                              │
   .gds / .oas (layout) ──────>│                              │───> .gds / .oas
   .lef        (abstract) ────>│                              │───> .lef
   .cdl / .sp  (netlist) ─────>│                              │───> .lib
   .lib        (timing)  ─────>│                              │───> .v  (Verilog)
                               │                              │───> .sp / .cdl
  Parameters:                  │                              │───> .cpf / .upf
                               │                              │───> .pat  (ATPG/MBIST)
   depth            ──────────>│                              │───> .pdf  (Datasheet)
   width            ──────────>│                              │
   port             ──────────>│                              │
   mux factor       ──────────>│                              │
   corner (PVT)     ──────────>│                              │
                               └──────────────────────────────┘

Output Views

ViewFormatUsed by
LayoutGDS / OASISTapeout, sent to fab
AbstractLEFP&R (Place and Route) — exposes pins + blockage only
Timing.lib (Liberty)STA (Static Timing Analysis) — setup/hold, access time, power
BehavioralVerilog / SVRTL simulation
NetlistSPICE / CDLSPICE simulation, LVS
Power.lib / CPF / UPFPower analysis
TestATPG / MBIST patternDFT (Design for Testability) / manufacturing test

4. Two Teams, One Boundary

A memory compiler sits at the intersection of two teams with very different jobs.

The Hardware Team: Designing the Leaf Cell

The hardware team — circuit designers and layout engineers — designs and verifies the leaf cell: the smallest possible SRAM unit. Designing a single 6T bitcell involves resolving multiple interacting constraints:

The hardware team delivers a small set of verified, hand-crafted files — GDS, CDL, LEF, .lib — for that one leaf cell. Their job ends at the single-cell boundary.

The CAD Team: Building the Compiler

The CAD team (or memory compiler team) takes those verified leaf cells and builds the automation that scales them to any size. Their job is to answer: given a leaf cell that works correctly in isolation, how do you tile it into an array of arbitrary depth × width while guaranteeing:

  1. LVS passes — GDS and netlist tiling must match exactly
  2. DRC passes — abutment boundaries must be clean at every process node's design rules
  3. All output views are consistent — the .lib timing model, the Verilog behavioral model, the SPICE netlist, and the GDS all describe the same circuit

Why Not Just Write a Script?

A script-based flow (e.g., using Tcl or Python) is a common initial approach for structural assembly. While functional for simple cases, it introduces specific limitations as complexity scales:

ProblemWhy a script struggles
Multiple output formatsEach format has different tiling logic; keeping them in sync manually is error-prone
LVS correctnessA single off-by-one in row numbering silently produces an LVS mismatch across thousands of nets
New configurationsAdding a new mux ratio or port combination requires touching tiling logic in every format independently
Regression safetyNo shared data model means a fix in GDS tiling doesn't automatically propagate to netlist tiling
Bitcell swapChanging the leaf cell requires hunting down every hardcoded assumption across every format script

The last point deserves emphasis. In a script-based flow, the bitcell's geometry and netlist structure leak into the tiling logic — net names, transistor counts, pin locations get hardcoded. Swap a 6T cell for an 8T cell, and the script breaks in multiple places simultaneously, across multiple files, in ways that may not be immediately obvious.

A memory compiler treats the bitcell views (GDS, CDL, .lib) as inputs, not assumptions baked into the code. The tiling engine doesn't know or care what's inside the cell — it only knows how to replicate and connect it. Swap the input files, and every output format updates automatically. Consistency is structural, not manual.

A memory compiler solves this by having one data structure drive all emitters. The tiling logic runs once; GDS, CDL, Verilog, and .lib are all derived outputs. This is the same architectural insight behind any compiler: separate the representation of intent from the multiple backends that render it.


5. The Parameter Space

depth × width — Array Shape

        <──── width = 32 bits ────>
   ┌──┬──┬──┬──┬──┬──┬──┬── ... ──┐  ↑
   │  │  │  │  │  │  │  │         │  │
   ├──┼──┼──┼──┼──┼──┼──┼── ... ──┤  │
   │  │  │  │  │  │  │  │         │  depth = 1024 words
   ├──┼──┼──┼──┼──┼──┼──┼── ... ──┤  │
   │  │  │  │  │  │  │  │         │  │
   └──┴──┴──┴──┴──┴──┴──┴── ... ──┘  ↓
   Total bits = 1024 × 32 = 32,768 bits

port — Independent Access Channels

Feature1-Port (1P / 1RW)Multi-Port (e.g., 2R1W)
Address buses12+ (separate read addr, write addr)
ConcurrencyRead or WriteRead and Write simultaneously
BitcellStandard 6T8T (isolated read port)
AreaHigh density~1.5x–2.0x penalty
ApplicationsL1/L2/L3 cachesCPU register files, FIFOs

Register files require multi-port arrays to prevent pipeline stalls. Executing ADD R1, R2, R3 requires reading R2 and R3 simultaneously while writing back a previous instruction's result into R1. A 1-port SRAM forces serial execution, breaking pipeline concurrency. Consequently, register files absorb the 8T area penalty to sustain throughput.

mux factor — Column Multiplexer Ratio

Sense amplifiers are expensive in area. The mux factor decides how many bitline columns share one sense amp:

  mux=1 (no mux):              mux=4:
  one SA per column             one SA per 4 columns
  ┌─┬─┬─┬─┐                   ┌─┬─┬─┬─┐
  │ │ │ │ │                   │ │ │ │ │
 SA SA SA SA                  └─┴─┴─┴─┘
                                     SA
  fastest, widest layout        slower, narrower layout

  // SA = Sense Amplifier: detects the small voltage differential
  //      on BL/BLB and amplifies it to a full logic level

An easy-to-miss point: mux factor doesn't only affect area — it directly affects latency. Every cycle spent selecting which column connects to the sense amp is latency. Higher mux factor means longer bitlines, more capacitance, and a slower sense amp enable. A significant portion of cache access latency is hiding inside the column mux — this is a first-class trade-off knob in L1 microarchitecture design, not just a layout convenience.

corner — PVT (Process, Voltage, Temperature)

  Process:  TT (typical), SS (slow), FF (fast)
  Voltage:  0.72V (low),  0.8V (nom),  0.88V (high)
  Temp:     -40°C, 25°C, 125°C

  SS + high temp + low voltage  → worst-case timing (setup check)
  FF + low temp  + high voltage → best-case timing  (hold check)

The .lib file contains one timing table per corner. The memory compiler must pass Static Timing Analysis (STA) at every corner.

Putting It Together: Full Array Structure

                    BL_0  BL_1  BL_2  BL_3  BL_4  BL_5  BL_6  BL_7
                     │     │     │     │     │     │     │     │
  addr[A-1:M]        ▼     ▼     ▼     ▼     ▼     ▼     ▼     ▼
  (high bits)   ┌────────────────────────────────────────────────────┐
  ┌───────┐     │   ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐   │
  │   R   │────►│   └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘   │◄─ WL_0
  │   o   │     │   ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐   │
  │   w   │────►│   └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘   │◄─ WL_1
  │       │     │   ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐   │
  │   D   │────►│   └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘   │◄─ WL_2
  │   e   │     │   ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐   │
  │   c   │────►│   └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘   │◄─ WL_3
  └───────┘     │     6T bitcell array (depth rows × width cols)     │
                └────────────────────────────────────────────────────┘
                     │     │     │     │     │     │     │     │
                     ▼     ▼     ▼     ▼     ▼     ▼     ▼     ▼
                ┌────────────────────────────────────────────────────┐
                │         Sense Amplifiers (one per BL pair)         │
                └─────────────────────────┬──────────────────────────┘
                                          │  (mux=2: 8 cols → 4 sense amps → 4 outputs)
  addr[M-1:0]                             ▼
  (low bits)    ┌────────────────────────────────────────────────────┐
  select col    │                Column MUX / Decoder                │
                └─────────────────────────┬──────────────────────────┘
                                          │
                                          ▼
                                Data I/O (width bits)

6. Cache Hierarchy and SRAM Selection

Different cache levels have different requirements, and those requirements directly determine which bitcell to use:

CacheDistance from CPUPriorityCell choiceReason
L1NearestSpeed, concurrent R/W6T or 8T (design-dependent)Pipeline needs read + write in the same cycle; high-perf designs often use fast 6T, multi-port 8T is more common in register files
L2MiddleBalance6T 1-portSpeed matters, but so does density
L3FarthestCapacity, low power6T 1-port (dense)Area is the primary concern; PPA must hold

L1 needs concurrent read and write because the pipeline demands it — hence 8T. But an 8T array is 1.5x–2x the area of 6T. At L3 scale (tens of megabytes), the area overhead of 8T becomes prohibitive, making it difficult to satisfy standard PPA (Power, Performance, Area) targets.

In SoC modeling, all of these decisions need to be captured together, ideally in a cycle-latency-accurate performance simulator, so different cache hierarchy configurations can be evaluated against actual performance targets. I built Stratum for exactly this — a configurable cache simulator where hierarchy topology is bound at compile time, mirroring how memory systems are fixed at hardware synthesis.


7. How the Array Is Built: Tiling in Two Representations

How does a memory compiler go from a single leaf cell to a full array? Tiling — like laying floor tiles:

leaf cell (6T)
┌──┐
│  │  ×  (depth rows × width cols)  →  complete bitcell array
└──┘

The critical constraint: the compiler must produce two representations simultaneously, and they must match exactly:

GDS  --[extraction]--> netlist A --+
                                   +--> [LVS tool] --> PASS or FAIL
CDL / SPICE netlist B -------------+

This is LVS (Layout vs Schematic). An EDA (Electronic Design Automation) tool like Calibre reads the GDS polygons, infers transistors and connections, and compares the resulting netlist against the schematic netlist you generated. One mismatched net name, one missing transistor → LVS fails → tapeout blocked.

GDS Tiling (layout layer)

Netlist Tiling (schematic layer)

Common LVS failures in a memory compiler:

This is why memory compilers demand extreme rigor: GDS and netlist tiling logic must share a single source of truth — one data structure drives both emitters. Any new feature (new mux ratio, new port combination) must not break LVS, and automated regression tests (pytest) are the only way to sleep at night.

Cell Flipping: Mirrored Abutment

When tiling, adjacent rows of bitcells are not simply copied — they are vertically mirrored before being placed:

WL0: normal orientation  → top: Vdd,  bottom: GND
WL1: flipped upside-down → top: GND,  bottom: Vdd

→ WL0's bottom GND and WL1's top GND overlap perfectly
→ two rows share a single GND rail; no extra DRC spacing needed

8. What Makes It Production-Grade

EMA — A Post-Silicon Timing Knob

Memory internal read/write timing is asynchronous and self-timed: dummy bitlines and delay cells decide when to fire the sense amp, independent of the external clock.

When process variation causes read failures or insufficient write margin, EMA (Extra Margin Adjustment) pins provide a post-fabrication fix:

Increasing EMA gives the bitcell more time to pull down the bitline → larger differential voltage → better Read SNM (Static Noise Margin) → better yield, at the cost of slightly longer access time. This is a three-way trade-off between area, speed, and yield.

Low Power: VDDC vs VDDP Dual-Rail

SRAM arrays are often the largest, leakiest structures on an SoC. Memory compilers must support power gating and generate corresponding UPF/CPF files:

RailSuppliesCan be shut off?
VDDC6T bitcell array itselfNo (kills data)
VDDPAddress decoders, sense amps, controlYes

Three power states:

  1. Light Sleep: gate peripheral clocks, reduce dynamic switching power
  2. Deep Sleep (Retention): shut off VDDP (periphery fully off), reduce VDDC voltage just enough to keep cross-coupled inverters locked — data retained
  3. Shutdown: both VDDC and VDDP off — all data lost

Redundancy — Repairing Manufacturing Defects

Tiling millions of deep sub-micron transistors at maximum density means manufacturing defects are statistically unavoidable. Discarding a whole SoC for one dead bitcell would be economically catastrophic.

The solution is built-in redundancy:

  1. The compiler automatically generates spare redundant columns alongside the main array
  2. During manufacturing test, the MBIST engine identifies all failing addresses
  3. An eFuse is permanently blown, recording the bad column
  4. On every power-up, identical MUX logic redirects that column's I/O to the redundant spare
  5. The memory appears completely intact to the outside world

9. Two Directions of Tiling

The concept of "tiling" exists in both EDA and ML compiler domains, mapping to structurally opposite operations.

Memory Compiler Tiling: Small → Large (Assembling)

Memory compiler tiling takes a designed leaf cell and replicates it into a full array, like laying floor tiles:

small leaf cell  →  tile into  →  large SRAM array

   ┌──┐                             ┌────────────┐
   │6T│  ×  (R rows × C cols)   →   │            │
   └──┘                             │   array    │
                                    │            │
                                    └────────────┘

Direction: bottom-up, assembling

ML Compiler Tiling: Large → Small (Decomposing)

ML compiler tiling (e.g., GEMM (General Matrix Multiplication) on an NPU (Neural Processing Unit)) takes a large computation and cuts it into pieces that fit the hardware:

full GEMM computation (M × N × K)  →  cut into  →  hardware-sized tiles

┌──────────────────┐                   ┌──┬──┬──┐
│                  │                   │  │  │  │
│   A (M×K)        │  →   tiling   →   ├──┼──┼──┤
│   × B (K×N)      │                   │  │  │  │
│                  │                   └──┴──┴──┘
└──────────────────┘             each tile fits in SRAM + systolic array

The constraint: an NPU's systolic array is a fixed size, and the on-chip SRAM is bounded. You can't feed the whole matrix at once. Tiling decomposes the computation into sizes that maximize hardware utilization and minimize memory bandwidth bubbles.

Direction: top-down, decomposing

The Symmetry

ContextInput → OutputDirection
Memory Compilerleaf → arraysmall → large
ML Compilercomputation → tilelarge → small

10. Two Open Questions

Chisel and the Memory Compiler Interface

Chisel treats memory as a first-class language primitive. When you write SyncReadMem(1024, UInt(32.W)), you are not instantiating a specific SRAM macro — you are declaring intent: I need something that behaves like a 1024-entry synchronous-read memory. The implementation is deliberately unspecified.

This abstraction breaks down at tapeout. Modern CAD tools cannot synthesize SRAM macros from an RTL description; without intervention, FIRRTL maps all SeqMem instances to flip-flop arrays, which are functionally correct but physically unroutable at scale. The fix is a FIRRTL transform called ReplSeqMem: it scans the design, converts every SeqMem above a size threshold into an external module reference (a black box with only pins visible), and outputs a .conf file listing every unique SRAM configuration the design requires.

That .conf file is then consumed by MacroCompiler (part of Chipyard's Tapeout-Tools). MacroCompiler is also given an .mdf file describing either the available vendor SRAM macros or the capabilities of the foundry's memory compiler. It matches the requested configurations against what is available and emits the technology-mapped Verilog — or, if no direct match exists, passes the request to the memory compiler itself to generate a new macro.

Chisel SyncReadMem           FIRRTL ReplSeqMem          MacroCompiler
(abstract intent)   ─────►  (.conf: what is needed) ──► (maps to vendor SRAM or calls memory compiler)

Chisel does it at the language level; ReplSeqMem does it at the IR level; MacroCompiler does it at the physical level. The traditional memory compiler sits at the bottom of this stack, still responsible for generating actual GDS and netlists — but it is now invoked programmatically rather than by hand.

What remains unsettled is the interface. The .conf / .mdf format is UCB-specific and not an industry standard. As Chisel and CIRCT (the LLVM/MLIR-based FIRRTL compiler) gain adoption, this plumbing will need to standardize.


Compute-In-Memory and the Storage/Compute Boundary

Throughout this article the memory macro has been pure storage: data in, data out, computation elsewhere. Compute-In-Memory (CIM) erases that separation.

The core idea is straightforward. In a standard read, a single WL is asserted and one row drives the bitlines. In analog CIM, multiple WLs are asserted simultaneously. Each active bitcell contributes a small current to the shared BL proportional to its stored bit. The total BL discharge becomes a current accumulation — a dot product, computed in the analog domain without ever moving data out of the array. An ADC at the column sense amp converts the result.

Standard read:   assert 1 WL  → read 1 row
Analog CIM:      assert N WLs → BL current ∝ Σ(weight_i × input_i) = dot product

The 6T cell's physics — the same physics that creates Read Disturbance and necessitates the Beta Ratio — becomes the compute primitive. The bitcell is not being repurposed; it is being used for something its physics already enables, just never intentionally.

Digital CIM (DCIM) takes a different path: it adds explicit logic gates alongside the bitcells so that computation is fully digital and deterministic, at the cost of area. Recent work (e.g., 12nm DCIM at 137 TOPS/W) fits this approach into foundry 8T bitcells, making it compatible with standard memory compiler flows.

Traditional memory compilers have a strict internal model: the array is storage, the periphery is control, and the two are generated separately. CIM breaks this model. The array now participates in computation; the sense amplifier periphery doubles as an ADC; power and timing budgets span both. A CIM-aware memory compiler cannot treat the array and its periphery as independent subsystems. Some researchers have noted that CIM macros are amenable to automated design via memory compilers, but what that actually requires — a compiler that co-generates storage geometry and compute periphery from a unified specification — does not yet exist in any standardized form.