What "Memory Compiler" Actually Means: From Bitcells to GDS Tiling

May 29, 2026 [compiler-design] #memory-compiler #sram #eda #layout-vs-schematic #tiling

Most people see the name "memory compiler" and have no idea what it actually does.

1. Compiler Representation

The classical representation is the T-diagram (or Tombstone diagram). It characterizes a compiler by three languages: the source language it reads (A), the target language it emits (B), and the implementation language it is written in (C):

┌───┬───┐
│ A → B │
└───┴───┘
    │ C

Read it as: "a program that translates A into B, written in C."

The textbook example is C → machine code. But the definition is broader than that:

Typical compiler: C / C++ → assembly / machine code
The compiler I work on at Synopsys: VHDL / Verilog RTL → simulation / debugging database
Memory compiler: design parameters (depth, width, port…) → all views needed to tape out a chip
Shader compiler: GLSL / HLSL → GPU machine code (e.g., Mesa's NIR pipeline, DXC)
Query compiler: SQL → a physical execution plan (e.g., PostgreSQL's planner/executor)
Bytecode compiler: Java source → JVM bytecode (javac), or JavaScript → V8 bytecode (Ignition)

2. The Raw Material: Inside a Bitcell

The fundamental building block of any SRAM (Static Random-Access Memory) is the bitcell — a tiny circuit that holds one bit. Hardware teams design and characterize these cells by hand; the memory compiler's job is to replicate them at scale.

The most common variant is the 6T bitcell (six transistors). Three signal lines connect it to the outside world: WL (Word Line — selects a row), BL (Bit Line), and BLB (Bit Line Bar, the complementary signal):

                WL
══════════════════╦══════════════════════════════╦════════
                  ║                              ║
            ┌─────╫──────────────────────────────╫─────┐
            │   [ M5 ]                         [ M6 ]  │
            │     │                              │     │
       BL ──┼─────●───── Q               Q' ─────●─────┼── BLB
            │            │               │             │
            │            │   ┌───────┐   │             │
            │            ├─►─┤ INV_R ├───┤             │
            │            │   │(M2,M4)│   │             │
            │            │   └───────┘   │             │
            │            │               │             │
            │            │   ┌───────┐   │             │
            │            ├───┤ INV_L ├─◄─┤             │
            │            │   │(M1,M3)│   │             │
            │            │   └───────┘   │             │
            │                                          │
            │                bitcell (6T)              │
            └──────────────────────────────────────────┘

Transistor count:

INV_L: PMOS M3 + NMOS M1 = 2
INV_R: PMOS M4 + NMOS M2 = 2
Access transistors: M5, M6 = 2
Total: 6 → "6T"

Positive Feedback: How a Bit Gets Locked

Two inverters connected head-to-tail (cross-coupled):

Assume Q = 1 (high, Vdd)
INV_R receives 1, outputs 0 to Q'
INV_L receives 0, outputs 1 back to Q
Q stays 1 — a perfect closed loop

This is positive feedback + bistability. As long as Vdd is present, Q and Q' are locked at opposite voltages indefinitely. The stored bit is literally just the voltage sitting on those two nodes. SRAM is called "Static" because — unlike DRAM — it never needs to refresh.

Three Operations

Hold (Standby)

WL = 0, access transistors off
Cross-coupled inverters maintain state entirely on their own; only leakage current flows
Strictly speaking, Hold is a state, not an operation — but datasheets list it alongside Read/Write because designers need leakage current figures for power budgeting

Read

Precharge BL and BLB to Vdd
Assert WL high; M5 and M6 turn on
The side storing 0 pulls its bitline down by ~100 mV
Sense amplifier amplifies the small differential → 1 bit out

Read Disturbance: the moment WL turns on, the high-voltage BLB can pull up Q' (the node storing 0) slightly through M6. If it rises past the switching threshold of INV_L, the cell flips — a Destructive Read. The fix is making the pull-down NMOS M2 stronger (wider) than the access transistor M6. This strength ratio is the Beta Ratio (β = W_pull-down / W_access), typically required to be > 1. This is why the six transistors in a standard bitcell are not all the same size.

Write

Write driver forces BL and BLB to target values (one high, one low)
Assert WL high
Write driver strength overcomes the cell's pull-up → forces the cell to flip
The sizing trade-off between these forces is called the pull-up ratio

Beyond 6T: Other Bitcell Types

6T is not the only option. Different use cases demand different trade-offs:

Cell	Transistors	Key property	Typical use
6T	6	High density, standard read/write	L2/L3 caches
8T	8	Isolated read port, no read disturb	L1 caches, register files
10T	10	Ultra-low voltage operation	Near-threshold designs

The 8T cell adds a dedicated read path (two extra transistors) so the read operation never touches the storage nodes — eliminating read disturbance entirely at the cost of ~33% more area.

3. Why a Memory Compiler Exists

When a hardware team designs an SoC (System on Chip), they need SRAM — lots of it, at many different sizes. The fundamental unit they work with is a cell: the smallest verified building block of an SRAM array. But a chip might need an SRAM of depth 512 × width 32 in one place, and depth 4096 × width 64 in another. Redesigning a new cell from scratch for every configuration is not feasible.

The solution is a parameterized IP: instead of a fixed design, you ship a tool that accepts parameters and generates the correct implementation automatically. That tool is the memory compiler. Given depth, width, port count, and a few other knobs, it produces a complete, tapeout-ready SRAM macro.

A memory compiler generates a complete, self-contained SRAM macro — the bitcell array plus all the peripheral circuitry needed to operate it: row/column decoders, sense amplifiers, write drivers, and self-timed control. What it does not generate is the memory controller — the system-level logic that decides what to read and write, handles arbitration and BIST sequencing, and lives in the SoC RTL outside the macro.

Leaf cell views              ┌──────────────────────────────┐     Generated View
(from hardware team):        │        Memory Compiler       │
                             │                              │
 .gds / .oas (layout) ──────>│                              │───> .gds / .oas
 .lef        (abstract) ────>│                              │───> .lef
 .cdl / .sp  (netlist) ─────>│                              │───> .lib
 .lib        (timing)  ─────>│                              │───> .v  (Verilog)
                             │                              │───> .sp / .cdl
Parameters:                  │                              │───> .cpf / .upf
                             │                              │───> .pat  (ATPG/MBIST)
 depth            ──────────>│                              │───> .pdf  (Datasheet)
 width            ──────────>│                              │
 port             ──────────>│                              │
 mux factor       ──────────>│                              │
 corner (PVT)     ──────────>│                              │
                             └──────────────────────────────┘

Output Views

View	Format	Used by
Layout	GDS / OASIS	Tapeout, sent to fab
Abstract	LEF	P&R (Place and Route) — exposes pins + blockage only
Timing	`.lib` (Liberty)	STA (Static Timing Analysis) — setup/hold, access time
Behavioral	Verilog / SV	RTL simulation
Netlist	SPICE / CDL	SPICE simulation, LVS
Power	`.lib` / CPF / UPF	Power analysis
Test	ATPG / MBIST pattern	DFT (Design for Testability) / manufacturing test

4. Two Teams, One Boundary

A memory compiler sits at the intersection of two teams with very different jobs.

The Hardware Team: Designing the Leaf Cell

The hardware team — circuit designers and layout engineers — designs and verifies the leaf cell: the smallest possible SRAM unit. Designing a single 6T bitcell involves resolving multiple interacting constraints:

Transistor sizing (Beta ratio, pull-up ratio) to balance read stability vs. write ability
Full-custom layout at the process node's design rules
Characterization across all PVT corners (dozens of SPICE simulations)
Sign-off: DRC (Design Rule Check), LVS (Layout vs Schematic), antenna checks, electromigration

The hardware team delivers a small set of verified, hand-crafted files — GDS, CDL, LEF, .lib — for that one leaf cell. Their job ends at the single-cell boundary.

The CAD Team: Building the Compiler

The CAD team (or memory compiler team) takes those verified leaf cells and builds the automation that scales them to any size. Their job is to answer: given a leaf cell that works correctly in isolation, how do you tile it into an array of arbitrary depth × width while guaranteeing:

LVS passes — GDS and netlist tiling must match exactly
DRC passes — abutment boundaries must be clean at every process node's design rules
All output views are consistent — the .lib timing model, the Verilog behavioral model, the SPICE netlist, and the GDS all describe the same circuit

Why Not Just Write a Script?

A script-based flow (e.g., using Tcl or Python) is a common initial approach for structural assembly. While functional for simple cases, it introduces specific limitations as complexity scales:

Problem	Why a script struggles
Multiple output formats	Each format has different tiling logic; keeping them in sync manually is error-prone
LVS correctness	A single off-by-one in row numbering silently produces an LVS mismatch across thousands of nets
New configurations	Adding a new mux ratio or port combination requires touching tiling logic in every format independently
Regression safety	No shared data model means a fix in GDS tiling doesn't automatically propagate to netlist tiling
Bitcell swap	Changing the leaf cell requires hunting down every hardcoded assumption across every format script

The last point deserves emphasis. In a script-based flow, the bitcell's geometry and netlist structure leak into the tiling logic — net names, transistor counts, pin locations get hardcoded. Swap a 6T cell for an 8T cell, and the script breaks in multiple places simultaneously, across multiple files, in ways that may not be immediately obvious.

A memory compiler treats the bitcell views (GDS, CDL, .lib) as inputs, not assumptions baked into the code. The tiling engine doesn't know or care what's inside the cell — it only knows how to replicate and connect it. Swap the input files, and every output format updates automatically. Consistency is structural, not manual.

A memory compiler solves this by having one data structure drive all emitters. The tiling logic runs once; GDS, CDL, Verilog, and .lib are all derived outputs. This is the same architectural insight behind any compiler: separate the representation of intent from the multiple backends that render it.

5. The Parameter Space

depth × width — Array Shape

     <──── width = 32 bits ────>
┌──┬──┬──┬──┬──┬──┬──┬── ... ──┐  ↑
│  │  │  │  │  │  │  │         │  │
├──┼──┼──┼──┼──┼──┼──┼── ... ──┤  │
│  │  │  │  │  │  │  │         │  depth = 1024 words
├──┼──┼──┼──┼──┼──┼──┼── ... ──┤  │
│  │  │  │  │  │  │  │         │  │
└──┴──┴──┴──┴──┴──┴──┴── ... ──┘  ↓
Total bits = 1024 × 32 = 32,768 bits

port — Independent Access Channels

Feature	1-Port (1P / 1RW)	Multi-Port (e.g., 2R1W)
Address buses	1	2+ (separate read addr, write addr)
Concurrency	Read or Write	Read and Write simultaneously
Bitcell	Standard 6T	8T (isolated read port)
Area	High density	~1.5x–2.0x penalty
Applications	L1/L2/L3 caches	CPU register files, FIFOs

Register files require multi-port arrays to prevent pipeline stalls. Executing ADD R1, R2, R3 requires reading R2 and R3 simultaneously while writing back a previous instruction's result into R1. A 1-port SRAM forces serial execution, breaking pipeline concurrency. Consequently, register files absorb the 8T area penalty to sustain throughput.

mux factor — Column Multiplexer Ratio

Sense amplifiers are expensive in area. The mux factor decides how many bitline columns share one sense amp:

 mux=1 (no mux):              mux=4:
 one SA per column             one SA per 4 columns
 ┌─┬─┬─┬─┐                   ┌─┬─┬─┬─┐
 │ │ │ │ │                   │ │ │ │ │
SA SA SA SA                  └─┴─┴─┴─┘
                                    SA
 fastest, widest layout        slower, narrower layout

 // SA = Sense Amplifier: detects the small voltage differential
 //      on BL/BLB and amplifies it to a full logic level

An easy-to-miss point: mux factor doesn't only affect area — it directly affects latency. Every cycle spent selecting which column connects to the sense amp is latency. Higher mux factor means longer bitlines, more capacitance, and a slower sense amp enable. A significant portion of cache access latency is hiding inside the column mux — this is a first-class trade-off knob in L1 microarchitecture design, not just a layout convenience.

corner — PVT (Process, Voltage, Temperature)

Process:  TT (typical), SS (slow), FF (fast)
Voltage:  0.72V (low),  0.8V (nom),  0.88V (high)
Temp:     -40°C, 25°C, 125°C

SS + high temp + low voltage  → worst-case timing (setup check)
FF + low temp  + high voltage → best-case timing  (hold check)

The .lib file contains one timing table per corner. The memory compiler must pass Static Timing Analysis (STA) at every corner.

Putting It Together: Full Array Structure

                  BL_0  BL_1  BL_2  BL_3  BL_4  BL_5  BL_6  BL_7
                   │     │     │     │     │     │     │     │
addr[A-1:M]        ▼     ▼     ▼     ▼     ▼     ▼     ▼     ▼
(high bits)   ┌────────────────────────────────────────────────────┐
┌───────┐     │   ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐   │
│   R   │────►│   └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘   │◄─ WL_0
│   o   │     │   ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐   │
│   w   │────►│   └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘   │◄─ WL_1
│       │     │   ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐   │
│   D   │────►│   └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘   │◄─ WL_2
│   e   │     │   ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐  ┌──┐   │
│   c   │────►│   └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘  └──┘   │◄─ WL_3
└───────┘     │     6T bitcell array (depth rows × width cols)     │
              └────────────────────────────────────────────────────┘
                   │     │     │     │     │     │     │     │
                   ▼     ▼     ▼     ▼     ▼     ▼     ▼     ▼
              ┌────────────────────────────────────────────────────┐
              │         Sense Amplifiers (one per BL pair)         │
              └─────────────────────────┬──────────────────────────┘
                                        │  (mux=2: 8 cols → 4 sense amps → 4 outputs)
addr[M-1:0]                             ▼
(low bits)    ┌────────────────────────────────────────────────────┐
select col    │                Column MUX / Decoder                │
              └─────────────────────────┬──────────────────────────┘
                                        │
                                        ▼
                              Data I/O (width bits)

6. Cache Hierarchy and SRAM Selection

Different cache levels have different requirements, and those requirements directly determine which bitcell to use:

Cache	Distance from CPU	Priority	Cell choice	Reason
L1	Nearest	Speed, concurrent R/W	6T or 8T (design-dependent)	Pipeline needs read + write in the same cycle; high-perf designs often use fast 6T, multi-port 8T is more common in register files
L2	Middle	Balance	6T 1-port	Speed matters, but so does density
L3	Farthest	Capacity, low power	6T 1-port (dense)	Area is the primary concern; PPA must hold

L1 needs concurrent read and write because the pipeline demands it — hence 8T. But an 8T array is 1.5x–2x the area of 6T. At L3 scale (tens of megabytes), the area overhead of 8T becomes prohibitive, making it difficult to satisfy standard PPA (Power, Performance, Area) targets.

In SoC modeling, all of these decisions need to be captured together, ideally in a cycle-latency-accurate performance simulator, so different cache hierarchy configurations can be evaluated against actual performance targets. I built Stratum for exactly this — a configurable cache simulator where hierarchy topology is bound at compile time, mirroring how memory systems are fixed at hardware synthesis.

7. How the Array Is Built: Tiling in Two Representations

How does a memory compiler go from a single leaf cell to a full array? Tiling — like laying floor tiles:

leaf cell (6T)
┌──┐
│  │  ×  (depth rows × width cols)  →  complete bitcell array
└──┘

The critical constraint: the compiler must produce two representations simultaneously, and they must match exactly:

GDS  --[extraction]--> netlist A --+
                                   +--> [LVS tool] --> PASS or FAIL
CDL / SPICE netlist B -------------+

This is LVS (Layout vs Schematic). An EDA (Electronic Design Automation) tool like Calibre reads the GDS polygons, infers transistors and connections, and compares the resulting netlist against the schematic netlist you generated. One mismatched net name, one missing transistor → LVS fails → tapeout blocked.

GDS Tiling (layout layer)

Operates on geometry: polygons, vias, metal layers
Places GDS sub-cells on a grid with abutment-aligned boundaries: power rails, BL, and WL must connect cleanly at every edge
Boundary handling (abutment) is the hardest part: the bitcell array must interface with the row decoder, column mux, sense amp, and control logic
Output: .gds / .oas

Netlist Tiling (schematic layer)

Operates on instances and nets
Follows the same tiling pattern to instantiate and wire SPICE/CDL sub-circuits
Output: .sp / .cdl / Verilog netlist

Common LVS failures in a memory compiler:

BL_0 in the netlist is connected to BL_1 in the GDS abutment
GDS tiling added one extra row of bitcells but the netlist wasn't updated
A missing via in GDS leaves two nets floating that should be shorted

This is why memory compilers demand extreme rigor: GDS and netlist tiling logic must share a single source of truth — one data structure drives both emitters. Any new feature (new mux ratio, new port combination) must not break LVS, and automated regression tests (pytest) are the only way to sleep at night.

Cell Flipping: Mirrored Abutment

When tiling, adjacent rows of bitcells are not simply copied — they are vertically mirrored before being placed:

WL0: normal orientation  → top: Vdd,  bottom: GND
WL1: flipped upside-down → top: GND,  bottom: Vdd

→ WL0's bottom GND and WL1's top GND overlap perfectly
→ two rows share a single GND rail; no extra DRC spacing needed

Power rail sharing: Vdd/GND rails are shared between adjacent rows, eliminating one full metal rail
Bitline contact sharing: access transistors in adjacent rows share the same BL contacts, cutting contact count in half
N-Well / P-Well continuity: PMOS transistors live in N-Well; mirrored placement lets adjacent N-Wells merge seamlessly, avoiding the large Well-to-Well isolation spacing penalty

8. What Makes It Production-Grade

EMA — A Post-Silicon Timing Knob

Memory internal read/write timing is asynchronous and self-timed: dummy bitlines and delay cells decide when to fire the sense amp, independent of the external clock.

When process variation causes read failures or insufficient write margin, EMA (Extra Margin Adjustment) pins provide a post-fabrication fix:

EMA: adjusts WL pulse width and SA enable delay (read margin)
EMAW: adjusts timing margin for write operations

Increasing EMA gives the bitcell more time to pull down the bitline → larger differential voltage → better Read SNM (Static Noise Margin) → better yield, at the cost of slightly longer access time. This is a three-way trade-off between area, speed, and yield.

Low Power: VDDC vs VDDP Dual-Rail

SRAM arrays are often the largest, leakiest structures on an SoC. Memory compilers must support power gating and generate corresponding UPF~/~CPF files:

Rail	Supplies	Can be shut off?
VDDC	6T bitcell array itself	No (kills data)
VDDP	Address decoders, sense amps, control	Yes

Three power states:

Light Sleep: gate peripheral clocks, reduce dynamic switching power
Deep Sleep (Retention): shut off VDDP (periphery fully off), reduce VDDC voltage just enough to keep cross-coupled inverters locked — data retained
Shutdown: both VDDC and VDDP off — all data lost

Redundancy — Repairing Manufacturing Defects

Tiling millions of deep sub-micron transistors at maximum density means manufacturing defects are statistically unavoidable. Discarding a whole SoC for one dead bitcell would be economically catastrophic.

The solution is built-in redundancy:

The compiler automatically generates spare redundant columns alongside the main array
During manufacturing test, the MBIST engine identifies all failing addresses
An eFuse is permanently blown, recording the bad column
On every power-up, identical MUX logic redirects that column's I/O to the redundant spare
The memory appears completely intact to the outside world

9. Two Directions of Tiling

The concept of "tiling" exists in both EDA and ML compiler domains, mapping to structurally opposite operations.

Memory Compiler Tiling: Small → Large (Assembling)

Memory compiler tiling takes a designed leaf cell and replicates it into a full array, like laying floor tiles:

small leaf cell  →  tile into  →  large SRAM array

   ┌──┐                             ┌────────────┐
   │6T│  ×  (R rows × C cols)   →   │            │
   └──┘                             │   array    │
                                    │            │
                                    └────────────┘

Direction: bottom-up, assembling

ML Compiler Tiling: Large → Small (Decomposing)

ML compiler tiling (e.g., GEMM (General Matrix Multiplication) on an NPU (Neural Processing Unit)) takes a large computation and cuts it into pieces that fit the hardware:

full GEMM computation (M × N × K)  →  cut into  →  hardware-sized tiles

┌──────────────────┐                   ┌──┬──┬──┐
│                  │                   │  │  │  │
│   A (M×K)        │  →   tiling   →   ├──┼──┼──┤
│   × B (K×N)      │                   │  │  │  │
│                  │                   └──┴──┴──┘
└──────────────────┘             each tile fits in SRAM + systolic array

The constraint: an NPU's systolic array is a fixed size, and the on-chip SRAM is bounded. You can't feed the whole matrix at once. Tiling decomposes the computation into sizes that maximize hardware utilization and minimize memory bandwidth bubbles.

Direction: top-down, decomposing

The Symmetry

Context	Input → Output	Direction
Memory Compiler	leaf → array	small → large
ML Compiler	computation → tile	large → small

10. Two Open Questions

Chisel and the Memory Compiler Interface

Chisel treats memory as a first-class language primitive. When you write SyncReadMem(1024, UInt(32.W)), you are not instantiating a specific SRAM macro — you are declaring intent: I need something that behaves like a 1024-entry synchronous-read memory. The implementation is deliberately unspecified.

This abstraction breaks down at tapeout. Modern CAD tools cannot synthesize SRAM macros from an RTL description; without intervention, FIRRTL maps all SeqMem instances to flip-flop arrays, which are functionally correct but physically unroutable at scale. The fix is a FIRRTL transform called ReplSeqMem: it scans the design, converts every SeqMem above a size threshold into an external module reference (a black box with only pins visible), and outputs a .conf file listing every unique SRAM configuration the design requires.

That .conf file is then consumed by MacroCompiler (part of Chipyard's Tapeout-Tools). MacroCompiler is also given an .mdf file describing either the available vendor SRAM macros or the capabilities of the foundry's memory compiler. It matches the requested configurations against what is available and emits the technology-mapped Verilog — or, if no direct match exists, passes the request to the memory compiler itself to generate a new macro.

Chisel SyncReadMem           FIRRTL ReplSeqMem          MacroCompiler
(abstract intent)   ─────►  (.conf: what is needed) ──► (maps to vendor SRAM or calls memory compiler)

Chisel does it at the language level; ReplSeqMem does it at the IR level; MacroCompiler does it at the physical level. The traditional memory compiler sits at the bottom of this stack, still responsible for generating actual GDS and netlists — but it is now invoked programmatically rather than by hand.

What remains unsettled is the interface. The .conf / .mdf format is UCB-specific and not an industry standard. As Chisel and CIRCT (the LLVM/MLIR-based FIRRTL compiler) gain adoption, this plumbing will need to standardize.

Compute-In-Memory and the Storage/Compute Boundary

Throughout this article the memory macro has been pure storage: data in, data out, computation elsewhere. Compute-In-Memory (CIM) erases that separation.

The core idea is straightforward. In a standard read, a single WL is asserted and one row drives the bitlines. In analog CIM, multiple WLs are asserted simultaneously. Each active bitcell contributes a small current to the shared BL proportional to its stored bit. The total BL discharge becomes a current accumulation — a dot product, computed in the analog domain without ever moving data out of the array. An ADC at the column sense amp converts the result.

Mode	Operation
Standard read	assert 1 `WL` → read 1 row
Analog CIM	assert \(N\) `WLs` → \(I_{BL} \propto \sum_{i} w_i \cdot x_i\)

The column BL current accumulates a dot product in the analog domain:

\[I_{BL} \propto \sum_{i=1}^{N} w_i \cdot x_i\]

where \(w_i \in \{0, 1\}\) is the stored bit in row \(i\) and \(x_i\) is the input activation (encoded as WL drive strength).

The 6T cell's physics — the same physics that creates Read Disturbance and necessitates the Beta Ratio — becomes the compute primitive. The bitcell is not being repurposed; it is being used for something its physics already enables, just never intentionally.

Digital CIM (DCIM) takes a different path: it adds explicit logic gates alongside the bitcells so that computation is fully digital and deterministic, at the cost of area. Recent work (e.g., 12nm DCIM at 137 TOPS/W) fits this approach into foundry 8T bitcells, making it compatible with standard memory compiler flows.

Traditional memory compilers have a strict internal model: the array is storage, the periphery is control, and the two are generated separately. CIM breaks this model. The array now participates in computation; the sense amplifier periphery doubles as an ADC; power and timing budgets span both. A CIM-aware memory compiler cannot treat the array and its periphery as independent subsystems. Some researchers have noted that CIM macros are amenable to automated design via memory compilers, but what that actually requires — a compiler that co-generates storage geometry and compute periphery from a unified specification — does not yet exist in any standardized form.

> share on twitter / bluesky / mastodon / facebook / reddit / telegram / email

> related articles: