All About C & C++ Strings
Table of Contents
All About C & C++ Strings
Background
When I was implementing Coogle, I discovered something bizarre about C++ strings. That is why I wrote this note to summarize what I learned about C++ strings.
Coogle is a search engine that I implemented in C++. It is designed to quickly and efficiently find function signatures in large codebases.
For example, there is a function signature like this:
int ;
When I search "int(int, int)", Coogle should be able to find this signature quickly.
However, when I search "std::string(std::string)" to find function signatures that take and return std::string, I found that Coogle could not find any results, even though there are many such functions in the codebase.
After some investigation, I realized that the issue was related to how C++ handles strings. In C++, std::string is actually a typedef for std::basic_string<char>, and there are some subtle differences in how these types are treated in function signatures.
As a compiler engineer, strings and trees are my bread and butter. I decided to dig deeper into the C++ string implementation and document my findings here.
PART 1: C Strings and Character Types
Back to the Basics: Char
Character Types in C
From the C language's perspective, char is a type that represents how data in memory is interpreted. It is a byte of data with a width of CHAR_BIT bits (typically 8 bits on most modern systems). The C standard does not define whether char is signed or unsigned—it is implementation-defined. However, in most implementations, char is signed.
According to C99 §6.2.5 (Types), there are three distinct character types:
-
char- Implementation-defined signedness (§6.2.5 ¶15)- Whether
charhas the same range, representation, and behavior assigned charorunsigned charis implementation-defined CHAR_MINis either 0 (unsigned) orSCHAR_MIN(signed) — see §5.2.4.2.1- Must be able to represent any member of the basic execution character set (§6.2.5 ¶3)
- Used for character data and strings
- Whether
-
signed char- Always signed (§6.2.5 ¶4)- Range: at least -127 to +127 (§5.2.4.2.1)
- Typically -128 to +127 (2's complement:
SCHAR_MIN= -128,SCHAR_MAX= 127) - Treated as a small integer type, not for character data
-
unsigned char- Always unsigned (§6.2.5 ¶6)- Range: 0 to at least 255 (
UCHAR_MAX≥ 255) (§5.2.4.2.1) - No padding bits (§6.2.6.1 ¶4) — pure binary representation
- Used to inspect object representations of any type
- Range: 0 to at least 255 (
Why Three Distinct Types? Historical Context
When C was being standardized (late 1970s - 1980s), different architectures had different ideas about bytes:
- PDP-11 (where C was born):
charwas 8 bits, naturally unsigned - IBM mainframes: Used EBCDIC, not ASCII; different character handling
- Signedness debates: Some CPUs made signed arithmetic faster, others unsigned
- Existing codebases: Millions of lines of code with different assumptions about
char
The Three-Type Solution
The committee couldn't just pick "signed" or "unsigned" for char without breaking half the existing code. So they made a brilliant compromise:
1. char — The Compatibility Type
Purpose: Preserve existing code and allow hardware-specific optimization
char str = "Hello"; // Text data - don't care about sign
char *filename = "/tmp/file"; // String operations
Why implementation-defined:
- Lets each platform choose what's most efficient for their CPU
- x86: Signed by default (sign-extend is slightly cheaper)
- ARM: Unsigned by default (zero-extend is cheaper)
- Old code continues to work on its original platform
The contract: "If you use char for text/strings where values are ASCII (0-127), you're safe everywhere"
2. signed char — The Small Integer Type
Purpose: When you need a guaranteed signed 8-bit integer
// Example: Delta encoding in compression
signed char deltas; // Differences can be negative
for
Why separate from char:
- You NEED negative values
- Can't rely on
charbeing signed (it might be unsigned on ARM!) - Makes intent explicit: "I'm using this as a number, not a character"
3. unsigned char — The Byte Manipulation Type
Purpose: Raw memory access and binary data
This is the most important one! Per C99 §6.2.6.1, unsigned char has special properties:
// Serialize an int to bytes (portable!)
int value = 0x12345678;
unsigned char bytes;
;
// bytes[0], bytes[1], bytes[2], bytes[3] are guaranteed to contain
// the byte representation
// Read binary file data
FILE *f = ;
unsigned char buffer;
;
Why separate and why no padding bits:
- Type punning safety: Only
unsigned char*can legally alias any object (§6.5 ¶7) - No padding: Every bit pattern is valid; all 256 values are guaranteed
- Wrap-around: Overflow is well-defined (wraps at
UCHAR_MAX + 1) - Low-level code: Networks, crypto, compression all need predictable byte access
Type System Distinctions
From the compiler's perspective (and this is crucial for you as a compiler engineer!):
char a;
signed char b;
unsigned char c;
// These are THREE DIFFERENT TYPES in the type system!
// Even if char == signed char at runtime, they're distinct at compile time
char *p1;
signed char *p2;
p1 = p2; // Compiles, but produces a warning about incompatible pointer types
Per C99 §6.2.5 ¶15, even though char must have the same representation as one of the others, they remain distinct types for type checking purposes. The C standard allows implicit conversions between incompatible pointer types, but good compilers will warn about it because it's a sign of confused intent.
Why compilers warn:
Even though char has the same representation as either signed char or unsigned char on a given platform, treating them as interchangeable violates the type system's semantic distinctions. The three types exist to express different intent: char for text, signed char for small signed integers, and unsigned char for raw bytes. Mixing pointers to these types suggests you may be confused about what your data represents.
C++ Overloading Context
Why this matters for function overloading in C++:
void ; // Overload 1
void ; // Overload 2 - DISTINCT!
void ; // Overload 3 - DISTINCT!
// All three can coexist as separate overloads!
Summary: The Design Wins
- Backward compatibility: Old code works on its original platform
- Performance: Each platform uses the most efficient representation for
char - Type safety: Explicit
signed char/unsigned charprevents bugs - Low-level power:
unsigned chargives guaranteed byte access - Portability: Code that needs specific signedness can request it
The "redundancy" is actually separation of concerns:
char= "I want text, optimize for this platform"signed char= "I need signed arithmetic"unsigned char= "I need raw bytes"
This design is why C succeeded—it balanced portability with "trust the programmer" philosophy and hardware efficiency!
String in C is Just a Char Array
A fact that surprises approximately no one, but let's quickly recap.
In C, strings are represented as arrays of characters (char), terminated by a null character ('\0'). This means that a string in C is essentially a sequence of char values stored in contiguous memory locations.
For example, the string "Hello, World!" can be represented in C as:
char str = "Hello, World!";
(Visulization of memory layout)
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ H │ e │ l │ l │ o │ , │ │ W │ o │ r │ l │ d │ ! │ \0 │
├─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┤
│ 72 │ 101 │ 108 │ 108 │ 111 │ 44 │ 32 │ 87 │ 111 │ 114 │ 108 │ 100 │ 33 │ 0 │
├─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┤
│0x48 │0x65 │0x6C │0x6C │0x6F │0x2C │0x20 │0x57 │0x6F │0x72 │0x6C │0x64 │0x21 │0x00 │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
^ ^
| |
str (pointer to first character) Null terminator ('\0')
As mentioned earlier, unsigned char is the proper way to inspect raw bytes:
char str = "Hello, World!";
unsigned char *bytes = str;
for
// Output:
// bytes[ 0] = 0x48 ('H')
// bytes[ 1] = 0x65 ('e')
// ...
// bytes[13] = 0x00 ('?') ← The null terminator
String Manipulation
There are various of functions to help manipulate C strings, such as strlen, strcpy, strcat, etc., all of which rely on the null terminator to determine the end of the string.
Why This Matters for Compiler Engineers
Understanding this memory layout is crucial because:
-
String literals in read-only memory: Compilers typically place string literals in
.rodatasection -
Pointer vs. array semantics:
char *p = "Hello"vs.char arr[] = "Hello"have different propertieschar *p = "Hello"; // Pointer to string literal (read-only) char arr = "Hello"; // Array initialized with string (writable)Visualization:
char *p = "Hello"; char arr[] = "Hello"; ┌──────────┐ ┌─────┬─────┬─────┬─────┬─────┬─────┐ │ pointer │──────┐ │ H │ e │ l │ l │ o │ \0 │ └──────────┘ │ └─────┴─────┴─────┴─────┴─────┴─────┘ (8 bytes) │ (6 bytes, modifiable) │ ↑ │ └─ arr (address) ↓ .rodata (read-only) ┌─────┬─────┬─────┬─────┬─────┬─────┐ │ H │ e │ l │ l │ o │ \0 │ └─────┴─────┴─────┴─────┴─────┴─────┘ (6 bytes, read-only) -
String manipulation: Functions like
strcpy,strcatrely on\0to know when to stop -
Buffer overflows: Classic security vulnerabilities arise from not accounting for the null terminator
// Common mistake: char buffer; // Only space for "Hello, World!" WITHOUT \0 ; // WARNING: BUFFER OVERFLOW! Needs 14 bytes
The problem with C strings
-
(P1) No length metadata:
- C strings rely on the null terminator to indicate the end of the string. This means that functions like
strlenhave to traverse the entire string to find its length, leading to O(n) time complexity for length retrieval.
- C strings rely on the null terminator to indicate the end of the string. This means that functions like
-
(P2) Buffer overflows
- Since C strings do not have built-in bounds checking, it is easy to accidentally write beyond the allocated memory for a string, leading to buffer overflows and potential security vulnerabilities.
-
(P3) Ambiguous ownership:
char *p = "Hello";points to read-only memory (.rodata), but looks mutable.- Writing to it → undefined behavior.
-
(P4) Manual memory management
- Dynamic strings require malloc/free.
- Easy to leak or double-free.
-
(P5) Encoding issues:
- C strings are just byte arrays, so handling multi-byte encodings (like UTF-8, UTF-16) requires extra care.
-
(P6) Error-prone APIs
- strncpy pads with zeros, doesn’t guarantee null termination.
- strlen returns size excluding '\0', leading to off-by-one bugs.
PART 2: std::string and Templates
So the C++98 Standard Library Introduced std::string
To address the problems with C strings, the C++98 standard library introduced std::string, which is a more robust and user-friendly string class. Let's see how std::string solves each of the problems (P1-P6):
How std::string Solves C String Problems
Solution to P1: Length Metadata
Problem: C strings require O(n) traversal to find length.
Solution: std::string stores the length as a member variable!
;
Benefits:
std::string s = "Hello, World!";
// O(1) time complexity - just returns the stored size_
s.; // Returns 13 instantly
s.; // Same as size(), returns 13
// Compare to C:
char c_str = "Hello, World!";
; // O(n) - must traverse entire string
Memory trade-off: Extra sizeof(size_t) bytes (typically 8 bytes on 64-bit) to store the length, but huge performance gain!
Solution to P2: Buffer Overflows
Problem: No bounds checking in C string operations.
Solution: std::string automatically manages buffer size and reallocates when needed!
std::string s = "Hello";
s += ", World!"; // OK: Automatically resizes if needed
s += " How are you?"; // OK: Still safe, grows as needed
// C equivalent - dangerous:
char buffer = "Hello";
; // WARNING: BUFFER OVERFLOW! Only 10 bytes allocated
Bounds-checked access:
std::string s = "Hello";
// Safe access with bounds checking:
try catch
// Unchecked access (like C, for performance):
char c = s; // WARNING: Undefined behavior, but faster (no bounds check)
Solution to P3: Ambiguous Ownership
Problem: Unclear whether char* points to read-only or writable memory.
Solution: std::string owns its data with clear semantics!
// C - ambiguous:
char *p1 = "Hello"; // Points to read-only .rodata
char *p2 = ; // Points to heap memory
; // Need to track who owns what
// C++ - clear ownership:
std::string s1 = "Hello"; // s1 OWNS a copy of the data
std::string s2 = s1; // s2 OWNS its own independent copy (deep copy)
s1 = 'h'; // OK: Safe, s1 = "hello"
// s2 is still "Hello" (unaffected)
RAII (Resource Acquisition Is Initialization):
void // OK: Destructor automatically frees memory - no manual cleanup!
// C equivalent:
void
Solution to P4: Manual Memory Management
Problem: Dynamic C strings require manual malloc/free.
Solution: std::string uses automatic memory management (RAII)!
// C++ - automatic:
std::string s;
for
// OK: Memory automatically freed when s goes out of scope
// C - manual nightmare:
char *s = ;
s = '\0';
size_t capacity = 1;
for
; // WARNING: Must remember to free!
Copy semantics:
std::string s1 = "Hello";
std::string s2 = s1; // Deep copy - s2 gets its own memory
s1 = 'h'; // s1 = "hello", s2 = "Hello" (independent)
// Move semantics (C++11):
std::string s3 = ; // s3 takes ownership of s1's memory
// s1 is now in a valid but unspecified state (typically empty)
Solution to P5: Encoding Issues
(The real howto is in the next section)
Problem: C strings are just byte arrays, no encoding awareness.
Solution: std::string is still byte-based BUT provides a foundation for encoding-aware types!
// C++98: std::string is still byte-based
std::string utf8_str = u8"Hello, 世界"; // C++11 UTF-8 literal
// Each char is still one byte, but you can work with UTF-8 data
// C++11 added encoding-specific types:
std::u16string utf16_str = u"Hello, 世界"; // UTF-16
std::u32string utf32_str = U"Hello, 世界"; // UTF-32
std::wstring wide_str = L"Hello, 世界"; // Platform-dependent wide char
// All have the same interface as std::string!
utf16_str.; // Number of char16_t units
utf16_str += u"!"; // Concatenation works
Better than C:
// C - manual UTF-8 handling:
char utf8 = "世界"; // How many characters? Need external library!
// strlen(utf8) gives BYTES, not character count
// C++ - at least you have type safety:
std::string utf8 = u8"世界";
// Still need external library for character count, but:
// - Memory is automatically managed
// - Can use standard algorithms
// - Type-safe operations
Solution to P6: Error-Prone APIs
Problem: C string APIs are inconsistent and error-prone.
Solution: std::string provides consistent, intuitive, and safe APIs!
Comparison table:
| Operation | C (error-prone) | C++ (safe & intuitive) |
|---|---|---|
| Copy | strcpy(dest, src) - no size checking | dest = src; - automatic resizing |
| Copy with limit | strncpy(dest, src, n) - may not null-terminate! | dest.assign(src, 0, n); - always valid |
| Concatenate | strcat(dest, src) - no size checking | dest += src; - automatic resizing |
| Length | strlen(s) - O(n) traversal | s.size() - O(1) |
| Compare | strcmp(s1, s2) - returns int | s1 == s2 - returns bool |
| Substring | Manual pointer arithmetic | s.substr(pos, len) |
| Find | strstr(haystack, needle) - returns pointer | s.find(str) - returns size_t position |
Examples:
// C - error prone:
char dest;
; // WARNING: Not null-terminated!
dest = '\0'; // Must manually add null terminator
// C++ - safe:
std::string dest = "Hello, World!";
dest = dest.; // OK: "Hello, Wor" - properly terminated
// C - confusing return values:
if // 0 means equal? Confusing!
// C++ - intuitive:
if // OK: Natural boolean comparison
// C - pointer arithmetic for substring:
char str = "Hello, World!";
char *world = str + 7; // Points to "World!"
// WARNING: No bounds checking, lifetime tied to str
// C++ - safe substring:
std::string str = "Hello, World!";
std::string world = str.; // "World!" - independent copy
Summary: What std::string Provides
std::string s; // Empty string
s = "Hello"; // Assignment from C string
s += ", World"; // Concatenation with automatic memory management
s = 'h'; // Mutable access: "hello, World"
s.; // O(1) length: 12
s.; // Substring: "hello"
s.; // Find position: 7
if
// OK: No manual memory management
// OK: No buffer overflow worries (with proper usage)
// OK: Clear ownership semantics
// OK: Consistent, intuitive API
// OK: Automatic cleanup (RAII)
The price you pay:
- Small overhead: extra bytes for size/capacity
- Potential heap allocations (though SSO mitigates this - more on that later!)
- Need to understand value semantics (copies vs. references)
But the safety and convenience are usually worth it!
The Real Story: std::basic_string Template
Now we get to the heart of the matter—and this ties directly back to the Coogle problem mentioned at the beginning!
This section covers:
- Template nature: Why
std::stringis actually a typedef - Character traits: Customizing string behavior
- Allocators: Memory management customization (including C++17 PMR)
- Practical implications: How this affects tools like Coogle
std::string is Actually a Typedef!
Here's the big reveal from the C++ standard library:
// From <string> header (simplified):
This means:
std::string s = "Hello";
// Is actually:
std::basic_string<char, std::char_traits<char>, std::allocator<char>> s = "Hello";
Why Does This Matter? (Back to the Coogle Problem!)
Remember from the introduction:
// Searching for this signature:
std::string ;
// But looking for:
"std::string(std::string)"
// Might not match if the compiler sees it as:
"std::basic_string<char>(std::basic_string<char>)"
The compiler's type system sees the full template instantiation, not the typedef!
What Are Character Traits?
Character traits define how characters behave. They're a policy class that abstracts character operations:
;
Why Separate Traits from the Character Type?
Design principle: Separate the data representation (char, wchar_t) from the behavior (comparison, copying, etc.)
Example - Case-Insensitive String:
// Custom traits for case-insensitive comparison
;
// Now create a case-insensitive string type!
typedef std::basic_string<char, ci_char_traits> ci_string;
int
Memory Layout of std::basic_string
A typical implementation (simplified):
;
Visualizing std::string vs std::basic_string<char>
Type System View:
std::string
↓ (typedef expansion)
std::basic_string<char, std::char_traits<char>, std::allocator<char>>
↓ (template instantiation)
[Concrete class with all methods specialized for char]
Memory Layout (example with SSO):
sizeof(std::string) = 32 bytes on typical 64-bit system
┌───────────────────────────────────┐
│ Union (24 bytes) │
│ ┌─────────────────────────────┐ │
│ │ Option 1: Small (≤ 15 chars)│ │
│ │ buffer[16 chars] │ │
│ │ size (1 byte) │ │
│ ├─────────────────────────────┤ │
│ │ Option 2: Large (> 15 chars)│ │
│ │ ptr (8 bytes) │ │
│ │ size (8 bytes) │ │
│ │ capacity (8 bytes) │ │
│ └─────────────────────────────┘ │
├───────────────────────────────────┤
│ Allocator (varies, often 0) │
└───────────────────────────────────┘
Small String Optimization (SSO)
Notice the union in the memory layout above? That's the key to Small String Optimization (SSO), one of the most important performance optimizations in modern C++ implementations.
The Problem SSO Solves
Every heap allocation is expensive:
- System call overhead (malloc/new)
- Cache misses (heap data is far from the string object)
- Memory fragmentation
- Deallocation overhead (free/delete)
But most strings in real programs are short! Studies show:
- ~80% of strings are 15 characters or less
- Function names, variable names, error messages, JSON keys, etc.
Why allocate on the heap for "Hello"?
How SSO Works
Instead of always heap-allocating, std::string uses a clever trick:
std::string short_str = "Hello"; // SSO: stored inline, no heap allocation!
std::string long_str = "This is a much longer string that exceeds SSO limit";
// Heap: allocated on heap
// Both are 32 bytes (or 24, or 16 depending on implementation)
sizeof == sizeof
// But behavior is different:
// short_str: data is INSIDE the object
// long_str: data is OUTSIDE the object (heap)
SSO Performance Benefits
Before SSO:
std::string s = "Hello"; // Allocate 6 bytes on heap
// - malloc() system call
// - Cache miss when accessing "Hello"
// - free() call on destruction
With SSO:
std::string s = "Hello"; // Store inline in the 32-byte object
// - No malloc()
// - Data in cache (next to object)
// - No free() needed
Benchmark impact:
- 2-10x faster for short string operations
- Better cache locality: String data is adjacent to the object
- Reduced memory fragmentation: Fewer heap allocations
Implementation Variations
Different standard library implementations use different SSO buffer sizes:
| Implementation | SSO Size | Total sizeof(std::string) |
|---|---|---|
| libstdc++ (GCC) | 15 bytes | 32 bytes (64-bit) |
| libc++ (Clang) | 22 bytes | 24 bytes (64-bit) |
| MSVC STL | 15 bytes | 32 bytes (64-bit) |
Why different sizes?
- Trade-off between object size and inline storage
- ABI (Application Binary Interface) stability concerns
- Different optimization strategies
How to Detect SSO in Action
void
int
// Typical output:
// Object at: 0x7ffc1234abc0
// Data at: 0x7ffc1234abc0 ← Same! Data is inside object
// SSO: Data stored INLINE
//
// Object at: 0x7ffc1234abe0
// Data at: 0x55a8d9e0f2c0 ← Different! Data is on heap
// Heap: Data stored on HEAP
When SSO Doesn't Apply
SSO is disabled when:
- String is too long: Exceeds the buffer size (typically 15-22 chars)
- Custom allocator used: Some allocators may not support SSO
- Shared ownership: If string shares data (rare, mostly removed in C++11)
// SSO applies
std::string s1 = "Hello";
// SSO does NOT apply - too long
std::string s2 = "This string is definitely too long for SSO";
// PMR with null resource - SSO is the ONLY option!
std::pmr::null_memory_resource null_mr;
std::pmr::string ; // OK: Fits in SSO buffer
std::pmr::string ; // RUNTIME ERROR: Can't allocate!
Why This Matters
- Performance: Short strings are extremely common; SSO makes them fast
- Memory: Fewer heap allocations = less fragmentation
- Cache: Better locality = fewer cache misses
- Predictability: Short strings have deterministic performance (no malloc)
The "free lunch": Most modern C++ programs get SSO optimization automatically without any code changes!
Type Identity Problem for Compilers
Here's why this matters for your Coogle tool:
void ;
// Mangled name might be: _Z3fooNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
void ;
// Mangled name: _Z3barNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
// They're THE SAME type, but name lookup might differ!
For your search engine, you need to handle:
- Typedef aliases:
std::string==std::basic_string<char> - Default template arguments:
std::basic_string<char>==std::basic_string<char, std::char_traits<char>, std::allocator<char>> - Namespace qualifications:
stringvsstd::stringvs::std::string
Template Instantiation Example
// When you write:
std::string s = "Hello";
// The compiler instantiates:
;
// And calls the constructor:
;
PART 3: Advanced Topics (PMR, Allocators, Encodings)
Memory Management: The Allocator Parameter
The third template parameter (Allocator) controls how std::basic_string allocates memory:
;
This section covers two approaches to custom memory management:
- Traditional allocators (C++98): Type-based, compile-time selection
- Polymorphic allocators (C++17): Runtime selection with type compatibility
Why Customize Allocators?
- Performance: Custom allocation strategies for specific use cases
- Debugging: Track memory usage, detect leaks
- Embedded systems: Pre-allocated memory pools
- Memory locality: Keep related data together in cache
Traditional Allocators (C++98)
Example - Custom allocator:
// Using a custom allocator
;
typedef std::basic_string<char, std::char_traits<char>, MyAllocator<char>> my_string;
my_string s = "Hello"; // Uses MyAllocator for memory management
The Problem with Traditional Allocators:
// These are DIFFERENT TYPES because of different allocators!
std::basic_string<char, std::char_traits<char>, std::allocator<char>> s1;
std::basic_string<char, std::char_traits<char>, MyAllocator<char>> s2;
s1 = s2; // ERROR: COMPILE ERROR! Incompatible types!
Different allocators create different types, making code inflexible and hard to compose.
Polymorphic Memory Resources (C++17)
C++17 introduced std::pmr to solve the allocator type problem. This is a major improvement for flexible memory management.
The Solution:
Key Benefits of PMR
- Same type: All
pmr::stringobjects have the same type regardless of memory resource - Runtime selection: Choose memory resource at runtime, not compile time
- Interoperability: Can assign between strings using different resources
Basic Usage Example
int
Available Memory Resources
C++17 provides several built-in memory resources for different use cases:
// 1. Default heap allocator
std::pmr::string s1 = "Hello"; // Uses new/delete (default)
// 2. Monotonic buffer - fast allocation, no individual deallocation
char buffer;
std::pmr::monotonic_buffer_resource ;
std::pmr::string ; // Allocates from buffer
// 3. Unsynchronized pool - fast, single-threaded
std::pmr::unsynchronized_pool_resource pool;
std::pmr::string ;
// 4. Synchronized pool - thread-safe
std::pmr::synchronized_pool_resource sync_pool;
std::pmr::string ;
// 5. Null memory resource - allocations fail (for testing)
std::pmr::null_memory_resource null_mr;
std::pmr::string ; // Only works if SSO applies!
Memory Resource Hierarchy:
std::pmr::memory_resource (abstract base class)
│
├── std::pmr::new_delete_resource() - default heap
├── std::pmr::null_memory_resource() - always fails
├── std::pmr::monotonic_buffer_resource - append-only, fast
├── std::pmr::unsynchronized_pool_resource - pooled, single-threaded
└── std::pmr::synchronized_pool_resource - pooled, thread-safe
Real-World Example: Arena Allocation
A common pattern in high-performance code is arena allocation (also called region-based allocation). All memory for a request is allocated from a buffer and freed in one operation:
void // OK: Arena destroyed, all memory freed at once (super fast!)
// No individual deallocations needed!
Why this is faster:
- No individual
deletecalls - Better cache locality (data packed together)
- Reduced memory fragmentation
- Common in game engines, servers, compilers
Comparison: Traditional vs PMR Strings
// Traditional string - allocator is part of the type
std::string s1 = "Hello";
std::basic_string<char, std::char_traits<char>, MyAllocator<char>> s2 = "World";
// s1 and s2 are DIFFERENT TYPES - cannot assign!
// PMR string - allocator chosen at runtime
std::pmr::monotonic_buffer_resource ;
std::pmr::monotonic_buffer_resource ;
std::pmr::string ;
std::pmr::string ;
// pmr_s1 and pmr_s2 are the SAME TYPE - can assign!
pmr_s1 = pmr_s2; // OK: Works!
When to Use PMR
Use std::pmr::string when:
- You need to control memory allocation strategy
- Working with embedded systems or real-time systems
- Building high-performance servers (arena allocation)
- Need containers with strings to share memory resources
- Want runtime flexibility without type proliferation
Stick with std::string when:
- Default heap allocation is fine
- Code simplicity is priority
- No special memory requirements
- C++17 not available
PMR and Type Identity (Important for Coogle!)
// For Coogle, you need to handle:
"std::string" → "std::basic_string<char>"
"std::pmr::string" → "std::basic_string<char, std::char_traits<char>, std::pmr::polymorphic_allocator<char>>"
// They're DIFFERENT types despite similar names!
void ; // Type 1
void ; // Type 2 - DIFFERENT!
// Cannot implicitly convert:
std::string s1 = "Hello";
std::pmr::string s2 = s1; // ERROR: Compile error!
// Must explicitly construct:
std::pmr::string ; // OK
Different Character Encodings
Beyond char, std::basic_string supports multiple character types for different encodings:
// All these use the same basic_string template:
std::string s8 = "Hello"; // char
std::wstring ws = L"Hello"; // wchar_t (2 or 4 bytes)
std::u16string s16 = u"Hello"; // char16_t (2 bytes, C++11)
std::u32string s32 = U"Hello"; // char32_t (4 bytes, C++11)
std::u8string u8s = u8"Hello"; // char8_t (1 byte, C++20)
// Memory layout for "Hello":
// s8: [H][e][l][l][o][\0] 6 bytes
// ws: [H\0][e\0][l\0][l\0][o\0][\0\0] 12 bytes (UTF-16) or 24 (UTF-32)
// s16: [H\0][e\0][l\0][l\0][o\0][\0\0] 12 bytes
// s32: [H\0\0\0][e\0\0\0]... 24 bytes
// PMR versions (C++17):
std::pmr::string pmr_s8 = "Hello"; // Uses polymorphic allocator
std::pmr::wstring pmr_ws = L"Hello";
std::pmr::u16string pmr_s16 = u"Hello";
std::pmr::u32string pmr_s32 = U"Hello";
Design Rationale and Trade-offs
Why This Template-Based Design?
Benefits:
- Code reuse: One implementation works for all character types
- Type safety:
std::stringandstd::wstringare incompatible types - Customization: Can provide custom traits or allocators
- Performance: Template specialization allows optimization
- Consistency: Same interface for all character types
Trade-offs:
- Compilation time: Templates increase compile time
- Code bloat: Each instantiation generates code
- Complex error messages: Template errors can be verbose
- Binary size: Multiple instantiations = larger binaries
Summary: The Template Nature of C++ Strings
┌─────────────────────────────────────────────────────────────┐
│ std::basic_string<CharT, Traits, Alloc> (Template) │
└─────────────────────────────────────────────────────────────┘
│
┌──────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌───────────┐ ┌────────────┐
│ string │ │ wstring │ │ u16string │
│ (char) │ │ (wchar_t) │ │ (char16_t) │
└──────────┘ └───────────┘ └────────────┘
Same template, different character types!
Key takeaway: Understanding that std::string is a template instantiation (not a primitive type) is crucial for:
- Building tools like Coogle that analyze C++ code
- Understanding compilation errors
- Knowing when conversions are allowed
- Optimizing performance (e.g., move semantics)
- Creating custom string types with different behaviors
Summary: String Evolution Timeline
C Era:
- 1972: C introduced null-terminated strings (
char*with\0) - 1989: ANSI C standardized string functions (
strcpy,strlen, etc.)
C++ Evolution:
- 1998 (C++98):
std::stringintroduced with RAII and automatic memory management - 2011 (C++11):
- Move semantics for efficient string transfers
- UTF-16/UTF-32 strings (
std::u16string,std::u32string) - Raw string literals (
R"(text)") - User-defined literals support
std::to_string()for converting numbers to stringsstd::stoi(),std::stol(),std::stof()for string-to-number conversions- Range-based for loops work with strings
shrink_to_fit()to reduce capacity to size
- 2014 (C++14):
- Standard user-defined string literals (
""soperator) - Heterogeneous lookup for
std::stringin associative containers
- Standard user-defined string literals (
- 2017 (C++17):
std::string_viewfor non-owning string referencesstd::pmr::basic_string(polymorphic allocators forstd::string,std::wstring, etc.)- String deduction guides
std::to_chars()andstd::from_chars()for low-level, locale-independent conversions- Splicing string literals (
"Hello" "World"concatenation at compile time)
- 2020 (C++20):
std::formatfor type-safe string formattingstd::u8stringfor UTF-8 (withchar8_t)constexpr std::stringsupport (limited - destructor not constexpr yet)- Compile-time format string checking
- String prefix/suffix operations (
starts_with,ends_with) erase()anderase_if()for removing elements
- 2023 (C++23):
std::printandstd::printlnfor simplified output- More constexpr string operations
std::string::contains()for substring checkingstd::string::resize_and_overwrite()for efficient buffer manipulation
- 2026 (C++26) (proposed/in progress):
- Further constexpr improvements
- Potential text encoding conversions in standard library
Key Milestones:
- SSO (Small String Optimization): Widely adopted in implementations around 2005-2010
- Copy-on-write removal: Most implementations removed COW after C++11 move semantics (2011-2015)
- ABI stability issues: GCC's dual-ABI support for std::string (2015) to maintain compatibility
- String view adoption: Widespread use after C++17 for avoiding unnecessary copies
Topics for Further Study
The following topics are worth exploring to deepen your understanding of C++ strings and the type system:
- Stream operators: How the
<<operator works with strings and the iostream library - String literals in templates: Template deduction rules and how string literals behave in template contexts
- C++ type system design: How templates shape the fundamental design of C++'s type system