All About C & C++ Strings

All About C & C++ Strings

Background

When I was implementing Coogle, I discovered something bizarre about C++ strings. That is why I wrote this note to summarize what I learned about C++ strings.

Coogle is a search engine that I implemented in C++. It is designed to quickly and efficiently find function signatures in large codebases.

For example, there is a function signature like this:

int add(int a, int b);

When I search "int(int, int)", Coogle should be able to find this signature quickly.

However, when I search "std::string(std::string)" to find function signatures that take and return std::string, I found that Coogle could not find any results, even though there are many such functions in the codebase.

After some investigation, I realized that the issue was related to how C++ handles strings. In C++, std::string is actually a typedef for std::basic_string<char>, and there are some subtle differences in how these types are treated in function signatures.

As a compiler engineer, strings and trees are my bread and butter. I decided to dig deeper into the C++ string implementation and document my findings here.


PART 1: C Strings and Character Types

Back to the Basics: Char

Character Types in C

From the C language's perspective, char is a type that represents how data in memory is interpreted. It is a byte of data with a width of CHAR_BIT bits (typically 8 bits on most modern systems). The C standard does not define whether char is signed or unsigned—it is implementation-defined. However, in most implementations, char is signed.

According to C99 §6.2.5 (Types), there are three distinct character types:

  1. char - Implementation-defined signedness (§6.2.5 ¶15)

    • Whether char has the same range, representation, and behavior as signed char or unsigned char is implementation-defined
    • CHAR_MIN is either 0 (unsigned) or SCHAR_MIN (signed) — see §5.2.4.2.1
    • Must be able to represent any member of the basic execution character set (§6.2.5 ¶3)
    • Used for character data and strings
  2. signed char - Always signed (§6.2.5 ¶4)

    • Range: at least -127 to +127 (§5.2.4.2.1)
    • Typically -128 to +127 (2's complement: SCHAR_MIN = -128, SCHAR_MAX = 127)
    • Treated as a small integer type, not for character data
  3. unsigned char - Always unsigned (§6.2.5 ¶6)

    • Range: 0 to at least 255 (UCHAR_MAX ≥ 255) (§5.2.4.2.1)
    • No padding bits (§6.2.6.1 ¶4) — pure binary representation
    • Used to inspect object representations of any type

Why Three Distinct Types? Historical Context

When C was being standardized (late 1970s - 1980s), different architectures had different ideas about bytes:

  • PDP-11 (where C was born): char was 8 bits, naturally unsigned
  • IBM mainframes: Used EBCDIC, not ASCII; different character handling
  • Signedness debates: Some CPUs made signed arithmetic faster, others unsigned
  • Existing codebases: Millions of lines of code with different assumptions about char

The Three-Type Solution

The committee couldn't just pick "signed" or "unsigned" for char without breaking half the existing code. So they made a brilliant compromise:

1. char — The Compatibility Type

Purpose: Preserve existing code and allow hardware-specific optimization

char str[] = "Hello";           // Text data - don't care about sign
char *filename = "/tmp/file";   // String operations

Why implementation-defined:

  • Lets each platform choose what's most efficient for their CPU
  • x86: Signed by default (sign-extend is slightly cheaper)
  • ARM: Unsigned by default (zero-extend is cheaper)
  • Old code continues to work on its original platform

The contract: "If you use char for text/strings where values are ASCII (0-127), you're safe everywhere"

2. signed char — The Small Integer Type

Purpose: When you need a guaranteed signed 8-bit integer

// Example: Delta encoding in compression
signed char deltas[100];  // Differences can be negative
for (int i = 1; i < 100; i++) {
    deltas[i] = data[i] - data[i-1];  // Might be negative
}

Why separate from char:

  • You NEED negative values
  • Can't rely on char being signed (it might be unsigned on ARM!)
  • Makes intent explicit: "I'm using this as a number, not a character"
3. unsigned char — The Byte Manipulation Type

Purpose: Raw memory access and binary data

This is the most important one! Per C99 §6.2.6.1, unsigned char has special properties:

// Serialize an int to bytes (portable!)
int value = 0x12345678;
unsigned char bytes[sizeof(int)];
memcpy(bytes, &value, sizeof(int));
// bytes[0], bytes[1], bytes[2], bytes[3] are guaranteed to contain
// the byte representation

// Read binary file data
FILE *f = fopen("image.png", "rb");
unsigned char buffer[4096];
fread(buffer, 1, 4096, f);

Why separate and why no padding bits:

  • Type punning safety: Only unsigned char* can legally alias any object (§6.5 ¶7)
  • No padding: Every bit pattern is valid; all 256 values are guaranteed
  • Wrap-around: Overflow is well-defined (wraps at UCHAR_MAX + 1)
  • Low-level code: Networks, crypto, compression all need predictable byte access

Type System Distinctions

From the compiler's perspective (and this is crucial for you as a compiler engineer!):

char a;
signed char b;
unsigned char c;

// These are THREE DIFFERENT TYPES in the type system!
// Even if char == signed char at runtime, they're distinct at compile time

char *p1;
signed char *p2;
p1 = p2;  // Compiles, but produces a warning about incompatible pointer types

Per C99 §6.2.5 ¶15, even though char must have the same representation as one of the others, they remain distinct types for type checking purposes. The C standard allows implicit conversions between incompatible pointer types, but good compilers will warn about it because it's a sign of confused intent.

Why compilers warn:

Even though char has the same representation as either signed char or unsigned char on a given platform, treating them as interchangeable violates the type system's semantic distinctions. The three types exist to express different intent: char for text, signed char for small signed integers, and unsigned char for raw bytes. Mixing pointers to these types suggests you may be confused about what your data represents.

C++ Overloading Context

Why this matters for function overloading in C++:

void foo(char c);           // Overload 1
void foo(signed char c);    // Overload 2 - DISTINCT!
void foo(unsigned char c);  // Overload 3 - DISTINCT!

// All three can coexist as separate overloads!

Summary: The Design Wins

  1. Backward compatibility: Old code works on its original platform
  2. Performance: Each platform uses the most efficient representation for char
  3. Type safety: Explicit signed char / unsigned char prevents bugs
  4. Low-level power: unsigned char gives guaranteed byte access
  5. Portability: Code that needs specific signedness can request it

The "redundancy" is actually separation of concerns:

  • char = "I want text, optimize for this platform"
  • signed char = "I need signed arithmetic"
  • unsigned char = "I need raw bytes"

This design is why C succeeded—it balanced portability with "trust the programmer" philosophy and hardware efficiency!


String in C is Just a Char Array

A fact that surprises approximately no one, but let's quickly recap.

In C, strings are represented as arrays of characters (char), terminated by a null character ('\0'). This means that a string in C is essentially a sequence of char values stored in contiguous memory locations.

For example, the string "Hello, World!" can be represented in C as:

char str[] = "Hello, World!";

(Visulization of memory layout)

┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│  H  │  e  │  l  │  l  │  o  │  ,  │     │  W  │  o  │  r  │  l  │  d  │  !  │ \0  │
├─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┤
│ 72  │ 101 │ 108 │ 108 │ 111 │ 44  │ 32  │ 87  │ 111 │ 114 │ 108 │ 100 │ 33  │  0  │
├─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┤
│0x48 │0x65 │0x6C │0x6C │0x6F │0x2C │0x20 │0x57 │0x6F │0x72 │0x6C │0x64 │0x21 │0x00 │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
  [0]   [1]   [2]   [3]   [4]   [5]   [6]   [7]   [8]   [9]  [10]  [11]  [12]  [13]

  ^                                                                               ^
  |                                                                               |
  str (pointer to first character)                            Null terminator ('\0')

As mentioned earlier, unsigned char is the proper way to inspect raw bytes:

char str[] = "Hello, World!";
unsigned char *bytes = (unsigned char *)str;

for (int i = 0; i < sizeof(str); i++) {
    printf("bytes[%2d] = 0x%02X ('%c')\n",
           i, bytes[i], bytes[i] ? bytes[i] : '?');
}

// Output:
// bytes[ 0] = 0x48 ('H')
// bytes[ 1] = 0x65 ('e')
// ...
// bytes[13] = 0x00 ('?')  ← The null terminator

String Manipulation

There are various of functions to help manipulate C strings, such as strlen, strcpy, strcat, etc., all of which rely on the null terminator to determine the end of the string.

Why This Matters for Compiler Engineers

Understanding this memory layout is crucial because:

  1. String literals in read-only memory: Compilers typically place string literals in .rodata section

  2. Pointer vs. array semantics: char *p = "Hello" vs. char arr[] = "Hello" have different properties

    char *p = "Hello";    // Pointer to string literal (read-only)
    char arr[] = "Hello"; // Array initialized with string (writable)
    

    Visualization:

    char *p = "Hello";              char arr[] = "Hello";
    
    ┌──────────┐                    ┌─────┬─────┬─────┬─────┬─────┬─────┐
    │ pointer  │──────┐             │  H  │  e  │  l  │  l  │  o  │ \0  │
    └──────────┘      │             └─────┴─────┴─────┴─────┴─────┴─────┘
        (8 bytes)     │                      (6 bytes, modifiable)
                      │                       ↑
                      │                       └─ arr (address)
                .rodata (read-only)
                ┌─────┬─────┬─────┬─────┬─────┬─────┐
                │  H  │  e  │  l  │  l  │  o  │ \0  │
                └─────┴─────┴─────┴─────┴─────┴─────┘
                         (6 bytes, read-only)
    
  3. String manipulation: Functions like strcpy, strcat rely on \0 to know when to stop

  4. Buffer overflows: Classic security vulnerabilities arise from not accounting for the null terminator

    // Common mistake:
    char buffer[13];  // Only space for "Hello, World!" WITHOUT \0
    strcpy(buffer, "Hello, World!");  // WARNING: BUFFER OVERFLOW! Needs 14 bytes
    

The problem with C strings

  1. (P1) No length metadata:

    • C strings rely on the null terminator to indicate the end of the string. This means that functions like strlen have to traverse the entire string to find its length, leading to O(n) time complexity for length retrieval.
  2. (P2) Buffer overflows

    • Since C strings do not have built-in bounds checking, it is easy to accidentally write beyond the allocated memory for a string, leading to buffer overflows and potential security vulnerabilities.
  3. (P3) Ambiguous ownership:

    • char *p = "Hello"; points to read-only memory (.rodata), but looks mutable.
    • Writing to it → undefined behavior.
  4. (P4) Manual memory management

    • Dynamic strings require malloc/free.
    • Easy to leak or double-free.
  5. (P5) Encoding issues:

    • C strings are just byte arrays, so handling multi-byte encodings (like UTF-8, UTF-16) requires extra care.
  6. (P6) Error-prone APIs

    • strncpy pads with zeros, doesn’t guarantee null termination.
    • strlen returns size excluding '\0', leading to off-by-one bugs.

PART 2: std::string and Templates

So the C++98 Standard Library Introduced std::string

To address the problems with C strings, the C++98 standard library introduced std::string, which is a more robust and user-friendly string class. Let's see how std::string solves each of the problems (P1-P6):

How std::string Solves C String Problems

Solution to P1: Length Metadata

Problem: C strings require O(n) traversal to find length.

Solution: std::string stores the length as a member variable!

class basic_string {
    char* data_;      // Pointer to character data
    size_t size_;     // Length of string (excluding '\0')
    size_t capacity_; // Allocated capacity
    // ...
};

Benefits:

std::string s = "Hello, World!";

// O(1) time complexity - just returns the stored size_
s.size();     // Returns 13 instantly
s.length();   // Same as size(), returns 13

// Compare to C:
char c_str[] = "Hello, World!";
strlen(c_str); // O(n) - must traverse entire string

Memory trade-off: Extra sizeof(size_t) bytes (typically 8 bytes on 64-bit) to store the length, but huge performance gain!

Solution to P2: Buffer Overflows

Problem: No bounds checking in C string operations.

Solution: std::string automatically manages buffer size and reallocates when needed!

std::string s = "Hello";
s += ", World!";  // OK: Automatically resizes if needed
s += " How are you?";  // OK: Still safe, grows as needed

// C equivalent - dangerous:
char buffer[10] = "Hello";
strcat(buffer, ", World!");  // WARNING: BUFFER OVERFLOW! Only 10 bytes allocated

Bounds-checked access:

std::string s = "Hello";

// Safe access with bounds checking:
try {
    char c = s.at(100);  // Throws std::out_of_range exception
} catch (const std::out_of_range& e) {
    std::cout << "Caught: " << e.what() << std::endl;
}

// Unchecked access (like C, for performance):
char c = s[100];  // WARNING: Undefined behavior, but faster (no bounds check)
Solution to P3: Ambiguous Ownership

Problem: Unclear whether char* points to read-only or writable memory.

Solution: std::string owns its data with clear semantics!

// C - ambiguous:
char *p1 = "Hello";          // Points to read-only .rodata
char *p2 = malloc(10);       // Points to heap memory
strcpy(p2, "Hello");         // Need to track who owns what

// C++ - clear ownership:
std::string s1 = "Hello";    // s1 OWNS a copy of the data
std::string s2 = s1;         // s2 OWNS its own independent copy (deep copy)
s1[0] = 'h';                 // OK: Safe, s1 = "hello"
                             // s2 is still "Hello" (unaffected)

RAII (Resource Acquisition Is Initialization):

void foo() {
    std::string s = "Hello";
    // ... use s ...
}  // OK: Destructor automatically frees memory - no manual cleanup!

// C equivalent:
void foo_c() {
    char *s = malloc(6);
    strcpy(s, "Hello");
    // ... use s ...
    free(s);  // WARNING: Easy to forget! Memory leak if not called
}
Solution to P4: Manual Memory Management

Problem: Dynamic C strings require manual malloc/free.

Solution: std::string uses automatic memory management (RAII)!

// C++ - automatic:
std::string s;
for (int i = 0; i < 1000; i++) {
    s += "word ";  // Automatically allocates/reallocates as needed
}
// OK: Memory automatically freed when s goes out of scope

// C - manual nightmare:
char *s = malloc(1);
s[0] = '\0';
size_t capacity = 1;
for (int i = 0; i < 1000; i++) {
    size_t needed = strlen(s) + strlen("word ") + 1;
    if (needed > capacity) {
        capacity *= 2;
        char *new_s = realloc(s, capacity);
        if (!new_s) {
            free(s);  // WARNING: Must remember to free on error
            return;
        }
        s = new_s;
    }
    strcat(s, "word ");
}
free(s);  // WARNING: Must remember to free!

Copy semantics:

std::string s1 = "Hello";
std::string s2 = s1;  // Deep copy - s2 gets its own memory
s1[0] = 'h';          // s1 = "hello", s2 = "Hello" (independent)

// Move semantics (C++11):
std::string s3 = std::move(s1);  // s3 takes ownership of s1's memory
// s1 is now in a valid but unspecified state (typically empty)
Solution to P5: Encoding Issues

(The real howto is in the next section)

Problem: C strings are just byte arrays, no encoding awareness.

Solution: std::string is still byte-based BUT provides a foundation for encoding-aware types!

// C++98: std::string is still byte-based
std::string utf8_str = u8"Hello, 世界";  // C++11 UTF-8 literal
// Each char is still one byte, but you can work with UTF-8 data

// C++11 added encoding-specific types:
std::u16string utf16_str = u"Hello, 世界";  // UTF-16
std::u32string utf32_str = U"Hello, 世界";  // UTF-32
std::wstring wide_str = L"Hello, 世界";     // Platform-dependent wide char

// All have the same interface as std::string!
utf16_str.size();     // Number of char16_t units
utf16_str += u"!";    // Concatenation works

Better than C:

// C - manual UTF-8 handling:
char utf8[] = "世界";  // How many characters? Need external library!
// strlen(utf8) gives BYTES, not character count

// C++ - at least you have type safety:
std::string utf8 = u8"世界";
// Still need external library for character count, but:
// - Memory is automatically managed
// - Can use standard algorithms
// - Type-safe operations
Solution to P6: Error-Prone APIs

Problem: C string APIs are inconsistent and error-prone.

Solution: std::string provides consistent, intuitive, and safe APIs!

Comparison table:

OperationC (error-prone)C++ (safe & intuitive)
Copystrcpy(dest, src) - no size checkingdest = src; - automatic resizing
Copy with limitstrncpy(dest, src, n) - may not null-terminate!dest.assign(src, 0, n); - always valid
Concatenatestrcat(dest, src) - no size checkingdest += src; - automatic resizing
Lengthstrlen(s) - O(n) traversals.size() - O(1)
Comparestrcmp(s1, s2) - returns ints1 == s2 - returns bool
SubstringManual pointer arithmetics.substr(pos, len)
Findstrstr(haystack, needle) - returns pointers.find(str) - returns size_t position

Examples:

// C - error prone:
char dest[10];
strncpy(dest, "Hello, World!", 10);  // WARNING: Not null-terminated!
dest[9] = '\0';  // Must manually add null terminator

// C++ - safe:
std::string dest = "Hello, World!";
dest = dest.substr(0, 10);  // OK: "Hello, Wor" - properly terminated

// C - confusing return values:
if (strcmp(s1, s2) == 0) { /* equal */ }  // 0 means equal? Confusing!

// C++ - intuitive:
if (s1 == s2) { /* equal */ }  // OK: Natural boolean comparison

// C - pointer arithmetic for substring:
char str[] = "Hello, World!";
char *world = str + 7;  // Points to "World!"
// WARNING: No bounds checking, lifetime tied to str

// C++ - safe substring:
std::string str = "Hello, World!";
std::string world = str.substr(7);  // "World!" - independent copy

Summary: What std::string Provides

#include <string>

std::string s;  // Empty string
s = "Hello";    // Assignment from C string
s += ", World"; // Concatenation with automatic memory management
s[0] = 'h';     // Mutable access: "hello, World"
s.size();       // O(1) length: 12
s.substr(0, 5); // Substring: "hello"
s.find("Wor");  // Find position: 7

if (s == "hello, World") {  // Natural comparison
    // ...
}

// OK: No manual memory management
// OK: No buffer overflow worries (with proper usage)
// OK: Clear ownership semantics
// OK: Consistent, intuitive API
// OK: Automatic cleanup (RAII)

The price you pay:

  • Small overhead: extra bytes for size/capacity
  • Potential heap allocations (though SSO mitigates this - more on that later!)
  • Need to understand value semantics (copies vs. references)

But the safety and convenience are usually worth it!


The Real Story: std::basic_string Template

Now we get to the heart of the matter—and this ties directly back to the Coogle problem mentioned at the beginning!

This section covers:

  1. Template nature: Why std::string is actually a typedef
  2. Character traits: Customizing string behavior
  3. Allocators: Memory management customization (including C++17 PMR)
  4. Practical implications: How this affects tools like Coogle

std::string is Actually a Typedef!

Here's the big reveal from the C++ standard library:

// From <string> header (simplified):
namespace std {
    template<
        class CharT,
        class Traits = char_traits<CharT>,
        class Allocator = allocator<CharT>
    >
    class basic_string;

    // std::string is just a typedef!
    typedef basic_string<char> string;

    // And there are others:
    typedef basic_string<wchar_t> wstring;      // Wide characters
    typedef basic_string<char16_t> u16string;   // UTF-16 (C++11)
    typedef basic_string<char32_t> u32string;   // UTF-32 (C++11)
    typedef basic_string<char8_t> u8string;     // UTF-8 (C++20)
}

This means:

std::string s = "Hello";
// Is actually:
std::basic_string<char, std::char_traits<char>, std::allocator<char>> s = "Hello";

Why Does This Matter? (Back to the Coogle Problem!)

Remember from the introduction:

// Searching for this signature:
std::string foo(std::string s);

// But looking for:
"std::string(std::string)"

// Might not match if the compiler sees it as:
"std::basic_string<char>(std::basic_string<char>)"

The compiler's type system sees the full template instantiation, not the typedef!

What Are Character Traits?

Character traits define how characters behave. They're a policy class that abstracts character operations:

template<class CharT>
struct char_traits {
    typedef CharT char_type;
    typedef /* implementation-defined */ int_type;
    typedef /* implementation-defined */ pos_type;
    typedef /* implementation-defined */ off_type;
    typedef /* implementation-defined */ state_type;

    // Character comparison
    static bool eq(char_type a, char_type b);
    static bool lt(char_type a, char_type b);

    // String operations
    static size_t length(const char_type* s);
    static int compare(const char_type* s1, const char_type* s2, size_t n);
    static char_type* copy(char_type* dest, const char_type* src, size_t n);
    static char_type* move(char_type* dest, const char_type* src, size_t n);

    // Character manipulation
    static char_type to_char_type(int_type c);
    static int_type to_int_type(char_type c);
    static bool eq_int_type(int_type c1, int_type c2);
    static int_type eof();
    static int_type not_eof(int_type c);

    // And more...
};

Why Separate Traits from the Character Type?

Design principle: Separate the data representation (char, wchar_t) from the behavior (comparison, copying, etc.)

Example - Case-Insensitive String:

// Custom traits for case-insensitive comparison
struct ci_char_traits : public std::char_traits<char> {
    static bool eq(char c1, char c2) {
        return std::toupper(c1) == std::toupper(c2);
    }

    static bool lt(char c1, char c2) {
        return std::toupper(c1) < std::toupper(c2);
    }

    static int compare(const char* s1, const char* s2, size_t n) {
        while (n-- != 0) {
            if (std::toupper(*s1) < std::toupper(*s2)) return -1;
            if (std::toupper(*s1) > std::toupper(*s2)) return 1;
            ++s1; ++s2;
        }
        return 0;
    }

    // ... implement other methods ...
};

// Now create a case-insensitive string type!
typedef std::basic_string<char, ci_char_traits> ci_string;

int main() {
    ci_string s1 = "Hello";
    ci_string s2 = "HELLO";

    if (s1 == s2) {
        std::cout << "Equal (case-insensitive)!\n";  // Prints!
    }

    std::string s3 = "Hello";
    std::string s4 = "HELLO";

    if (s3 == s4) {
        std::cout << "Equal\n";  // Doesn't print
    } else {
        std::cout << "Not equal (case-sensitive)\n";  // Prints!
    }
}

Memory Layout of std::basic_string

A typical implementation (simplified):

template<class CharT, class Traits, class Allocator>
class basic_string {
private:
    CharT* data_;           // Pointer to character buffer
    size_t size_;           // Number of characters (excluding null terminator)
    size_t capacity_;       // Allocated capacity
    Allocator allocator_;   // Allocator for memory management

    // OR with Small String Optimization (SSO):
    union {
        struct {
            CharT* ptr;      // Heap pointer
            size_t size;
            size_t capacity;
        } heap;
        struct {
            CharT buffer[16]; // Stack buffer (size varies by implementation)
            unsigned char size;
        } stack;
    } data_;

public:
    // Member types
    typedef Traits traits_type;
    typedef typename Traits::char_type value_type;
    typedef Allocator allocator_type;
    typedef size_t size_type;
    typedef ptrdiff_t difference_type;
    typedef value_type& reference;
    typedef const value_type& const_reference;
    typedef /* implementation */ iterator;
    typedef /* implementation */ const_iterator;

    // Constructors, operators, methods...
};

Visualizing std::string vs std::basic_string<char>

Type System View:

std::string
    ↓ (typedef expansion)
std::basic_string<char, std::char_traits<char>, std::allocator<char>>
    ↓ (template instantiation)
[Concrete class with all methods specialized for char]

Memory Layout (example with SSO):
sizeof(std::string) = 32 bytes on typical 64-bit system

┌───────────────────────────────────┐
│  Union (24 bytes)                 │
│  ┌─────────────────────────────┐  │
│  │ Option 1: Small (≤ 15 chars)│  │
│  │ buffer[16 chars]            │  │
│  │ size (1 byte)               │  │
│  ├─────────────────────────────┤  │
│  │ Option 2: Large (> 15 chars)│  │
│  │ ptr (8 bytes)               │  │
│  │ size (8 bytes)              │  │
│  │ capacity (8 bytes)          │  │
│  └─────────────────────────────┘  │
├───────────────────────────────────┤
│  Allocator (varies, often 0)      │
└───────────────────────────────────┘

Small String Optimization (SSO)

Notice the union in the memory layout above? That's the key to Small String Optimization (SSO), one of the most important performance optimizations in modern C++ implementations.

The Problem SSO Solves

Every heap allocation is expensive:

  • System call overhead (malloc/new)
  • Cache misses (heap data is far from the string object)
  • Memory fragmentation
  • Deallocation overhead (free/delete)

But most strings in real programs are short! Studies show:

  • ~80% of strings are 15 characters or less
  • Function names, variable names, error messages, JSON keys, etc.

Why allocate on the heap for "Hello"?

How SSO Works

Instead of always heap-allocating, std::string uses a clever trick:

std::string short_str = "Hello";      // SSO: stored inline, no heap allocation!
std::string long_str = "This is a much longer string that exceeds SSO limit";
                                      // Heap: allocated on heap

// Both are 32 bytes (or 24, or 16 depending on implementation)
sizeof(short_str) == sizeof(long_str)

// But behavior is different:
// short_str: data is INSIDE the object
// long_str: data is OUTSIDE the object (heap)
SSO Performance Benefits

Before SSO:

std::string s = "Hello";  // Allocate 6 bytes on heap
// - malloc() system call
// - Cache miss when accessing "Hello"
// - free() call on destruction

With SSO:

std::string s = "Hello";  // Store inline in the 32-byte object
// - No malloc()
// - Data in cache (next to object)
// - No free() needed

Benchmark impact:

  • 2-10x faster for short string operations
  • Better cache locality: String data is adjacent to the object
  • Reduced memory fragmentation: Fewer heap allocations
Implementation Variations

Different standard library implementations use different SSO buffer sizes:

ImplementationSSO SizeTotal sizeof(std::string)
libstdc++ (GCC)15 bytes32 bytes (64-bit)
libc++ (Clang)22 bytes24 bytes (64-bit)
MSVC STL15 bytes32 bytes (64-bit)

Why different sizes?

  • Trade-off between object size and inline storage
  • ABI (Application Binary Interface) stability concerns
  • Different optimization strategies
How to Detect SSO in Action
#include <iostream>
#include <string>

void print_address(const std::string& s) {
    const void* obj_addr = &s;
    const void* data_addr = s.data();

    std::cout << "Object at:  " << obj_addr << "\n";
    std::cout << "Data at:    " << data_addr << "\n";

    // Check if data is inside the object (SSO)
    const char* obj_bytes = reinterpret_cast<const char*>(obj_addr);
    const char* data_bytes = reinterpret_cast<const char*>(data_addr);

    if (data_bytes >= obj_bytes &&
        data_bytes < obj_bytes + sizeof(std::string)) {
        std::cout << "SSO: Data stored INLINE\n";
    } else {
        std::cout << "Heap: Data stored on HEAP\n";
    }
}

int main() {
    std::string short_str = "Hello";
    std::string long_str = "This is a very long string that will not fit in SSO buffer";

    print_address(short_str);  // SSO
    std::cout << "\n";
    print_address(long_str);   // Heap
}

// Typical output:
// Object at:  0x7ffc1234abc0
// Data at:    0x7ffc1234abc0  ← Same! Data is inside object
// SSO: Data stored INLINE
//
// Object at:  0x7ffc1234abe0
// Data at:    0x55a8d9e0f2c0  ← Different! Data is on heap
// Heap: Data stored on HEAP
When SSO Doesn't Apply

SSO is disabled when:

  1. String is too long: Exceeds the buffer size (typically 15-22 chars)
  2. Custom allocator used: Some allocators may not support SSO
  3. Shared ownership: If string shares data (rare, mostly removed in C++11)
// SSO applies
std::string s1 = "Hello";

// SSO does NOT apply - too long
std::string s2 = "This string is definitely too long for SSO";

// PMR with null resource - SSO is the ONLY option!
std::pmr::null_memory_resource null_mr;
std::pmr::string s3("Hi", &null_mr);  // OK: Fits in SSO buffer
std::pmr::string s4("This is too long", &null_mr);  // RUNTIME ERROR: Can't allocate!
Why This Matters
  1. Performance: Short strings are extremely common; SSO makes them fast
  2. Memory: Fewer heap allocations = less fragmentation
  3. Cache: Better locality = fewer cache misses
  4. Predictability: Short strings have deterministic performance (no malloc)

The "free lunch": Most modern C++ programs get SSO optimization automatically without any code changes!


Type Identity Problem for Compilers

Here's why this matters for your Coogle tool:

void foo(std::string s);
// Mangled name might be: _Z3fooNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

void bar(std::basic_string<char> s);
// Mangled name: _Z3barNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

// They're THE SAME type, but name lookup might differ!

For your search engine, you need to handle:

  1. Typedef aliases: std::string == std::basic_string<char>
  2. Default template arguments: std::basic_string<char> == std::basic_string<char, std::char_traits<char>, std::allocator<char>>
  3. Namespace qualifications: string vs std::string vs ::std::string

Template Instantiation Example

#include <string>

// When you write:
std::string s = "Hello";

// The compiler instantiates:
template class std::basic_string<
    char,                       // CharT
    std::char_traits<char>,     // Traits (default)
    std::allocator<char>        // Allocator (default)
>;

// And calls the constructor:
basic_string::basic_string(const char* s);

PART 3: Advanced Topics (PMR, Allocators, Encodings)

Memory Management: The Allocator Parameter

The third template parameter (Allocator) controls how std::basic_string allocates memory:

template<
    class CharT,
    class Traits = char_traits<CharT>,
    class Allocator = allocator<CharT>  // ← This one!
>
class basic_string;

This section covers two approaches to custom memory management:

  • Traditional allocators (C++98): Type-based, compile-time selection
  • Polymorphic allocators (C++17): Runtime selection with type compatibility

Why Customize Allocators?

  1. Performance: Custom allocation strategies for specific use cases
  2. Debugging: Track memory usage, detect leaks
  3. Embedded systems: Pre-allocated memory pools
  4. Memory locality: Keep related data together in cache

Traditional Allocators (C++98)

Example - Custom allocator:

#include <memory>
#include <string>

// Using a custom allocator
template<typename T>
class MyAllocator : public std::allocator<T> {
    // Custom allocation logic...
};

typedef std::basic_string<char, std::char_traits<char>, MyAllocator<char>> my_string;

my_string s = "Hello";  // Uses MyAllocator for memory management

The Problem with Traditional Allocators:

// These are DIFFERENT TYPES because of different allocators!
std::basic_string<char, std::char_traits<char>, std::allocator<char>> s1;
std::basic_string<char, std::char_traits<char>, MyAllocator<char>> s2;

s1 = s2;  // ERROR: COMPILE ERROR! Incompatible types!

Different allocators create different types, making code inflexible and hard to compose.


Polymorphic Memory Resources (C++17)

C++17 introduced std::pmr to solve the allocator type problem. This is a major improvement for flexible memory management.

The Solution:

#include <memory_resource>
#include <string>

namespace std {
    namespace pmr {
        // All pmr strings use the SAME TYPE but different memory resources!
        typedef basic_string<char, char_traits<char>, polymorphic_allocator<char>> string;
        typedef basic_string<wchar_t, char_traits<wchar_t>, polymorphic_allocator<wchar_t>> wstring;
        typedef basic_string<char16_t, char_traits<char16_t>, polymorphic_allocator<char16_t>> u16string;
        typedef basic_string<char32_t, char_traits<char32_t>, polymorphic_allocator<char32_t>> u32string;
    }
}
Key Benefits of PMR
  1. Same type: All pmr::string objects have the same type regardless of memory resource
  2. Runtime selection: Choose memory resource at runtime, not compile time
  3. Interoperability: Can assign between strings using different resources
Basic Usage Example
#include <memory_resource>
#include <string>
#include <iostream>

int main() {
    // Different memory resources
    std::pmr::monotonic_buffer_resource pool1(1024);  // 1KB buffer
    std::pmr::monotonic_buffer_resource pool2(2048);  // 2KB buffer

    // Both are std::pmr::string type, but use different resources!
    std::pmr::string s1("Hello", &pool1);   // Uses pool1
    std::pmr::string s2("World", &pool2);   // Uses pool2

    // OK: This works! Same type, can assign despite different resources
    s1 = s2;  // OK - copies the string, s1 still uses pool1 for allocation

    std::cout << s1 << std::endl;  // Prints "World"
}
Available Memory Resources

C++17 provides several built-in memory resources for different use cases:

#include <memory_resource>

// 1. Default heap allocator
std::pmr::string s1 = "Hello";  // Uses new/delete (default)

// 2. Monotonic buffer - fast allocation, no individual deallocation
char buffer[1024];
std::pmr::monotonic_buffer_resource mbr(buffer, sizeof(buffer));
std::pmr::string s2("Hello", &mbr);  // Allocates from buffer

// 3. Unsynchronized pool - fast, single-threaded
std::pmr::unsynchronized_pool_resource pool;
std::pmr::string s3("Hello", &pool);

// 4. Synchronized pool - thread-safe
std::pmr::synchronized_pool_resource sync_pool;
std::pmr::string s4("Hello", &sync_pool);

// 5. Null memory resource - allocations fail (for testing)
std::pmr::null_memory_resource null_mr;
std::pmr::string s5("Hi", &null_mr);  // Only works if SSO applies!

Memory Resource Hierarchy:

std::pmr::memory_resource (abstract base class)
    ├── std::pmr::new_delete_resource() - default heap
    ├── std::pmr::null_memory_resource() - always fails
    ├── std::pmr::monotonic_buffer_resource - append-only, fast
    ├── std::pmr::unsynchronized_pool_resource - pooled, single-threaded
    └── std::pmr::synchronized_pool_resource - pooled, thread-safe
Real-World Example: Arena Allocation

A common pattern in high-performance code is arena allocation (also called region-based allocation). All memory for a request is allocated from a buffer and freed in one operation:

#include <memory_resource>
#include <vector>
#include <string>

void process_request() {
    // Create a memory arena for this request
    char buffer[4096];  // 4KB stack buffer
    std::pmr::monotonic_buffer_resource arena(buffer, sizeof(buffer));

    // All allocations come from the arena
    std::pmr::vector<std::pmr::string> messages(&arena);

    messages.push_back(std::pmr::string("Message 1", &arena));
    messages.push_back(std::pmr::string("Message 2", &arena));
    messages.push_back(std::pmr::string("Message 3", &arena));

    // Process messages...

}  // OK: Arena destroyed, all memory freed at once (super fast!)
   // No individual deallocations needed!

Why this is faster:

  • No individual delete calls
  • Better cache locality (data packed together)
  • Reduced memory fragmentation
  • Common in game engines, servers, compilers
Comparison: Traditional vs PMR Strings
// Traditional string - allocator is part of the type
std::string s1 = "Hello";
std::basic_string<char, std::char_traits<char>, MyAllocator<char>> s2 = "World";
// s1 and s2 are DIFFERENT TYPES - cannot assign!

// PMR string - allocator chosen at runtime
std::pmr::monotonic_buffer_resource pool1(1024);
std::pmr::monotonic_buffer_resource pool2(2048);

std::pmr::string pmr_s1("Hello", &pool1);
std::pmr::string pmr_s2("World", &pool2);
// pmr_s1 and pmr_s2 are the SAME TYPE - can assign!
pmr_s1 = pmr_s2;  // OK: Works!
When to Use PMR

Use std::pmr::string when:

  • You need to control memory allocation strategy
  • Working with embedded systems or real-time systems
  • Building high-performance servers (arena allocation)
  • Need containers with strings to share memory resources
  • Want runtime flexibility without type proliferation

Stick with std::string when:

  • Default heap allocation is fine
  • Code simplicity is priority
  • No special memory requirements
  • C++17 not available
PMR and Type Identity (Important for Coogle!)
// For Coogle, you need to handle:
"std::string""std::basic_string<char>"
"std::pmr::string""std::basic_string<char, std::char_traits<char>, std::pmr::polymorphic_allocator<char>>"

// They're DIFFERENT types despite similar names!
void foo(std::string s);      // Type 1
void bar(std::pmr::string s); // Type 2 - DIFFERENT!

// Cannot implicitly convert:
std::string s1 = "Hello";
std::pmr::string s2 = s1;  // ERROR: Compile error!

// Must explicitly construct:
std::pmr::string s3(s1.begin(), s1.end());  // OK

Different Character Encodings

Beyond char, std::basic_string supports multiple character types for different encodings:

// All these use the same basic_string template:

std::string         s8  = "Hello";           // char
std::wstring        ws  = L"Hello";          // wchar_t (2 or 4 bytes)
std::u16string      s16 = u"Hello";          // char16_t (2 bytes, C++11)
std::u32string      s32 = U"Hello";          // char32_t (4 bytes, C++11)
std::u8string       u8s = u8"Hello";         // char8_t (1 byte, C++20)

// Memory layout for "Hello":
// s8:  [H][e][l][l][o][\0]          6 bytes
// ws:  [H\0][e\0][l\0][l\0][o\0][\0\0]  12 bytes (UTF-16) or 24 (UTF-32)
// s16: [H\0][e\0][l\0][l\0][o\0][\0\0]  12 bytes
// s32: [H\0\0\0][e\0\0\0]...            24 bytes

// PMR versions (C++17):
std::pmr::string    pmr_s8  = "Hello";       // Uses polymorphic allocator
std::pmr::wstring   pmr_ws  = L"Hello";
std::pmr::u16string pmr_s16 = u"Hello";
std::pmr::u32string pmr_s32 = U"Hello";

Design Rationale and Trade-offs

Why This Template-Based Design?

Benefits:

  1. Code reuse: One implementation works for all character types
  2. Type safety: std::string and std::wstring are incompatible types
  3. Customization: Can provide custom traits or allocators
  4. Performance: Template specialization allows optimization
  5. Consistency: Same interface for all character types

Trade-offs:

  1. Compilation time: Templates increase compile time
  2. Code bloat: Each instantiation generates code
  3. Complex error messages: Template errors can be verbose
  4. Binary size: Multiple instantiations = larger binaries

Summary: The Template Nature of C++ Strings

┌─────────────────────────────────────────────────────────────┐
│    std::basic_string<CharT, Traits, Alloc> (Template)       │
└─────────────────────────────────────────────────────────────┘
          ┌──────────────────┼───────────────────┐
          │                  │                   │
          ▼                  ▼                   ▼
    ┌──────────┐       ┌───────────┐      ┌────────────┐
    │  string  │       │ wstring   │      │ u16string  │
    │ (char)   │       │ (wchar_t) │      │ (char16_t) │
    └──────────┘       └───────────┘      └────────────┘

    Same template, different character types!

Key takeaway: Understanding that std::string is a template instantiation (not a primitive type) is crucial for:

  • Building tools like Coogle that analyze C++ code
  • Understanding compilation errors
  • Knowing when conversions are allowed
  • Optimizing performance (e.g., move semantics)
  • Creating custom string types with different behaviors

Summary: String Evolution Timeline

C Era:

  • 1972: C introduced null-terminated strings (char* with \0)
  • 1989: ANSI C standardized string functions (strcpy, strlen, etc.)

C++ Evolution:

  • 1998 (C++98): std::string introduced with RAII and automatic memory management
  • 2011 (C++11):
    • Move semantics for efficient string transfers
    • UTF-16/UTF-32 strings (std::u16string, std::u32string)
    • Raw string literals (R"(text)")
    • User-defined literals support
    • std::to_string() for converting numbers to strings
    • std::stoi(), std::stol(), std::stof() for string-to-number conversions
    • Range-based for loops work with strings
    • shrink_to_fit() to reduce capacity to size
  • 2014 (C++14):
    • Standard user-defined string literals (""s operator)
    • Heterogeneous lookup for std::string in associative containers
  • 2017 (C++17):
    • std::string_view for non-owning string references
    • std::pmr::basic_string (polymorphic allocators for std::string, std::wstring, etc.)
    • String deduction guides
    • std::to_chars() and std::from_chars() for low-level, locale-independent conversions
    • Splicing string literals ("Hello" "World" concatenation at compile time)
  • 2020 (C++20):
    • std::format for type-safe string formatting
    • std::u8string for UTF-8 (with char8_t)
    • constexpr std::string support (limited - destructor not constexpr yet)
    • Compile-time format string checking
    • String prefix/suffix operations (starts_with, ends_with)
    • erase() and erase_if() for removing elements
  • 2023 (C++23):
    • std::print and std::println for simplified output
    • More constexpr string operations
    • std::string::contains() for substring checking
    • std::string::resize_and_overwrite() for efficient buffer manipulation
  • 2026 (C++26) (proposed/in progress):
    • Further constexpr improvements
    • Potential text encoding conversions in standard library

Key Milestones:

  • SSO (Small String Optimization): Widely adopted in implementations around 2005-2010
  • Copy-on-write removal: Most implementations removed COW after C++11 move semantics (2011-2015)
  • ABI stability issues: GCC's dual-ABI support for std::string (2015) to maintain compatibility
  • String view adoption: Widespread use after C++17 for avoiding unnecessary copies

Topics for Further Study

The following topics are worth exploring to deepen your understanding of C++ strings and the type system:

  • Stream operators: How the << operator works with strings and the iostream library
  • String literals in templates: Template deduction rules and how string literals behave in template contexts
  • C++ type system design: How templates shape the fundamental design of C++'s type system