Performance cost: std::allocator vs std::pmr::polymorphic_allocator

Performance cost: std::allocator vs std::pmr::polymorphic_allocator

This blog post explores why memory management is a critical bottleneck in embedded systems and how C++17’s Polymorphic Memory Resources (PMR) compare to the default std::allocator in performance, determinism, and memory efficiency. (See correction notice - original results were flawed.)

  • TOC

Update (April 2026): I posted this to r/cpp and got roasted. Deservedly. The benchmark had two bugs that made every result below wrong: my PMR buffer was too small (so it quietly fell back to the heap) and I accidentally compiled at -O0. Fix both and PMR monotonic is actually ~1.5× faster, not 3-4× slower. I’ve kept the original numbers up (struck-through where needed) and added an errata at the bottom with the full story. Thanks to kammce, jwakely, SupermanLeRetour, Peddy699, SeanCline and everyone else on r/cpp who took the time.

Why Memory Management Matters in Embedded Systems

Embedded systems operate under constraints that desktop applications rarely face:

  • Limited RAM - Often measured in kilobytes, not gigabytes
  • Deterministic behavior - Real-time deadlines must be met
  • No virtual memory - Physical RAM is all you get
  • Fragmentation concerns - Memory leaks and fragmentation can’t be “fixed” by rebooting
  • Power constraints - Every allocation costs energy

The problem: new and delete aren’t just slow—they’re unpredictable. A single allocation might succeed in 10µs or fail after 1ms of heap searching.

Traditional dynamic memory allocation using malloc/free or new/delete introduces:

graph TD
    A[Dynamic Allocation Request] --> B{Heap Search}
    B -->|Best Fit| C[Search entire free list]
    B -->|First Fit| D[Search until fit found]
    C --> E[Non-deterministic timing]
    D --> E
    E --> F[Fragmentation over time]
    F --> G[Memory exhaustion]
    G --> H[System failure]
    
    style E fill:#ff6b6b,stroke:#c92a2a,color:#fff
    style F fill:#ffe066,stroke:#f08c00
    style H fill:#ff6b6b,stroke:#c92a2a,color:#fff

Real-world impact:

  • Spacecraft software often bans dynamic allocation entirely
  • Automotive safety systems use static allocation only
  • Industrial controllers pre-allocate all memory at startup

Traditional Approaches

Embedded developers have historically used several strategies to avoid heap allocation:

1. Static Allocation

// Everything allocated at compile time
struct SensorBuffer {
    static constexpr size_t MAX_SAMPLES = 100;
    int samples[MAX_SAMPLES];
    size_t count = 0;
};

SensorBuffer buffer;  // No runtime allocation

Pros: Predictable, fast, no fragmentation
Cons: Wastes memory, inflexible, compile-time sizing

2. Fixed-Size Pools

// Pre-allocated pool of Message objects
template<typename T, size_t N>
class ObjectPool {
    alignas(T) char storage[N][sizeof(T)];
    bool used[N] = {};
    
public:
    T* allocate() {
        for (size_t i = 0; i < N; ++i) {
            if (!used[i]) {
                used[i] = true;
                return new (&storage[i]) T();
            }
        }
        return nullptr;
    }
    
    void deallocate(T* ptr) {
        // Mark slot as free
    }
};

Pros: Bounded allocation time, no fragmentation
Cons: Fixed capacity, manual management, type-specific

3. Arena/Region Allocators

class Arena {
    char* buffer;
    size_t size;
    size_t offset = 0;
    
public:
    void* allocate(size_t n) {
        if (offset + n > size) return nullptr;
        void* ptr = buffer + offset;
        offset += n;
        return ptr;
    }
    
    void reset() { offset = 0; }  // Bulk deallocation
};

Pros: Fast allocation, bulk deallocation
Cons: No individual deallocation, manual lifecycle

STL are the standard Containers of C++, But Why Embedded Developers Are Skeptical of it ?

The C++ Standard Library provides powerful containers like vector, map, unordered_map, etc. Yet embedded developers often avoid them:

// Desktop developer's natural approach
void process_data() {
    std::vector<SensorReading> readings;  // Uses heap
    std::map<int, Device> devices;        // Uses heap
    
    for (int i = 0; i < sensor_count; ++i) {
        readings.push_back(read_sensor());  // Hidden allocations
    }
}

The Problems

Problems:

  1. Non-deterministic allocation: vector::push_back might allocate or might not
  2. Exceptions: Many embedded projects compile with -fno-exceptions
  3. Code bloat: Templates can increase binary size significantly
  4. Hidden costs: Iterator operations might be expensive
  5. Lack of control: Can’t specify where memory comes from

Dynamic Memory Management: The Hidden Costs

Let’s quantify what “hidden costs” actually means:

Allocation overhead on ARM Cortex-M4 (168 MHz):

OperationBest CaseWorst CaseVariance
malloc(32)150 cycles (~0.9µs)15,000 cycles (~89µs)100x
free()80 cycles (~0.5µs)8,000 cycles (~48µs)100x
vector::push_back (no resize)20 cycles20 cycles1x
vector::push_back (resize)200 cycles20,000 cycles100x

The variance is what matters. A 100x timing difference is unacceptable when you have a 1ms deadline.

Fragmentation

sequenceDiagram
    participant App
    participant Heap
    
    Note over Heap: Initial: [4KB free block]
    
    App->>Heap: alloc 1KB
    Note over Heap: [1KB used][3KB free]
    
    App->>Heap: alloc 1KB
    Note over Heap: [1KB][1KB][2KB free]
    
    App->>Heap: free first block
    Note over Heap: [1KB free][1KB][2KB free]
    
    App->>Heap: alloc 2KB
    Note over Heap: Can't fit! Only 1KB contiguous
    
    Note over Heap: Total free: 3KB<br/>Max contiguous: 2KB<br/>Fragmentation: 33%

Real-world impact:

  • System with 128KB RAM might only have 64KB usable after fragmentation
  • Medical device recalled due to memory exhaustion after 72 hours of operation
  • Industrial controller required daily reboots to “clear memory”

Call Stack Example

std::vector<int> data;
data.reserve(100);  // One allocation

// Stack trace during reserve():
// vector::reserve()
//   └─ allocator::allocate()
//      └─ operator new()
//         └─ malloc()
//            └─ heap_search()     ← Non-deterministic
//               └─ find_free_block()
//                  └─ coalesce_blocks()

Understanding std::pmr (Polymorphic Memory Resources)

C++17 introduced Polymorphic Memory Resources to solve exactly this problem. The key insight:

Separate the what (container logic) from the where (memory source)

The Architecture

graph TD
    A[pmr::vector Container] --> B[memory_resource interface]
    B --> C[monotonic_buffer_resource]
    B --> D[pool_resource]
    B --> E[synchronized_pool_resource]
    B --> F[Custom allocator]
    
    C --> G[Stack buffer]
    C --> H[Arena]
    D --> I[Fixed pools]
    F --> J[DMA memory]
    F --> K[Shared memory]
    
    style A fill:#74c0fc,stroke:#1971c2
    style B fill:#ffe066,stroke:#f08c00
    style C fill:#51cf66,stroke:#2f9e44
    style D fill:#51cf66,stroke:#2f9e44
    style E fill:#51cf66,stroke:#2f9e44
    style F fill:#51cf66,stroke:#2f9e44

Key idea: The container doesn’t know or care where memory comes from—it just calls the memory_resource interface.

// Traditional: tied to global heap
std::vector<int> vec;

// PMR: you control the memory source
char buffer[4096];
std::pmr::monotonic_buffer_resource pool{buffer, sizeof(buffer)};
std::pmr::vector<int> vec{&pool};  // Uses our buffer, not heap!

Key pmr Components

1. memory_resource (Base Interface)

class memory_resource {
public:
    virtual void* allocate(size_t bytes, size_t alignment) = 0;
    virtual void deallocate(void* p, size_t bytes, size_t alignment) = 0;
    virtual bool is_equal(const memory_resource& other) const = 0;
};

Your guarantee: All pmr containers use only these three functions.

2. monotonic_buffer_resource

Best for: Bulk allocation with single reset

// Allocate from stack buffer
char buffer[8192];
std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};

{
    std::pmr::vector<Message> messages{&mbr};
    std::pmr::string temp{&mbr};
    
    process_messages(messages);
    
}  // No individual deallocations!

mbr.release();  // Bulk reset - O(1)
sequenceDiagram
    participant Container
    participant MBR as monotonic_buffer
    participant Buffer as Stack Buffer
    
    Container->>MBR: allocate(64 bytes)
    MBR->>Buffer: bump pointer by 64
    Note over Buffer: [64 used][remaining free]
    
    Container->>MBR: allocate(128 bytes)
    MBR->>Buffer: bump pointer by 128
    Note over Buffer: [192 used][remaining free]
    
    Container->>MBR: deallocate(first block)
    Note over MBR: No-op! Can't reuse yet
    
    Note over Container: Scope ends
    Container->>MBR: release()
    MBR->>Buffer: Reset pointer to start
    Note over Buffer: [All free again]

Performance:

  • Allocation: O(1) - just increment a pointer
  • Deallocation: O(1) - no-op
  • Reset: O(1) - bulk cleanup

Trade-off: Individual deallocations are ignored until release()

3. pool_resource & synchronized_pool_resource

Best for: Frequent allocation/deallocation of similar sizes

std::pmr::unsynchronized_pool_resource pool;

{
    std::pmr::vector<Packet> packets{&pool};
    std::pmr::map<int, Device> devices{&pool};
    
    // Efficient allocation from pools
    for (int i = 0; i < 1000; ++i) {
        packets.push_back(Packet{});  // From pool
    }
}

// Memory returned to pools, ready for reuse

How it works:

graph TD
    A[pool_resource] --> B[Pool: 16-byte blocks]
    A --> C[Pool: 32-byte blocks]
    A --> D[Pool: 64-byte blocks]
    A --> E[Pool: 128-byte blocks]
    A --> F[Overflow to upstream]
    
    B --> B1[Free list]
    C --> C1[Free list]
    D --> D1[Free list]
    E --> E1[Free list]
    
    style A fill:#ffe066,stroke:#f08c00
    style B fill:#51cf66,stroke:#2f9e44
    style C fill:#51cf66,stroke:#2f9e44
    style D fill:#51cf66,stroke:#2f9e44
    style E fill:#51cf66,stroke:#2f9e44
    style F fill:#ff6b6b,stroke:#c92a2a,color:#fff

Performance:

  • Allocation: O(1) - pop from free list
  • Deallocation: O(1) - push to free list
  • No fragmentation within pools

Trade-off: More memory overhead for pool metadata

4. Custom Allocators

You can implement your own memory_resource for specialized needs:

class DMAMemoryResource : public std::pmr::memory_resource {
    void* dma_region;
    size_t offset = 0;
    
protected:
    void* do_allocate(size_t bytes, size_t align) override {
        // Allocate from DMA-capable memory region
        void* ptr = align_pointer(dma_region + offset, align);
        offset += bytes;
        return ptr;
    }
    
    void do_deallocate(void*, size_t, size_t) override {
        // No-op for DMA buffers
    }
    
    bool do_is_equal(const memory_resource& other) const override {
        return this == &other;
    }
};

// Now use with any pmr container
DMAMemoryResource dma_mem;
std::pmr::vector<uint8_t> dma_buffer{&dma_mem};

Common custom allocators:

  • DMA-capable memory regions
  • Cache-aligned allocations
  • Shared memory segments
  • Non-volatile RAM
  • Custom memory protection schemes

Performance Cost of std::pmr vs std Containers & Trade-offs

Let’s measure real performance on actual hardware.

My Test Setup

I built a benchmark suite to measure these claims objectively:

Hardware: [Snapdragon X Plus (12‑core Oryon CPU), Adreno GPU, 16GB LPDDR5x RAM]
Compiler: GCC 12.2, -O3 -std=c++17
Methodology: 100 iterations per test, measuring mean/variance/percentiles

Reproducible: All code and scripts are on GitHub. You can run these benchmarks yourself.

Benchmark 1: Vector Operations

Test: 1000 push_back operations of sensor data (80 bytes each)

// Test structure - typical embedded data
struct SensorData {
    int id;
    double temperature;
    double pressure;
    uint64_t timestamp;
    char label[32];
};

// What we're measuring
constexpr size_t N = 1000;

// Standard vector (heap allocations)
{
    std::vector<SensorData> vec;
    for (size_t i = 0; i < N; ++i) {
        vec.push_back(read_sensor(i));
    }
}

// PMR vector (stack buffer)
{
    char buffer[N * sizeof(SensorData) * 2];
    std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};
    std::pmr::vector<SensorData> vec{&mbr};
    for (size_t i = 0; i < N; ++i) {
        vec.push_back(read_sensor(i));
    }
}

Results from my machine:

Metricstd::vectorpmr::vector (monotonic)pmr::vector (pool)Comparison
Mean time16.90 µs57.44 µs68.86 µspmr 3.4-4.1x slower
Worst case33.60 µs264.20 µs91.70 µspmr worse P99
Variance (CV)10.23%43.18%5.62%pool most consistent
Best case16.40 µs51.00 µs64.40 µsstd fastest
Heap allocationsMany00pmr: no heap
graph LR
    A[std::vector<br/>16.9µs ±10%<br/>Fast but variable] -->|Switch to PMR| B[pmr::vector pool<br/>68.9µs ±5.6%<br/>Slower but consistent]
    
    style A fill:#74c0fc,stroke:#1971c2
    style B fill:#51cf66,stroke:#2f9e44

PMR is slower for simple operations like vector push_back, but offers better consistency. The pool allocator has the best determinism (5.62% CV) making it ideal for real-time scenarios where predictability matters more than raw speed.

P95 means 95% of operations complete faster than this value, P99 means 99%. The narrower the gap between P95 and P99, the more predictable the performance—pmr_pool has the tightest spread, meaning fewer outliers and more reliable worst-case timing for real-time systems.

Benchmark 2: String Operations

Test: Build 100 strings from sensor labels

// 100 strings like "Temperature_Sensor_42"
std::array<const char*, 100> labels = { /* ... */ };

// Standard strings
{
    std::vector<std::string> names;
    for (const auto* label : labels) {
        names.emplace_back(label);
    }
}

// PMR strings
{
    char buffer[8192];
    std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};
    std::pmr::vector<std::pmr::string> names{&mbr};
    for (const auto* label : labels) {
        names.emplace_back(label, &mbr);
    }
}

Results:

Operationstd::stringpmr::stringComparison
String concat (mean)307.77 µs360.17 µspmr 17% slower
String concat (CV)29.52%9.40%pmr 3.1x more consistent
SSO strings (mean)120.57 µs121.53 µsNearly identical
SSO strings (CV)46.58%29.19%pmr 1.6x more consistent
Heap allocationsMany0pmr: no heap

For strings, PMR trades a small speed penalty for significantly better determinism. String concat CV drops from 29.52% to 9.40% - a 3x improvement in predictability.

Benchmark 3: Map Operations

Test: Insert 500 device records into a map

Results:

Operationstd::mappmr::mapComparison
Int keys (mean)206.57 µs256.41 µspmr 24% slower
Int keys (CV)18.43%22.93%std more consistent
String keys (mean)487.28 µs578.90 µspmr 19% slower
String keys (CV)28.31%25.22%pmr 11% more consistent
Heap allocationsPer node0pmr: no heap

For maps, PMR doesn’t show the determinism advantage we saw with strings. The pool allocator adds overhead without significant consistency gains for this workload.

Why the Huge Variance in std::vector?

The standard allocator has unpredictable performance because it does complex memory management:

sequenceDiagram
    participant Vec as std::vector
    participant Heap as malloc/free
    
    Note over Vec: push_back #1
    Vec->>Heap: malloc(16 bytes)
    Heap->>Heap: Search free list
    Heap->>Heap: Check if split needed
    Heap-->>Vec: Fast path (0.8µs)
    
    Note over Vec: push_back #512
    Vec->>Heap: Need more space!
    Heap->>Heap: Search free list
    Heap->>Heap: No suitable block
    Heap->>Heap: Coalesce adjacent blocks
    Note right of Heap: Combine fragmented<br/>free blocks into<br/>larger blocks
    Heap->>Heap: Still not enough?
    Heap->>Heap: Request from OS (sbrk/mmap)
    Heap-->>Vec: Slow path (45µs)
    
    Note over Heap: 56x variance!

What’s “coalescing”? When memory is freed, adjacent free blocks are merged together:

Before dealloc:  [Used A][Used B][Used C]
After free B:    [Used A][Free B][Used C]  ← Fragmented
After free C:    [Used A][Free B+C merged]  ← Coalesced!

This is expensive - the allocator must scan neighbors, update metadata, and maintain sorted free lists.

PMR eliminates this complexity:

sequenceDiagram
    participant Vec as pmr::vector
    participant MBR as monotonic_buffer
    participant Stack as Stack Buffer
    
    Note over Vec: push_back #1
    Vec->>MBR: allocate(16)
    MBR->>Stack: offset += 16
    Note right of Stack: Just bump pointer!<br/>No search,<br/>no coalesce,<br/>no metadata
    Stack-->>Vec: 0.5µs
    
    Note over Vec: push_back #512
    Vec->>MBR: allocate(16)
    MBR->>Stack: offset += 16
    Stack-->>Vec: 0.5µs
    
    Note over Vec: deallocate(ptr)
    Vec->>MBR: deallocate()
    MBR->>MBR: No-op!
    Note right of MBR: Doesn't track<br/>individual frees.<br/>Bulk reset later.
    
    Note over Stack: Consistent every time!

The Memory Management Trade-off

So in brief … The 3 allocator options and their core pros (+)/cons (-).

flowchart TB
    A[Memory Management Options]
    A --> B[std::allocator<br/>+ reuses frees<br/>- search & coalesce<br/>- fragmentation]
    A --> C[pmr::monotonic_buffer<br/>+ O1 alloc, zero frag<br/>- reset to reclaim]
    A --> D[pmr::pool_resource<br/>+ O1 alloc/free in bins<br/>- bins do not merge]

And how does the 3 allocators handle “Determinism & Fragmentation”

flowchart TB
    E[Determinism & Fragmentation]
    E --> B2[std::allocator:<br/>low determinism,<br/>can fragment]
    E --> C2[pmr::monotonic:<br/>high determinism,<br/>no fragmentation]
    E --> D2[pmr::pool:<br/>high determinism,<br/>limited to bins]

Why this matters for embedded systems:

Operationstd::allocatorpmr::monotonicpmr::pool
AllocateO(n) search + coalesceO(1) pointer bumpO(1) pop from bin
DeallocateO(1) but triggers coalesceNo-opO(1) push to bin
Reuse freed memory✅ Yes❌ No (until reset)✅ Within same bin
Fragmentation❌ Yes - requires coalesce✅ None⚠️ Limited to bins
Determinism❌ Poor✅ Excellent✅ Very good

Example: Why coalescing matters

// Scenario: Allocate 1KB, 2KB, 1KB, then free middle
void* a = malloc(1024);  // [1KB used]
void* b = malloc(2048);  // [1KB used][2KB used]
void* c = malloc(1024);  // [1KB used][2KB used][1KB used]

free(b);                 // [1KB used][2KB free][1KB used]

// Now request 3KB - should fit in total free space (2KB + gaps)
void* d = malloc(3072);  // ❌ Fails! Not enough contiguous space
                         //    (unless allocator coalesces with neighbors)

// With coalescing (expensive):
free(a); free(c);        // Allocator scans and merges:
                         // [4KB free] ← Now can fit 3KB
void* d = malloc(3072);  // ✅ Works, but took time to coalesce

PMR approach:

char buffer[8192];
std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};

// Allocate whatever you need
void* a = mbr.allocate(1024, 8);
void* b = mbr.allocate(2048, 8);
void* c = mbr.allocate(1024, 8);

// Deallocate - no-op!
mbr.deallocate(a, 1024, 8);  // Does nothing
mbr.deallocate(b, 2048, 8);  // Does nothing

// Memory isn't reused...
void* d = mbr.allocate(3072, 8);  // Uses NEW space after 'c'

// But you get it all back at once:
mbr.release();  // Instant - resets pointer to start
// Now all 8192 bytes available again

// Or let destructor handle it
}  // mbr destroyed - automatic cleanup

PMR trades flexibility (can’t reuse freed memory immediately) for speed and determinism (no complex bookkeeping). Good for short-lived, bulk operations.

The Trade-offs

flowchart TD
    A[Choose Memory Resource] --> B{Usage Pattern?}
    
    B -->|Short-lived,<br/>bulk operations| C[monotonic_buffer]
    B -->|Frequent alloc/dealloc,<br/>mixed sizes| D[pool_resource]
    B -->|Rare allocations| E[new_delete_resource]
    
    C --> C1[Fastest<br/>Zero fragmentation<br/>No individual dealloc]
    D --> D1[Efficient reuse<br/>Good for varied sizes<br/>Setup complexity]
    E --> E1[Familiar<br/>Non-deterministic<br/>Fragmentation]
    
    style C1 fill:#51cf66,stroke:#2f9e44
    style D1 fill:#74c0fc,stroke:#1971c2
    style E1 fill:#ffe066,stroke:#f08c00

Summary table:

Aspectstd::vectorpmr::vector (monotonic)pmr::vector (pool)
Raw speed⭐⭐⭐⭐⭐ (16.9µs)⭐⭐ (57.4µs)⭐⭐ (68.9µs)
Determinism⭐⭐⭐ (CV 10.2%)⭐⭐ (CV 43.2%)⭐⭐⭐⭐⭐ (CV 5.6%)
Memory efficiency⚠️ Heap overhead✅ No heap✅ No heap
Flexibility✅ Full⚠️ No individual free✅ Can free
Setup complexity✅ Zero⚠️ Buffer sizing⚠️ Pool tuning

Verdict: For simple vector operations, std::vector is faster. PMR pool offers the best determinism at the cost of 4x slower speed. Choose PMR when you need predictable timing over raw performance.

Real-World Impact Example

Imagine a sensor system that processes 1000 readings every 10ms:

// With std::vector - fast but variable
void process_readings() {
    std::vector<Reading> data;
    for (int i = 0; i < 1000; ++i) {
        data.push_back(read_sensor(i));
    }
    analyze(data);
}
// Mean: 16.90µs, but CV=10.23%
// Worst case can spike 2x (33.60µs)
// For hard real-time: Variance is the problem ❌
// With pmr::pool - slower but predictable
void process_readings() {
    static std::pmr::unsynchronized_pool_resource pool;
    std::pmr::vector<Reading> data{&pool};
    
    for (int i = 0; i < 1000; ++i) {
        data.push_back(read_sensor(i));
    }
    analyze(data);
}
// Mean: 68.86µs, but CV=5.62% 
// Worst case bounded (91.70µs)
// For hard real-time: Predictability wins ✅

PMR is slower (4x in this case), but offers 2x better determinism. For real-time systems, you sacrifice speed for predictability. Choose based on your constraints.

How to Reproduce These Results

All benchmarks are on GitHub:

git clone https://github.com/saptarshi-max/pmr-benchmark
cd pmr-benchmark
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
./benchmark_vector  # See results

The repository includes:

  • Complete source code
  • Automated benchmark scripts
  • Report generation (markdown + graphs)
  • Methodology documentation

Your results may vary based on CPU/compiler, but the relative improvements should be similar.

Best Practices & Recommendations

1. Choose the Right Memory Resource

// For short-lived, bulk operations (request handling)
void handle_request(const Request& req) {
    char buffer[4096];
    std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};
    
    std::pmr::vector<Response> responses{&mbr};
    std::pmr::string temp{&mbr};
    
    process(req, responses, temp);
    
}  // Automatic cleanup, no fragmentation

// For long-lived, dynamic containers (device registry)
class DeviceManager {
    std::pmr::unsynchronized_pool_resource pool_;
    std::pmr::map<DeviceId, Device> devices_{&pool_};
    
public:
    void add_device(DeviceId id, Device dev) {
        devices_.emplace(id, std::move(dev));  // Efficient pooling
    }
};

// For extremely constrained systems (safety-critical)
alignas(Device) static char device_storage[MAX_DEVICES][sizeof(Device)];
// Use indices instead of containers

2. Stack Buffer Sizing Strategy

// Measure actual usage first
void tune_buffer_size() {
    struct : std::pmr::memory_resource {
        size_t peak = 0;
        size_t current = 0;
        
        void* do_allocate(size_t n, size_t) override {
            current += n;
            peak = std::max(peak, current);
            return ::operator new(n);
        }
        
        void do_deallocate(void* p, size_t n, size_t) override {
            current -= n;
            ::operator delete(p);
        }
        
        bool do_is_equal(const memory_resource& o) const override {
            return this == &o;
        }
    } tracker;
    
    // Run your typical workload
    {
        std::pmr::vector<Data> vec{&tracker};
        std::pmr::string str{&tracker};
        // ... typical operations ...
    }
    
    printf("Peak memory: %zu bytes\n", tracker.peak);
    // Now allocate buffer at peak + 20% margin
}

Rules of thumb:

  • Monotonic buffer: Workload peak + 20-30% margin
  • Pool resource: Expected concurrent allocations × avg size × 1.5
  • Always add overflow handling to upstream allocator

3. Error Handling

// PMR allocators can return nullptr!
std::pmr::vector<Data> vec{&limited_resource};

try {
    vec.push_back(data);
} catch (const std::bad_alloc&) {
    // Buffer exhausted
    log_error("Memory exhausted");
    // Graceful degradation
}

// Or check capacity proactively
if (vec.capacity() < vec.size() + 1) {
    // Handle before allocation fails
}

4. Combining Resources

// Hierarchy: fast path → slow path → error
char fast_buffer[1024];
std::pmr::monotonic_buffer_resource fast{fast_buffer, sizeof(fast_buffer)};

char slow_buffer[16384];
std::pmr::monotonic_buffer_resource slow{slow_buffer, sizeof(slow_buffer), &fast};

std::pmr::vector<Packet> packets{&slow};
// Uses fast buffer first, overflows to slow, then fails

5. Move from std to pmr Gradually

// Phase 1: Identify hot paths with profiling
// Phase 2: Replace one container at a time
// Phase 3: Measure improvement

// Before
void process_messages(std::vector<Message>& messages) {
    for (auto& msg : messages) {
        std::string payload = decode(msg);  // Hidden allocation
        handle(payload);
    }
}

// After (step 1: function-local)
void process_messages(std::vector<Message>& messages) {
    char buffer[4096];
    std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};
    
    for (auto& msg : messages) {
        std::pmr::string payload{&mbr};
        decode(msg, payload);  // Reuse buffer
        handle(payload);
    }
}

// After (step 2: propagate PMR through API)
void process_messages(std::pmr::vector<Message>& messages) {
    // Caller controls memory resource
}

6. Testing Strategy

// Create memory pressure in tests
class LimitedResource : public std::pmr::memory_resource {
    size_t limit_;
    size_t used_ = 0;
    
protected:
    void* do_allocate(size_t n, size_t a) override {
        if (used_ + n > limit_) {
            throw std::bad_alloc();
        }
        used_ += n;
        return upstream_->allocate(n, a);
    }
    // ...
};

TEST(SensorProcessor, HandlesMemoryExhaustion) {
    LimitedResource limited{1024};  // Only 1KB available
    
    std::pmr::vector<Sample> samples{&limited};
    
    // Verify graceful degradation
    EXPECT_NO_CRASH(fill_samples(samples, 10000));
}

Surprising Findings from Real Benchmarks

Heads up — these numbers are wrong. I’m leaving them here so you can see what I originally published, but skip to the errata for why.

After running comprehensive benchmarks, the results were different from expectations:

What I Expected vs. What I Found

ExpectationReality
PMR would be fasterPMR is 3-4x slower for simple operations
PMR would reduce variance significantlyMixed results: Great for strings (3x better), worse for vectors
Pool allocator would be fasteststd::allocator is fastest for basic operations
PMR always wins on determinismNot always: Map operations showed minimal improvement

The Real Trade-off

PMR is NOT a performance win in all cases. The actual benefit is:

  • When you need: Predictable timing, no heap usage, bounded worst-case
  • What you sacrifice: 3-4x slower execution for basic operations
  • Best use case: Real-time systems where 68µs±5.6% beats 16µs±10%

Benchmark Highlights

Vector operations (1000 push_back):

  • std: 16.90µs (fast, 10% variance)
  • pmr_pool: 68.86µs (slow, 5.6% variance) ← Best determinism
  • Winner: Depends on your constraint (speed vs. predictability)

String concatenation:

  • std: 307.77µs (CV: 29.52%)
  • pmr: 360.17µs (CV: 9.40%) ← 3x more consistent
  • Winner: PMR if you need determinism

Sensor data collection (realistic embedded workload):

  • std: 2197.62µs (CV: 14.15%)
  • pmr: 2655.63µs (CV: 12.72%) ← Slight improvement
  • Winner: Depends on whether 20% speed loss is acceptable for 10% better determinism

Message queue (mixed workload):

  • std: 171.90µs (CV: 8.93%)
  • pmr: 272.57µs (CV: 28.10%) ← Worse!
  • Winner: std (better on both metrics)

PMR isn’t universally better. It trades raw speed for predictability. Profile your specific workload before switching.

Conclusions

These conclusions were based on broken benchmarks. See what I actually learned below.

Key Takeaways (wrong)

PMR offers predictability, not speed: Expect 3-4x slower but more consistent timing Use PMR for determinism: When low variance (68µs±5%) matters more than speed (16µs±10%) in real-time systems std::allocator is still fast: For non-critical paths, standard containers perform well Not a silver bullet: Profile first, PMR helps specific cases (strings, bounded allocations)

When to Use PMR

✅ Use PMR when:

  • Hard real-time systems where consistent timing matters more than average speed (predictability over performance)
  • Embedded systems where heap allocation must be avoided
  • Safety-critical software requiring deterministic behavior
  • Short-lived operations with known memory bounds

✅ Use std::allocator when:

  • Raw performance is the priority (3-4x faster for basic ops)
  • Workload has low natural variance (like our message queue example)
  • Long-lived containers with unpredictable growth
  • Development speed matters more than optimization

⚠️ Benchmark first:

  • Results vary by workload (vector vs. string vs. map)
  • Your hardware/compiler may show different ratios
  • Profile to see if determinism improvement justifies speed loss

The Mental Model Shift

graph LR
    A[Traditional:<br/>'Container owns memory'] --> B[PMR:<br/>'Container borrows memory']
    
    B --> C[You control:<br/>• Where<br/>• When<br/>• How much]
    
    C --> D[Better:<br/>• Determinism<br/>• Performance<br/>• Control]
    
    style A fill:#ffe066,stroke:#f08c00
    style B fill:#74c0fc,stroke:#1971c2
    style D fill:#51cf66,stroke:#2f9e44

PMR isn’t about replacing std containers—it’s about giving you control when you need it. In embedded systems, that control often means the difference between “works most of the time” and “proven reliable.”

Quick Decision Matrix

Your SituationRecommendationRationale
Need minimum latency variancepmr::pool_resourceCV as low as 5.6%
Need maximum raw speedstd::allocator3-4x faster for vectors
String-heavy workloadpmr::string3x better determinism
Mixed container operationsProfile firstResults vary (std won our message queue test)
Embedded, hard real-timepmr::monotonic_bufferZero heap, bounded timing
General-purpose applicationStick with std::Faster and simpler

Errata — What I Got Wrong

I posted this to r/cpp expecting some discussion about embedded trade-offs. Instead, the top comments were pointing out that my benchmarks were fundamentally broken. 44 comments, and the majority were some variation of “your data is wrong.” They were right.

Two bugs. That’s all it took to completely invert the results. And a third mistake in how I analysed variance made the comparison misleading on top of that.

Bug 1: My PMR buffer was too small

Here’s what I had:

constexpr size_t BUFFER_SIZE = NUM_ELEMENTS * sizeof(int) * 4; // 16,000 bytes

4× the raw data. Seemed like plenty of headroom. It wasn’t.

The thing I forgot: std::vector grows geometrically. When it reallocates, the old and new buffers both exist at the same time (elements get copied over, then the old one is freed). And monotonic_buffer_resource doesn’t actually free anything — deallocations are no-ops. So every old allocation is still sitting there eating space. 16KB ran out fast.

Here’s the part that burned me: when monotonic_buffer_resource runs out of its initial buffer, it doesn’t throw. It doesn’t warn. It just quietly starts allocating from its upstream resource, which by default is… new_delete_resource(). The global heap. The exact thing I was trying to avoid.

So what was I actually benchmarking?

  • std::allocator → heap
  • pmr::monotonic → heap + PMR overhead on top

Of course PMR looked slower. It was doing strictly more work.

kammce (WG21 member) proved this by swapping in null_memory_resource() as the upstream. With a too-small buffer, the program immediately throws std::bad_alloc:

std::pmr::monotonic_buffer_resource mbr(
    buffer, BUFFER_SIZE, std::pmr::null_memory_resource());
// Crashes with original 16KB buffer. The buffer was exhausted.

jwakely (libstdc++ maintainer, LWG chair) confirmed independently.

The silent fallback is arguably a good design for production code — you don’t want your program to crash because a buffer was slightly undersized. But for benchmarking, it’s a trap. You get numbers that look plausible but mean nothing. Lesson: when benchmarking PMR, always set the upstream to null_memory_resource(). You want it to blow up if the buffer isn’t big enough.

Bug 2: I compiled at -O0

My CMakeLists.txt set the right flags:

set(CMAKE_CXX_FLAGS_RELEASE "-O3 -march=native -DNDEBUG")

And my build script ran:

cmake --build . --config Release

The problem: on Linux with Makefiles or Ninja (single-config generators), --config Release does absolutely nothing. It’s silently ignored. That flag only matters for multi-config generators like Visual Studio or Xcode. For Makefiles, the build type has to be set at configure time:

cmake .. -DCMAKE_BUILD_TYPE=Release   # <-- this is what actually matters
cmake --build .

kammce ran VERBOSE=1 and confirmed: no -O3 anywhere. I was benchmarking debug code. All of it.

If you develop on Windows with VS and then move to Linux CI with Makefiles, you will hit this. There’s no warning. The fix:

if(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
    message(STATUS "No build type selected — defaulting to Release")
    set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type" FORCE)
endif()

And always check VERBOSE=1 output to confirm -O3 is actually there.

Bug 3: I compared the wrong kind of variance

I looked at CV (coefficient of variation) and concluded PMR was more deterministic:

AllocatorMeanCVAbsolute jitter
std::allocator17 µs10%1.7 µs
pmr::pool69 µs6%4.14 µs

6% < 10%, so PMR wins on determinism, right? No. If you have a 100µs deadline, you care about the actual microseconds of jitter, not the percentage. PMR’s absolute jitter was 4.14µs vs 1.7µs — 2.4× worse. Peddy699 and SeanCline both called this out (with 43 and 35 upvotes respectively). They were right.

So what are the actual results?

kammce re-ran with correct buffer sizes and -O3:

  • Monotonic PMR: ~1.52× faster, ~5.10× better determinism
  • Mixed workload: PMR ~1.26× faster, ~2.84× better determinism

My original claim that PMR was 3-4× slower was exactly backwards. A pointer bump should beat a heap search. I should have been suspicious the moment my numbers said otherwise.

Other stuff people caught

I wasn’t calling vector::reserve(), so the benchmark was mostly measuring reallocation behaviour, not allocator performance (IfreetBalkan). kamrann_ pointed out that on MSVC’s STL, polymorphic_allocator blocks memmove optimisation for trivially-copyable types — so part of what I measured was element-move overhead, not allocator overhead (see P1144R7 §4.2). m-in said I should have looked at the codegen on Godbolt before drawing any conclusions. ald_loop (63 upvotes) pointed out that Google Benchmark exists for a reason.

Also, my title said “std vs pmr” — jwakely and SupermanLeRetour pointed out that std is a namespace, not an allocator. The comparison is std::allocator vs std::pmr::polymorphic_allocator. And custom allocators have been a thing since C++98 via the second template parameter. PMR isn’t the only alternative.

Things people suggested I look into

  • Compile-time custom allocators (hk19921992) — no virtual dispatch, full inlining potential
  • boost::static_vector (SuperV1234) — fixed capacity, no heap
  • ETL (lxbrtn) — Embedded Template Library
  • Static allocation only (RogerLeigh) — what safety-critical codebases actually do

What I Actually Learned

I published confidently wrong results and 44 strangers on the internet caught what I missed. That stung, but honestly I learned more from the Reddit thread than from writing the benchmarks.

The short version:

  • PMR monotonic is faster when the buffer is sized right. Pointer bump beats heap search. If your numbers say otherwise, your benchmark is broken.
  • monotonic_buffer_resource will silently fall back to the heap when it runs out. Use null_memory_resource() as upstream during dev/benchmarking so you get a crash instead of garbage data.
  • --config Release is a no-op on Makefiles/Ninja. Set CMAKE_BUILD_TYPE at configure time. Check VERBOSE=1.
  • CV% is misleading when comparing things with different means. For real-time, report absolute jitter.
  • Posting publicly is worth the embarrassment. kammce and jwakely didn’t just say “you’re wrong” — they showed exactly why, with code. That’s worth more than any textbook.
  • If a pointer bump benchmarks slower than malloc, your benchmark is wrong. Trust the theory enough to question surprising results.

Corrected Benchmark Results

I haven’t re-run on my hardware yet. Once I do, I’ll drop the numbers here. Based on kammce’s re-run, expect monotonic PMR to be ~1.5× faster with ~5× better determinism, and the mixed workload to be ~1.3× faster with ~2.8× better determinism.

TODO: re-run with 10× buffer, null_memory_resource() upstream, -DCMAKE_BUILD_TYPE=Release verified via VERBOSE=1, with and without reserve().


References

  1. C++17 Standard: Memory Resources
  2. P0220R1: Adopt Library Fundamentals V1 TS Components for C++17
  3. Pablo Halpern: “Allocators@C++Now 2017”
  4. Bjarne Stroustrup: “A Tour of C++” (2nd Edition), Section 13.6
  5. MISRA C++:2023 Guidelines - Rule 18-4-1: Dynamic heap memory allocation shall not be used
  6. JSF AV C++ Coding Standards - AV Rule 206: Dynamic memory allocation shall not be used
  7. Embedded Artistry: “Practical Guide to Bare Metal C++”
  8. ARM Cortex-M7 Technical Reference Manual - Memory system performance characteristics
  9. A. Mahmutbegović, C++ in Embedded Systems: A practical transition from C to modern C++, Packt Publishing, 2024.
  10. Reddit r/cpp discussion - Community feedback that caught the benchmark bugs
  11. P1144R7 � Object relocation - Trivial relocatability and PMR concerns
  12. Reddit r/cpp discussion - Community feedback that caught the benchmark bugs
  13. P1144R7 — Object relocation - Trivial relocatability and PMR concerns