Performance cost std::pmr vs std containers

Performance cost std::pmr vs std containers

This blog post explores why memory management is a critical bottleneck in embedded systems and how C++17’s Polymorphic Memory Resources (PMR) can dramatically improve performance, determinism, and memory efficiency compared to traditional std containers.

  • TOC

Why Memory Management Matters in Embedded Systems

Embedded systems operate under constraints that desktop applications rarely face:

  • Limited RAM - Often measured in kilobytes, not gigabytes
  • Deterministic behavior - Real-time deadlines must be met
  • No virtual memory - Physical RAM is all you get
  • Fragmentation concerns - Memory leaks and fragmentation can’t be “fixed” by rebooting
  • Power constraints - Every allocation costs energy

The problem: new and delete aren’t just slow—they’re unpredictable. A single allocation might succeed in 10µs or fail after 1ms of heap searching.

Traditional dynamic memory allocation using malloc/free or new/delete introduces:

graph TD
    A[Dynamic Allocation Request] --> B{Heap Search}
    B -->|Best Fit| C[Search entire free list]
    B -->|First Fit| D[Search until fit found]
    C --> E[Non-deterministic timing]
    D --> E
    E --> F[Fragmentation over time]
    F --> G[Memory exhaustion]
    G --> H[System failure]
    
    style E fill:#ff6b6b,stroke:#c92a2a,color:#fff
    style F fill:#ffe066,stroke:#f08c00
    style H fill:#ff6b6b,stroke:#c92a2a,color:#fff

Real-world impact:

  • Spacecraft software often bans dynamic allocation entirely
  • Automotive safety systems use static allocation only
  • Industrial controllers pre-allocate all memory at startup

Traditional Approaches

Embedded developers have historically used several strategies to avoid heap allocation:

1. Static Allocation

// Everything allocated at compile time
struct SensorBuffer {
    static constexpr size_t MAX_SAMPLES = 100;
    int samples[MAX_SAMPLES];
    size_t count = 0;
};

SensorBuffer buffer;  // No runtime allocation

Pros: Predictable, fast, no fragmentation
Cons: Wastes memory, inflexible, compile-time sizing

2. Fixed-Size Pools

// Pre-allocated pool of Message objects
template<typename T, size_t N>
class ObjectPool {
    alignas(T) char storage[N][sizeof(T)];
    bool used[N] = {};
    
public:
    T* allocate() {
        for (size_t i = 0; i < N; ++i) {
            if (!used[i]) {
                used[i] = true;
                return new (&storage[i]) T();
            }
        }
        return nullptr;
    }
    
    void deallocate(T* ptr) {
        // Mark slot as free
    }
};

Pros: Bounded allocation time, no fragmentation
Cons: Fixed capacity, manual management, type-specific

3. Arena/Region Allocators

class Arena {
    char* buffer;
    size_t size;
    size_t offset = 0;
    
public:
    void* allocate(size_t n) {
        if (offset + n > size) return nullptr;
        void* ptr = buffer + offset;
        offset += n;
        return ptr;
    }
    
    void reset() { offset = 0; }  // Bulk deallocation
};

Pros: Fast allocation, bulk deallocation
Cons: No individual deallocation, manual lifecycle

Standard Containers of C++, But Why Embedded Developers Are Skeptical of STL

The C++ Standard Library provides powerful containers like vector, map, unordered_map, etc. Yet embedded developers often avoid them:

// Desktop developer's natural approach
void process_data() {
    std::vector<SensorReading> readings;  // Uses heap
    std::map<int, Device> devices;        // Uses heap
    
    for (int i = 0; i < sensor_count; ++i) {
        readings.push_back(read_sensor());  // Hidden allocations
    }
}

The Problems

Problems:

  1. Non-deterministic allocation: vector::push_back might allocate or might not
  2. Exceptions: Many embedded projects compile with -fno-exceptions
  3. Code bloat: Templates can increase binary size significantly
  4. Hidden costs: Iterator operations might be expensive
  5. Lack of control: Can’t specify where memory comes from

Dynamic Memory Management: The Hidden Costs

Let’s quantify what “hidden costs” actually means:

Allocation overhead on ARM Cortex-M4 (168 MHz):

OperationBest CaseWorst CaseVariance
malloc(32)150 cycles (~0.9µs)15,000 cycles (~89µs)100x
free()80 cycles (~0.5µs)8,000 cycles (~48µs)100x
vector::push_back (no resize)20 cycles20 cycles1x
vector::push_back (resize)200 cycles20,000 cycles100x

The variance is what matters. A 100x timing difference is unacceptable when you have a 1ms deadline.

Fragmentation

sequenceDiagram
    participant App
    participant Heap
    
    Note over Heap: Initial: [4KB free block]
    
    App->>Heap: alloc 1KB
    Note over Heap: [1KB used][3KB free]
    
    App->>Heap: alloc 1KB
    Note over Heap: [1KB][1KB][2KB free]
    
    App->>Heap: free first block
    Note over Heap: [1KB free][1KB][2KB free]
    
    App->>Heap: alloc 2KB
    Note over Heap: Can't fit! Only 1KB contiguous
    
    Note over Heap: Total free: 3KB<br/>Max contiguous: 2KB<br/>Fragmentation: 33%

Real-world impact:

  • System with 128KB RAM might only have 64KB usable after fragmentation
  • Medical device recalled due to memory exhaustion after 72 hours of operation
  • Industrial controller required daily reboots to “clear memory”

Call Stack Example

std::vector<int> data;
data.reserve(100);  // One allocation

// Stack trace during reserve():
// vector::reserve()
//   └─ allocator::allocate()
//      └─ operator new()
//         └─ malloc()
//            └─ heap_search()     ← Non-deterministic
//               └─ find_free_block()
//                  └─ coalesce_blocks()

Understanding std::pmr (Polymorphic Memory Resources)

C++17 introduced Polymorphic Memory Resources to solve exactly this problem. The key insight:

Separate the what (container logic) from the where (memory source)

The Architecture

graph TD
    A[pmr::vector Container] --> B[memory_resource interface]
    B --> C[monotonic_buffer_resource]
    B --> D[pool_resource]
    B --> E[synchronized_pool_resource]
    B --> F[Custom allocator]
    
    C --> G[Stack buffer]
    C --> H[Arena]
    D --> I[Fixed pools]
    F --> J[DMA memory]
    F --> K[Shared memory]
    
    style A fill:#74c0fc,stroke:#1971c2
    style B fill:#ffe066,stroke:#f08c00
    style C fill:#51cf66,stroke:#2f9e44
    style D fill:#51cf66,stroke:#2f9e44
    style E fill:#51cf66,stroke:#2f9e44
    style F fill:#51cf66,stroke:#2f9e44

Key idea: The container doesn’t know or care where memory comes from—it just calls the memory_resource interface.

// Traditional: tied to global heap
std::vector<int> vec;

// PMR: you control the memory source
char buffer[4096];
std::pmr::monotonic_buffer_resource pool{buffer, sizeof(buffer)};
std::pmr::vector<int> vec{&pool};  // Uses our buffer, not heap!

Key pmr Components

1. memory_resource (Base Interface)

class memory_resource {
public:
    virtual void* allocate(size_t bytes, size_t alignment) = 0;
    virtual void deallocate(void* p, size_t bytes, size_t alignment) = 0;
    virtual bool is_equal(const memory_resource& other) const = 0;
};

Your guarantee: All pmr containers use only these three functions.

2. monotonic_buffer_resource

Best for: Bulk allocation with single reset

// Allocate from stack buffer
char buffer[8192];
std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};

{
    std::pmr::vector<Message> messages{&mbr};
    std::pmr::string temp{&mbr};
    
    process_messages(messages);
    
}  // No individual deallocations!

mbr.release();  // Bulk reset - O(1)
sequenceDiagram
    participant Container
    participant MBR as monotonic_buffer
    participant Buffer as Stack Buffer
    
    Container->>MBR: allocate(64 bytes)
    MBR->>Buffer: bump pointer by 64
    Note over Buffer: [64 used][remaining free]
    
    Container->>MBR: allocate(128 bytes)
    MBR->>Buffer: bump pointer by 128
    Note over Buffer: [192 used][remaining free]
    
    Container->>MBR: deallocate(first block)
    Note over MBR: No-op! Can't reuse yet
    
    Note over Container: Scope ends
    Container->>MBR: release()
    MBR->>Buffer: Reset pointer to start
    Note over Buffer: [All free again]

Performance:

  • Allocation: O(1) - just increment a pointer
  • Deallocation: O(1) - no-op
  • Reset: O(1) - bulk cleanup

Trade-off: Individual deallocations are ignored until release()

3. pool_resource & synchronized_pool_resource

Best for: Frequent allocation/deallocation of similar sizes

std::pmr::unsynchronized_pool_resource pool;

{
    std::pmr::vector<Packet> packets{&pool};
    std::pmr::map<int, Device> devices{&pool};
    
    // Efficient allocation from pools
    for (int i = 0; i < 1000; ++i) {
        packets.push_back(Packet{});  // From pool
    }
}

// Memory returned to pools, ready for reuse

How it works:

graph TD
    A[pool_resource] --> B[Pool: 16-byte blocks]
    A --> C[Pool: 32-byte blocks]
    A --> D[Pool: 64-byte blocks]
    A --> E[Pool: 128-byte blocks]
    A --> F[Overflow to upstream]
    
    B --> B1[Free list]
    C --> C1[Free list]
    D --> D1[Free list]
    E --> E1[Free list]
    
    style A fill:#ffe066,stroke:#f08c00
    style B fill:#51cf66,stroke:#2f9e44
    style C fill:#51cf66,stroke:#2f9e44
    style D fill:#51cf66,stroke:#2f9e44
    style E fill:#51cf66,stroke:#2f9e44
    style F fill:#ff6b6b,stroke:#c92a2a,color:#fff

Performance:

  • Allocation: O(1) - pop from free list
  • Deallocation: O(1) - push to free list
  • No fragmentation within pools

Trade-off: More memory overhead for pool metadata

4. Custom Allocators

You can implement your own memory_resource for specialized needs:

class DMAMemoryResource : public std::pmr::memory_resource {
    void* dma_region;
    size_t offset = 0;
    
protected:
    void* do_allocate(size_t bytes, size_t align) override {
        // Allocate from DMA-capable memory region
        void* ptr = align_pointer(dma_region + offset, align);
        offset += bytes;
        return ptr;
    }
    
    void do_deallocate(void*, size_t, size_t) override {
        // No-op for DMA buffers
    }
    
    bool do_is_equal(const memory_resource& other) const override {
        return this == &other;
    }
};

// Now use with any pmr container
DMAMemoryResource dma_mem;
std::pmr::vector<uint8_t> dma_buffer{&dma_mem};

Common custom allocators:

  • DMA-capable memory regions
  • Cache-aligned allocations
  • Shared memory segments
  • Non-volatile RAM
  • Custom memory protection schemes

Performance Cost of std::pmr vs std Containers & Trade-offs

Let’s measure real performance on actual hardware.

My Test Setup

I built a benchmark suite to measure these claims objectively:

Hardware: [Snapdragon X Plus (12‑core Oryon CPU), Adreno GPU, 16GB LPDDR5x RAM]
Compiler: GCC 12.2, -O3 -std=c++17
Methodology: 100 iterations per test, measuring mean/variance/percentiles

Reproducible: All code and scripts are on GitHub. You can run these benchmarks yourself.

Benchmark 1: Vector Operations

Test: 1000 push_back operations of sensor data (80 bytes each)

// Test structure - typical embedded data
struct SensorData {
    int id;
    double temperature;
    double pressure;
    uint64_t timestamp;
    char label[32];
};

// What we're measuring
constexpr size_t N = 1000;

// Standard vector (heap allocations)
{
    std::vector<SensorData> vec;
    for (size_t i = 0; i < N; ++i) {
        vec.push_back(read_sensor(i));
    }
}

// PMR vector (stack buffer)
{
    char buffer[N * sizeof(SensorData) * 2];
    std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};
    std::pmr::vector<SensorData> vec{&mbr};
    for (size_t i = 0; i < N; ++i) {
        vec.push_back(read_sensor(i));
    }
}

Results from my machine:

Metricstd::vectorpmr::vector (monotonic)pmr::vector (pool)Comparison
Mean time16.90 µs57.44 µs68.86 µspmr 3.4-4.1x slower
Worst case33.60 µs264.20 µs91.70 µspmr worse P99
Variance (CV)10.23%43.18%5.62%pool most consistent
Best case16.40 µs51.00 µs64.40 µsstd fastest
Heap allocationsMany00pmr: no heap
graph LR
    A[std::vector<br/>16.9µs ±10%<br/>Fast but variable] -->|Switch to PMR| B[pmr::vector pool<br/>68.9µs ±5.6%<br/>Slower but consistent]
    
    style A fill:#74c0fc,stroke:#1971c2
    style B fill:#51cf66,stroke:#2f9e44

PMR is slower for simple operations like vector push_back, but offers better consistency. The pool allocator has the best determinism (5.62% CV) making it ideal for real-time scenarios where predictability matters more than raw speed.

Vector Push Back Performance Comparison Figure 1: Mean execution time comparison - std::vector is 3-4x faster than PMR variants

Vector Push Back Percentile Distribution Figure 2: P95/P99 latency distribution - pmr_pool shows the most consistent behavior

P95 means 95% of operations complete faster than this value, P99 means 99%. The narrower the gap between P95 and P99, the more predictable the performance—pmr_pool has the tightest spread, meaning fewer outliers and more reliable worst-case timing for real-time systems.

Benchmark 2: String Operations

Test: Build 100 strings from sensor labels

// 100 strings like "Temperature_Sensor_42"
std::array<const char*, 100> labels = { /* ... */ };

// Standard strings
{
    std::vector<std::string> names;
    for (const auto* label : labels) {
        names.emplace_back(label);
    }
}

// PMR strings
{
    char buffer[8192];
    std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};
    std::pmr::vector<std::pmr::string> names{&mbr};
    for (const auto* label : labels) {
        names.emplace_back(label, &mbr);
    }
}

Results:

Operationstd::stringpmr::stringComparison
String concat (mean)307.77 µs360.17 µspmr 17% slower
String concat (CV)29.52%9.40%pmr 3.1x more consistent
SSO strings (mean)120.57 µs121.53 µsNearly identical
SSO strings (CV)46.58%29.19%pmr 1.6x more consistent
Heap allocationsMany0pmr: no heap

For strings, PMR trades a small speed penalty for significantly better determinism. String concat CV drops from 29.52% to 9.40% - a 3x improvement in predictability.

String Concatenation Performance Figure 3: String concatenation - PMR is slightly slower but much more consistent

String SSO Performance Figure 4: Small String Optimization - PMR matches std performance with better determinism

Benchmark 3: Map Operations

Test: Insert 500 device records into a map

Results:

Operationstd::mappmr::mapComparison
Int keys (mean)206.57 µs256.41 µspmr 24% slower
Int keys (CV)18.43%22.93%std more consistent
String keys (mean)487.28 µs578.90 µspmr 19% slower
String keys (CV)28.31%25.22%pmr 11% more consistent
Heap allocationsPer node0pmr: no heap

For maps, PMR doesn’t show the determinism advantage we saw with strings. The pool allocator adds overhead without significant consistency gains for this workload.

Why the Huge Variance in std::vector?

The standard allocator has unpredictable performance because it does complex memory management:

sequenceDiagram
    participant Vec as std::vector
    participant Heap as malloc/free
    
    Note over Vec: push_back #1
    Vec->>Heap: malloc(16 bytes)
    Heap->>Heap: Search free list
    Heap->>Heap: Check if split needed
    Heap-->>Vec: Fast path (0.8µs)
    
    Note over Vec: push_back #512
    Vec->>Heap: Need more space!
    Heap->>Heap: Search free list
    Heap->>Heap: No suitable block
    Heap->>Heap: Coalesce adjacent blocks
    Note right of Heap: Combine fragmented<br/>free blocks into<br/>larger blocks
    Heap->>Heap: Still not enough?
    Heap->>Heap: Request from OS (sbrk/mmap)
    Heap-->>Vec: Slow path (45µs)
    
    Note over Heap: 56x variance!

What’s “coalescing”? When memory is freed, adjacent free blocks are merged together:

Before dealloc:  [Used A][Used B][Used C]
After free B:    [Used A][Free B][Used C]  ← Fragmented
After free C:    [Used A][Free B+C merged]  ← Coalesced!

This is expensive - the allocator must scan neighbors, update metadata, and maintain sorted free lists.

PMR eliminates this complexity:

sequenceDiagram
    participant Vec as pmr::vector
    participant MBR as monotonic_buffer
    participant Stack as Stack Buffer
    
    Note over Vec: push_back #1
    Vec->>MBR: allocate(16)
    MBR->>Stack: offset += 16
    Note right of Stack: Just bump pointer!<br/>No search,<br/>no coalesce,<br/>no metadata
    Stack-->>Vec: 0.5µs
    
    Note over Vec: push_back #512
    Vec->>MBR: allocate(16)
    MBR->>Stack: offset += 16
    Stack-->>Vec: 0.5µs
    
    Note over Vec: deallocate(ptr)
    Vec->>MBR: deallocate()
    MBR->>MBR: No-op!
    Note right of MBR: Doesn't track<br/>individual frees.<br/>Bulk reset later.
    
    Note over Stack: Consistent every time!

The Memory Management Trade-off

So in brief … The 3 allocator options and their core pros (+)/cons (-).

flowchart TB
    A[Memory Management Options]
    A --> B[std::allocator<br/>+ reuses frees<br/>- search & coalesce<br/>- fragmentation]
    A --> C[pmr::monotonic_buffer<br/>+ O1 alloc, zero frag<br/>- reset to reclaim]
    A --> D[pmr::pool_resource<br/>+ O1 alloc/free in bins<br/>- bins do not merge]

And how does the 3 allocators handle “Determinism & Fragmentation”

flowchart TB
    E[Determinism & Fragmentation]
    E --> B2[std::allocator:<br/>low determinism,<br/>can fragment]
    E --> C2[pmr::monotonic:<br/>high determinism,<br/>no fragmentation]
    E --> D2[pmr::pool:<br/>high determinism,<br/>limited to bins]

Why this matters for embedded systems:

Operationstd::allocatorpmr::monotonicpmr::pool
AllocateO(n) search + coalesceO(1) pointer bumpO(1) pop from bin
DeallocateO(1) but triggers coalesceNo-opO(1) push to bin
Reuse freed memory✅ Yes❌ No (until reset)✅ Within same bin
Fragmentation❌ Yes - requires coalesce✅ None⚠️ Limited to bins
Determinism❌ Poor✅ Excellent✅ Very good

Example: Why coalescing matters

// Scenario: Allocate 1KB, 2KB, 1KB, then free middle
void* a = malloc(1024);  // [1KB used]
void* b = malloc(2048);  // [1KB used][2KB used]
void* c = malloc(1024);  // [1KB used][2KB used][1KB used]

free(b);                 // [1KB used][2KB free][1KB used]

// Now request 3KB - should fit in total free space (2KB + gaps)
void* d = malloc(3072);  // ❌ Fails! Not enough contiguous space
                         //    (unless allocator coalesces with neighbors)

// With coalescing (expensive):
free(a); free(c);        // Allocator scans and merges:
                         // [4KB free] ← Now can fit 3KB
void* d = malloc(3072);  // ✅ Works, but took time to coalesce

PMR approach:

char buffer[8192];
std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};

// Allocate whatever you need
void* a = mbr.allocate(1024, 8);
void* b = mbr.allocate(2048, 8);
void* c = mbr.allocate(1024, 8);

// Deallocate - no-op!
mbr.deallocate(a, 1024, 8);  // Does nothing
mbr.deallocate(b, 2048, 8);  // Does nothing

// Memory isn't reused...
void* d = mbr.allocate(3072, 8);  // Uses NEW space after 'c'

// But you get it all back at once:
mbr.release();  // Instant - resets pointer to start
// Now all 8192 bytes available again

// Or let destructor handle it
}  // mbr destroyed - automatic cleanup

PMR trades flexibility (can’t reuse freed memory immediately) for speed and determinism (no complex bookkeeping). Good for short-lived, bulk operations.

The Trade-offs

flowchart TD
    A[Choose Memory Resource] --> B{Usage Pattern?}
    
    B -->|Short-lived,<br/>bulk operations| C[monotonic_buffer]
    B -->|Frequent alloc/dealloc,<br/>mixed sizes| D[pool_resource]
    B -->|Rare allocations| E[new_delete_resource]
    
    C --> C1[Fastest<br/>Zero fragmentation<br/>No individual dealloc]
    D --> D1[Efficient reuse<br/>Good for varied sizes<br/>Setup complexity]
    E --> E1[Familiar<br/>Non-deterministic<br/>Fragmentation]
    
    style C1 fill:#51cf66,stroke:#2f9e44
    style D1 fill:#74c0fc,stroke:#1971c2
    style E1 fill:#ffe066,stroke:#f08c00

Summary table:

Aspectstd::vectorpmr::vector (monotonic)pmr::vector (pool)
Raw speed⭐⭐⭐⭐⭐ (16.9µs)⭐⭐ (57.4µs)⭐⭐ (68.9µs)
Determinism⭐⭐⭐ (CV 10.2%)⭐⭐ (CV 43.2%)⭐⭐⭐⭐⭐ (CV 5.6%)
Memory efficiency⚠️ Heap overhead✅ No heap✅ No heap
Flexibility✅ Full⚠️ No individual free✅ Can free
Setup complexity✅ Zero⚠️ Buffer sizing⚠️ Pool tuning

Verdict: For simple vector operations, std::vector is faster. PMR pool offers the best determinism at the cost of 4x slower speed. Choose PMR when you need predictable timing over raw performance.

Real-World Impact Example

Imagine a sensor system that processes 1000 readings every 10ms:

// With std::vector - fast but variable
void process_readings() {
    std::vector<Reading> data;
    for (int i = 0; i < 1000; ++i) {
        data.push_back(read_sensor(i));
    }
    analyze(data);
}
// Mean: 16.90µs, but CV=10.23%
// Worst case can spike 2x (33.60µs)
// For hard real-time: Variance is the problem ❌
// With pmr::pool - slower but predictable
void process_readings() {
    static std::pmr::unsynchronized_pool_resource pool;
    std::pmr::vector<Reading> data{&pool};
    
    for (int i = 0; i < 1000; ++i) {
        data.push_back(read_sensor(i));
    }
    analyze(data);
}
// Mean: 68.86µs, but CV=5.62% 
// Worst case bounded (91.70µs)
// For hard real-time: Predictability wins ✅

PMR is slower (4x in this case), but offers 2x better determinism. For real-time systems, you sacrifice speed for predictability. Choose based on your constraints.

How to Reproduce These Results

All benchmarks are on GitHub:

git clone https://github.com/saptarshi-max/pmr-benchmark
cd pmr-benchmark
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
./benchmark_vector  # See results

The repository includes:

  • Complete source code
  • Automated benchmark scripts
  • Report generation (markdown + graphs)
  • Methodology documentation

Your results may vary based on CPU/compiler, but the relative improvements should be similar.

Best Practices & Recommendations

1. Choose the Right Memory Resource

// For short-lived, bulk operations (request handling)
void handle_request(const Request& req) {
    char buffer[4096];
    std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};
    
    std::pmr::vector<Response> responses{&mbr};
    std::pmr::string temp{&mbr};
    
    process(req, responses, temp);
    
}  // Automatic cleanup, no fragmentation

// For long-lived, dynamic containers (device registry)
class DeviceManager {
    std::pmr::unsynchronized_pool_resource pool_;
    std::pmr::map<DeviceId, Device> devices_{&pool_};
    
public:
    void add_device(DeviceId id, Device dev) {
        devices_.emplace(id, std::move(dev));  // Efficient pooling
    }
};

// For extremely constrained systems (safety-critical)
alignas(Device) static char device_storage[MAX_DEVICES][sizeof(Device)];
// Use indices instead of containers

2. Stack Buffer Sizing Strategy

// Measure actual usage first
void tune_buffer_size() {
    struct : std::pmr::memory_resource {
        size_t peak = 0;
        size_t current = 0;
        
        void* do_allocate(size_t n, size_t) override {
            current += n;
            peak = std::max(peak, current);
            return ::operator new(n);
        }
        
        void do_deallocate(void* p, size_t n, size_t) override {
            current -= n;
            ::operator delete(p);
        }
        
        bool do_is_equal(const memory_resource& o) const override {
            return this == &o;
        }
    } tracker;
    
    // Run your typical workload
    {
        std::pmr::vector<Data> vec{&tracker};
        std::pmr::string str{&tracker};
        // ... typical operations ...
    }
    
    printf("Peak memory: %zu bytes\n", tracker.peak);
    // Now allocate buffer at peak + 20% margin
}

Rules of thumb:

  • Monotonic buffer: Workload peak + 20-30% margin
  • Pool resource: Expected concurrent allocations × avg size × 1.5
  • Always add overflow handling to upstream allocator

3. Error Handling

// PMR allocators can return nullptr!
std::pmr::vector<Data> vec{&limited_resource};

try {
    vec.push_back(data);
} catch (const std::bad_alloc&) {
    // Buffer exhausted
    log_error("Memory exhausted");
    // Graceful degradation
}

// Or check capacity proactively
if (vec.capacity() < vec.size() + 1) {
    // Handle before allocation fails
}

4. Combining Resources

// Hierarchy: fast path → slow path → error
char fast_buffer[1024];
std::pmr::monotonic_buffer_resource fast{fast_buffer, sizeof(fast_buffer)};

char slow_buffer[16384];
std::pmr::monotonic_buffer_resource slow{slow_buffer, sizeof(slow_buffer), &fast};

std::pmr::vector<Packet> packets{&slow};
// Uses fast buffer first, overflows to slow, then fails

5. Move from std to pmr Gradually

// Phase 1: Identify hot paths with profiling
// Phase 2: Replace one container at a time
// Phase 3: Measure improvement

// Before
void process_messages(std::vector<Message>& messages) {
    for (auto& msg : messages) {
        std::string payload = decode(msg);  // Hidden allocation
        handle(payload);
    }
}

// After (step 1: function-local)
void process_messages(std::vector<Message>& messages) {
    char buffer[4096];
    std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};
    
    for (auto& msg : messages) {
        std::pmr::string payload{&mbr};
        decode(msg, payload);  // Reuse buffer
        handle(payload);
    }
}

// After (step 2: propagate PMR through API)
void process_messages(std::pmr::vector<Message>& messages) {
    // Caller controls memory resource
}

6. Testing Strategy

// Create memory pressure in tests
class LimitedResource : public std::pmr::memory_resource {
    size_t limit_;
    size_t used_ = 0;
    
protected:
    void* do_allocate(size_t n, size_t a) override {
        if (used_ + n > limit_) {
            throw std::bad_alloc();
        }
        used_ += n;
        return upstream_->allocate(n, a);
    }
    // ...
};

TEST(SensorProcessor, HandlesMemoryExhaustion) {
    LimitedResource limited{1024};  // Only 1KB available
    
    std::pmr::vector<Sample> samples{&limited};
    
    // Verify graceful degradation
    EXPECT_NO_CRASH(fill_samples(samples, 10000));
}

Surprising Findings from Real Benchmarks

After running comprehensive benchmarks, the results were different from expectations:

What I Expected vs. What I Found

ExpectationReality
PMR would be fasterPMR is 3-4x slower for simple operations
PMR would reduce variance significantlyMixed results: Great for strings (3x better), worse for vectors
Pool allocator would be fasteststd::allocator is fastest for basic operations
PMR always wins on determinismNot always: Map operations showed minimal improvement

The Real Trade-off

PMR is NOT a performance win in all cases. The actual benefit is:

  • When you need: Predictable timing, no heap usage, bounded worst-case
  • What you sacrifice: 3-4x slower execution for basic operations
  • Best use case: Real-time systems where 68µs±5.6% beats 16µs±10%

Benchmark Highlights

Vector operations (1000 push_back):

  • std: 16.90µs (fast, 10% variance)
  • pmr_pool: 68.86µs (slow, 5.6% variance) ← Best determinism
  • Winner: Depends on your constraint (speed vs. predictability)

String concatenation:

  • std: 307.77µs (CV: 29.52%)
  • pmr: 360.17µs (CV: 9.40%) ← 3x more consistent
  • Winner: PMR if you need determinism

String Concatenation Percentiles Figure: String concatenation P95/P99 - PMR shows tighter distribution

Sensor data collection (realistic embedded workload):

  • std: 2197.62µs (CV: 14.15%)
  • pmr: 2655.63µs (CV: 12.72%) ← Slight improvement
  • Winner: Depends on whether 20% speed loss is acceptable for 10% better determinism

Sensor Collection Benchmark Figure 8: Sensor data collection - PMR trades 20% speed for slightly better consistency

Message queue (mixed workload):

  • std: 171.90µs (CV: 8.93%)
  • pmr: 272.57µs (CV: 28.10%) ← Worse!
  • Winner: std (better on both metrics)

Message Queue Performance Figure 7: Message queue benchmark - std::allocator wins on both speed AND consistency

PMR isn’t universally better. It trades raw speed for predictability. Profile your specific workload before switching.

Conclusions

Key Takeaways

PMR offers predictability, not speed: Expect 3-4x slower but more consistent timing
Use PMR for determinism: When low variance (68µs±5%) matters more than speed (16µs±10%) in real-time systems
std::allocator is still fast: For non-critical paths, standard containers perform well
Not a silver bullet: Profile first, PMR helps specific cases (strings, bounded allocations)

When to Use PMR

✅ Use PMR when:

  • Hard real-time systems where consistent timing matters more than average speed (predictability over performance)
  • Embedded systems where heap allocation must be avoided
  • Safety-critical software requiring deterministic behavior
  • Short-lived operations with known memory bounds

✅ Use std::allocator when:

  • Raw performance is the priority (3-4x faster for basic ops)
  • Workload has low natural variance (like our message queue example)
  • Long-lived containers with unpredictable growth
  • Development speed matters more than optimization

⚠️ Benchmark first:

  • Results vary by workload (vector vs. string vs. map)
  • Your hardware/compiler may show different ratios
  • Profile to see if determinism improvement justifies speed loss

The Mental Model Shift

graph LR
    A[Traditional:<br/>'Container owns memory'] --> B[PMR:<br/>'Container borrows memory']
    
    B --> C[You control:<br/>• Where<br/>• When<br/>• How much]
    
    C --> D[Better:<br/>• Determinism<br/>• Performance<br/>• Control]
    
    style A fill:#ffe066,stroke:#f08c00
    style B fill:#74c0fc,stroke:#1971c2
    style D fill:#51cf66,stroke:#2f9e44

PMR isn’t about replacing std containers—it’s about giving you control when you need it. In embedded systems, that control often means the difference between “works most of the time” and “proven reliable.”

Quick Decision Matrix

Your SituationRecommendationRationale
Need minimum latency variancepmr::pool_resourceCV as low as 5.6%
Need maximum raw speedstd::allocator3-4x faster for vectors
String-heavy workloadpmr::string3x better determinism
Mixed container operationsProfile firstResults vary (std won our message queue test)
Embedded, hard real-timepmr::monotonic_bufferZero heap, bounded timing
General-purpose applicationStick with std::Faster and simpler

References

  1. C++17 Standard: Memory Resources
  2. P0220R1: Adopt Library Fundamentals V1 TS Components for C++17
  3. Pablo Halpern: “Allocators@C++Now 2017”
  4. Bjarne Stroustrup: “A Tour of C++” (2nd Edition), Section 13.6
  5. MISRA C++:2023 Guidelines - Rule 18-4-1: Dynamic heap memory allocation shall not be used
  6. JSF AV C++ Coding Standards - AV Rule 206: Dynamic memory allocation shall not be used
  7. Embedded Artistry: “Practical Guide to Bare Metal C++”
  8. ARM Cortex-M7 Technical Reference Manual - Memory system performance characteristics
  9. A. Mahmutbegović, C++ in Embedded Systems: A practical transition from C to modern C++, Packt Publishing, 2024.