Performance cost: std::allocator vs std::pmr::polymorphic_allocator

This blog post explores why memory management is a critical bottleneck in embedded systems and how C++17’s Polymorphic Memory Resources (PMR) compare to the default std::allocator in performance, determinism, and memory efficiency. (See correction notice - original results were flawed.)
- Why Memory Management Matters in Embedded Systems
- Traditional Approaches
- STL are the standard Containers of C++, But Why Embedded Developers Are Skeptical of it ?
- Understanding std::pmr (Polymorphic Memory Resources)
- Key pmr Components
- Performance Cost of std::pmr vs std Containers & Trade-offs
- Best Practices & Recommendations
- Surprising Findings from Real Benchmarks
- Conclusions
- Errata — What I Got Wrong
- What I Actually Learned
- Corrected Benchmark Results
- References
- TOC
Why Memory Management Matters in Embedded Systems
Embedded systems operate under constraints that desktop applications rarely face:
- Limited RAM - Often measured in kilobytes, not gigabytes
- Deterministic behavior - Real-time deadlines must be met
- No virtual memory - Physical RAM is all you get
- Fragmentation concerns - Memory leaks and fragmentation can’t be “fixed” by rebooting
- Power constraints - Every allocation costs energy
The problem: new and delete aren’t just slow—they’re unpredictable. A single allocation might succeed in 10µs or fail after 1ms of heap searching.
Traditional dynamic memory allocation using malloc/free or new/delete introduces:
graph TD
A[Dynamic Allocation Request] --> B{Heap Search}
B -->|Best Fit| C[Search entire free list]
B -->|First Fit| D[Search until fit found]
C --> E[Non-deterministic timing]
D --> E
E --> F[Fragmentation over time]
F --> G[Memory exhaustion]
G --> H[System failure]
style E fill:#ff6b6b,stroke:#c92a2a,color:#fff
style F fill:#ffe066,stroke:#f08c00
style H fill:#ff6b6b,stroke:#c92a2a,color:#fff
Real-world impact:
- Spacecraft software often bans dynamic allocation entirely
- Automotive safety systems use static allocation only
- Industrial controllers pre-allocate all memory at startup
Traditional Approaches
Embedded developers have historically used several strategies to avoid heap allocation:
1. Static Allocation
// Everything allocated at compile time
struct SensorBuffer {
static constexpr size_t MAX_SAMPLES = 100;
int samples[MAX_SAMPLES];
size_t count = 0;
};
SensorBuffer buffer; // No runtime allocation
Pros: Predictable, fast, no fragmentation
Cons: Wastes memory, inflexible, compile-time sizing
2. Fixed-Size Pools
// Pre-allocated pool of Message objects
template<typename T, size_t N>
class ObjectPool {
alignas(T) char storage[N][sizeof(T)];
bool used[N] = {};
public:
T* allocate() {
for (size_t i = 0; i < N; ++i) {
if (!used[i]) {
used[i] = true;
return new (&storage[i]) T();
}
}
return nullptr;
}
void deallocate(T* ptr) {
// Mark slot as free
}
};
Pros: Bounded allocation time, no fragmentation
Cons: Fixed capacity, manual management, type-specific
3. Arena/Region Allocators
class Arena {
char* buffer;
size_t size;
size_t offset = 0;
public:
void* allocate(size_t n) {
if (offset + n > size) return nullptr;
void* ptr = buffer + offset;
offset += n;
return ptr;
}
void reset() { offset = 0; } // Bulk deallocation
};
Pros: Fast allocation, bulk deallocation
Cons: No individual deallocation, manual lifecycle
STL are the standard Containers of C++, But Why Embedded Developers Are Skeptical of it ?
The C++ Standard Library provides powerful containers like vector, map, unordered_map, etc. Yet embedded developers often avoid them:
// Desktop developer's natural approach
void process_data() {
std::vector<SensorReading> readings; // Uses heap
std::map<int, Device> devices; // Uses heap
for (int i = 0; i < sensor_count; ++i) {
readings.push_back(read_sensor()); // Hidden allocations
}
}
The Problems
Problems:
- Non-deterministic allocation:
vector::push_backmight allocate or might not - Exceptions: Many embedded projects compile with
-fno-exceptions - Code bloat: Templates can increase binary size significantly
- Hidden costs: Iterator operations might be expensive
- Lack of control: Can’t specify where memory comes from
Dynamic Memory Management: The Hidden Costs
Let’s quantify what “hidden costs” actually means:
Allocation overhead on ARM Cortex-M4 (168 MHz):
| Operation | Best Case | Worst Case | Variance |
|---|---|---|---|
malloc(32) | 150 cycles (~0.9µs) | 15,000 cycles (~89µs) | 100x |
free() | 80 cycles (~0.5µs) | 8,000 cycles (~48µs) | 100x |
vector::push_back (no resize) | 20 cycles | 20 cycles | 1x |
vector::push_back (resize) | 200 cycles | 20,000 cycles | 100x |
The variance is what matters. A 100x timing difference is unacceptable when you have a 1ms deadline.
Fragmentation
sequenceDiagram
participant App
participant Heap
Note over Heap: Initial: [4KB free block]
App->>Heap: alloc 1KB
Note over Heap: [1KB used][3KB free]
App->>Heap: alloc 1KB
Note over Heap: [1KB][1KB][2KB free]
App->>Heap: free first block
Note over Heap: [1KB free][1KB][2KB free]
App->>Heap: alloc 2KB
Note over Heap: Can't fit! Only 1KB contiguous
Note over Heap: Total free: 3KB<br/>Max contiguous: 2KB<br/>Fragmentation: 33%
Real-world impact:
- System with 128KB RAM might only have 64KB usable after fragmentation
- Medical device recalled due to memory exhaustion after 72 hours of operation
- Industrial controller required daily reboots to “clear memory”
Call Stack Example
std::vector<int> data;
data.reserve(100); // One allocation
// Stack trace during reserve():
// vector::reserve()
// └─ allocator::allocate()
// └─ operator new()
// └─ malloc()
// └─ heap_search() ← Non-deterministic
// └─ find_free_block()
// └─ coalesce_blocks()
Understanding std::pmr (Polymorphic Memory Resources)
C++17 introduced Polymorphic Memory Resources to solve exactly this problem. The key insight:
Separate the what (container logic) from the where (memory source)
The Architecture
graph TD
A[pmr::vector Container] --> B[memory_resource interface]
B --> C[monotonic_buffer_resource]
B --> D[pool_resource]
B --> E[synchronized_pool_resource]
B --> F[Custom allocator]
C --> G[Stack buffer]
C --> H[Arena]
D --> I[Fixed pools]
F --> J[DMA memory]
F --> K[Shared memory]
style A fill:#74c0fc,stroke:#1971c2
style B fill:#ffe066,stroke:#f08c00
style C fill:#51cf66,stroke:#2f9e44
style D fill:#51cf66,stroke:#2f9e44
style E fill:#51cf66,stroke:#2f9e44
style F fill:#51cf66,stroke:#2f9e44
Key idea: The container doesn’t know or care where memory comes from—it just calls the memory_resource interface.
// Traditional: tied to global heap
std::vector<int> vec;
// PMR: you control the memory source
char buffer[4096];
std::pmr::monotonic_buffer_resource pool{buffer, sizeof(buffer)};
std::pmr::vector<int> vec{&pool}; // Uses our buffer, not heap!
Key pmr Components
1. memory_resource (Base Interface)
class memory_resource {
public:
virtual void* allocate(size_t bytes, size_t alignment) = 0;
virtual void deallocate(void* p, size_t bytes, size_t alignment) = 0;
virtual bool is_equal(const memory_resource& other) const = 0;
};
Your guarantee: All pmr containers use only these three functions.
2. monotonic_buffer_resource
Best for: Bulk allocation with single reset
// Allocate from stack buffer
char buffer[8192];
std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};
{
std::pmr::vector<Message> messages{&mbr};
std::pmr::string temp{&mbr};
process_messages(messages);
} // No individual deallocations!
mbr.release(); // Bulk reset - O(1)
sequenceDiagram
participant Container
participant MBR as monotonic_buffer
participant Buffer as Stack Buffer
Container->>MBR: allocate(64 bytes)
MBR->>Buffer: bump pointer by 64
Note over Buffer: [64 used][remaining free]
Container->>MBR: allocate(128 bytes)
MBR->>Buffer: bump pointer by 128
Note over Buffer: [192 used][remaining free]
Container->>MBR: deallocate(first block)
Note over MBR: No-op! Can't reuse yet
Note over Container: Scope ends
Container->>MBR: release()
MBR->>Buffer: Reset pointer to start
Note over Buffer: [All free again]
Performance:
- Allocation: O(1) - just increment a pointer
- Deallocation: O(1) - no-op
- Reset: O(1) - bulk cleanup
Trade-off: Individual deallocations are ignored until release()
3. pool_resource & synchronized_pool_resource
Best for: Frequent allocation/deallocation of similar sizes
std::pmr::unsynchronized_pool_resource pool;
{
std::pmr::vector<Packet> packets{&pool};
std::pmr::map<int, Device> devices{&pool};
// Efficient allocation from pools
for (int i = 0; i < 1000; ++i) {
packets.push_back(Packet{}); // From pool
}
}
// Memory returned to pools, ready for reuse
How it works:
graph TD
A[pool_resource] --> B[Pool: 16-byte blocks]
A --> C[Pool: 32-byte blocks]
A --> D[Pool: 64-byte blocks]
A --> E[Pool: 128-byte blocks]
A --> F[Overflow to upstream]
B --> B1[Free list]
C --> C1[Free list]
D --> D1[Free list]
E --> E1[Free list]
style A fill:#ffe066,stroke:#f08c00
style B fill:#51cf66,stroke:#2f9e44
style C fill:#51cf66,stroke:#2f9e44
style D fill:#51cf66,stroke:#2f9e44
style E fill:#51cf66,stroke:#2f9e44
style F fill:#ff6b6b,stroke:#c92a2a,color:#fff
Performance:
- Allocation: O(1) - pop from free list
- Deallocation: O(1) - push to free list
- No fragmentation within pools
Trade-off: More memory overhead for pool metadata
4. Custom Allocators
You can implement your own memory_resource for specialized needs:
class DMAMemoryResource : public std::pmr::memory_resource {
void* dma_region;
size_t offset = 0;
protected:
void* do_allocate(size_t bytes, size_t align) override {
// Allocate from DMA-capable memory region
void* ptr = align_pointer(dma_region + offset, align);
offset += bytes;
return ptr;
}
void do_deallocate(void*, size_t, size_t) override {
// No-op for DMA buffers
}
bool do_is_equal(const memory_resource& other) const override {
return this == &other;
}
};
// Now use with any pmr container
DMAMemoryResource dma_mem;
std::pmr::vector<uint8_t> dma_buffer{&dma_mem};
Common custom allocators:
- DMA-capable memory regions
- Cache-aligned allocations
- Shared memory segments
- Non-volatile RAM
- Custom memory protection schemes
Performance Cost of std::pmr vs std Containers & Trade-offs
Let’s measure real performance on actual hardware.
My Test Setup
I built a benchmark suite to measure these claims objectively:
Hardware: [Snapdragon X Plus (12‑core Oryon CPU), Adreno GPU, 16GB LPDDR5x RAM]
Compiler: GCC 12.2, -O3 -std=c++17
Methodology: 100 iterations per test, measuring mean/variance/percentiles
Reproducible: All code and scripts are on GitHub. You can run these benchmarks yourself.
Benchmark 1: Vector Operations
Test: 1000 push_back operations of sensor data (80 bytes each)
// Test structure - typical embedded data
struct SensorData {
int id;
double temperature;
double pressure;
uint64_t timestamp;
char label[32];
};
// What we're measuring
constexpr size_t N = 1000;
// Standard vector (heap allocations)
{
std::vector<SensorData> vec;
for (size_t i = 0; i < N; ++i) {
vec.push_back(read_sensor(i));
}
}
// PMR vector (stack buffer)
{
char buffer[N * sizeof(SensorData) * 2];
std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};
std::pmr::vector<SensorData> vec{&mbr};
for (size_t i = 0; i < N; ++i) {
vec.push_back(read_sensor(i));
}
}
Results from my machine:
| Metric | std::vector | pmr::vector (monotonic) | pmr::vector (pool) | Comparison |
|---|---|---|---|---|
| Mean time | 16.90 µs | 57.44 µs | 68.86 µs | pmr 3.4-4.1x slower |
| Worst case | 33.60 µs | 264.20 µs | 91.70 µs | pmr worse P99 |
| Variance (CV) | 10.23% | 43.18% | 5.62% | pool most consistent |
| Best case | 16.40 µs | 51.00 µs | 64.40 µs | std fastest |
| Heap allocations | Many | 0 | 0 | pmr: no heap |
graph LR
A[std::vector<br/>16.9µs ±10%<br/>Fast but variable] -->|Switch to PMR| B[pmr::vector pool<br/>68.9µs ±5.6%<br/>Slower but consistent]
style A fill:#74c0fc,stroke:#1971c2
style B fill:#51cf66,stroke:#2f9e44
PMR is slower for simple operations like vector push_back, but offers better consistency. The pool allocator has the best determinism (5.62% CV) making it ideal for real-time scenarios where predictability matters more than raw speed.
P95 means 95% of operations complete faster than this value, P99 means 99%. The narrower the gap between P95 and P99, the more predictable the performance—pmr_pool has the tightest spread, meaning fewer outliers and more reliable worst-case timing for real-time systems.
Benchmark 2: String Operations
Test: Build 100 strings from sensor labels
// 100 strings like "Temperature_Sensor_42"
std::array<const char*, 100> labels = { /* ... */ };
// Standard strings
{
std::vector<std::string> names;
for (const auto* label : labels) {
names.emplace_back(label);
}
}
// PMR strings
{
char buffer[8192];
std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};
std::pmr::vector<std::pmr::string> names{&mbr};
for (const auto* label : labels) {
names.emplace_back(label, &mbr);
}
}
Results:
| Operation | std::string | pmr::string | Comparison |
|---|---|---|---|
| String concat (mean) | 307.77 µs | 360.17 µs | pmr 17% slower |
| String concat (CV) | 29.52% | 9.40% | pmr 3.1x more consistent |
| SSO strings (mean) | 120.57 µs | 121.53 µs | Nearly identical |
| SSO strings (CV) | 46.58% | 29.19% | pmr 1.6x more consistent |
| Heap allocations | Many | 0 | pmr: no heap |
For strings, PMR trades a small speed penalty for significantly better determinism. String concat CV drops from 29.52% to 9.40% - a 3x improvement in predictability.
Benchmark 3: Map Operations
Test: Insert 500 device records into a map
Results:
| Operation | std::map | pmr::map | Comparison |
|---|---|---|---|
| Int keys (mean) | 206.57 µs | 256.41 µs | pmr 24% slower |
| Int keys (CV) | 18.43% | 22.93% | std more consistent |
| String keys (mean) | 487.28 µs | 578.90 µs | pmr 19% slower |
| String keys (CV) | 28.31% | 25.22% | pmr 11% more consistent |
| Heap allocations | Per node | 0 | pmr: no heap |
For maps, PMR doesn’t show the determinism advantage we saw with strings. The pool allocator adds overhead without significant consistency gains for this workload.
Why the Huge Variance in std::vector?
The standard allocator has unpredictable performance because it does complex memory management:
sequenceDiagram
participant Vec as std::vector
participant Heap as malloc/free
Note over Vec: push_back #1
Vec->>Heap: malloc(16 bytes)
Heap->>Heap: Search free list
Heap->>Heap: Check if split needed
Heap-->>Vec: Fast path (0.8µs)
Note over Vec: push_back #512
Vec->>Heap: Need more space!
Heap->>Heap: Search free list
Heap->>Heap: No suitable block
Heap->>Heap: Coalesce adjacent blocks
Note right of Heap: Combine fragmented<br/>free blocks into<br/>larger blocks
Heap->>Heap: Still not enough?
Heap->>Heap: Request from OS (sbrk/mmap)
Heap-->>Vec: Slow path (45µs)
Note over Heap: 56x variance!
What’s “coalescing”? When memory is freed, adjacent free blocks are merged together:
Before dealloc: [Used A][Used B][Used C]
After free B: [Used A][Free B][Used C] ← Fragmented
After free C: [Used A][Free B+C merged] ← Coalesced!
This is expensive - the allocator must scan neighbors, update metadata, and maintain sorted free lists.
PMR eliminates this complexity:
sequenceDiagram
participant Vec as pmr::vector
participant MBR as monotonic_buffer
participant Stack as Stack Buffer
Note over Vec: push_back #1
Vec->>MBR: allocate(16)
MBR->>Stack: offset += 16
Note right of Stack: Just bump pointer!<br/>No search,<br/>no coalesce,<br/>no metadata
Stack-->>Vec: 0.5µs
Note over Vec: push_back #512
Vec->>MBR: allocate(16)
MBR->>Stack: offset += 16
Stack-->>Vec: 0.5µs
Note over Vec: deallocate(ptr)
Vec->>MBR: deallocate()
MBR->>MBR: No-op!
Note right of MBR: Doesn't track<br/>individual frees.<br/>Bulk reset later.
Note over Stack: Consistent every time!
The Memory Management Trade-off
So in brief … The 3 allocator options and their core pros (+)/cons (-).
flowchart TB
A[Memory Management Options]
A --> B[std::allocator<br/>+ reuses frees<br/>- search & coalesce<br/>- fragmentation]
A --> C[pmr::monotonic_buffer<br/>+ O1 alloc, zero frag<br/>- reset to reclaim]
A --> D[pmr::pool_resource<br/>+ O1 alloc/free in bins<br/>- bins do not merge]
And how does the 3 allocators handle “Determinism & Fragmentation”
flowchart TB
E[Determinism & Fragmentation]
E --> B2[std::allocator:<br/>low determinism,<br/>can fragment]
E --> C2[pmr::monotonic:<br/>high determinism,<br/>no fragmentation]
E --> D2[pmr::pool:<br/>high determinism,<br/>limited to bins]
Why this matters for embedded systems:
| Operation | std::allocator | pmr::monotonic | pmr::pool |
|---|---|---|---|
| Allocate | O(n) search + coalesce | O(1) pointer bump | O(1) pop from bin |
| Deallocate | O(1) but triggers coalesce | No-op | O(1) push to bin |
| Reuse freed memory | ✅ Yes | ❌ No (until reset) | ✅ Within same bin |
| Fragmentation | ❌ Yes - requires coalesce | ✅ None | ⚠️ Limited to bins |
| Determinism | ❌ Poor | ✅ Excellent | ✅ Very good |
Example: Why coalescing matters
// Scenario: Allocate 1KB, 2KB, 1KB, then free middle
void* a = malloc(1024); // [1KB used]
void* b = malloc(2048); // [1KB used][2KB used]
void* c = malloc(1024); // [1KB used][2KB used][1KB used]
free(b); // [1KB used][2KB free][1KB used]
// Now request 3KB - should fit in total free space (2KB + gaps)
void* d = malloc(3072); // ❌ Fails! Not enough contiguous space
// (unless allocator coalesces with neighbors)
// With coalescing (expensive):
free(a); free(c); // Allocator scans and merges:
// [4KB free] ← Now can fit 3KB
void* d = malloc(3072); // ✅ Works, but took time to coalesce
PMR approach:
char buffer[8192];
std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};
// Allocate whatever you need
void* a = mbr.allocate(1024, 8);
void* b = mbr.allocate(2048, 8);
void* c = mbr.allocate(1024, 8);
// Deallocate - no-op!
mbr.deallocate(a, 1024, 8); // Does nothing
mbr.deallocate(b, 2048, 8); // Does nothing
// Memory isn't reused...
void* d = mbr.allocate(3072, 8); // Uses NEW space after 'c'
// But you get it all back at once:
mbr.release(); // Instant - resets pointer to start
// Now all 8192 bytes available again
// Or let destructor handle it
} // mbr destroyed - automatic cleanup
PMR trades flexibility (can’t reuse freed memory immediately) for speed and determinism (no complex bookkeeping). Good for short-lived, bulk operations.
The Trade-offs
flowchart TD
A[Choose Memory Resource] --> B{Usage Pattern?}
B -->|Short-lived,<br/>bulk operations| C[monotonic_buffer]
B -->|Frequent alloc/dealloc,<br/>mixed sizes| D[pool_resource]
B -->|Rare allocations| E[new_delete_resource]
C --> C1[Fastest<br/>Zero fragmentation<br/>No individual dealloc]
D --> D1[Efficient reuse<br/>Good for varied sizes<br/>Setup complexity]
E --> E1[Familiar<br/>Non-deterministic<br/>Fragmentation]
style C1 fill:#51cf66,stroke:#2f9e44
style D1 fill:#74c0fc,stroke:#1971c2
style E1 fill:#ffe066,stroke:#f08c00
Summary table:
| Aspect | std::vector | pmr::vector (monotonic) | pmr::vector (pool) |
|---|---|---|---|
| Raw speed | ⭐⭐⭐⭐⭐ (16.9µs) | ⭐⭐ (57.4µs) | ⭐⭐ (68.9µs) |
| Determinism | ⭐⭐⭐ (CV 10.2%) | ⭐⭐ (CV 43.2%) | ⭐⭐⭐⭐⭐ (CV 5.6%) |
| Memory efficiency | ⚠️ Heap overhead | ✅ No heap | ✅ No heap |
| Flexibility | ✅ Full | ⚠️ No individual free | ✅ Can free |
| Setup complexity | ✅ Zero | ⚠️ Buffer sizing | ⚠️ Pool tuning |
Verdict: For simple vector operations, std::vector is faster. PMR pool offers the best determinism at the cost of 4x slower speed. Choose PMR when you need predictable timing over raw performance.
Real-World Impact Example
Imagine a sensor system that processes 1000 readings every 10ms:
// With std::vector - fast but variable
void process_readings() {
std::vector<Reading> data;
for (int i = 0; i < 1000; ++i) {
data.push_back(read_sensor(i));
}
analyze(data);
}
// Mean: 16.90µs, but CV=10.23%
// Worst case can spike 2x (33.60µs)
// For hard real-time: Variance is the problem ❌
// With pmr::pool - slower but predictable
void process_readings() {
static std::pmr::unsynchronized_pool_resource pool;
std::pmr::vector<Reading> data{&pool};
for (int i = 0; i < 1000; ++i) {
data.push_back(read_sensor(i));
}
analyze(data);
}
// Mean: 68.86µs, but CV=5.62%
// Worst case bounded (91.70µs)
// For hard real-time: Predictability wins ✅
PMR is slower (4x in this case), but offers 2x better determinism. For real-time systems, you sacrifice speed for predictability. Choose based on your constraints.
How to Reproduce These Results
All benchmarks are on GitHub:
git clone https://github.com/saptarshi-max/pmr-benchmark
cd pmr-benchmark
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
./benchmark_vector # See results
The repository includes:
- Complete source code
- Automated benchmark scripts
- Report generation (markdown + graphs)
- Methodology documentation
Your results may vary based on CPU/compiler, but the relative improvements should be similar.
Best Practices & Recommendations
1. Choose the Right Memory Resource
// For short-lived, bulk operations (request handling)
void handle_request(const Request& req) {
char buffer[4096];
std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};
std::pmr::vector<Response> responses{&mbr};
std::pmr::string temp{&mbr};
process(req, responses, temp);
} // Automatic cleanup, no fragmentation
// For long-lived, dynamic containers (device registry)
class DeviceManager {
std::pmr::unsynchronized_pool_resource pool_;
std::pmr::map<DeviceId, Device> devices_{&pool_};
public:
void add_device(DeviceId id, Device dev) {
devices_.emplace(id, std::move(dev)); // Efficient pooling
}
};
// For extremely constrained systems (safety-critical)
alignas(Device) static char device_storage[MAX_DEVICES][sizeof(Device)];
// Use indices instead of containers
2. Stack Buffer Sizing Strategy
// Measure actual usage first
void tune_buffer_size() {
struct : std::pmr::memory_resource {
size_t peak = 0;
size_t current = 0;
void* do_allocate(size_t n, size_t) override {
current += n;
peak = std::max(peak, current);
return ::operator new(n);
}
void do_deallocate(void* p, size_t n, size_t) override {
current -= n;
::operator delete(p);
}
bool do_is_equal(const memory_resource& o) const override {
return this == &o;
}
} tracker;
// Run your typical workload
{
std::pmr::vector<Data> vec{&tracker};
std::pmr::string str{&tracker};
// ... typical operations ...
}
printf("Peak memory: %zu bytes\n", tracker.peak);
// Now allocate buffer at peak + 20% margin
}
Rules of thumb:
- Monotonic buffer: Workload peak + 20-30% margin
- Pool resource: Expected concurrent allocations × avg size × 1.5
- Always add overflow handling to upstream allocator
3. Error Handling
// PMR allocators can return nullptr!
std::pmr::vector<Data> vec{&limited_resource};
try {
vec.push_back(data);
} catch (const std::bad_alloc&) {
// Buffer exhausted
log_error("Memory exhausted");
// Graceful degradation
}
// Or check capacity proactively
if (vec.capacity() < vec.size() + 1) {
// Handle before allocation fails
}
4. Combining Resources
// Hierarchy: fast path → slow path → error
char fast_buffer[1024];
std::pmr::monotonic_buffer_resource fast{fast_buffer, sizeof(fast_buffer)};
char slow_buffer[16384];
std::pmr::monotonic_buffer_resource slow{slow_buffer, sizeof(slow_buffer), &fast};
std::pmr::vector<Packet> packets{&slow};
// Uses fast buffer first, overflows to slow, then fails
5. Move from std to pmr Gradually
// Phase 1: Identify hot paths with profiling
// Phase 2: Replace one container at a time
// Phase 3: Measure improvement
// Before
void process_messages(std::vector<Message>& messages) {
for (auto& msg : messages) {
std::string payload = decode(msg); // Hidden allocation
handle(payload);
}
}
// After (step 1: function-local)
void process_messages(std::vector<Message>& messages) {
char buffer[4096];
std::pmr::monotonic_buffer_resource mbr{buffer, sizeof(buffer)};
for (auto& msg : messages) {
std::pmr::string payload{&mbr};
decode(msg, payload); // Reuse buffer
handle(payload);
}
}
// After (step 2: propagate PMR through API)
void process_messages(std::pmr::vector<Message>& messages) {
// Caller controls memory resource
}
6. Testing Strategy
// Create memory pressure in tests
class LimitedResource : public std::pmr::memory_resource {
size_t limit_;
size_t used_ = 0;
protected:
void* do_allocate(size_t n, size_t a) override {
if (used_ + n > limit_) {
throw std::bad_alloc();
}
used_ += n;
return upstream_->allocate(n, a);
}
// ...
};
TEST(SensorProcessor, HandlesMemoryExhaustion) {
LimitedResource limited{1024}; // Only 1KB available
std::pmr::vector<Sample> samples{&limited};
// Verify graceful degradation
EXPECT_NO_CRASH(fill_samples(samples, 10000));
}
Surprising Findings from Real Benchmarks
After running comprehensive benchmarks, the results were different from expectations:
What I Expected vs. What I Found
| Expectation | Reality |
|---|---|
| PMR would be faster | PMR is 3-4x slower for simple operations |
| PMR would reduce variance significantly | Mixed results: Great for strings (3x better), worse for vectors |
| Pool allocator would be fastest | std::allocator is fastest for basic operations |
| PMR always wins on determinism | Not always: Map operations showed minimal improvement |
The Real Trade-off
PMR is NOT a performance win in all cases. The actual benefit is:
- When you need: Predictable timing, no heap usage, bounded worst-case
- What you sacrifice: 3-4x slower execution for basic operations
- Best use case: Real-time systems where 68µs±5.6% beats 16µs±10%
Benchmark Highlights
Vector operations (1000 push_back):
- std: 16.90µs (fast, 10% variance)
- pmr_pool: 68.86µs (slow, 5.6% variance) ← Best determinism
- Winner: Depends on your constraint (speed vs. predictability)
String concatenation:
- std: 307.77µs (CV: 29.52%)
- pmr: 360.17µs (CV: 9.40%) ← 3x more consistent
- Winner: PMR if you need determinism
Sensor data collection (realistic embedded workload):
- std: 2197.62µs (CV: 14.15%)
- pmr: 2655.63µs (CV: 12.72%) ← Slight improvement
- Winner: Depends on whether 20% speed loss is acceptable for 10% better determinism
Message queue (mixed workload):
- std: 171.90µs (CV: 8.93%)
- pmr: 272.57µs (CV: 28.10%) ← Worse!
- Winner: std (better on both metrics)
PMR isn’t universally better. It trades raw speed for predictability. Profile your specific workload before switching.
Conclusions
Key Takeaways (wrong)
PMR offers predictability, not speed: Expect 3-4x slower but more consistent timing Use PMR for determinism: When low variance (68µs±5%) matters more than speed (16µs±10%) in real-time systems std::allocator is still fast: For non-critical paths, standard containers perform well Not a silver bullet: Profile first, PMR helps specific cases (strings, bounded allocations)
When to Use PMR
✅ Use PMR when:
- Hard real-time systems where consistent timing matters more than average speed (predictability over performance)
- Embedded systems where heap allocation must be avoided
- Safety-critical software requiring deterministic behavior
- Short-lived operations with known memory bounds
✅ Use std::allocator when:
- Raw performance is the priority (3-4x faster for basic ops)
- Workload has low natural variance (like our message queue example)
- Long-lived containers with unpredictable growth
- Development speed matters more than optimization
⚠️ Benchmark first:
- Results vary by workload (vector vs. string vs. map)
- Your hardware/compiler may show different ratios
- Profile to see if determinism improvement justifies speed loss
The Mental Model Shift
graph LR
A[Traditional:<br/>'Container owns memory'] --> B[PMR:<br/>'Container borrows memory']
B --> C[You control:<br/>• Where<br/>• When<br/>• How much]
C --> D[Better:<br/>• Determinism<br/>• Performance<br/>• Control]
style A fill:#ffe066,stroke:#f08c00
style B fill:#74c0fc,stroke:#1971c2
style D fill:#51cf66,stroke:#2f9e44
PMR isn’t about replacing std containers—it’s about giving you control when you need it. In embedded systems, that control often means the difference between “works most of the time” and “proven reliable.”
Quick Decision Matrix
| Your Situation | Recommendation | Rationale |
|---|---|---|
| Need minimum latency variance | pmr::pool_resource | CV as low as 5.6% |
| Need maximum raw speed | std::allocator | 3-4x faster for vectors |
| String-heavy workload | pmr::string | 3x better determinism |
| Mixed container operations | Profile first | Results vary (std won our message queue test) |
| Embedded, hard real-time | pmr::monotonic_buffer | Zero heap, bounded timing |
| General-purpose application | Stick with std:: | Faster and simpler |
Errata — What I Got Wrong
I posted this to r/cpp expecting some discussion about embedded trade-offs. Instead, the top comments were pointing out that my benchmarks were fundamentally broken. 44 comments, and the majority were some variation of “your data is wrong.” They were right.
Two bugs. That’s all it took to completely invert the results. And a third mistake in how I analysed variance made the comparison misleading on top of that.
Bug 1: My PMR buffer was too small
Here’s what I had:
constexpr size_t BUFFER_SIZE = NUM_ELEMENTS * sizeof(int) * 4; // 16,000 bytes
4× the raw data. Seemed like plenty of headroom. It wasn’t.
The thing I forgot: std::vector grows geometrically. When it reallocates, the old and new buffers both exist at the same time (elements get copied over, then the old one is freed). And monotonic_buffer_resource doesn’t actually free anything — deallocations are no-ops. So every old allocation is still sitting there eating space. 16KB ran out fast.
Here’s the part that burned me: when monotonic_buffer_resource runs out of its initial buffer, it doesn’t throw. It doesn’t warn. It just quietly starts allocating from its upstream resource, which by default is… new_delete_resource(). The global heap. The exact thing I was trying to avoid.
So what was I actually benchmarking?
std::allocator→ heappmr::monotonic→ heap + PMR overhead on top
Of course PMR looked slower. It was doing strictly more work.
kammce (WG21 member) proved this by swapping in null_memory_resource() as the upstream. With a too-small buffer, the program immediately throws std::bad_alloc:
std::pmr::monotonic_buffer_resource mbr(
buffer, BUFFER_SIZE, std::pmr::null_memory_resource());
// Crashes with original 16KB buffer. The buffer was exhausted.
jwakely (libstdc++ maintainer, LWG chair) confirmed independently.
The silent fallback is arguably a good design for production code — you don’t want your program to crash because a buffer was slightly undersized. But for benchmarking, it’s a trap. You get numbers that look plausible but mean nothing. Lesson: when benchmarking PMR, always set the upstream to null_memory_resource(). You want it to blow up if the buffer isn’t big enough.
Bug 2: I compiled at -O0
My CMakeLists.txt set the right flags:
set(CMAKE_CXX_FLAGS_RELEASE "-O3 -march=native -DNDEBUG")
And my build script ran:
cmake --build . --config Release
The problem: on Linux with Makefiles or Ninja (single-config generators), --config Release does absolutely nothing. It’s silently ignored. That flag only matters for multi-config generators like Visual Studio or Xcode. For Makefiles, the build type has to be set at configure time:
cmake .. -DCMAKE_BUILD_TYPE=Release # <-- this is what actually matters
cmake --build .
kammce ran VERBOSE=1 and confirmed: no -O3 anywhere. I was benchmarking debug code. All of it.
If you develop on Windows with VS and then move to Linux CI with Makefiles, you will hit this. There’s no warning. The fix:
if(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
message(STATUS "No build type selected — defaulting to Release")
set(CMAKE_BUILD_TYPE Release CACHE STRING "Build type" FORCE)
endif()
And always check VERBOSE=1 output to confirm -O3 is actually there.
Bug 3: I compared the wrong kind of variance
I looked at CV (coefficient of variation) and concluded PMR was more deterministic:
| Allocator | Mean | CV | Absolute jitter |
|---|---|---|---|
std::allocator | 17 µs | 10% | 1.7 µs |
pmr::pool | 69 µs | 6% | 4.14 µs |
6% < 10%, so PMR wins on determinism, right? No. If you have a 100µs deadline, you care about the actual microseconds of jitter, not the percentage. PMR’s absolute jitter was 4.14µs vs 1.7µs — 2.4× worse. Peddy699 and SeanCline both called this out (with 43 and 35 upvotes respectively). They were right.
So what are the actual results?
kammce re-ran with correct buffer sizes and -O3:
- Monotonic PMR: ~1.52× faster, ~5.10× better determinism
- Mixed workload: PMR ~1.26× faster, ~2.84× better determinism
My original claim that PMR was 3-4× slower was exactly backwards. A pointer bump should beat a heap search. I should have been suspicious the moment my numbers said otherwise.
Other stuff people caught
I wasn’t calling vector::reserve(), so the benchmark was mostly measuring reallocation behaviour, not allocator performance (IfreetBalkan). kamrann_ pointed out that on MSVC’s STL, polymorphic_allocator blocks memmove optimisation for trivially-copyable types — so part of what I measured was element-move overhead, not allocator overhead (see P1144R7 §4.2). m-in said I should have looked at the codegen on Godbolt before drawing any conclusions. ald_loop (63 upvotes) pointed out that Google Benchmark exists for a reason.
Also, my title said “std vs pmr” — jwakely and SupermanLeRetour pointed out that std is a namespace, not an allocator. The comparison is std::allocator vs std::pmr::polymorphic_allocator. And custom allocators have been a thing since C++98 via the second template parameter. PMR isn’t the only alternative.
Things people suggested I look into
- Compile-time custom allocators (hk19921992) — no virtual dispatch, full inlining potential
boost::static_vector(SuperV1234) — fixed capacity, no heap- ETL (lxbrtn) — Embedded Template Library
- Static allocation only (RogerLeigh) — what safety-critical codebases actually do
What I Actually Learned
I published confidently wrong results and 44 strangers on the internet caught what I missed. That stung, but honestly I learned more from the Reddit thread than from writing the benchmarks.
The short version:
- PMR monotonic is faster when the buffer is sized right. Pointer bump beats heap search. If your numbers say otherwise, your benchmark is broken.
monotonic_buffer_resourcewill silently fall back to the heap when it runs out. Usenull_memory_resource()as upstream during dev/benchmarking so you get a crash instead of garbage data.--config Releaseis a no-op on Makefiles/Ninja. SetCMAKE_BUILD_TYPEat configure time. CheckVERBOSE=1.- CV% is misleading when comparing things with different means. For real-time, report absolute jitter.
- Posting publicly is worth the embarrassment. kammce and jwakely didn’t just say “you’re wrong” — they showed exactly why, with code. That’s worth more than any textbook.
- If a pointer bump benchmarks slower than malloc, your benchmark is wrong. Trust the theory enough to question surprising results.
Corrected Benchmark Results
I haven’t re-run on my hardware yet. Once I do, I’ll drop the numbers here. Based on kammce’s re-run, expect monotonic PMR to be ~1.5× faster with ~5× better determinism, and the mixed workload to be ~1.3× faster with ~2.8× better determinism.
TODO: re-run with 10× buffer, null_memory_resource() upstream, -DCMAKE_BUILD_TYPE=Release verified via VERBOSE=1, with and without reserve().
References
- C++17 Standard: Memory Resources
- P0220R1: Adopt Library Fundamentals V1 TS Components for C++17
- Pablo Halpern: “Allocators@C++Now 2017”
- Bjarne Stroustrup: “A Tour of C++” (2nd Edition), Section 13.6
- MISRA C++:2023 Guidelines - Rule 18-4-1: Dynamic heap memory allocation shall not be used
- JSF AV C++ Coding Standards - AV Rule 206: Dynamic memory allocation shall not be used
- Embedded Artistry: “Practical Guide to Bare Metal C++”
- ARM Cortex-M7 Technical Reference Manual - Memory system performance characteristics
- A. Mahmutbegović, C++ in Embedded Systems: A practical transition from C to modern C++, Packt Publishing, 2024.
- Reddit r/cpp discussion - Community feedback that caught the benchmark bugs
- P1144R7 � Object relocation - Trivial relocatability and PMR concerns
- Reddit r/cpp discussion - Community feedback that caught the benchmark bugs
- P1144R7 — Object relocation - Trivial relocatability and PMR concerns