Parallel Flame Graph Rendering

September 3, 2025

In Parca and the Polar Signals Cloud product, one of the ways which you can understand your profiling data is by using the Flame Graph visualisation.

In case you didn’t know, Flame Graphs are a visualisation tool used for displaying hierarchical profiling data, where each horizontal rectangle represents a function call and the width corresponds to the time spent or samples collected in that function. They are usually meant to help developers quickly identify performance bottlenecks by making the most resource-intensive code paths visually prominent, i.e., the wider the rectangle, the more time spent in that function and its children.

The Problem

While Flame Graphs are excellent for analysis, rendering them efficiently became a challenge for us. Up until some months ago, our frontend Flame Graph was being rendered by recursive SVG components that didn’t necessarily scale well, mostly because the coordinates calculation of each frame in the Flame Graph were being done by tree traversals.

We chose SVG for our Flame Graph implementation because it gives us granular control over rendering, styling, and interactivity, allowing us to implement features like precise hover states, custom tooltips, and expansion interactions that would be more challenging with canvas-based rendering.

This meant that for every render of the Flame Graph, we were doing expensive tree traversals to draw the SVG components on the frontend. While SVG's flexibility was perfect for our interactive features, the recursive component structure was creating performance bottlenecks that scaled poorly with larger Flame Graphs.

// Complex nested component structure
<FlameGraphNodes childRows={data} level={0}>
  <FlameNode />
  <FlameGraphNodes childRows={children} level={1}>  // Recursion!
    <FlameNode />
    <FlameGraphNodes level={2}>  // More recursion!
      // ... potentially hundreds of levels deep
    </FlameGraphNodes>
  </FlameGraphNodes>
</FlameGraphNodes>

The nested components approach meant that we rendered the Flame Graph in specific sequences and thus suggested there was room for improvement. So we went back to the drawing board and asked the classic performance optimization question:

What if we move some of the heavy computation to the backend?

In this case, the heavy computation here is efficient coordinate calculation without tree traversal, so that each frame in the Flame Graph is aware of where it's supposed to be rendered without knowing about its parent or children.

So we came up with a design to pre-calculate rendering positions in the backend.

Our Solution

To pre-calculate rendering positions in the backend, we needed to do the following:

Depth Calculation: During the Flame Graph tree construction, we calculate and store the depth for each node, representing its distance from the root of the call stack. This removes the need for runtime tree traversal to determine the Y-coordinates, as each frame can directly compute its vertical position using y = depth * frame_height.

Value Offset Pre-computation: Introduce a new value_offset field to represent the cumulative value from the left edge of the Flame Graph to the start of each frame’s horizontal position. This is also calculated during the flame graph tree construction, and achieved by accumulating the widths of all preceding sibling frames at each level. This pre-computed offset value ensures that the frontend doesn’t need to traverse the tree structure to determine where each frame should be positioned horizontally, without any prior knowledge of its neighbors or ancestors.

Direct Coordinate Calculation: With both the depth and value_offset pre-calculated, the frontend can then easily determine any frame’s exact screen coordinates using formulas like x = (value_offset / total) * width and y = depth * frame_height. This approach eliminates all dependency on tree traversal, parent lookups, or cumulative value calculations during rendering. Each frame is now completely self-contained, with respect to positioning, enabling truly independent and parallel rendering.

Parent Reference Preservation: While we eliminated the need for parent lookups during rendering, we maintain explicit parent references in the data structure for easily walking up the stack. This is done by each frame storing an index pointing to its parent frame, and with possible values, -1 and 0 where -1 indicates root frames and 0 is the direct children of the root.

Depth-First Ordering

In PR #5643, a depth field was added to the Flame Graph builder to track how deep each frame is in the stack trace, starting from 0 for the root frame. We modified the flamegraphBuilder to include depth tracking during tree construction:

type flamegraphBuilder struct {
      // ... existing fields ...
    builderDepth    *array.Uint32Builder  // Builder for depth values
    maxHeight       int32                 // Track max depth for optimization
    height          int32                 // Current stack height during building
}

During stack trace processing, each frame gets assigned its depth value as we traverse deeper into the call stack:

// During stack trace processing
fb.height = 1 // Start with root + label row

fb.builderDepth.Append(uint32(fb.height)) // Store depth for each frame
fb.maxHeight = max(fb.maxHeight, fb.height) // Track maximum depth

Value Offset

In PR #5651, the logic for the value offset computation was implemented by adding a new valueOffset field to the Flame Graph builder.

type flamegraphBuilder struct {
    // ... existing fields ...
    valueOffset           array.Builder               
}

The offset for each frame is then computed by the cumulative accumulation of its sibling frame widths. What does that mean?

valueOffset := te.valueOffset
for _, cr := range fb.childrenList[te.row] {
    if v := fb.builderCumulative.Value(cr); v > int64(cumThreshold) {
        trimmingQueue.push(trimmingElement{
            row:         cr,
            parent:      row,
            valueOffset: valueOffset,
        })
        valueOffset += fb.builderCumulative.Value(cr)
    } else {
        fb.trimmed += v
    }
}

It essentially means that each frame begins with the valueOffset passed from its parent (te.valueOffset), then each child in this particular frame that we're looking at, gets assigned the current valueOffset, after which the the child's cumulative value is added to the valueOffset for the next sibling.

Visually, you can also think of it like this: If you have a parent frame with three children:

Child A (width: 100): gets valueOffset = 0 (assuming parent frame started at 0)
Child B (width: 200): gets valueOffset = 100 (0 + 100)
Child C (width: 150): gets valueOffset = 300 (100 + 200)

Pre-calculated tree structure

In PR #5644, the parent field was also added to the Flame Graph builder.

type flamegraphBuilder struct {
    // ... existing fields ...
    builderParent    *array.Int32Builder  // Parent indices (-1 for root)
    parent           parent               // Current parent tracker
}

Parent relationships are then established during the tree construction with semantic meaning:

// Root frame has no parent
fb.builderParent.Append(-1)

// Child frames reference their parent
fb.builderParent.Append(int32(fb.parent.Get()))

Putting it all together, each node now has metadata that roughly looks like the one below.

// Values ready for display
cumulative: Int64,          // Width of the frame
flat: Int64,               // Self-time/value  
diff: Int64,               // For differential Flame Graphs

// Visual properties
depth: UInt32,             // Y-coordinate
value_offset: Int64,       // X-coordinate (left edge)

// Tree structure
parent: Int32,          // Parent frame index (-1 for root, 0 for root children)

With all of the above now in place on the backend, we could do something like below on the frontend, where we could just iterate over the rows, and all the nodes are drawn in parallel because they each know where they should be rendered.

// Flat, parallel rendering
const nodes = useMemo(() => {
  const result = [];
  for (let row = 0; row < numRows; row++) {
    // Each node calculates position independently
    const x = (valueOffset - selectionOffset) / total * width;
    const y = depth * ROW_HEIGHT;
    result.push({ row, x, y, width });
  }
  return result;
}, [table]);

return (
  <svg>
    {nodes.map(({ row, x, y, width }) => (
      <FlameNode key={row} x={x} y={y} width={width} />
    ))}
  </svg>
);

An illustration of the recursive tree rendering vs parallel rendering

This was already a huge performance optimization over the tree traversal method, and it performed nicely until we noticed another bottleneck.

When working with large profiling data which happens to the the case for most of our customers, rendering the Flame Graph was such a memory intensive task and would sometimes cause the browser to momentarily freeze up.

So we decided to do some investigation and the culprit here as seen in the Flame Graph below was the FlameNodeNoMemo component which is responsible for iterating over the rows and drawing the frames on the screen. In this case, the component took about 8 seconds to render all the frames on the screen.

Viewport Culling

So as a way to fix this, we introduced viewport culling a.k.a. don’t render off-screen frames.

Viewport culling is an optimization technique borrowed from 3D graphics, where only objects within the visible screen region are rendered. In 3D graphics, this involves calculating a viewing frustum, which is the pyramidal region of 3D space visible to the camera, and culling objects that fall completely outside this volume. By skipping the rendering of invisible objects, the graphics pipeline avoids wasting computational resources on elements that would never appear on screen.

We decided to implement something similar by culling frames that aren’t visible yet in the viewport.

Since we’re now aware of the depth of each frame, we cull vertically and efficiently by grouping the rows (by their depth level) into buckets, and this is done by pre-computing buckets i.e., we don't scan all rows, which results in much faster lookups.

Horizontal culling is done by checking horizontal bounds and only iterating through rows visible horizontally in the viewport.

Vertical Culling

Vertical culling uses the depth field to eliminate frames outside the vertical viewport bounds. Instead of scanning through all multiple rows to find frames at visible depths, we use a depth bucketing strategy:

// Pre-compute depth buckets during initialization
const buckets: number[][] = Array.from({length: maxDepth + 1}, () => []);

// Group rows by depth level
for (let row = 0; row < table.numRows; row++) {
  const depth = depthColumn.get(row) ?? 0;
  buckets[depth].push(row); 
}

To cull frames in the viewport vertically, we calculate which depth levels are visible based on scroll position:

// Calculate visible depth range with buffer for smooth scrolling
const startDepth = Math.max(0, Math.floor(scrollTop / RowHeight) - 5);
const endDepth = Math.min(
effectiveDepth,
Math.ceil((scrollTop + containerHeight) / RowHeight) + 5
);

// Only iterate through visible depth buckets
for (let depth = startDepth; depth <= endDepth && depth < depthBuckets.length; depth++) {
const rowsAtDepth = depthBuckets[depth];
// Process only rows at this visible depth level
}

For a Flame Graph with 50,000 rows spanning 200 depth levels, if only 10 depth levels are visible, we examine ~2,500 rows instead of all 50,000.

Horizontal Culling

Horizontal culling uses the value_offset field to eliminate frames outside the horizontal viewport bounds. Each frame's horizontal position is determined by its relationship to the currently selected frame:

// Get selection bounds for horizontal culling
const selectionOffset = BigInt(valueOffsetColumn?.get(selectedRow) ?? 0);
const selectionCumulative = BigInt(cumulativeColumn?.get(selectedRow) ?? 0);

// Check if frame is within horizontal viewport
const frameOffset = Number(valueOffsetColumn?.get(row) ?? 0);
const frameCumulative = Number(cumulativeColumn?.get(row) ?? 0);

// Skip frames completely outside selection bounds
if (frameOffset + frameCumulative <= selectionOffsetNumber ||
frameOffset >= selectionOffsetNumber + selectionCumulativeNumber) {
continue; // Frame is horizontally outside viewport
}

The horizontal culling logic works by:

1. Using value_offset to determine each frame's X position relative to the selected frame.
2. Comparing frame boundaries against the visible selection range.
3. Skipping frames that fall completely outside the horizontal viewport.

Additionally, we perform size culling as part of horizontal optimization:

// Skip frames too small to be visually meaningful (< 1px width)
const computedWidth = (frameCumulative / totalNumber) * width;

if (computedWidth <= 1) {
  continue; // Frame would be invisible anyway
}

As a result of the changes above, the Flame Graph rendering time reduced to 91ms.

Conclusion

By pre-calculating depth, value offsets, and parent relationships, we eliminated the recursive tree traversal that was limiting our rendering performance, and instead introduced parallel rendering of the Flame Graph.

Each frame now contains all the metadata needed to position itself independently, enabling parallel rendering where thousands of frames can be processed simultaneously rather than sequentially.

We've also considered using Canvas to render the Flame Graph, as it could potentially offer better performance by rendering frames as pixels directly on screen as opposed to SVG's individual DOM elements. However, we're unsure how much improvement this would provide given our current viewport culling optimizations, and we'd lose SVG's built-in interactivity features like hover states while needing to implement hit-testing manually.

We have some ideas to further move computations to the backend, like creating depth buckets on the backend instead of the frontend. On the frontend, we currently iterate over all frames to determine the ones that fall within the visible depth range for viewport culling.

This can be improved by pre-grouping frames by their depth in the backend and also using Run-End Encoding (REE) for the depth column. This will help to achieve O(1) lookup to know exactly which range of rows belong to each depth level, eliminating the need to scan through all frames when culling on the frontend.

Try out the faster parallel Flame Graph rendering yourself at cloud.polarsignals.com

Discuss: