Troubleshooting Apache Pulsar 4.1 Bucket-Based Delayed Queue Performance Issues

Aug 5, 2025 by ADMIN 80 views

Investigating Performance Issues with Bucket-Based Delayed Queue in Apache Pulsar 4.1

Introduction

Hey guys, today we're diving deep into a performance issue encountered while using the bucket-based delayed queue in Apache Pulsar 4.1. This is a crucial topic for anyone leveraging Pulsar's delayed messaging capabilities, so let’s get started! We'll break down the problem, the troubleshooting steps, and potential solutions in a way that’s super easy to understand. We're focusing on making sure your Pulsar setup runs smoothly, especially when dealing with high message throughput.

Background on Apache Pulsar Delayed Queues

First off, let's quickly touch on what delayed queues are in Apache Pulsar. Imagine you need to send a message, but not right away. Maybe you want it delivered in a few minutes, hours, or even days. That’s where delayed queues come in. They hold messages until a specified time, then release them for consumption. Pulsar's bucket-based delayed queue is designed to handle this efficiently, but like any system, it can run into snags under heavy load. Understanding the architecture and how messages are managed within these queues is key to troubleshooting performance bottlenecks. We’ll explore how Pulsar’s internal mechanisms work, including how messages are indexed and stored, to give you a solid foundation for tackling any issues that arise.

The Issue: High Latency with 100,000 TPS

The core of the issue? When testing delayed messages at a throughput of 100,000 messages per second (TPS), significant consumption delays were observed. That's a lot of messages, and any hiccup in the system can quickly turn into a major bottleneck. The goal here is to pinpoint why these delays occur and how to mitigate them. This high TPS scenario puts a lot of stress on the system, revealing potential weaknesses in the delayed queue implementation. We'll dig into the specifics of the test setup and the environment to understand the context of this performance issue fully.

Environment and Setup

User Environment Details

Let's get into the nitty-gritty of the environment where this issue was observed. Here’s the setup:

JDK Version: 1.8
Pulsar Version: 2.9 (with delayed queue cherry-picked from 4.1)

Using an older JDK and a cherry-picked feature from a newer Pulsar version adds complexity. It’s essential to consider compatibility and potential conflicts. The choice of JDK version can impact performance due to differences in garbage collection and other optimizations. Cherry-picking features, while sometimes necessary, can introduce instability if not done carefully. We'll discuss the implications of this setup and how it might contribute to the performance issues.

Issue Reproduction Steps

To reproduce the issue, the following pulsar-perf command was used:

nohup bin/pulsar-perf produce --auth_plugin org.apache.pulsar.client.impl.auth.AuthenticationToken --auth-params xxxxxxxxxxxxxx -threads 5 -u pulsar://xxxx:6650  -n 4 -s 2048 -r 100000 -dr 1,180 persistent://qlm-tn/delay/topic-delay-1  > producer.log 2>&1 &

Let’s break this down:

pulsar-perf produce: This is the command-line tool for performance testing in Pulsar.
--auth_plugin: Specifies the authentication plugin.
--auth-params: Authentication parameters.
-threads 5: Uses 5 threads to produce messages.
-u pulsar://xxxx:6650: Pulsar broker URL.
-n 4: Sends 4 messages (this seems low for a performance test, which might be a typo).
-s 2048: Message size of 2048 bytes.
-r 100000: Sends messages at a rate of 100,000 per second.
-dr 1,180: Delays messages between 1 and 180 seconds.
persistent://qlm-tn/delay/topic-delay-1: The topic to which messages are sent.
> producer.log 2>&1 &: Redirects output and errors to a log file and runs the command in the background.

The key here is the high message rate (-r 100000) and the delayed delivery (-dr 1,180). This command simulates a scenario where a large number of messages are being produced with varying delays, which can heavily tax the delayed queue. We’ll analyze how this command stresses the system and what aspects of the delayed queue are most likely to be affected.

Arthas Analysis: Pinpointing the Bottleneck

Arthas: The Detective Tool

Arthas is a powerful open-source diagnostic tool for Java applications. It allows you to peek inside a running JVM, inspect method execution times, and much more. In this case, Arthas was used to pinpoint the method causing the delay. Using Arthas is like having a super-detailed log of what your application is doing internally. It allows you to see exactly where time is being spent and identify bottlenecks that might not be obvious from standard logging or monitoring.

The Culprit: `MutableBucket.createImmutableBucketAndAsyncPersistent()`

The analysis pointed to org.apache.pulsar.broker.delayed.bucket.MutableBucket:createImmutableBucketAndAsyncPersistent() as the method causing the most significant delay, taking a whopping 1525.23451ms to execute. This is where things get interesting! This method is crucial for creating immutable buckets, which are fundamental to how Pulsar manages delayed messages. Understanding what this method does and why it’s slow is key to solving the performance issue. We’ll dissect the method's functionality and examine the operations it performs to identify the root cause of the delay.

Deep Dive into Method Execution

Here’s a breakdown of the time spent within createImmutableBucketAndAsyncPersistent():

org.apache.pulsar.broker.delayed.bucket.DelayedIndexQueue:isEmpty(): Called multiple times, contributing a significant amount of time (3.14% + 3.18%).
org.apache.pulsar.broker.delayed.bucket.DelayedIndexQueue:peekTimestamp(): Also called multiple times (3.31% + 3.36%).
org.apache.pulsar.broker.delayed.proto.SnapshotSegment:addIndexe(): 3.93% of the time.
org.apache.pulsar.broker.delayed.bucket.DelayedIndexQueue:popToObject(): A major time consumer, accounting for 18.37% of the execution time.
org.apache.pulsar.broker.delayed.proto.DelayedIndex:getLedgerId() and getEntryId(): 3.13% and 3.38% respectively.
org.apache.pulsar.broker.delayed.bucket.MutableBucket:removeIndexBit(): 4.17% of the time.
com.google.common.base.Preconditions:checkArgument(): 3.01% of the time.
org.apache.pulsar.common.util.collections.TripleLongPriorityQueue:add(): 0.24% of the time.
org.roaringbitmap.RoaringBitmap:add(): 4.19% of the time.
Various Protobuf operations and RoaringBitmap operations also contribute to the delay.

This detailed breakdown is invaluable. It shows us that several operations within the method are contributing to the latency. The repeated calls to isEmpty() and peekTimestamp(), the time spent in popToObject(), and the operations involving RoaringBitmap are all potential areas for optimization. We’ll explore each of these in detail to understand their impact and how they can be improved.

Identifying the Root Cause and Potential Solutions

Analyzing the Bottlenecks

Based on the Arthas analysis, we can pinpoint several key areas that are contributing to the performance bottleneck:

DelayedIndexQueue Operations: The repeated calls to isEmpty() and peekTimestamp() suggest that the queue management within the MutableBucket is a significant overhead. These operations are likely being performed frequently as part of the process of creating the immutable bucket. The efficiency of these queue operations directly impacts the overall performance.
popToObject(): This method, consuming 18.37% of the time, is a major concern. It likely involves removing elements from the queue, which can be an expensive operation, especially if it involves re-indexing or re-organizing the queue. We need to understand what popToObject() is doing internally and why it’s taking so long.
RoaringBitmap Operations: The add() operations on RoaringBitmap indicate that bitmap manipulations are also contributing to the delay. RoaringBitmap is used for efficient storage of sparse sets of integers, which in this case, likely represent the indexes of delayed messages. While RoaringBitmap is generally efficient, frequent additions can still be costly, especially at high throughput.
Protobuf Serialization: Operations involving Protobuf, such as building and clearing snapshot metadata, also add overhead. Protobuf is used for serializing data, and while it's generally efficient, serialization and deserialization can become bottlenecks under heavy load.

Potential Solutions and Optimizations

Now that we’ve identified the bottlenecks, let’s brainstorm some potential solutions:

Optimize DelayedIndexQueue Operations: We can look into optimizing the isEmpty() and peekTimestamp() methods. Caching results, reducing the frequency of calls, or using more efficient data structures could help. Reducing the overhead of queue management can significantly improve performance.
Improve popToObject() Efficiency: Understanding the internal workings of popToObject() is crucial. If it involves expensive re-indexing, we can explore alternative algorithms or data structures that minimize this overhead. Reducing the complexity of the removal operation can have a big impact.
Batch RoaringBitmap Updates: Instead of adding indexes one by one, we can batch the updates to RoaringBitmap. This can reduce the number of operations and improve overall performance. Batching operations is a common technique for reducing overhead in high-throughput systems.
Reduce Protobuf Overhead: We can explore ways to reduce the overhead of Protobuf serialization. This might involve optimizing the data structures being serialized, using more efficient serialization methods, or caching serialized data. Reducing serialization costs can improve performance, especially when dealing with large volumes of data.
Upgrade Pulsar Version: Since the delayed queue implementation was cherry-picked from version 4.1, upgrading to a stable version of 4.1 or later might include performance improvements and bug fixes that address these issues. Upgrading to a stable version ensures you're benefiting from the latest optimizations and fixes.
Review Configuration: Check Pulsar’s configuration related to delayed messages. Parameters like the number of buckets, bucket size, and the frequency of snapshot creation can impact performance. Tuning these parameters based on your specific workload can optimize the system.

Conclusion and Next Steps

So, guys, we’ve taken a deep dive into a performance issue with bucket-based delayed queues in Apache Pulsar. We've identified the bottlenecks and explored potential solutions. The key takeaway here is that high message throughput can expose performance limitations in complex systems like Pulsar’s delayed queues.

Summary of Findings

We found that the createImmutableBucketAndAsyncPersistent() method is a major bottleneck, with significant time spent in DelayedIndexQueue operations, popToObject(), RoaringBitmap manipulations, and Protobuf serialization. These findings provide a clear roadmap for optimization efforts. Understanding the specific operations that are slow allows us to focus our efforts on the areas that will yield the biggest performance gains.

Next Steps

Here’s what I recommend as the next steps:

Implement Optimizations: Try implementing the solutions discussed above, such as optimizing queue operations, batching RoaringBitmap updates, and reducing Protobuf overhead.
Test Thoroughly: After implementing any changes, test the system thoroughly with realistic workloads to ensure that the optimizations are effective and don’t introduce new issues. Rigorous testing is crucial for validating performance improvements and ensuring stability.
Consider Upgrading: Evaluate the possibility of upgrading to a stable version of Pulsar 4.1 or later to take advantage of potential performance improvements and bug fixes.
Monitor Performance: Continuously monitor the performance of the delayed queues to identify any regressions or new bottlenecks that may arise. Monitoring is essential for maintaining optimal performance over time.

By addressing these bottlenecks and implementing the suggested optimizations, you can significantly improve the performance of Pulsar’s delayed queues and ensure smooth operation even under high message throughput. Keep experimenting and refining your setup—you've got this!