REDUCE LONG GC PAUSES

In this post, We intend to share 9 tips that will help to reduce your GC pause times:

  1. Start Tuning from Scratch
  2. Resize your Heap
  3. Right GC Algorithm
  4. Adjust Internal Memory Regions Size
  5. Tune GC Algorithm Settings
  6. Address the Causes of GC Events
  7. Disable Explicit GC
  8. Allocate Sufficient System Capacity
  9. Reduce Object Creation Rate

Let's review these tips and their benefits in this post.

1. Start Tuning from Scratch

Have you looked at the JVM arguments that are configured to your application? When you look at it, the following questions will arise: What are they? What do they do? Are they even relevant? Be advised there are 600+ JVM arguments related to JVM Memory and Garbage Collection. Your application would have gotten its share of arguments over the years. In several applications, contrary to our common belief, these unfamiliar arguments can turn out to be counterproductive, degrading application performance instead of enhancing it. These arguments might have been relevant when your application first went live (5-10 years ago). Since then, your traffic volume would have changed, some of the arguments might have been deprecated, and their default values could have changed. Thus, carrying old JVM arguments can result in counterproductive GC behavior.

It's highly recommended to remove all old arguments and start tuning from scratch. JVM itself has many heuristics and intelligence to auto-tune itself. Start with -Xmx (heap size) argument, study the JVM’s performance behavior and then add new JVM arguments.

Here is a real case study from CloudBees, a parent company behind Jenkins, on how they started the GC tuning from scratch and improved GC performance by 3500%.




2. Resize Heap

In most applications, heap size is either under allocated or over allocated. When heap size is under allocated, GCs will run more frequently, resulting in the degradation of the application’s performance.

Here is a real case study of an insurance application, which was  configured to run with 8gb heap size (-Xmx). This heap size wasn’t sufficient enough to handle the incoming traffic, due to which garbage collector was running back-to-back. As we know, whenever a GC event runs, it pauses the application. Thus, when GC events run back-to-back, pause times were getting stretched and application was becoming unresponsive in the middle of the day. Upon observing this behavior, the heap size was increased from 8GB to 12GB. This change reduced the frequency of GC events and significantly improved the application’s overall availability.




3. Right GC Algorithm

Garbage Collection algorithm plays a pivotal role in influencing the GC pause times. As of now, there are 7 GC algorithms in OpenJDK: Serial GC, Parallel GC, CMS GC, G1 GC, Shenandoah GC, ZGC, Epsilon. This brings the question: 'How to choose the right GC algorithm for my application?'

Flow chart animation

Fig: ‘java.lang.OutOfMemoryError: Metaspace’

The above flow chart will help you to identify the right GC algorithm for your application. You may also refer to this detailed post which highlights the capabilities, advantages and disadvantages of each GC algorithm.

Here is a real-world case study of an application, which was used in warehouses to control the robots for shipments. This application was running with the CMS GC algorithm and suffered from long GC pause times of up to 5 minutes. Yes, you read that correctly, it’s 5 minutes, not 5 seconds 😊. During this 5-minute window, robots weren’t receiving instructions from the application and a lot of chaos was caused. When the GC algorithm was switched from CMS GC to G1 GC, the pause time instantly dropped from 5 minutes to 2 seconds. This GC algorithm change made a big difference in improving the warehouse's delivery.




4. Adjust Internal Memory Regions Size

JVM memory has the following internal memory regions:

  1. Young Generation
  2. Old Generation
  3. MetaSpace
  4. Others

You can visit this video post to learn about different JVM memory regions. Changing the internal memory regions size can also result in positive GC pause time improvements. Here is a real case study of an application, which was suffering from 12.5 seconds average GC Pause time. This application’s Young Generation Size was configured at 14.65GB, and Old gen size was also configured at the same 14.65GB. Upon reducing the Young Gen size to 1GB, average GC pause time remarkably got reduced to 138 ms, which is a 98.9% improvement.




5. Tune GC Algorithm Settings

Garbage Collection Pause time is influenced by the specific JVM arguments you configure. As we mentioned in 'Tip #1 Start Tuning from Scratch', there are 600+ JVM arguments related to Memory and GC settings. It’s a tedious task for anyone to choose the right arguments from this lengthy poorly documented list. Thus, we have curated less than a handful JVM arguments by each GC algorithm and given them below. Use the arguments pertaining to your GC algorithm and optimize the GC pause time.

  1. Serial GC Tuning Parameters
  2. Parallel GC Tuning Parameters
  3. CMS GC Tuning Parameters
  4. G1 GC Tuning Parameters
  5. Shenandoah Tuning Parameters
  6. ZGC Tuning Parameters



6. Address the Causes of GC Events

GC events are triggered due to various causes, such as Allocation Failure, Promotion Failure, Evacuation Failure,… The causes for which GC events are triggered are reported in the GC log file. When you analyze the GC log file using tools like GCeasy, it will present you with a consolidated summary of the causes, as shown in the figure below.

GC causes

Fig: GC Causes Summary reported by GCeasy

By studying these reasons, you can tune the GC settings accordingly. In fact, the GCeasy tool provides recommendations for GC settings that need to be adjusted based on these GC causes. You can try implementing those recommended settings.

gc cause recommendation

Fig: GC recommendation reported by GCeasy

Here is a real case study of an application which was suffering from long GC pauses due to Allocation Failures. Allocation Failures occur when there isn’t sufficient memory in the young generation to create new objects. Thus, the team adjusted the young generation size and saw dramatic reduction in their GC pause time.




7. Disable Explicit GC

Your own application code or third-party libraries/frameworks that are running in your application can invoke the System.gc() API call. When this API is invoked, a Full GC event is triggered in your application. Such explicit GC calls are not advisable as they add needless overhead to the application.

Consider this scenario: Say your memory got filled up and JVM triggered a GC event, then right after that your application code triggers a System.gc() call. Now two GC events are triggered back-to-back. The second GC event (triggered by the application code) would not reclaim any objects because the first event would have reclaimed all the unreferenced objects. But still the second GC event would have paused your application unnecessarily.

To prevent this unnecessary overhead, you can silence explicit System.gc() calls by passing either one of the following JVM arguments:

a. -XX:+DisableExplicitGC: This JVM argument will silence all System.gc() calls invoked anywhere in your application stack.

b. -XX:+ExplicitGCInvokesConcurrent: This JVM argument will allow System.gc() call to trigger GC; however, the GC event will run concurrently with the application threads, minimizing the impact on GC pause times and maintaining application responsiveness.




8. Allocate Sufficient System Capacity

Garbage Collection performance can sometimes suffer due to insufficient system-level resources such as threads, CPU, and I/O. GC log analysis tools like GCeasy, identifies these limitations by examining following two patterns in your GC log files:

a. Sys time > User Time: This pattern indicates that the GC event is spending more time on kernel-level operations (system time) compared to executing user-level code. This could be a sign that your application is facing high contention for system resources, which can hinder GC performance. For more details, you can refer to this article.

b. Sys time + User Time > Real Time: This pattern suggests that the combined CPU time (system time plus user time) exceeds the actual elapsed wall-clock time. This discrepancy indicates that the system is overburdened, possibly due to insufficient CPU resources or lack of GC threads. You can find more information about this pattern.

To address these system level limitations, consider taking one of the following actions:

a. Increase GC Threads: Allocate more GC threads to your application by adjusting the relevant JVM parameters.

b. Add CPU Resources: If your application is running on a machine with limited CPU capacity, consider scaling up by adding more CPU cores. This can provide the additional processing power needed to handle GC operations more efficiently.

c. I/O bandwidth: Ensure that your application’s I/O operations are optimized and not creating bottlenecks. Poor I/O performance can lead to increased system time, negatively impacting GC performance.




9. Reduce Object Creation rate

There is a famous Chinese proverb in the 'Art of War' book: ‘The greatest victory is that which requires no battle’. Similarly, instead of trying to focus on tuning the GC events, it would be more efficient if you can prevent the GC events from running. The amount of time spent in garbage collection is directly proportional to the number of objects created by the application. If the application creates more objects, GC events are triggered more frequently. Conversely, if the application creates fewer objects, fewer GC events will be triggered.

By profiling your application’s memory using tools like HeapHero, you can identify the memory bottlenecks & fix them. Reducing memory consumption will, in turn, reduce the GC impact on your application. However, reducing the object creation rate is a tedious and time-consuming process as it involves studying your application, identifying the bottlenecks, refactoring the code and thoroughly testing it. However, it's well worth the effort in the long run, as it leads to significant improvements in application performance and more efficient resource usage.




Conclusion

Tuning GC performance provides more significant rewards than tuning any other aspect of your application. It’s the most light-weight, non-intrusive approach to improve your application’s performance. Hopefully tips shared in this post are helpful to you.