VMware Virtual Machine CPU Resource Usage Anomaly

Background: The business system, running in Windows version, is deployed locally and consumes approximately 5% of CPU resources. The Linux version of the business system, deployed within a VMware-installed CentOS8 environment, exhibits abnormal resource consumption.

Problem Description

Host Machine: Windows 10 Enterprise
VMware: 17.5
Virtual Machine: CentOS8 The virtual machine resource allocation is 4C8GB, running the business system. The business system is deployed in the Linux system within the virtual machine, and the internal top command observes system resource usage. CPU utilization is not high, while the external Windows system’s Task Manager shows very high CPU resource consumption. Examining processes reveals that the VMware process consumes a large amount of CPU resources.

+—————————+ | Windows | | | | +——————–+ | | | VMware | | | | Program | | | +——————–+ | | | +—————————+

Key Concepts

Troubleshooting this issue wasn’t smooth, as the root cause wasn’t the business system itself but the virtual machine. How to shift thinking from conventional business code to system load, then from abnormal load data to pinpoint a soft interrupt, and finally arrive at the critical point – what factors affect VMware soft interrupt efficiency? This article will first introduce various concepts and then provide solutions.

Hyper-V

The virtualization technology for Windows operating systems underwent a significant transformation. When Microsoft initially released WSL, enabling the Hyper-V service would prevent VMware virtual machines from working simultaneously. It wasn’t until subsequent versions that VMware could be compatible with the Hyper-V service.

System Load

In Linux systems, “load” refers to the number of processes currently running or waiting to be executed. The load is typically represented by three numbers: the average process count in the run queue over 1 minute, 5 minutes, and 15 minutes respectively. These numbers can be viewed by running the uptime command or the top command.

Specifically, these three numbers represent:

1-minute load: The average number of processes in the run queue over the past 1 minute.
5-minute load: The average number of processes in the run queue over the past 5 minutes.
15-minute load: The average number of processes in the run queue over the past 15 minutes.

The meaning of the load is the number of processes waiting to be executed within the system. If this number exceeds the logical CPU count for the system, it indicates a high system load, meaning many processes are waiting for processor resources. This can cause the system to become slow or unresponsive, depending on the severity of the load and the configuration and performance of the system.

Ideally, the load should remain within the logical CPU count range to optimize system performance. If the load consistently exceeds the CPU count, it may be necessary to further analyze processes in the system to identify the cause of the high load and take appropriate measures to adjust system resource allocation or optimize how processes run.

Analyzing Load with mpstat

The mpstat command is used to report multiple pieces of information about one or more processors, including average load, CPU utilization, interrupts, and context switches. Within the sysstat package, mpstat is a valuable tool for analyzing system load conditions. Here’s how to perform load analysis using mpstat:

Install sysstat: If sysstat isn’t installed on your system, use your system’s package manager to install it.
Run mpstat: Use the mpstat command to view CPU usage and load. By default, mpstat displays CPU utilization averages once per second. You can adjust the output frequency by specifying an interval. For example, to run mpstat at a rate of one time per second, use the following command: mpstat -P ALL 2, where irq represents interrupt resource usage.
Analyze Output: The output from mpstat includes CPU utilization for each processor, as well as the system’s average load. Pay particular attention to the average load and the utilization of each CPU to understand the system’s load conditions. If the load is high, further analysis can be done to determine which processes are causing it and whether there are any performance bottlenecks.
Combine with Other Tools: In addition to mpstat, you can use tools like sar, pidstat, and iostat to comprehensively analyze system performance. By combining the outputs of multiple tools, you can gain a more complete understanding of the system’s load conditions and identify the root causes of performance issues.

Interrupt

This section doesn’t elaborate on the content too much, Recommended: System Guide for Application Developers - CPU Part - Soft Interrupt Frequent triggering of soft interrupts will also be reflected in system load.

Troubleshooting

Considering that analysis solely from the CPU perspective couldn’t pinpoint the issue, should we start to suspect that the system had become abnormal? It might be due to excessive load on the Linux operating system, causing VMware to consume an unusually high amount of CPU resources. By using mpstat to analyze local virtual machines, we found that irq utilization was abnormally high, approaching 25% per core, while in normal circumstances, when business processes were idle, irq should have accounted for approximately 5%.

In a colleague’s development environment within the group, his CentOS 7 was deployed on VMware with normal resource usage. Conversely, in the Shanghai development environment, although also running on VMware, we couldn’t directly observe the host machine’s CPU resource situation. At this point, we faced multiple variables: VMware virtual machines, the Linux operating system, and the GCC version.

Shifting our focus to the test environment, the Shenzhen test environment was deployed on a physical machine running low-version GCC compiled services and was running on CentOS 8. Interestingly, in the Shenzhen environment, irq utilization was normal.

To investigate potential issues introduced by the GCC version, we deployed a program compiled with a high-version GCC to the Shenzhen environment for testing, which also yielded normal results.

The problem seemed to become clearer, and we began to suspect that the operating system might be experiencing an issue. After all, CentOS 8 is no longer officially supported. Even after deploying clean CentOS 7 and CentOS 8, the problem persisted.

At this point, we started to suspect the only remaining uncertainty: the VMware virtual machine software itself. Suddenly, a flash of insight occurred – could we have inadvertently enabled Hyper-V previously without fully disabling it, thereby causing this issue? After all, interrupts are also implemented through virtualization software. Do different virtualization technologies have bugs? These questions deserved in-depth consideration and investigation.

Conclusion

According to the Microsoft official documentation, after completely disabling the Hyper-V service on the machine as described, VMware recovered normal operation on the host. This finally resolved the issue. As initially stated, this experience was convoluted and arduous, requiring comprehensive analysis and judgment. It was also our first time troubleshooting and pinpointing the problem down to the virtual machine level.

Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-HyperV-Hypervisor
bcdedit /set hypervisorlaunchtype off

https://learn.microsoft.com/zh-cn/troubleshoot/windows-client/application-management/virtualization-apps-not-work-with-hyper-v