The background is that the locally deployed Windows version of the business system occupies about 5% of CPU resources. The Linux version of the business system deployed in VMware-installed CentOS8 has abnormal resource occupancy.
Problem description
- Host machine: Windows 10 Enterprise Edition
- vmware:17.5
- Virtual machine: CentOS 8
Virtual machine resources are allocated as 4C8GB
, and the business system is started. The business system is deployed in a virtual machine Linux environment. Inside the virtual machine, the top command observes system resource usage, and CPU utilization is not high. However, in the outer Windows system, Task Manager shows very high CPU resource utilization. Checking processes reveals that the VMware process consumes a lot of CPU resources.
+—————————+ | Windows | | | | +——————–+ | | | VMware | | | | Program | | | +——————–+ | | | +—————————+
Key points
Troubleshooting this issue has not been smooth, as the root cause wasn’t in the business system itself but rather a problem with the virtual machine. How to shift focus from routine business code to system load, then from anomalies in load data to soft interrupts, and finally pinpoint the key factor: what could affect the efficiency of VMware soft interrupts? This article will first explain the relevant knowledge points and then provide a solution.
hyper-v
Virtualization technology for Windows operating systems has undergone a significant change. When Microsoft first released WSL, enabling the Hyper-V service would prevent the simultaneous use of VMware virtual machines. It wasn’t until subsequent versions that VMware became compatible with the Hyper-V service.
System load
In a Linux system, “load” refers to the number of processes that are running or waiting to be executed in the system. Load is typically represented by three numbers, which represent the average number of processes in the run queue over 1 minute, 5 minutes, and 15 minutes. These numbers can be viewed by running the “uptime” command or the “top” command.
Specifically, these three numbers represent:
- Average number of processes in the run queue over the past 1 minute
- Average number of processes in the run queue over the past 5 minutes
- Average number of processes in the run queue over the past 15 minutes
The meaning of load is the number of processes waiting to run in the system. If this number exceeds the logical CPU count of the system, it indicates a high system load, meaning that many processes are waiting for processor resources. This can lead to the system becoming slow or unresponsive, depending on the degree of the load and the system’s configuration and performance.
Ideally, the load should be maintained within the logical CPU count of the system to optimize performance. If the load consistently exceeds the number of CPUs, further analysis of processes in the system may be necessary to identify the cause of the high load and take appropriate measures to adjust system resource allocation or optimize process execution methods.
Analyzing load with mpstat
The mpstat
command is used to report multiple pieces of information for single or multiple processors, including average load, CPU utilization, interrupts, and context switching. As a useful tool in the sysstat
package, mpstat
can be used to analyze system load conditions. The following are the steps for using mpstat
to perform load analysis:
-
Installing sysstat If
sysstat
is not installed on your system, you can use a package management tool suitable for your system to install it -
Run mpstat Use the
mpstat
command to view CPU usage and load. By default,mpstat
displays the average CPU usage every second. You can adjust the output frequency by specifying a time interval. For example, to runmpstat
once per second, you can use the following command:mpstat -P ALL 2
, whereirq
indicates resource occupancy.01:32:33 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 01:32:35 PM all 0.00 0.00 0.26 0.00 3.73 0.26 0.00 0.00 0.00 95.76 01:32:35 PM 0 0.00 0.00 0.51 0.00 3.57 0.00 0.00 0.00 0.00 95.92 01:32:35 PM 1 0.00 0.00 0.00 0.00 3.59 0.51 0.00 0.00 0.00 95.90 01:32:35 PM 2 0.00 0.00 0.00 0.00 4.15 0.00 0.00 0.00 0.00 95.85 01:32:35 PM 3 0.00 0.00 0.52 0.00 3.61 0.52 0.00 0.00 0.00 95.36
-
Analysis output The output of
mpstat
includes the utilization rate for each CPU and the system’s average load. Paying particular attention to the average load and the utilization rate of each CPU can help you understand the system’s load situation. If the load is high, further analysis can identify which processes are causing it and whether there are any performance bottlenecks. -
Combining with other tools In addition to
mpstat
, tools such assar
,pidstat
, andiostat
can also be used for comprehensive system performance analysis. By combining the output of multiple tools, you can gain a more complete understanding of system load and identify the root causes of performance issues.
Interruption
I won’t elaborate on the content here Recommended: System Guide for Application Developers - CPU Part on Soft Interrupts
Frequent triggering of soft interrupts will also be reflected in system load
Troubleshooting
Considering that analyzing the problem solely from a CPU perspective is unable to pinpoint the issue, should we start suspecting an anomaly in the system? It’s possible that the Linux operating system has an excessively high load, leading VMware to consume excessive CPU resources. By using mpstat
to analyze the local virtual machine, we found abnormal irq
usage, approaching 25% on a single core. Under normal circumstances, when running business processes without any load, the irq
percentage should be around 5%.
Within the development environment of his team, CentOS 7 is deployed on VMware and resource usage appears normal. On the other hand, in the Shanghai development environment, although it’s also VMware, we cannot directly observe the CPU resources of the host machine. We are now facing multiple variables: the VMware virtual machine, the Linux operating system, and the GCC version.
Turning to analyze the test environment, Shenzhen’s test environment is deployed on physical machines and runs a low-version GCC compilation service, and it operates on CentOS 8. Interestingly, in the Shenzhen environment, irq
occupancy is normal.
To investigate issues introduced by GCC versions, we deployed programs compiled with a newer version of GCC to the Shenzhen environment for testing, and the results were also normal
The issue seems to be becoming clearer, and we’ve started to suspect there might be a problem with the operating system. After all, CentOS 8 is no longer officially supported. But even after redeploying clean installations of both CentOS 7 and CentOS 8, the problem persists.
At this point, we began to suspect the only remaining uncertainty: VMware virtualization software. Suddenly, an idea struck us – Hyper-V technology. Could Hyper-V have been enabled previously but not completely shut down, leading to this issue? After all, soft interrupts are also implemented through virtualization software. Are there bugs in different virtualization technologies? These questions warrant further thought and investigation.
Conclusion
According to the Microsoft official manual, we completely shut down the local Hyper-V service and found that VMware recovered normally on the host machine. With this, the problem was finally resolved. As mentioned earlier, this experience was tortuous and arduous, requiring comprehensive analysis and judgment. This was also the first time we troubleshooted the issue and located it at the virtual machine level.
Disable-WindowsOptionalFeature -Online -FeatureName Microsoft-Hyper-V-Hypervisor
bcdedit /set hypervisorlaunchtype off