While adding a new standby node to the existing DG configuration noticed the following in the new standby databases's alert log.
The problme was related to cgroup setup. In a database setup where everythig is working fine the cgroup for the VKTM would have /. For example if VKTM PID is 5207 then following could be be used to find otu cgroup setting
One of the solutions is to set the hidden parameter _high_priority_processes='VKTM'. But this was already set to TRUE so no going to be a solution.
The problem server had the cgroup setting as following.
However, the question reamin why did the cgroup change. This was the first incident of facing this error. Since this is adding a standby database server to exisitng setup there was a reference point to check against any changes. So first the OS. The "good" servers had OEL 7.9 while the "problem" server had OEL 7.7. Both servers are Azure VMs
Then decided to check inside cgroup setting. The good server had following (no user.slice)
If facing similar issues first check what has caused the cgroup change before attempting hidden parameter or kernel parameter workarounds.
Starting background process VKTMMOS Doc 2718971.1 gives a workaround for this issue (seems this doc has now been made internal).
2021-03-30T00:08:42.781044+10:00
Errors in file /opt/app/oracle/diag/rdbms/dbx12/dbx12/trace/dbx12_vktm_32410.trc (incident=41):
ORA-00800: soft external error, arguments: [Set Priority Failed], [VKTM], [Check traces and OS configuration], [Check Oracle document and MOS notes], []
Incident details in: /opt/app/oracle/diag/rdbms/dbx12/dbx12/incident/incdir_41/dbx12_vktm_32410_i41.trc
2021-03-30T00:08:42.782979+10:00
Error attempting to elevate VKTM's priority: no further priority changes will be attempted for this process
VKTM started with pid=5, OS id=32410
The problme was related to cgroup setup. In a database setup where everythig is working fine the cgroup for the VKTM would have /. For example if VKTM PID is 5207 then following could be be used to find otu cgroup setting
cat /proc/5207/cgroup | grep cpuAny other setting would mean VKTM would run into above issue.
10:cpu,cpuacct:/
6:cpuset:/
One of the solutions is to set the hidden parameter _high_priority_processes='VKTM'. But this was already set to TRUE so no going to be a solution.
The problem server had the cgroup setting as following.
# ps -eaf | grep -i vktm | grep -v grepSo the next workround was to set the kernel parameter as below.
oracle 3315 1 0 17:53 ? 00:00:00 ora_vktm_dbx12
oracle 3357 1 0 14:19 ? 00:01:40 asm_vktm_+ASM
# cat /proc/3315/cgroup | grep cpu
11:cpuset:/
6:cpu,cpuacct:/user.slice
# echo 0 > /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_usThis seem to resolve the situation and was able stop and start the standby instance without the above error message appaering on the alert log.
# echo 950000 > /sys/fs/cgroup/cpu,cpuacct/user.slice/cpu.rt_runtime_us
However, the question reamin why did the cgroup change. This was the first incident of facing this error. Since this is adding a standby database server to exisitng setup there was a reference point to check against any changes. So first the OS. The "good" servers had OEL 7.9 while the "problem" server had OEL 7.7. Both servers are Azure VMs
Then decided to check inside cgroup setting. The good server had following (no user.slice)
ls -l /sys/fs/cgroup/cpu,cpuacct/While the problem server had the following
drwxr-xr-x. 3 root root 0 Apr 7 09:32 WALinuxAgent
-rw-r--r--. 1 root root 0 Apr 7 09:32 tasks
-rw-r--r--. 1 root root 0 Apr 7 09:32 cgroup.procs
-rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.cfs_period_us
-r--r--r--. 1 root root 0 Apr 7 09:32 cgroup.sane_behavior
-r--r--r--. 1 root root 0 Apr 7 09:32 cpu.stat
-r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_percpu_sys
-rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.shares
-r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_percpu
-r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.stat
-rw-r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage
-rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.cfs_quota_us
-r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_sys
-r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_all
-r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_percpu_user
-rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.rt_runtime_us
-rw-r--r--. 1 root root 0 Apr 7 09:32 notify_on_release
-rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.rt_period_us
-rw-r--r--. 1 root root 0 Apr 7 09:32 release_agent
-rw-r--r--. 1 root root 0 Apr 7 09:32 cgroup.clone_children
-r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_user
drwxr-xr-x. 2 root root 0 Apr 7 09:29 auomsBeside user.slice the auoms* seem to be the different between the two. Auoms is Azure management agent plugin. Could this be the reason why cgroup has a user.slice? To test this out disableed auoms and restarted the server.
drwxr-xr-x. 2 root root 0 Apr 7 09:29 auomscollect
-rw-r--r--. 1 root root 0 Apr 7 09:29 cgroup.clone_children
-rw-r--r--. 1 root root 0 Apr 7 09:29 cgroup.procs
-r--r--r--. 1 root root 0 Apr 7 09:29 cgroup.sane_behavior
-r--r--r--. 1 root root 0 Apr 7 09:29 cpuacct.stat
-rw-r--r--. 1 root root 0 Apr 7 09:29 cpuacct.usage
-r--r--r--. 1 root root 0 Apr 7 09:29 cpuacct.usage_all
-r--r--r--. 1 root root 0 Apr 7 09:29 cpuacct.usage_percpu
-r--r--r--. 1 root root 0 Apr 7 09:29 cpuacct.usage_percpu_sys
-r--r--r--. 1 root root 0 Apr 7 09:29 cpuacct.usage_percpu_user
-r--r--r--. 1 root root 0 Apr 7 09:29 cpuacct.usage_sys
-r--r--r--. 1 root root 0 Apr 7 09:29 cpuacct.usage_user
-rw-r--r--. 1 root root 0 Apr 7 09:29 cpu.cfs_period_us
-rw-r--r--. 1 root root 0 Apr 7 09:29 cpu.cfs_quota_us
-rw-r--r--. 1 root root 0 Apr 7 09:29 cpu.rt_period_us
-rw-r--r--. 1 root root 0 Apr 7 09:29 cpu.rt_runtime_us
-rw-r--r--. 1 root root 0 Apr 7 09:29 cpu.shares
-r--r--r--. 1 root root 0 Apr 7 09:29 cpu.stat
-rw-r--r--. 1 root root 0 Apr 7 09:29 notify_on_release
-rw-r--r--. 1 root root 0 Apr 7 09:29 release_agent
drwxr-xr-x. 69 root root 0 Apr 7 09:29 system.slice
-rw-r--r--. 1 root root 0 Apr 6 14:35 tasks
drwxr-xr-x. 2 root root 0 Apr 7 09:29 user.slice
drwxr-xr-x. 2 root root 0 Apr 7 09:29 WALinuxAgent
# systemctl stop auoms.serviceWhen the server restart the cgroup didn't have a user.slice
# systemctl disable auoms.service
Removed symlink /etc/systemd/system/multi-user.target.wants/auoms.service.
Removed symlink /etc/systemd/system/auoms.service.
# /sbin/reboot
drwxr-xr-x. 3 root root 0 Apr 7 09:32 WALinuxAgentDB was starting without the VKTM related error on alert log (earlier set kernel parameters were not set to be persistent). VKTM had the / for cgroup.
-rw-r--r--. 1 root root 0 Apr 7 09:32 tasks
-rw-r--r--. 1 root root 0 Apr 7 09:32 cgroup.procs
-rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.cfs_period_us
-r--r--r--. 1 root root 0 Apr 7 09:32 cgroup.sane_behavior
-r--r--r--. 1 root root 0 Apr 7 09:32 cpu.stat
-r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_percpu_sys
-rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.shares
-r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_percpu
-r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.stat
-rw-r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage
-rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.cfs_quota_us
-r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_sys
-r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_all
-r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_percpu_user
-rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.rt_runtime_us
-rw-r--r--. 1 root root 0 Apr 7 09:32 notify_on_release
-rw-r--r--. 1 root root 0 Apr 7 09:32 cpu.rt_period_us
-rw-r--r--. 1 root root 0 Apr 7 09:32 release_agent
-rw-r--r--. 1 root root 0 Apr 7 09:32 cgroup.clone_children
-r--r--r--. 1 root root 0 Apr 7 09:32 cpuacct.usage_user
ps ax | grep vktmThere were several instances to be added to the DG and on other servers the OS was upgraded to 7.9 from 7.7. This upgrade seem to be remove the auoms and there were no cgroup issues.
3314 ? Ss 0:03 asm_vktm_+ASM
3506 ? Ss 0:03 ora_vktm_dbx12
4664 pts/0 S+ 0:00 grep --color=auto vktm
# cat /proc/3506/cgroup | grep cpu
6:cpu,cpuacct:/
4:cpuset:/
If facing similar issues first check what has caused the cgroup change before attempting hidden parameter or kernel parameter workarounds.