VMware NTP, is yours working?
Like me over the years millions of people have build ESX clusters, vCenter Environments, vRealize*, Photon based VMware platforms and the list goes on with VMware offerings. Have your ever thought or taken the time to actually check if your NTP configurations actually work ?
NTP is a critical component of the VMware ecosystem and not just from a logging date stamp perspective, but services such as FT, VSAN, HA, Virtual Machine Monitoring and VCF SDDC Manager as a small subset all rely on accurate synchronized NTP to ensure the services function correctly and don’t cause customer outages.
A lot of Engineers will sync the infrastructure time to either an external source such as au.pool.ntp.org ( Aussie reference ) or their internal Microsoft PDC’s.
So lets take a look from a ESX host perspective of NTP syncing to these two examples. The ESX server I am using is 7.0U2 for transparency. VMkernel esxuat3.local 7.0.2 #1 SMP Release build-17630552 Feb 17 2021 15:16:00 x86_64 x86_64 x86_64 ESXi .
ntpq> peer
remote refid st t when poll reach delay offset jitter
==============================================================================
bitburger.simon .GPS. 1 u 5 64 1 37.192 -0.846 0.000
pve01.as24220.n 216.218.254.202 2 u 9 64 1 32.151 +2.006 0.000
ntpq>
ntpq> peer
remote refid st t when poll reach delay offset jitter
==============================================================================
ad.local .LOCL. 1 u 2 64 1 0.293 +0.714 0.000
ntpq>
Looks good ! We have a peering au.pool.ntp.org in the first example and a 2019 Microsoft Domain Controller in the second example. Lets take a closer look at these NTP peering and look at the associations.
ntpq> assoc
ind assid status conf reach auth condition last_event cnt
===========================================================
1 19499 9014 yes yes none reject reachable 1
2 19500 9014 yes yes none reject reachable 1
ntpq> assoc
ind assid status conf reach auth condition last_event cnt
===========================================================
1 42369 9014 yes yes none reject reachable 1
ntpq>
You will notice that even though NTP is peered to the NTP servers, they are getting condition “rejected” which means they are not syncing time which can result in time drift, and issues in your environment.
So to start looking at why this might be happening I took a PCAP on the ESX host, copied the file off the host, and imported it into Wire Shark to analyze.
[root@esxuat3:~] pktcap-uw --vmk vmk0 -o /tmp/test.pcap -G 30
Source PCAP
Examined Packets
As you can see the source of my host is 192.168.1.153 and the destination is au.pool.ntp.org, and the NTP version is v4, which is default in VMware.
ntpq> version
ntpq 4.2.8p15+vmware@1.3728-o Tue Jun 30 17:18:49 UTC 2020 (1)
.au.pool.ntp.org offers NTP in v3 which is why it is in a rejected state on the host.
Another example of a PCAP dump on a 2019 Domain Controller which is receiving NTP requests from the same ESX host.
PCAP MS 2019 Domain Controller
Detail of Packet Sequence
You can see here that Domain Controller is handing back Version 3 NTP, and once again from the previous snippets it is getting rejected by the host.
So how to get around this dilemma ? Well in my case I just run a Linux distro NTP server running Chronyd and the default version for that is NTP v4. The output from the ESX host NTP associations is “sys.peer” which is a successful sync.
[root@dns1 ~]# ntpd -d
ntpd 4.2.6p5@1.2349-o Mon Jan 25 14:08:27 UTC 2016 (1)
1 Jun 15:04:43 ntpd[11113]: proto: precision = 0.042 usec
1 Jun 15:04:43 ntpd[11113]: 0.0.0.0 c01d 0d kern kernel time sync enabled
event at 0 0.0.0.0 c01d 0d kern kernel time sync enabled
Finished Parsing!!
ntpq> associations
ind assid status conf reach auth condition last_event cnt
===========================================================
1 7717 961a yes yes none sys.peer sys_peer 1
ntpq>
An other option is that you can modify the /etc/ntp.conf file on the infrastructure to include the version after the listed NTP servers. E.g. server x.x.x.x version 3.
I would not suggest doing this if you run something like a VCF stack as your modifying the default configuration outside of the SDDC Postgres database. If you want to change your NTP settings inside VCF SDDC Management you can you use the GUI API or do a curl from a remote command line or Postman
$ curl 'https://sddcmgmt.local/v1/system/ntp-configuration/validations' -i -X POST \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer etYWRta....' \
-d '{
"ntpServers" : [ {
"ipAddress" : "192.168.0.254"
} ]
}'
So check our your NTP and make sure it is having a good time …. ahhh Dad joke.