Infrastructure and Cloud for Enthusiasts

[blog 008]# git commit

Runecast Predictive Analytics

With many MSPs branching out into multi-cloud solutions to provide a plethora of customer services it is important to be able monitor your infrastructure to maintain uptime and availability for your customers. This challenge becomes exponentially more difficult when you have workloads and infrastructure across services such as AWS, Azure, on-premises vSphere stacks across multiple  supported versions and validated vendors, and Kubernetes,  to name a few.

This challenge though goes beyond just the normal SLAs of uptime and availability. MSPs must ensure all their platforms and services are built to best practices, compliant with CVEs, and comply with security standards used in Australia.

A break down of typical Australian security standards are:

  1. Essential Eight – a Government Cyber Security mitigation strategy[1].
  2. HIPAA – Health Information Privacy[2].
  3. ISO/IEC 27001 – a specification for information security management systems (ISMS)[3].
  4. PCI DSS – security policies for financial institutions and payment processing solutions[4].

So, to be able to monitor, review, remediate, and report on all these requirements is going to be a challenge both in time and human cost.

I have been fortunate to be able to evaluate a product called Runecast Analyzer[5] in my lab. This allows proactive audits across all your environments to provide visibility on Vendor KBs, Best Practices, Vulnerabilities, Security Compliance and Hardware Compatibility.

Even though I am running this in a lab I do try to stick to best practices as much as possible with the limited infrastructure I have. I was absolutely blown away (and a little shocked) at what was analyzed.

For the testing I was analyzing vCenter vSphere version 7.0.2.00100, NSX-T 3.1.1.0.0.1748.185, VMware Cloud Director 10.2.2.17855680 and Rancher Kubernetes 1.19.10. Frankly, it appears all is not well in my lab.

Main Dashboard Compliance

Main Dashboard Configurations

Inventory View

So, let’s break down what we are seeing here in slightly more detail, starting with Config KBs discovered.

Config KBs Discovered

Each KB is broken down classed on severity, with the ability to expand the severity to provide more detail such as the impacted infrastructure, a detailed description of the severity, and a reference link to the VMware KB to resolve the issue. It is important to note that while the detail of the analysis is impressive, application of the KBs to infrastructure is depended on your platform. An example is VMware VCF has stringent requirements around its deployment and applying KBs without consulting the vendor is not recommended and generally would overwritten by SDDC drift packages anyway.

 Let us move onto best practices.

Best Practices

Best Practices are ordered by Severity and the component which has been analysed, and in this example, there is recommendations on vSphere, Kubernetes, VCD and NSX-T. Expanding each of the Severities provides detailed information on the best practice and a URL link to the appropriate knowledge base article depending on the product. In Best Practices you will also note that Security, Availability, Manageability, and Recoverability are all analysed on a per product basis.

Now for Vulnerabilities … and I am looking a lot better-ish with some green Pass Results! (I know that “better-ish” is not a word, but it is my word).

Vulnerabilities

This is a very similar layout to KBs where you can see the related Severity, Issue ID and what product it applies to . Noted is the relevant CVE and advisory range which is important when MSP SLAs are involved. Personally I like this component as I usually rely on Qulays updates for this type of information and in this situation I don’t have to troll through infrastructure that may not be not applicable to my environment, or since I am a middle aged gentleman I just don’t see it in the particular report due to Stigmatism of the eyeball.

Third Floor: Men’s Apparel and Security Compliance.

Security Compliance

I will not go through all the sections in Security Compliance in each of the sections as the analysed report is the same layout and to be honest nobody wants to see around 100 Security Compliance failures against Essential Eight, HIPAA ISO etc as SSH is enabled on my infrastructure.  I can feel the judgement already. An important thing to note is that with PCI DSS Security Compliance virtual machines are also getting analysed.

For transparency, the Security Compliance that I have enabled in this lab is not the complete set, only what I deem in my mind as applicable for Australian workloads. I could have included NIST as it covers US[6] and Australia[7] however the specifics are beyond the scope of this article.

Other Security Compliance standards available include DISA STIG[8], BSI IT-Grundschutz[9] and GDPR[10].

Overall, I am quite impressed with Runecast’s ability to completely analyse just not on-premises VMware and Kubernetes environments, but also tenancies in AWS, Azure and Horizon as well, while making many Architects / Engineers cry at what they thought were secure compliant platforms.

Once the crying is over these analytics can also provide a baseline for where MSPs can leverage automation for the deployment of infrastructure consistency that meets Hardware Compatibility, Best Practices for Infrastructure, and Security Compliance across multiple platforms. Unfortunately, vulnerabilities are a constantly moving goal post, however with Runecast you can run schedule daily analytic reporting of your multi-cloud world allowing you to be on the front foot and proactive with your customers.

From an MSP Operational perspective, to be able to stay on top across multiple platforms is not an easy feat and when you throw multi-cloud and a diverse customer base into the mix you need every bit off assistance you can get. This at times can mean multiple application and reporting sets to get visibility of this data and I think Runecast ticks the box from a single reporting point.

I would like to thank Andre Carpenter at Runecast for the opportunity to test their product and providing me with a trial license. You can follow Andre at https://www.linkedin.com/in/andrecarpenter/ or @andrecarpenter on Twitter, and Runecast at https://www.linkedin.com/company/runecast/ .


[1] https://www.cyber.gov.au/acsc/view-all-content/publications/essential-eight-explained

[2] https://compliancy-group.com/hipaa-australia-the-privacy-act-1988/

[3] https://www.iso.org/isoiec-27001-information-security.html

[4] https://www.pcisecuritystandards.org/pci_security

[5] https://www.runecast.com/

[6] https://www.nist.gov/about-nist

[7] https://www.cyber.gov.au/acsc/view-all-content/referral-organisations/national-institute-standards-and-technology-nist

[8] https://public.cyber.mil/stigs/

[9] https://www.bsi.bund.de/EN/Topics/ITGrundschutz/itgrundschutz_node.html

[10] https://gdpr.eu/data-protection-officer/

[blog 007]# git commit

VMware NTP, is yours working?

Like me over the years millions of people have build ESX clusters, vCenter Environments, vRealize*, Photon based VMware platforms and the list goes on with VMware offerings. Have your ever thought or taken the time to actually check if your NTP configurations actually work ?

NTP is a critical component of the VMware ecosystem and not just from a logging date stamp perspective, but services such as FT, VSAN, HA, Virtual Machine Monitoring and VCF SDDC Manager as a small subset all rely on accurate synchronized NTP to ensure the services function correctly and don’t cause customer outages.

A lot of Engineers will sync the infrastructure time to either an external source such as au.pool.ntp.org ( Aussie reference ) or their internal Microsoft PDC’s.

So lets take a look from a ESX host perspective of NTP syncing to these two examples. The ESX server I am using is 7.0U2 for transparency. VMkernel esxuat3.local 7.0.2 #1 SMP Release build-17630552 Feb 17 2021 15:16:00 x86_64 x86_64 x86_64 ESXi .

ntpq> peer
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 bitburger.simon .GPS.            1 u    5   64    1   37.192   -0.846   0.000
 pve01.as24220.n 216.218.254.202  2 u    9   64    1   32.151   +2.006   0.000
ntpq>
ntpq> peer
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
 ad.local .LOCL.           1 u    2   64    1    0.293   +0.714   0.000
ntpq>

Looks good ! We have a peering au.pool.ntp.org in the first example and a 2019 Microsoft Domain Controller in the second example. Lets take a closer look at these NTP peering and look at the associations.

ntpq> assoc
ind assid status  conf reach auth condition  last_event cnt
===========================================================
  1 19499  9014   yes   yes  none    reject   reachable  1
  2 19500  9014   yes   yes  none    reject   reachable  1
ntpq> assoc
ind assid status  conf reach auth condition  last_event cnt
===========================================================
  1 42369  9014   yes   yes  none    reject   reachable  1
ntpq>

You will notice that even though NTP is peered to the NTP servers, they are getting condition “rejected” which means they are not syncing time which can result in time drift, and issues in your environment.

So to start looking at why this might be happening I took a PCAP on the ESX host, copied the file off the host, and imported it into Wire Shark to analyze.

[root@esxuat3:~] pktcap-uw --vmk vmk0 -o /tmp/test.pcap -G 30
Source PCAP
Examined Packets

As you can see the source of my host is 192.168.1.153 and the destination is au.pool.ntp.org, and the NTP version is v4, which is default in VMware.

ntpq> version
ntpq 4.2.8p15+vmware@1.3728-o Tue Jun 30 17:18:49 UTC 2020 (1)

.au.pool.ntp.org offers NTP in v3 which is why it is in a rejected state on the host.

Another example of a PCAP dump on a 2019 Domain Controller which is receiving NTP requests from the same ESX host.

PCAP MS 2019 Domain Controller
Detail of Packet Sequence

You can see here that Domain Controller is handing back Version 3 NTP, and once again from the previous snippets it is getting rejected by the host.

So how to get around this dilemma ? Well in my case I just run a Linux distro NTP server running Chronyd and the default version for that is NTP v4. The output from the ESX host NTP associations is “sys.peer” which is a successful sync.

[root@dns1 ~]# ntpd -d
ntpd 4.2.6p5@1.2349-o Mon Jan 25 14:08:27 UTC 2016 (1)
 1 Jun 15:04:43 ntpd[11113]: proto: precision = 0.042 usec
 1 Jun 15:04:43 ntpd[11113]: 0.0.0.0 c01d 0d kern kernel time sync enabled
event at 0 0.0.0.0 c01d 0d kern kernel time sync enabled
Finished Parsing!!
ntpq> associations
ind assid status  conf reach auth condition  last_event cnt
===========================================================
  1  7717  961a   yes   yes  none  sys.peer    sys_peer  1
ntpq>

An other option is that you can modify the /etc/ntp.conf file on the infrastructure to include the version after the listed NTP servers. E.g. server x.x.x.x version 3.

I would not suggest doing this if you run something like a VCF stack as your modifying the default configuration outside of the SDDC Postgres database. If you want to change your NTP settings inside VCF SDDC Management you can you use the GUI API or do a curl from a remote command line or Postman

$ curl 'https://sddcmgmt.local/v1/system/ntp-configuration/validations' -i -X POST \
    -H 'Content-Type: application/json' \
    -H 'Accept: application/json' \
    -H 'Authorization: Bearer etYWRta....' \
    -d '{
  "ntpServers" : [ {
    "ipAddress" : "192.168.0.254"
  } ]
}'

So check our your NTP and make sure it is having a good time …. ahhh Dad joke.

[blog 006]# git commit

VCF Stretched VSAN Cluster and APIs

In my previous blog I talked about the similarities between a 2 node VSAN cluster vs a stretched VSAN cluster so I thought I would continue the same theme and write about stretched VSAN cluster management in VMWare Cloud Foundation [1] using APIs.

 In VCF, VSAN is a requirement for the Management Workload Domain and is deployed during instantiation of the environment by Cloud Builder, however Cloud Builder will only deploy a single availability zone VSAN cluster. If you want your management plane to be highly available, you can stretch your Management Workload Domain across two availability zones and stretch the VSAN.

The advantage of having a stretched Management Workload Domain is that is provides high availability for your management virtual workloads in the event of for example a Data Center failure. Virtual machines will be brought up in the secondary availability and in my experience, this is generally under 10 minutes.

One thing that is not generally known about VCF is that any changes to the environment are generally done via API within the SDDC Manager web interface, or other common methods of pushing out API requests e.g., Postman or Curl. This applies to any VSAN operation which include stretching / un-stretching VSAN clusters and adding or removing hosts from the VSAN cluster. There is publicly available documentation for VCF API Reference Guide [2].

There are perquisites for stretched VSAN which must be addressed before any API calls can be carried. These include, available hosts must be added to the SDDC Manager and available for use, a VSAN Witness nodes at a tertiary location, and layer 3 network for the VSAN vmkernel at the secondary availability zone where you are stretching the VSAN too.

Once this is done you can retrieve the host ids of the newly assigned hosts and the cluster id which you wish to stretch via API.

To get the hosts run the following curl command or use the SDDC API Explorer.

 $ curl 'https://sddc-manager.yourdomain.local/v1/hosts' -i -u 'admin:VMwareInfra@1' -X GET -H 'Accept: application/json'

You will a JSON respone like the extract below which provides the host ids of the new hosts.

"id": "62771c25-8ef0-430c-b69b-a297e42c0ce1",
"esxiVersion": "7.0.1-17551050",
"fqdn": "esxuat10.techaspire.com.au",
"hardwareVendor": "SomeVendor",
"hardwareModel": "Some Model",

To get the cluster id run the following curl command or use the SDDC API Explorer

$ curl 'https://sddc-manager.yourdomain.local/v1/clusters' -i -u 'admin:VMwareInfra@1' -X GET

You will get a JSON respone like the extract below which provides the cluster id of the existing VSAN cluster you wish to stretch

"id": "c385a274-352f-4359-81c0-409efed3a012",
"name": "mgmt-cluster",
"primaryDatastoreName": "mgmt-vsan01",
"primaryDatastoreType": "VSAN"

As a note, the minimum number of Management hosts required to stretch a VSAN cluster between availability zones is 8 (4x hosts in each AZ ). Standard Workload Domain clusters can have a minimum of 3x hosts per availability zone.

Once all the pre-requisites are completed you can stretch the VSAN cluster to the secondary AZ which require 2x API POSTs to the SDDC Manager. The first is Cluster Validation SPEC API and the API to stretch the cluster.

An example API call and spec file to stretch the cluster would be similar to the following JSON albeit the example is a cut down version.

curl 'https://sddc-manager.yourdomain.local/v1/clusters/c385a274-352f-4359-81c0-409efed3a012/validations' -i -u 'admin:VMwareInfra@1' -X POST -H 'Content-Type: application/json' -d '
{
    "clusterStretchSpec": {
        "hostSpecs": [ {
            "id": "62771c25-8ef0-430c-b69b-a297e42c0ce1",
            "licenseKey": "THISI-SNOTT-HEKEY-YOURA-FTER1",
            "hostNetworkSpec":{ 
               "vmNics":[ 
                  { 
                     "id":"vmnic0",
                     "vdsName":"data"
                  },
                  { 
                     "id":"vmnic1",
                     "vdsName":"data"
                  },
                  { 
                     "id":"vmnic2",
                     "vdsName":"vsan"
                  },
                  { 
                     "id":"vmnic3",
                     "vdsName":"vsan"
                  }
               ]
            }
        },
		       "secondaryAzOverlayVlanId": 1234,
        "isEdgeClusterConfiguredForMultiAZ":true,
        "witnessSpec": {
            "fqdn": "dawitnessnode.yourdomain.local",
            "vsanCidr": "192.20.2.64/27",		
            "vsanIp": "192.20.2.20"
        }
    }
}

Once this task has completed you will have a stretched VSAN cluster across two availability zones.

As per my previous blog, the VSAN cluster will be broken up into a primary and secondary fault domain with virtual machines residing in the primary fault domain unless they are manually vmotioned to the secondary fault domain, or a disaster takes place which brings the virtual machines up in the secondary AZ.

An important note is that once you have expanded your VSAN cluster you will have to modify your Cluster HA Admission Control settings to reflect the number of hosts in a single VSAN fault domain.

Figure 1 – Example HA Configuration

To add or remove hosts from the cluster through the normal lifecycle of infrastructure and capacity requirements APIs are used once again using the same basic requirements of the host and cluster id’s, and is a very quick process especially if you use the SDDC API Explorer tool built into the SDDC Manager.

Figure 2 – Example SDDC Manager API Explorer

If you want to get your head around VCF and API’s there is always VMWare’s Hand on Labs, and  VMUG Advantage allows you to get VCF licensing so you can deploy a nested lab environment following this guide https://blogs.vmware.com/cloud-foundation/2020/01/31/deep-dive-into-vmware-cloud-foundation-part-1-building-a-nested-lab/ .

[1] https://www.vmware.com/au/products/cloud-foundation.html

[2] https://vdc-download.vmware.com/vmwb-repository/dcr-public/60ff5385-d6ee-41d3-9ccf-b719a59f7971/b8ba0641-d8ae-4243-acf2-3639ca248783/index.html

[blog 005]# git commit

Two Node vSAN vs Stretched vSAN

So recently I was doing a personal lab with VMware Tanzu and I decided to do a 2x node vSAN cluster with a Witness node to provide storage to the cluster and I had never done a small vSAN cluster before. The cluster was based on VMware’s vSAN Two-Node Architecture for Remote Offices. https://www.vmware.com/files/pdf/products/vsan/vmware-vsan-robo-solution-overview.pdf

I have built in my work life vSAN and most recently stretch vSAN clusters across Availability Zones and the latter is where I realized that Stretched vSAN and Robo vSAN are the same thing! Minus the whole L3 vSAN Kernel across AZ’s with a Witness node in another AZ, big dollars on infrastructure, 9000 MTU etc. etc. I could go about the differences, but this is a quick blog around the fundamental similarities.

Robo vSAN and Stretched vSAN can be broken down into the same fundamental underlying VMware Storage Technology.

  • Validated Host Infrastructure with localized disk (This could be Hybrid vs All Flash)
  • vSAN disk groups
  • vSAN Fault Domains
  • Preferred and Secondary Fault Domains
  • Requirements of a vSAN Witness node
vSAN Disk Groups

The concepts around vSAN Fault Tolerance (FTT) and Host cluster availability still apply regardless of Two Node or Stretched vSAN.

vSAN Fault Domains

The major differences albeit the availability zones where your vSAN hosts reside will be firstly the number of hosts you have in a Fault Domain. For example, the minimum in a VCF deployment is 4x nodes per Fault Domain. Secondly is that you should not stretch Layer 2 networks between vSAN fault domains as they should be routed VMkernels.

The vSAN witness node in both accounts will have a management network and a secondary network on the same L3 subnet that vSAN VMkernels reside in for vSAN object tracking.  In both cases the Witness Node must be in a separate Availability Zone (except my lab … at least it is on another host that is not in the same cluster).

Ultimately the use of Two node vs Stretched are completely two different architectural requirements, however VMware is maintaining consistency with the overall underlying technology, which I suppose is not a surprise with the rise of VCF and Life Cycle Management.

As a Storage Engineer at heart, I am a big fan of vSAN, Converged Storage Infrastructure, or any type of storage that relies on object, block or meta data replication, albeit it is only as good as the underlying network redundancy below it. The reason why I am big fan is that the scale and resiliency become endless from cold storage to high I/O workloads fronted by NVME controllers which cache and reduce write I/O amplification to SSD based media.

So next time you playing storage take a closer look as you might be surprised at what you find and, in my case, technological consistency.

[blog 004]# git commit

MinIO S3 Gateway on Kobol NAS

I was talking with a colleague of mine who is a well know storage and data protection boffin who has been in the salt mines technically and blogging well before I knew what hexadecimal lun id’s were.

“You want to play with a Kobol NAS” .. “Sure .. hold my beer”. This guy knows how to tweak my inner love for storage. So, what is a Kobol NAS, well it is Open-Source NAS running on Helios 64 ARM and the best bit is you get to build it yourself, https://kobol.io/ . So cool, just insert disks here, after building it of course.

This article is not about the Kobol NAS specifically as I am still yet to do it as promised to said storage guy, but how it can run Docker along with other applications natively such as Open Media Vault. The Helios 64 operating system is quite accommodating and anybody familiar with Debian will feel right at home.

I have been a fan of MinIO for quite a long time and used it for testing S3 extents for products like Veeam and Linux S3 Fuse file systems. For this blog MinIO is essentially an Open-Source file management application that supports unstructured data utilizing S3 compliant API calls such as Puts and Gets and uses a bucket construct for file placement.

There is a product within the MinIO suite which is the Minio S3 Gateway for NAS https://docs.min.io/docs/minio-gateway-for-nas.html and as Kobol can run Docker well hello we now can now ingest S3 objects into out NAS with a web front end to boot and API support.

Once you have your Kobol up and running and install the Docker engine through the TUI the process is quite simple as MinIO as it has its own Github repo https://hub.docker.com/r/minio/minio/

docker run –name minio -p 9000:9000 -v /srv/dev-disk-by-label-kobolxfs/minio:/data -e “MINIO_ACCESS_KEY=enteryourkeyhere” -e “MINIO_SECRET_KEY=enteryourkeyhere” –restart unless-stopped minio minio/minio  server /data

To break down the components of the docker container instantiation –

/srv/dev-disk-by-label-kobolxfs/minio:/data mounts the data folder in the docker container to /srv/dev-disk-by-label-kobolxfs/minio. This can also be seen in the deployed docker container as /dev/md127 which is the raid array on the Kobol NAS

The secret and access key are stipulated during the container instantiation “enteryoukeyhere”, “–restart unless-stopped” is used for when the NAS is rebooted, and you want the docker container to restart automagically and “minio/minio” stipulates the version of MinIO you want to run as a docker container. Other options are minio/stable.

As any other Docker process, you can start, stop and kill the container

The Kobol Open-Source NAS provides the perfect platform to run on as you are not locked into vendor code and gives you the freedom to do what you wish while using the NAS for other services such as general file storage, media services, DNS, proxy services etc.

Could you extend the idea into an Enterprise environment? I have seen many cases where System Engineers have a need to Archive Logs, DB backups, archive FTP backup data etc and the cost of vendor-based solutions is out of commercial reach. MinIO provides an alternative with low capital expenditure and can still easily backed up using snapshot technology by traditional enterprise backup systems such as Rubrik, Veeam, Cohesity <enter vendor of choice here>, as the transactional IO is low if you were to visualize the solution.

There are plenty of options for ingesting S3 objects for Linux, S3 Fuse which can mount in fstab, and Windows has the typical suit from S3 Browser from Amazon to CloudBerry that can mount as a drive.