Infrastructure and Cloud for Enthusiasts

[blog 019]# git commit

Cloud Director Tenancy Container Applications

This image has an empty alt attribute; its file name is image-1.png

Deploying container based workloads in AWS, GCP and Azure is definitely not a new thing in the world of hyperscaler deployments and the majority of the planet does it for good reason, but is there another option out there from your sovereign VMware Cloud Service Provider to provide the same service ? ( We can discuss the benefits of sovereign cloud providers another time. )

This blog is going to be the beginning of series from high level, to implementation and then onto troubleshooting for Cloud Director Tenancy based container workloads.

So what are the benefits and efficiencies of deploying container based workloads withing a Cloud Director Tenancy.

  • Multi-tenancy – Tenancy isolation mean that one tenant cannot access or impact another tenants container based workloads
  • Resource based allocation – Tenants can have their own dedicated resources, policies, and controls, ensuring fair resource distribution and preventing resource contention,
  • Simplified Management – Tenants can manage networks, storage, and containers all from the same pane of glass where their VM workloads reside,
  • Scalability – Cloud Director supports a scale up and scale down approach with container workloads based on the requirement of the tenant and the tenant workloads,
  • Persistent Storage – Container applications have persistent storage backed by either principle vSAN storage or 3rd party storage vendors.
  • Automated Ingress Deployment – Application ingress and load balancing is deployed automatically with application instantiation.
  • kubetctl access – Cloud Director workload clusters allow for the use of traditional kubectl commands from the command line from the download Kubernetes config for the cluster for the management of applications.
  • Security – Cloud Director has features such as network segmentation, identity management and role based access control which extend into container workloads,
  • Availability – Cloud Director leverages underlying vSphere resources for high availability of workload clusters, while other components for the platform such as NSX Advance Load Balancer provides failover and load balancing services. Kubernetes Workload Clusters have the ability to “auto repair” when errored and consistent node health checking.

Within VMware Cloud Director a Tenant can deploy helm charts either from the VMware Market Place, which is provided by the service provider and presented to the tenants Content Hub, public helm repositories which can be either provided by the service provider or added to the tenants own content library or a Harbor repository. These deployed applications can have automated ingress deployments and can leverage CIDC pipelines with full management from Tanzu Mission Control. (We will cover TMC in another article.)

Figure 1 – Service Provider Helm Chart Repository.

Before being able to deploy helm charts into a tenancy, there are quite few tenancy based components are required.

The first requirements are provided by the service provider and published to the tenancy.

  • The service provider must deployed and Edge Gateway to provide L3 networking, security and NAT functionality.
  • The service provider must have provided a “network service specification” address range for container workload ingress and NAT addressing ( based on customer requirements )
  • The service provider must enable Load Balancing services for the tenant. This can either be shared or dedicated NSX ALB Service Engines.
  • A published role within the tenancy to be able to deploy Kubernetes container clusters. This is typically the “Kubernetes Cluster Author” role however a service provider may wish to create custom roles and provide them to the tenancy.
  • A published role ( can be the same Cluster Author role ) to able to manage applications within a deployed workload cluster.

The second requirements are implemented by the tenant.

  • A configured routed network deployed to the Edge Gateway,
  • A tenancy user assigned to the Kubernetes Cluster Author role to allow for creation of Kubernetes Workload Clusters and to have API rights to deployed applications,
  • A published helm chart repository,
  • A deployed Kubernetes Workload Cluster.
  • Rights to deploy the Cloud Director Kubernetes Operator to deployed workload clusters.
Figure 2 – Example Deployed Workload Cluster.

Before being able to deploy a container application, the Cloud Director Kubernetes Operator must be installed on the workload cluster. The VMware Cloud Director Kubernetes Operator is a component designed to facilitate the management and orchestration of Kubernetes clusters within VMware Cloud Director environments. Kubernetes Operators in general provide a mechanism to manage applications on the workload clusters for customer resource definitons (CRDs), automation, lifecycle management, allows tenant users to define custom configurations, self-healing of applications and consistency of deployments.

The VMware Cloud Director Kubernetes Operator is downloaded from a VMware Public Registry ( or a private registry where there is a requirement for air-gapped solutions.

Figure 3 – Example of the Deployed Kubernetes Operator.

When deploying applications, the manifest of the application helm chart can be modified based on the requirements of the tenants DevOps team for items such as changing from a ClusterIP to LoadBalancer to allow for automated deployment of ingress services, providing valid application certificates, or defining how many replica’s are required for the application just to name a few.

The advantage of deploying VMware Market Place application workloads for example from Bitnami is that their applications are constantly security hardened and updated providing the service provider publishes the updates. Also pre-packed applications reduced the requirement to maintain specific systems engineering skills within an organization to allow businesses to focus on the application and outcome, and not the engineering behind it.

Below is an example of deploying a container application.

Figure 4 – Example of Deploying a Container Application.

Below is an example of modifying the helm chart manifest to deploy a Load Balancer instead of a ClusterIP. Just as note you do not have to deploy a load balanced service during the instantiation of the application as you could run for example, “kubectl expose deployment grafana –type=LoadBalancer –port=8080 –target-port=3000 -n vcd-contenthub-workloads –name=grafana-ingress –kubeconfig C:\temp\kubeconfig-grafana.txt” to create an ingress controller from the command line which intern via the Cloud Director Kubernetes Operator deploy the ingress service in the tenancy.

Figure 5 – Example of Modifying the ClusterIP to type LoadBalancer.

Below is a exampled of deployed applications on a Kubernetes Workload Cluster called “monitoring”.

Figure 6 – Example of Deployed Container Applications.

Below is an example of automated provisioned ingress which was created when launching the application creation to allow access to the deployed application.

Figure 6 – Example of Ingress Load Balancing for Container Workloads.

While this has been a very high level overview of Cloud Director Tenancy Container Applications it provides insight into the capability of the platform to deploy your applications on a multi-tenanted cloud environment while providing automated features to allow applications to be deployed quickly and seamlessly while maintaining security and availability.

I’m excited to see how far I can push the platform with the intent to deploy GPU based workload clusters and try open source AI technologies such DeepSpeed and PyTorch as I believe there is a place for sovereign multi-tenanted private AI outside of the hyperscalers, so stay tuned for that. I have deployed SOLR machine learning already within a tenancy using the public GIT repository, but still yet to understand the application. .

I will go into a more deeper technical breakdown of components and how to deploy the infrastructure in further blogs but I feel this is a great place to start to allow people to understand the capabilities. Until next time.


Tony Williamson

[blog 018]# git commit

Cloud Director – Tenant Kubernetes Troubleshooting

Recently in my professional life I have been working through the product development process to allow customers to deploy Kubernetes Container Clusters within their Cloud Director tenancies leveraging the Tanzu Container Service Extension.

As a quick recap the Container Service Extension automates the instantiation of the ephemeral vApp in the customers tenancy which in turn deploys the Kubernetes control plane and worker nodes, and ingress and nat services from NSX ALB.

The requirements for deploying container workloads within a tenancy are the following,

  • A customer tenancy with a Provider Gateway, NSX Edge Gateway and Overlay Networks,
  • NSX ALB with a shared or dedicated Service Engine for the customer tenancy,
  • The T1 gateway must have load balancing services enabled,
  • Customer overlay networks must be able to route and resolve the Cloud Director endpoint.

The following diagram illustrates the high level networking requirements for Container Clusters.

Figure 1 – Example Customer Tenancy High Level Networking

In this blog I would like to cover trouble shooting that maybe required within the customers tenancy if a Kubernetes Container Cluster won’t deploy.

Trouble Shooting with the Ephemeral vApp

NSX DFW and Datacenter Groups

The purpose of the Ephemeral vApp is essentially deploy the control plane and worker nodes for the container cluster, and the automated process is similar to deploying TKGm cluster. The main difference is that instead of having a VM with all the required deployment packages already installed, the Ephemeral vApp downloads items such as capvcd, clusterd, kubectl and docker from Github repositories.

If you have a customer tenancy that has datacenter groups, DFW enable and NSX 4.x there is chance that the “DefaultMacliciousIpGroup” in NSX will block access to sites such as Github and Microsoft Services.

On the Ephemeral vApp if you tail /var/log/cloud-final.err you will see errors where the vApp cannot download the YAML components from Github and connectivity to “” is not reachable. This is due to NSX blocking the URL as being a malicious IP. Allow the IP within NSX as an exception.

Figure 2 – Example of URL blocking by NSX
Figure 3 – Example Malicious IP Blocked.
Figure 4 – Adding Exception to IP.
Checking for failed deployment of Container Clusters

If you are not seeing any errors within the /var/log/cloud-final.err logs check the deployment control plane for any errors by checking the availability of the bootstrap cluster.

This can be done using the following command on the ephemeral vApp. Note to get the root password check the guest customization of the vApp VM.

root@EPHEMERAL-TEMP-VM:/.kube# kubectl get po,deploy,cluster,kubeadmcontrolplane,machine,machinedeployment -A –kubeconfig /.kube/config

If there are any failed pods you can check the logs of the bootstrap pods for errors.

For example , kubectl –namespace kube-system logs etcd-kind-control-plane –kubeconfig /.kube/config

Figure 5 – Checking Bootstrap Cluster.

If the bootstap cluster is ok check the following,

  • The Ephemeral VM can reach and resolve the public VIP of the Cloud Director cluster.
  • Check NSX ALB for the following
    • The tenancy has a service engine assigned and load balancing is enabled.
    • Check that the transit network was created in NSX with dhcp enabled.
    • Check in NSX ALB that a NSX VRF context has been created for the customer tenancy T1 gateway and the associated transit network has dhcp enabled.
    • There is enough licensing resource to deploy service engines.

While this is not an exhaustive list of trouble shooting, I have found that this is typically the reason why container clusters will not deploy. To be honest it took time to trouble shoot each of these scenarios and I will add to this blog over time if I come across anymore issue.

This will be the last blog for 2023 so stay safe and I will see you all again in 2024.


Tony Williamson

[blog 017]# git commit

VMware NSX, Unable to add a Host Transport Node

So in this blog I am going to be talking about the benfits of NSX sub-cluster in version, how they work and why they are cool for Cloud Service Providers. At least that was the intention, till one of my lab hosts decided to grenade itself in style. So what you ask, don’t be soft,just rebuild it put it back in the cluster, and that is exactly what I did till I could not prepare the rebuilt host for NSX due to the following error.

Figure 1 – Validation Error

Not a problem. Lets move the host out of the vSphere cluster, and validate that the host still exists using the API.

Once we have validated it just remove it with the API.

GET https://<NSX Manager>/api/v1/transport-nodes/

So my ESX host was not returned via the GET API call which means the problem is within in the Corfu database. Now, I have done a blog previously on removing stuck Edge Transport Nodes from the Corfu database so I have sort of been down this path before. Once again, if you find yourself in this scenario in a production environment don’t go in guns blazing without the good folk at GSS leading the way.

For me I like to get my hands dirty and break things in my labs since I am an idiot who forgets sometimes he has a family and other commitments. Right, so on to the good stuff.

So with the ESX host moved out of the vSphere cluster it is time to run some queries on the Corfu database, and export to file the following tables to search for either the host name or host uuid, and extract the relevant fields associated to the ESX host.

  • HostModelMsg
  • GenericPolicyRealizedResource
  • HostTransportNode
root@nsx:/# /opt/vmware/bin/ --tool corfu-editor -n nsx -o showTable -t HostModelMsg > /tmp/file1.txt
3. root@nsx:/# /opt/vmware/bin/ --tool corfu-editor -n nsx -o showTable -t GenericPolicyRealizedResource > /tmp/file2.txt
4. root@nsx:/# /opt/vmware/bin/ --tool corfu-editor -n nsx -o showTable -t HostTransportNode > /tmp/file3.txt

Once the exports of the tables have been done we need to identify that the stuck ESX host within the json exists and extract the “StringID” and the “Key” for the ESX host for each table exported.

The following is an example of what I am looking for,

"stringId": "/infra/sites/default/enforcement-points/default/host-transport-nodes/esxuat3-30b975c9-60a0-4fb7-bfef-96770ee5f240host-10021"

  "uuid": {
    "left": "10489905160825487935",
    "right": "11102041072723471023"

You will notice the uuid at the end of “esxuat3″in the stringId matches the uuid from the previous screen shot.

Now that we have the required information we can shut down the NSX proton service and clean up the Corfu database.

root@nsxt:/# service proton stop; service corfu-server stop
root@nsxt:/# service corfu-server start

The first command removes the ESX host stringid key from “GenericPolicyRealizedResource” table. When the command has completed we need to ensure that “1 records deleted successfully” in the output otherwise we have not deleted anything. This is noted in bold within code snippet. --tool corfu-browser -o deleteRecord -n nsx -t GenericPolicyRealizedResource --keyToDelete '{"stringId": "/infra/realized-state/enforcement-points/default/host-transport-nodes/esxuat3-30b975c9-60a0-4fb7-bfef-96770ee5f240host-10021"}'

Deleting 1 records in table GenericPolicyRealizedResource and namespace nsx.  Stream Id dabf8af4-9eb6-3374-9a18-d273ed7132e9
Namespace: nsx
TableName: GenericPolicyRealizedResource
2023-08-22T00:11:46.361Z | INFO  |                           main |     o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$GenericPolicyRealizedResource id dabf8af4-9eb6-3374-9a18-d273ed7132e9
2023-08-22T00:11:46.362Z | INFO  |                           main |     o.c.runtime.view.SMRObject | Added SMRObject [dabf8af4-9eb6-3374-9a18-d273ed7132e9, PersistentCorfuTable] to objectCache
0: Deleting record with Key {"stringId": "/infra/realized-state/enforcement-points/default/host-transport-nodes/esxuat3-30b975c9-60a0-4fb7-bfef-96770ee5f240host-10021"}

 1 records deleted successfully.

The next command removes the ESX host key from “HostModelMsg” table. When the command has completed we need to ensure that “1 records deleted successfully” in the output otherwise we have not deleted anything. This is noted in bold within code snippet. --tool corfu-browser -o deleteRecord -n nsx -t HostModelMsg --keyToDelete '{"uuid": {"left": "10489905160825487935", "right": "11102041072723471023"} }'

Deleting 1 records in table HostModelMsg and namespace nsx.  Stream Id d8120129-1f35-34c2-a309-e5cf6dbe5487
Namespace: nsx
TableName: HostModelMsg
2023-08-22T00:12:20.049Z | INFO  |                           main |     o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$HostModelMsg id d8120129-1f35-34c2-a309-e5cf6dbe5487
2023-08-22T00:12:20.050Z | INFO  |                           main |     o.c.runtime.view.SMRObject | Added SMRObject [d8120129-1f35-34c2-a309-e5cf6dbe5487, PersistentCorfuTable] to objectCache
0: Deleting record with Key {"uuid": {"left": "10489905160825487935", "right": "11102041072723471023"} }

 1 records deleted successfully.

The final command removes the ESX host stringid key from “HostTransportNode” table. When the command has completed we need to ensure that “1 records deleted successfully” in the output otherwise we have not deleted anything. This is noted in bold within code snippet. --tool corfu-browser -o deleteRecord -n nsx -t HostTransportNode --keyToDelete '{"stringId": "/infra/sites/default/enforcement-points/default/host-transport-nodes/esxuat3-30b975c9-60a0-4fb7-bfef-96770ee5f240host-10021" }'

Deleting 1 records in table HostTransportNode and namespace nsx.  Stream Id a720622f-68c3-3359-9114-12231645d94e
Namespace: nsx
TableName: HostTransportNode
2023-08-22T00:12:49.897Z | INFO  |                           main |     o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$HostTransportNode id a720622f-68c3-3359-9114-12231645d94e
2023-08-22T00:12:49.898Z | INFO  |                           main |     o.c.runtime.view.SMRObject | Added SMRObject [a720622f-68c3-3359-9114-12231645d94e, PersistentCorfuTable] to objectCache
0: Deleting record with Key {"stringId": "/infra/sites/default/enforcement-points/default/host-transport-nodes/esxuat3-30b975c9-60a0-4fb7-bfef-96770ee5f240host-10021" }

 1 records deleted successfully.

Now that the corfu database entries have been removed we can now restart the proton and corfu database services and check the status to ensure they have started correctly.

root@nsxt:/# service proton restart; service corfu-server restart
root@nsxt:/#  service proton status; service corfu-server status

Once all the services have restarted, we can move the ESX host back into the vSphere cluster and allow NSX to prepare the host.

All things going well the host preparation should be successful and back into action for NSX goodness.

Figure 2 – Host Preparation.
Figure 3 – Completed NSX Installation.

Well that’s a wrap for this blog after getting a little side tracked down this rabbit hole. I hope this will help if you get stuck in your labs or at least understand the path GSS would go down in a production situation. Stay tuned for the next one (touch wood) on NSX sub-clusters, and the benefits for Cloud Service Providers. As always keep on NSXing !


Tony Williamson

[blog 016]# git commit

VMware Cloud Provider Lifecycle Manager

In my professional career I spend quite a bit of time designing cloud solutions and products. I am always looking at ways to improve the deployment and day 2 operations of products to make operational teams more efficient, provide product consistency and remove the “human factor” which can lead to undesired results in deployments.

VMware Cloud Director is part of my bread and butter so I was looking how I could follow my cloud solution mantra with VMware Cloud Provider Lifecyle Manager (VCPLCM) .

So what is VCPCLM and how does it provide deployment consistency and day 2 operations ? I am glad you asked,

VCPLCM allows for the creation of Cloud Director environments and the lifecycle of the platform for day 2 operations including software patching and certificates, Existing Cloud Director environments can also be imported to take over what would have normally been manual operations. VCPLCM has three components which are Environments, Datacenters and Tasks.

Lets start with Datacenters. While not mandatory to deploy a VMware Cloud Provider environment, VCPLCM provides integrations into the deployment of the platform with integrations which include vCenter, NSX, NSX ALB, and vROPs. The integrations allow VCPLCM to check the interoperability when Cloud Director is lifecycled and will flag an issue if identified.

The integrations are deployed as you would normally and then registered within VCPLCM. Below is an example of imported integrations into VCPLCM.

Registered Datacenters

Moving onto Environments. This is deployment of VMware Cloud Director, VMware Chargeback, vCloud Usage Meter and RabbitMQ. These are all add-on applications for Cloud Provider platforms however not a requirement as a provider not provide the application extensions or choose another method of lifecycle.

Environments which can deployed.

The application environments can be added into a VMware Cloud Director deployment and also have interoperability checked and flagged if not compliant. The difference with these application environments is they can be life-cycled with later software versions, scale the application (for example adding more VCD Cells) or have certificates updated by VCPLCM to ensure interoperability before updating a Cloud Director environment.

Blow is an example of what actions are available to a deployed Cloud Director environment.

Available actions for a VCD environment.

Below is an example of updating RabbitMQ.

Update or redeploy.

There is one caveat to all this deployment and lifecycle goodness, and that all the OVA’s used for the deployment and updating of applications need to be stored on VCPLCM which can add up to quite a bit of capacity. The location path on the appliance for the OVA’s is as fpllows.

vcplcm@vcplm [ /cplcmrepo ]$ pwd

vcplcm@vcplm [ /cplcmrepo ]$ ls -lr
total 4
drwxrwxrwx 16 root root 189 May 7 13:10 vropsta
drwxrwxrwx 28 root root 4096 May 7 13:10 vcd
drwxrwxrwx 6 root root 60 May 7 13:10 usage
drwxrwxrwx 4 root root 33 May 7 13:10 rmq

So depending on your cost of storage you may want to move the path off to NFS storage.

For example 47G 7.1G 40G 16% /cplcmrepo

Well that’s the end of this blog so I hope I have enlightened you to go take a look and see if it suitable for your existing or net new deployments. The best part is if you deploy an Environment or Datacenter from VCPLCM is does not delete your production systems, it just deletes it from the VCPLCM database.

I am hoping do more around Cloud Director as it a great platform multi-tenancy so stay tuned for more blogs on the subject.

[blog 015]# git commit

VMware Cloud Director Tenancy Load Balancing with NSX Advanced Load Balancer

I have spent quite a bit of time recently implementing Cloud Director Tenancy Load Balancing with NSX Advanced Load Balancer and also talking to quite a few people about it. The latest was at the Sydney VMUG Usercon as part of my of “Real World Service Provider Networking Load Balancing” Presentation which I will upload in the next blog. So after all the presenting and talking I thought I should do a blog on the implementation and behind the scenes.

Lets start with what has changed between NSX for vSphere, early implementations of NSX-T and where we are at now with load balancing.

So in NSX for vSphere load balancing was carried out on Edge Services Gateways and had simple functionality around virtual server protocols, port ranges and the backend pools for connectivity. There were also simple service monitoring services for TCP, HTTP and HTTPS to name a few.

Customers with a tenancy inside VMware Cloud Director could create simple load balancing services on demand based on their available IP resourcing assigned.

When NSX-T 2.4 came out it had similar functionality however was assigned to a T1 gateway and an Edge Cluster of least of medium size. While this could be done in NSX-T there was not supported functionality within Cloud Director.

Enter NSX Advance Load Balancer with NSX 3.0 and integration with Cloud Director. Now generally I am a big fan of NSX Advanced Load Balancer and with the integration into Cloud Director is brings new functionality and implementation such as “HTTP Cookie Persistence”, “Transparency Mode” Which allows the preservation of client IP’s and shared / dedicated service engines for customers.

The following diagram show the construct of how load balancing services are provided. An NSX-T Cloud Service Engine Group is assigned to a tenant. I prefer to not share Service Engines between customers as “sharing” can make them nervous even though in reality they would separated via there own routing domain VRF on the Service Engine.

It also allows for a simpler billing for customers as they can consume as many virtual services as they require depending on how many available IPs they have.

Cloud Director API creates a transit overlay network between the T1 and the Service engine and a static route is applied on the T1 for the Virtual Service that is hosted on the Service Engine.

Route advertisement is updated on the T1 via API from vCD to NSX to enable LB VIP Routes and Static Routes to allow advertisement to the T0 and into a customers VRF or their public facing network and VM workloads on NSX overlay networks can access the VIP service.

The management plane of the Service is connected via a “Cloud Management” network which is your typical T1 / T0 design.

Logical Design for NSX Advanced Load Balancer.

I have created the following video that shows the creation of Load Balanced Services and what is required / takes place from a NSX and NSX ALB perspective.

Deployment of NSX Advanced Load Balancer with Cloud Director.

Having an understanding of what happens behind the scenes in my mind is the most import aspect to any design and implementation, as it will help with trouble shooting deployments and existing environments when things don’t go as planned and I like to know the mystery behind the magic.

See you all in the next #git commit.