Infrastructure and Cloud for Enthusiasts

[blog 018]# git commit

Cloud Director – Tenant Kubernetes Troubleshooting

Recently in my professional life I have been working through the product development process to allow customers to deploy Kubernetes Container Clusters within their Cloud Director tenancies leveraging the Tanzu Container Service Extension.

As a quick recap the Container Service Extension automates the instantiation of the ephemeral vApp in the customers tenancy which in turn deploys the Kubernetes control plane and worker nodes, and ingress and nat services from NSX ALB.

The requirements for deploying container workloads within a tenancy are the following,

  • A customer tenancy with a Provider Gateway, NSX Edge Gateway and Overlay Networks,
  • NSX ALB with a shared or dedicated Service Engine for the customer tenancy,
  • The T1 gateway must have load balancing services enabled,
  • Customer overlay networks must be able to route and resolve the Cloud Director endpoint.

The following diagram illustrates the high level networking requirements for Container Clusters.

Figure 1 – Example Customer Tenancy High Level Networking

In this blog I would like to cover trouble shooting that maybe required within the customers tenancy if a Kubernetes Container Cluster won’t deploy.

Trouble Shooting with the Ephemeral vApp

NSX DFW and Datacenter Groups

The purpose of the Ephemeral vApp is essentially deploy the control plane and worker nodes for the container cluster, and the automated process is similar to deploying TKGm cluster. The main difference is that instead of having a VM with all the required deployment packages already installed, the Ephemeral vApp downloads items such as capvcd, clusterd, kubectl and docker from Github repositories.

If you have a customer tenancy that has datacenter groups, DFW enable and NSX 4.x there is chance that the “DefaultMacliciousIpGroup” in NSX will block access to sites such as Github and Microsoft Services.

On the Ephemeral vApp if you tail /var/log/cloud-final.err you will see errors where the vApp cannot download the YAML components from Github and connectivity to “raw.githubuserconent.com” is not reachable. This is due to NSX blocking the URL as being a malicious IP. Allow the IP within NSX as an exception.

Figure 2 – Example of URL blocking by NSX
Figure 3 – Example Malicious IP Blocked.
Figure 4 – Adding Exception to IP.
Checking for failed deployment of Container Clusters

If you are not seeing any errors within the /var/log/cloud-final.err logs check the deployment control plane for any errors by checking the availability of the bootstrap cluster.

This can be done using the following command on the ephemeral vApp. Note to get the root password check the guest customization of the vApp VM.

root@EPHEMERAL-TEMP-VM:/.kube# kubectl get po,deploy,cluster,kubeadmcontrolplane,machine,machinedeployment -A –kubeconfig /.kube/config

If there are any failed pods you can check the logs of the bootstrap pods for errors.

For example , kubectl –namespace kube-system logs etcd-kind-control-plane –kubeconfig /.kube/config

Figure 5 – Checking Bootstrap Cluster.

If the bootstap cluster is ok check the following,

  • The Ephemeral VM can reach and resolve the public VIP of the Cloud Director cluster.
  • Check NSX ALB for the following
    • The tenancy has a service engine assigned and load balancing is enabled.
    • Check that the transit network was created in NSX with dhcp enabled.
    • Check in NSX ALB that a NSX VRF context has been created for the customer tenancy T1 gateway and the associated transit network has dhcp enabled.
    • There is enough licensing resource to deploy service engines.

While this is not an exhaustive list of trouble shooting, I have found that this is typically the reason why container clusters will not deploy. To be honest it took time to trouble shoot each of these scenarios and I will add to this blog over time if I come across anymore issue.

This will be the last blog for 2023 so stay safe and I will see you all again in 2024.

Cheers

Tony Williamson

[blog 017]# git commit

VMware NSX, Unable to add a Host Transport Node

So in this blog I am going to be talking about the benfits of NSX sub-cluster in version 4.1.1.0.0., how they work and why they are cool for Cloud Service Providers. At least that was the intention, till one of my lab hosts decided to grenade itself in style. So what you ask, don’t be soft,just rebuild it put it back in the cluster, and that is exactly what I did till I could not prepare the rebuilt host for NSX due to the following error.

Figure 1 – Validation Error

Not a problem. Lets move the host out of the vSphere cluster, and validate that the host still exists using the API.

Once we have validated it just remove it with the API.

GET https://<NSX Manager>/api/v1/transport-nodes/

So my ESX host was not returned via the GET API call which means the problem is within in the Corfu database. Now, I have done a blog previously on removing stuck Edge Transport Nodes from the Corfu database so I have sort of been down this path before. Once again, if you find yourself in this scenario in a production environment don’t go in guns blazing without the good folk at GSS leading the way.

For me I like to get my hands dirty and break things in my labs since I am an idiot who forgets sometimes he has a family and other commitments. Right, so on to the good stuff.

So with the ESX host moved out of the vSphere cluster it is time to run some queries on the Corfu database, and export to file the following tables to search for either the host name or host uuid, and extract the relevant fields associated to the ESX host.

  • HostModelMsg
  • GenericPolicyRealizedResource
  • HostTransportNode
root@nsx:/# /opt/vmware/bin/corfu_tool_runner.py --tool corfu-editor -n nsx -o showTable -t HostModelMsg > /tmp/file1.txt
3. root@nsx:/# /opt/vmware/bin/corfu_tool_runner.py --tool corfu-editor -n nsx -o showTable -t GenericPolicyRealizedResource > /tmp/file2.txt
4. root@nsx:/# /opt/vmware/bin/corfu_tool_runner.py --tool corfu-editor -n nsx -o showTable -t HostTransportNode > /tmp/file3.txt

Once the exports of the tables have been done we need to identify that the stuck ESX host within the json exists and extract the “StringID” and the “Key” for the ESX host for each table exported.

The following is an example of what I am looking for,

"stringId": "/infra/sites/default/enforcement-points/default/host-transport-nodes/esxuat3-30b975c9-60a0-4fb7-bfef-96770ee5f240host-10021"

Key:
{
  "uuid": {
    "left": "10489905160825487935",
    "right": "11102041072723471023"
  }
}

You will notice the uuid at the end of “esxuat3″in the stringId matches the uuid from the previous screen shot.

Now that we have the required information we can shut down the NSX proton service and clean up the Corfu database.

root@nsxt:/# service proton stop; service corfu-server stop
root@nsxt:/# service corfu-server start

The first command removes the ESX host stringid key from “GenericPolicyRealizedResource” table. When the command has completed we need to ensure that “1 records deleted successfully” in the output otherwise we have not deleted anything. This is noted in bold within code snippet.

corfu_tool_runner.py --tool corfu-browser -o deleteRecord -n nsx -t GenericPolicyRealizedResource --keyToDelete '{"stringId": "/infra/realized-state/enforcement-points/default/host-transport-nodes/esxuat3-30b975c9-60a0-4fb7-bfef-96770ee5f240host-10021"}'


Deleting 1 records in table GenericPolicyRealizedResource and namespace nsx.  Stream Id dabf8af4-9eb6-3374-9a18-d273ed7132e9
Namespace: nsx
TableName: GenericPolicyRealizedResource
2023-08-22T00:11:46.361Z | INFO  |                           main |     o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$GenericPolicyRealizedResource id dabf8af4-9eb6-3374-9a18-d273ed7132e9
2023-08-22T00:11:46.362Z | INFO  |                           main |     o.c.runtime.view.SMRObject | Added SMRObject [dabf8af4-9eb6-3374-9a18-d273ed7132e9, PersistentCorfuTable] to objectCache
0: Deleting record with Key {"stringId": "/infra/realized-state/enforcement-points/default/host-transport-nodes/esxuat3-30b975c9-60a0-4fb7-bfef-96770ee5f240host-10021"}

 1 records deleted successfully.

The next command removes the ESX host key from “HostModelMsg” table. When the command has completed we need to ensure that “1 records deleted successfully” in the output otherwise we have not deleted anything. This is noted in bold within code snippet.

corfu_tool_runner.py --tool corfu-browser -o deleteRecord -n nsx -t HostModelMsg --keyToDelete '{"uuid": {"left": "10489905160825487935", "right": "11102041072723471023"} }'


Deleting 1 records in table HostModelMsg and namespace nsx.  Stream Id d8120129-1f35-34c2-a309-e5cf6dbe5487
Namespace: nsx
TableName: HostModelMsg
2023-08-22T00:12:20.049Z | INFO  |                           main |     o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$HostModelMsg id d8120129-1f35-34c2-a309-e5cf6dbe5487
2023-08-22T00:12:20.050Z | INFO  |                           main |     o.c.runtime.view.SMRObject | Added SMRObject [d8120129-1f35-34c2-a309-e5cf6dbe5487, PersistentCorfuTable] to objectCache
0: Deleting record with Key {"uuid": {"left": "10489905160825487935", "right": "11102041072723471023"} }

 1 records deleted successfully.

The final command removes the ESX host stringid key from “HostTransportNode” table. When the command has completed we need to ensure that “1 records deleted successfully” in the output otherwise we have not deleted anything. This is noted in bold within code snippet.

corfu_tool_runner.py --tool corfu-browser -o deleteRecord -n nsx -t HostTransportNode --keyToDelete '{"stringId": "/infra/sites/default/enforcement-points/default/host-transport-nodes/esxuat3-30b975c9-60a0-4fb7-bfef-96770ee5f240host-10021" }'


Deleting 1 records in table HostTransportNode and namespace nsx.  Stream Id a720622f-68c3-3359-9114-12231645d94e
Namespace: nsx
TableName: HostTransportNode
2023-08-22T00:12:49.897Z | INFO  |                           main |     o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$HostTransportNode id a720622f-68c3-3359-9114-12231645d94e
2023-08-22T00:12:49.898Z | INFO  |                           main |     o.c.runtime.view.SMRObject | Added SMRObject [a720622f-68c3-3359-9114-12231645d94e, PersistentCorfuTable] to objectCache
0: Deleting record with Key {"stringId": "/infra/sites/default/enforcement-points/default/host-transport-nodes/esxuat3-30b975c9-60a0-4fb7-bfef-96770ee5f240host-10021" }

 1 records deleted successfully.

Now that the corfu database entries have been removed we can now restart the proton and corfu database services and check the status to ensure they have started correctly.

root@nsxt:/# service proton restart; service corfu-server restart
root@nsxt:/#  service proton status; service corfu-server status

Once all the services have restarted, we can move the ESX host back into the vSphere cluster and allow NSX to prepare the host.

All things going well the host preparation should be successful and back into action for NSX goodness.

Figure 2 – Host Preparation.
Figure 3 – Completed NSX Installation.

Well that’s a wrap for this blog after getting a little side tracked down this rabbit hole. I hope this will help if you get stuck in your labs or at least understand the path GSS would go down in a production situation. Stay tuned for the next one (touch wood) on NSX sub-clusters, and the benefits for Cloud Service Providers. As always keep on NSXing !

Cheers

Tony Williamson

[blog 016]# git commit

VMware Cloud Provider Lifecycle Manager

In my professional career I spend quite a bit of time designing cloud solutions and products. I am always looking at ways to improve the deployment and day 2 operations of products to make operational teams more efficient, provide product consistency and remove the “human factor” which can lead to undesired results in deployments.

VMware Cloud Director https://www.vmware.com/au/products/cloud-director.html is part of my bread and butter so I was looking how I could follow my cloud solution mantra with VMware Cloud Provider Lifecyle Manager (VCPLCM) https://docs.vmware.com/en/VMware-Cloud-Provider-Lifecycle-Manager/index.html .

So what is VCPCLM and how does it provide deployment consistency and day 2 operations ? I am glad you asked,

VCPLCM allows for the creation of Cloud Director environments and the lifecycle of the platform for day 2 operations including software patching and certificates, Existing Cloud Director environments can also be imported to take over what would have normally been manual operations. VCPLCM has three components which are Environments, Datacenters and Tasks.

Lets start with Datacenters. While not mandatory to deploy a VMware Cloud Provider environment, VCPLCM provides integrations into the deployment of the platform with integrations which include vCenter, NSX, NSX ALB, and vROPs. The integrations allow VCPLCM to check the interoperability when Cloud Director is lifecycled and will flag an issue if identified.

The integrations are deployed as you would normally and then registered within VCPLCM. Below is an example of imported integrations into VCPLCM.

Registered Datacenters

Moving onto Environments. This is deployment of VMware Cloud Director, VMware Chargeback, vCloud Usage Meter and RabbitMQ. These are all add-on applications for Cloud Provider platforms however not a requirement as a provider not provide the application extensions or choose another method of lifecycle.

Environments which can deployed.

The application environments can be added into a VMware Cloud Director deployment and also have interoperability checked and flagged if not compliant. The difference with these application environments is they can be life-cycled with later software versions, scale the application (for example adding more VCD Cells) or have certificates updated by VCPLCM to ensure interoperability before updating a Cloud Director environment.

Blow is an example of what actions are available to a deployed Cloud Director environment.

Available actions for a VCD environment.

Below is an example of updating RabbitMQ.

Update or redeploy.

There is one caveat to all this deployment and lifecycle goodness, and that all the OVA’s used for the deployment and updating of applications need to be stored on VCPLCM which can add up to quite a bit of capacity. The location path on the appliance for the OVA’s is as fpllows.

vcplcm@vcplm [ /cplcmrepo ]$ pwd
/cplcmrepo

vcplcm@vcplm [ /cplcmrepo ]$ ls -lr
total 4
drwxrwxrwx 16 root root 189 May 7 13:10 vropsta
drwxrwxrwx 28 root root 4096 May 7 13:10 vcd
drwxrwxrwx 6 root root 60 May 7 13:10 usage
drwxrwxrwx 4 root root 33 May 7 13:10 rmq

So depending on your cost of storage you may want to move the path off to NFS storage.

For example 192.168.1.125:/nfs/vcplm/cplcmrepo 47G 7.1G 40G 16% /cplcmrepo

Well that’s the end of this blog so I hope I have enlightened you to go take a look and see if it suitable for your existing or net new deployments. The best part is if you deploy an Environment or Datacenter from VCPLCM is does not delete your production systems, it just deletes it from the VCPLCM database.

I am hoping do more around Cloud Director as it a great platform multi-tenancy so stay tuned for more blogs on the subject.

[blog 015]# git commit

VMware Cloud Director Tenancy Load Balancing with NSX Advanced Load Balancer

I have spent quite a bit of time recently implementing Cloud Director Tenancy Load Balancing with NSX Advanced Load Balancer and also talking to quite a few people about it. The latest was at the Sydney VMUG Usercon as part of my of “Real World Service Provider Networking Load Balancing” Presentation which I will upload in the next blog. So after all the presenting and talking I thought I should do a blog on the implementation and behind the scenes.

Lets start with what has changed between NSX for vSphere, early implementations of NSX-T and where we are at now with load balancing.

So in NSX for vSphere load balancing was carried out on Edge Services Gateways and had simple functionality around virtual server protocols, port ranges and the backend pools for connectivity. There were also simple service monitoring services for TCP, HTTP and HTTPS to name a few.

Customers with a tenancy inside VMware Cloud Director could create simple load balancing services on demand based on their available IP resourcing assigned.

When NSX-T 2.4 came out it had similar functionality however was assigned to a T1 gateway and an Edge Cluster of least of medium size. While this could be done in NSX-T there was not supported functionality within Cloud Director.

Enter NSX Advance Load Balancer with NSX 3.0 and integration with Cloud Director. Now generally I am a big fan of NSX Advanced Load Balancer and with the integration into Cloud Director is brings new functionality and implementation such as “HTTP Cookie Persistence”, “Transparency Mode” Which allows the preservation of client IP’s and shared / dedicated service engines for customers.

The following diagram show the construct of how load balancing services are provided. An NSX-T Cloud Service Engine Group is assigned to a tenant. I prefer to not share Service Engines between customers as “sharing” can make them nervous even though in reality they would separated via there own routing domain VRF on the Service Engine.

It also allows for a simpler billing for customers as they can consume as many virtual services as they require depending on how many available IPs they have.

Cloud Director API creates a transit overlay network between the T1 and the Service engine and a static route is applied on the T1 for the Virtual Service that is hosted on the Service Engine.

Route advertisement is updated on the T1 via API from vCD to NSX to enable LB VIP Routes and Static Routes to allow advertisement to the T0 and into a customers VRF or their public facing network and VM workloads on NSX overlay networks can access the VIP service.

The management plane of the Service is connected via a “Cloud Management” network which is your typical T1 / T0 design.

Logical Design for NSX Advanced Load Balancer.


I have created the following video that shows the creation of Load Balanced Services and what is required / takes place from a NSX and NSX ALB perspective.

Deployment of NSX Advanced Load Balancer with Cloud Director.

Having an understanding of what happens behind the scenes in my mind is the most import aspect to any design and implementation, as it will help with trouble shooting deployments and existing environments when things don’t go as planned and I like to know the mystery behind the magic.

See you all in the next #git commit.

[blog 014]# git commit

NSX Edge Transport Nodes With Failed Deletion

In my lab I am constantly adding and deleting virtual infrastructure depending on what projects I am working on, or testing for customers, or it could be just the fact my mind works like a squirrel collecting nuts while listening to Punk.

One thing I have come across is when the NSX Manager fails to delete an Edge Transport Node and gets itself into a balked state when the Edge Node has been deleted from the virtual infrastructure, however it is in a “Deletion in Progress” state within the NSX Manager. Even though this is a lab and it does not effect anything, I cannot stand having errors ( kind of like my obsession with certificates ).

Balked Deletion in Progress

Now this issue is not new, and the process is to either delete Edge Nodes via API (if they still exist via API) or delete the entries from the Corfu Database however the process for the DB has changed from 3.2 on-wards which this blog will cover, and for transparency this version of NSX is 4.0.0.1.0 . For an in depth method prior to 3.2 you can check Shank Mohan’s article here. https://www.lab2prod.com.au/2021/11/nsx-t-edge-deletion-failed.html

Before continuing make sure you have a backup of NSX in case things don’t go as planned and we all do backups anyway don’t we ….. don’t we !!, and it is best to have a GSS case logged with VMware before proceeding as this blog provides zero warranty.

The following process needs to be carried out as”root” on each of the NSX Managers in the environment.

From the root login we are going to run the internal Corfu Database Tool to run queries , updates and deletion to remove the stale entries.

First of all we are going to look for any Edge Nodes that are marked for deletion, so in the json payload that is return we need the “stringId” of the Edge Node.

root@nsxtuat:/opt/vmware/bin# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t ReplacementInfo

Key:
{
  "stringId": "/infra/sites/default/enforcement-points/default/edge-transport-node/787bc347-d015-43e8-8399-115e45c27f1d"
}

Payload:
{
  "abstractPolicyResource": {
    "managedResource": {
      "displayName": "transport-edge-05",
      "tagsArray": {
      }
    },
    "markedForDelete": true,
    "deleteWithParent": false,
    "locked": false,
    "isOnboarded": false,
    "internalKey": {
      "left": "8681747419887911912",
      "right": "9482629587000262429"
    },

Next we need to stop the Proton Service and Corfu Database, then just start the Corfu database so we can modify the tables. As a habit I always check to make the Corfu Database has started.

root@nsxtuat:/opt/vmware/bin# service proton stop; service corfu-server stop

root@nsxtuat:/opt/vmware/bin# service corfu-server start

root@nsxtuat:/opt/vmware/bin# service corfu-server status
* corfu-server.service - Corfu Infrastructure Server
   Loaded: loaded (/etc/init.d/corfu-server; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2022-08-30 03:26:48 UTC; 3s ago
     Docs: https://github.com/corfudb/corfudb
  Process: 2522 ExecStopPost=/etc/init.d/corfu-server poststop (code=exited, status=0/SUCCESS)
  Process: 2372 ExecStop=/etc/init.d/corfu-server stop (code=exited, status=0/SUCCESS)
  Process: 2838 ExecStart=/etc/init.d/corfu-server start (code=exited, status=0/SUCCESS)
  Process: 2807 ExecStartPre=/etc/init.d/corfu-server prestart (code=exited, status=0/SUCCESS)
    Tasks: 63 (limit: 4915)
   CGroup: /system.slice/corfu-server.service

The next step is to backup all the relevant tables in the database in case we need to restore them so I save them in the tmp directory as I don’t intent to keep them after the NSX Manager reboots down the track.

root@nsxtuat:/# cd /tmp
root@nsxtuat:/tmp# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t ReplacementInfo > ReplacementInfo.txt
root@nsxtuat:/tmp# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t EdgeNodeExternalConfig > EdgeNodeExternalConfig.txt
root@nsxtuat:/tmp# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t EdgeNodeInstallInfo > EdgeNodeInstallInfo.txt
root@nsxtuat:/tmp# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t EdgeNodeConfigInfo > EdgeNodeConfigInfo.txt
root@nsxtuat:/tmp# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t GenericPolicyRealizedResource > GenericPolicyRealizedResource.txt
root@nsxtuat:/tmp# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t EdgeTransportNode > EdgeTransportNode.txt
root@nsxtuat:/tmp# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t DeletedVm > DeletedVm.txt


root@nsxtuat:/tmp# ls -lhra
-rw-r--r--  1 root           root           2.3M Aug 30 23:48 GenericPolicyRealizedResource.txt
-rw-r--r--  1 root           root            34K Aug 30 23:49 EdgeTransportNode.txt
-rw-r--r--  1 root           root           7.5K Aug 30 23:47 EdgeNodeInstallInfo.txt
-rw-r--r--  1 root           root            14K Aug 30 23:47 EdgeNodeExternalConfig.txt
-rw-r--r--  1 root           root           7.6K Aug 30 23:47 EdgeNodeConfigInfo.txt
-rw-r--r--  1 root           root           4.4K Aug 30 23:49 DeletedVm.txt

The next step is to take the “stringId” we captured earlier, and in this case it is “787bc347-d015-43e8-8399-115e45c27f1d” and delete the associated stringId keys from each of the database tables.

If you get a response that includes “not found in nsx<table_name> it is not the end of the world, it just means that NSX has already cleaned up the key in that particular table already.

The next step is clean up any stale records in the “Client RPC Messaging Table” so we need to search for our saved “stringID” again. The “stringID” will help us identify the “left” and “right” uuid’s which will be required to remove the stale records. These are highlighted in bold below and is just a snippet of the output.

root@nsxtuat:/tmp#/opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t Client


Key:
{
  "uuid": {
    "left": "8681747419887911912",
    "right": "9482629587000262429"
  }
}

Payload:
{
  "clientType": "cvn-edge",
  "clientToken": "787bc347-d015-43e8-8399-115e45c27f1d",
  "masterClusterNode": {
    "left": "8679300982090774583",
    "right": "16873472161445019477"
  },

With the “left” and “right” uuids obtained we can now delete the stale keys out of the Client, EdgeMsgClientInfo, and EdgeSystemInfo tables. Note the uuids in bold below.

root@nsxtuat:/tmp#/opt/vmware/bin/corfu_tool_runner.py -o deleteRecord -n nsx -t Client --keyToDelete '{"uuid":{"left":8681747419887911912,"right":9482629587000262429}}'

Namespace: nsx
TableName: Client
2022-08-31T05:09:57.553Z | INFO  |                           main |     o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$Client id 55943778-4eff-34a9-bdd0-6a3bd274dc58
Deleting record with Key {"uuid":{"left":8681747419887911912,"right":9482629587000262429}} in table Client and namespace nsx.  Stream Id 55943778-4eff-34a9-bdd0-6a3bd274dc58


root@nsxtuat:/tmp#/opt/vmware/bin/corfu_tool_runner.py -o deleteRecord -n nsx -t EdgeMsgClientInfo --keyToDelete '{"uuid":{"left":8681747419887911912,"right": 9482629587000262429}}'

Namespace: nsx
TableName: EdgeMsgClientInfo
2022-08-31T05:12:00.531Z | INFO  |                           main |     o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$EdgeMsgClientInfo id 954ff3fb-d058-32de-a41b-452ad521950e
Deleting record with Key {"uuid":{"left":8681747419887911912,"right": 9482629587000262429}} in table EdgeMsgClientInfo and namespace nsx.  Stream Id 954ff3fb-d058-32de-a41b-452ad521950e


root@nsxtuat:/tmp#/opt/vmware/bin/corfu_tool_runner.py -o deleteRecord -n nsx -t EdgeSystemInfo --keyToDelete '{"uuid":{"left":8681747419887911912,"right": 9482629587000262429}}'

Namespace: nsx
TableName: EdgeSystemInfo
2022-08-31T05:12:16.629Z | INFO  |                           main |     o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$EdgeSystemInfo id 31c0178f-fedd-3ddf-9b06-6ffc8307ffcf
Deleting record with Key {"uuid":{"left":8681747419887911912,"right": 9482629587000262429}} in table EdgeSystemInfo and namespace nsx.  Stream Id 31c0178f-fedd-3ddf-9b06-6ffc8307ffcf

Now that we have cleanup all the relevant tables, to validate the Edge Node has been removed when can view the EdgeTransportNode table to show only valid Edge Nodes. I won’t show the output as it is quite a lot of json, however you can just search for the name of your Edge Nodes to confirm.

root@nsxtuat:/tmp#/opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t EdgeTransportNode

Now that everything is clean, restart the proton service and log into the NSX manager and you will see that the Edge Node has been deleted. Note that this process has to be done on all NSX Manager nodes in the cluster.

Finally we can ssh back into the NSX Manager as admin and run “start search resync manager” to sync up all the Edge Nodes.

As you can see below “transport-edge-05” has now been removed.

Edge Node Deleted

So all in all this is quite a complex process and took me quite a while to work through so I hope you find the process useful, however as I iterated earlier in the blog, if it is production this should only be attempted with the assistance of GSS and backups are mandatory.

Keep on NSXing peeps !