NSX Edge Transport Nodes With Failed Deletion
In my lab I am constantly adding and deleting virtual infrastructure depending on what projects I am working on, or testing for customers, or it could be just the fact my mind works like a squirrel collecting nuts while listening to Punk.
One thing I have come across is when the NSX Manager fails to delete an Edge Transport Node and gets itself into a balked state when the Edge Node has been deleted from the virtual infrastructure, however it is in a “Deletion in Progress” state within the NSX Manager. Even though this is a lab and it does not effect anything, I cannot stand having errors ( kind of like my obsession with certificates ).
Balked Deletion in Progress
Now this issue is not new, and the process is to either delete Edge Nodes via API (if they still exist via API) or delete the entries from the Corfu Database however the process for the DB has changed from 3.2 on-wards which this blog will cover, and for transparency this version of NSX is 4.0.0.1.0 . For an in depth method prior to 3.2 you can check Shank Mohan’s article here. https://www.lab2prod.com.au/2021/11/nsx-t-edge-deletion-failed.html
Before continuing make sure you have a backup of NSX in case things don’t go as planned and we all do backups anyway don’t we ….. don’t we !!, and it is best to have a GSS case logged with VMware before proceeding as this blog provides zero warranty.
The following process needs to be carried out as”root” on each of the NSX Managers in the environment.
From the root login we are going to run the internal Corfu Database Tool to run queries , updates and deletion to remove the stale entries.
First of all we are going to look for any Edge Nodes that are marked for deletion, so in the json payload that is return we need the “stringId” of the Edge Node.
root@nsxtuat:/opt/vmware/bin# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t ReplacementInfo
Key:
{
"stringId": "/infra/sites/default/enforcement-points/default/edge-transport-node/787bc347-d015-43e8-8399-115e45c27f1d"
}
Payload:
{
"abstractPolicyResource": {
"managedResource": {
"displayName": "transport-edge-05",
"tagsArray": {
}
},
"markedForDelete": true,
"deleteWithParent": false,
"locked": false,
"isOnboarded": false,
"internalKey": {
"left": "8681747419887911912",
"right": "9482629587000262429"
},
Next we need to stop the Proton Service and Corfu Database, then just start the Corfu database so we can modify the tables. As a habit I always check to make the Corfu Database has started.
root@nsxtuat:/opt/vmware/bin# service proton stop; service corfu-server stop
root@nsxtuat:/opt/vmware/bin# service corfu-server start
root@nsxtuat:/opt/vmware/bin# service corfu-server status
* corfu-server.service - Corfu Infrastructure Server
Loaded: loaded (/etc/init.d/corfu-server; enabled; vendor preset: enabled)
Active: active (running) since Tue 2022-08-30 03:26:48 UTC; 3s ago
Docs: https://github.com/corfudb/corfudb
Process: 2522 ExecStopPost=/etc/init.d/corfu-server poststop (code=exited, status=0/SUCCESS)
Process: 2372 ExecStop=/etc/init.d/corfu-server stop (code=exited, status=0/SUCCESS)
Process: 2838 ExecStart=/etc/init.d/corfu-server start (code=exited, status=0/SUCCESS)
Process: 2807 ExecStartPre=/etc/init.d/corfu-server prestart (code=exited, status=0/SUCCESS)
Tasks: 63 (limit: 4915)
CGroup: /system.slice/corfu-server.service
The next step is to backup all the relevant tables in the database in case we need to restore them so I save them in the tmp directory as I don’t intent to keep them after the NSX Manager reboots down the track.
root@nsxtuat:/# cd /tmp
root@nsxtuat:/tmp# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t ReplacementInfo > ReplacementInfo.txt
root@nsxtuat:/tmp# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t EdgeNodeExternalConfig > EdgeNodeExternalConfig.txt
root@nsxtuat:/tmp# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t EdgeNodeInstallInfo > EdgeNodeInstallInfo.txt
root@nsxtuat:/tmp# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t EdgeNodeConfigInfo > EdgeNodeConfigInfo.txt
root@nsxtuat:/tmp# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t GenericPolicyRealizedResource > GenericPolicyRealizedResource.txt
root@nsxtuat:/tmp# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t EdgeTransportNode > EdgeTransportNode.txt
root@nsxtuat:/tmp# /opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t DeletedVm > DeletedVm.txt
root@nsxtuat:/tmp# ls -lhra
-rw-r--r-- 1 root root 2.3M Aug 30 23:48 GenericPolicyRealizedResource.txt
-rw-r--r-- 1 root root 34K Aug 30 23:49 EdgeTransportNode.txt
-rw-r--r-- 1 root root 7.5K Aug 30 23:47 EdgeNodeInstallInfo.txt
-rw-r--r-- 1 root root 14K Aug 30 23:47 EdgeNodeExternalConfig.txt
-rw-r--r-- 1 root root 7.6K Aug 30 23:47 EdgeNodeConfigInfo.txt
-rw-r--r-- 1 root root 4.4K Aug 30 23:49 DeletedVm.txt
The next step is to take the “stringId” we captured earlier, and in this case it is “787bc347-d015-43e8-8399-115e45c27f1d” and delete the associated stringId keys from each of the database tables.
If you get a response that includes “not found in nsx<table_name> it is not the end of the world, it just means that NSX has already cleaned up the key in that particular table already.
The next step is clean up any stale records in the “Client RPC Messaging Table” so we need to search for our saved “stringID” again. The “stringID” will help us identify the “left” and “right” uuid’s which will be required to remove the stale records. These are highlighted in bold below and is just a snippet of the output.
root@nsxtuat:/tmp#/opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t Client
Key:
{
"uuid": {
"left": "8681747419887911912",
"right": "9482629587000262429"
}
}
Payload:
{
"clientType": "cvn-edge",
"clientToken": "787bc347-d015-43e8-8399-115e45c27f1d",
"masterClusterNode": {
"left": "8679300982090774583",
"right": "16873472161445019477"
},
With the “left” and “right” uuids obtained we can now delete the stale keys out of the Client, EdgeMsgClientInfo, and EdgeSystemInfo tables. Note the uuids in bold below.
root@nsxtuat:/tmp#/opt/vmware/bin/corfu_tool_runner.py -o deleteRecord -n nsx -t Client --keyToDelete '{"uuid":{"left":8681747419887911912,"right":9482629587000262429}}'
Namespace: nsx
TableName: Client
2022-08-31T05:09:57.553Z | INFO | main | o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$Client id 55943778-4eff-34a9-bdd0-6a3bd274dc58
Deleting record with Key {"uuid":{"left":8681747419887911912,"right":9482629587000262429}} in table Client and namespace nsx. Stream Id 55943778-4eff-34a9-bdd0-6a3bd274dc58
root@nsxtuat:/tmp#/opt/vmware/bin/corfu_tool_runner.py -o deleteRecord -n nsx -t EdgeMsgClientInfo --keyToDelete '{"uuid":{"left":8681747419887911912,"right": 9482629587000262429}}'
Namespace: nsx
TableName: EdgeMsgClientInfo
2022-08-31T05:12:00.531Z | INFO | main | o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$EdgeMsgClientInfo id 954ff3fb-d058-32de-a41b-452ad521950e
Deleting record with Key {"uuid":{"left":8681747419887911912,"right": 9482629587000262429}} in table EdgeMsgClientInfo and namespace nsx. Stream Id 954ff3fb-d058-32de-a41b-452ad521950e
root@nsxtuat:/tmp#/opt/vmware/bin/corfu_tool_runner.py -o deleteRecord -n nsx -t EdgeSystemInfo --keyToDelete '{"uuid":{"left":8681747419887911912,"right": 9482629587000262429}}'
Namespace: nsx
TableName: EdgeSystemInfo
2022-08-31T05:12:16.629Z | INFO | main | o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$EdgeSystemInfo id 31c0178f-fedd-3ddf-9b06-6ffc8307ffcf
Deleting record with Key {"uuid":{"left":8681747419887911912,"right": 9482629587000262429}} in table EdgeSystemInfo and namespace nsx. Stream Id 31c0178f-fedd-3ddf-9b06-6ffc8307ffcf
Now that we have cleanup all the relevant tables, to validate the Edge Node has been removed when can view the EdgeTransportNode table to show only valid Edge Nodes. I won’t show the output as it is quite a lot of json, however you can just search for the name of your Edge Nodes to confirm.
root@nsxtuat:/tmp#/opt/vmware/bin/corfu_tool_runner.py -o showTable -n nsx -t EdgeTransportNode
Now that everything is clean, restart the proton service and log into the NSX manager and you will see that the Edge Node has been deleted. Note that this process has to be done on all NSX Manager nodes in the cluster.
Finally we can ssh back into the NSX Manager as admin and run “start search resync manager” to sync up all the Edge Nodes.
As you can see below “transport-edge-05” has now been removed.
Edge Node Deleted
So all in all this is quite a complex process and took me quite a while to work through so I hope you find the process useful, however as I iterated earlier in the blog, if it is production this should only be attempted with the assistance of GSS and backups are mandatory.
Keep on NSXing peeps !