August 24, 2023
VMware NSX, Unable to add a Host Transport Node
So in this blog I am going to be talking about the benfits of NSX sub-cluster in version 4.1.1.0.0., how they work and why they are cool for Cloud Service Providers. At least that was the intention, till one of my lab hosts decided to grenade itself in style. So what you ask, don’t be soft,just rebuild it put it back in the cluster, and that is exactly what I did till I could not prepare the rebuilt host for NSX due to the following error.
Figure 1 – Validation Error
Not a problem. Lets move the host out of the vSphere cluster, and validate that the host still exists using the API.
Once we have validated it just remove it with the API.
GET https://<NSX Manager>/api/v1/transport-nodes/
So my ESX host was not returned via the GET API call which means the problem is within in the Corfu database. Now, I have done a blog previously on removing stuck Edge Transport Nodes from the Corfu database so I have sort of been down this path before. Once again, if you find yourself in this scenario in a production environment don’t go in guns blazing without the good folk at GSS leading the way.
For me I like to get my hands dirty and break things in my labs since I am an idiot who forgets sometimes he has a family and other commitments. Right, so on to the good stuff.
So with the ESX host moved out of the vSphere cluster it is time to run some queries on the Corfu database, and export to file the following tables to search for either the host name or host uuid, and extract the relevant fields associated to the ESX host.
HostModelMsg
GenericPolicyRealizedResource
HostTransportNode
root@nsx:/# /opt/vmware/bin/corfu_tool_runner.py --tool corfu-editor -n nsx -o showTable -t HostModelMsg > /tmp/file1.txt
3. root@nsx:/# /opt/vmware/bin/corfu_tool_runner.py --tool corfu-editor -n nsx -o showTable -t GenericPolicyRealizedResource > /tmp/file2.txt
4. root@nsx:/# /opt/vmware/bin/corfu_tool_runner.py --tool corfu-editor -n nsx -o showTable -t HostTransportNode > /tmp/file3.txt
Once the exports of the tables have been done we need to identify that the stuck ESX host within the json exists and extract the “StringID” and the “Key” for the ESX host for each table exported.
The following is an example of what I am looking for,
"stringId": "/infra/sites/default/enforcement-points/default/host-transport-nodes/esxuat3-30b975c9-60a0-4fb7-bfef-96770ee5f240host-10021"
Key:
{
"uuid": {
"left": "10489905160825487935",
"right": "11102041072723471023"
}
}
You will notice the uuid at the end of “esxuat3″in the stringId matches the uuid from the previous screen shot.
Now that we have the required information we can shut down the NSX proton service and clean up the Corfu database.
root@nsxt:/# service proton stop; service corfu-server stop
root@nsxt:/# service corfu-server start
The first command removes the ESX host stringid key from “GenericPolicyRealizedResource” table. When the command has completed we need to ensure that “1 records deleted successfully” in the output otherwise we have not deleted anything. This is noted in bold within code snippet.
corfu_tool_runner.py --tool corfu-browser -o deleteRecord -n nsx -t GenericPolicyRealizedResource --keyToDelete '{"stringId": "/infra/realized-state/enforcement-points/default/host-transport-nodes/esxuat3-30b975c9-60a0-4fb7-bfef-96770ee5f240host-10021"}'
Deleting 1 records in table GenericPolicyRealizedResource and namespace nsx. Stream Id dabf8af4-9eb6-3374-9a18-d273ed7132e9
Namespace: nsx
TableName: GenericPolicyRealizedResource
2023-08-22T00:11:46.361Z | INFO | main | o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$GenericPolicyRealizedResource id dabf8af4-9eb6-3374-9a18-d273ed7132e9
2023-08-22T00:11:46.362Z | INFO | main | o.c.runtime.view.SMRObject | Added SMRObject [dabf8af4-9eb6-3374-9a18-d273ed7132e9, PersistentCorfuTable] to objectCache
0: Deleting record with Key {"stringId": "/infra/realized-state/enforcement-points/default/host-transport-nodes/esxuat3-30b975c9-60a0-4fb7-bfef-96770ee5f240host-10021"}
1 records deleted successfully.
The next command removes the ESX host key from “HostModelMsg” table. When the command has completed we need to ensure that “1 records deleted successfully” in the output otherwise we have not deleted anything. This is noted in bold within code snippet.
corfu_tool_runner.py --tool corfu-browser -o deleteRecord -n nsx -t HostModelMsg --keyToDelete '{"uuid": {"left": "10489905160825487935", "right": "11102041072723471023"} }'
Deleting 1 records in table HostModelMsg and namespace nsx. Stream Id d8120129-1f35-34c2-a309-e5cf6dbe5487
Namespace: nsx
TableName: HostModelMsg
2023-08-22T00:12:20.049Z | INFO | main | o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$HostModelMsg id d8120129-1f35-34c2-a309-e5cf6dbe5487
2023-08-22T00:12:20.050Z | INFO | main | o.c.runtime.view.SMRObject | Added SMRObject [d8120129-1f35-34c2-a309-e5cf6dbe5487, PersistentCorfuTable] to objectCache
0: Deleting record with Key {"uuid": {"left": "10489905160825487935", "right": "11102041072723471023"} }
1 records deleted successfully.
The final command removes the ESX host stringid key from “HostTransportNode” table. When the command has completed we need to ensure that “1 records deleted successfully” in the output otherwise we have not deleted anything. This is noted in bold within code snippet.
corfu_tool_runner.py --tool corfu-browser -o deleteRecord -n nsx -t HostTransportNode --keyToDelete '{"stringId": "/infra/sites/default/enforcement-points/default/host-transport-nodes/esxuat3-30b975c9-60a0-4fb7-bfef-96770ee5f240host-10021" }'
Deleting 1 records in table HostTransportNode and namespace nsx. Stream Id a720622f-68c3-3359-9114-12231645d94e
Namespace: nsx
TableName: HostTransportNode
2023-08-22T00:12:49.897Z | INFO | main | o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$HostTransportNode id a720622f-68c3-3359-9114-12231645d94e
2023-08-22T00:12:49.898Z | INFO | main | o.c.runtime.view.SMRObject | Added SMRObject [a720622f-68c3-3359-9114-12231645d94e, PersistentCorfuTable] to objectCache
0: Deleting record with Key {"stringId": "/infra/sites/default/enforcement-points/default/host-transport-nodes/esxuat3-30b975c9-60a0-4fb7-bfef-96770ee5f240host-10021" }
1 records deleted successfully.
Now that the corfu database entries have been removed we can now restart the proton and corfu database services and check the status to ensure they have started correctly.
root@nsxt:/# service proton restart; service corfu-server restart
root@nsxt:/# service proton status; service corfu-server status
Once all the services have restarted, we can move the ESX host back into the vSphere cluster and allow NSX to prepare the host.
All things going well the host preparation should be successful and back into action for NSX goodness.
Figure 2 – Host Preparation.
Figure 3 – Completed NSX Installation.
Well that’s a wrap for this blog after getting a little side tracked down this rabbit hole. I hope this will help if you get stuck in your labs or at least understand the path GSS would go down in a production situation. Stay tuned for the next one (touch wood) on NSX sub-clusters, and the benefits for Cloud Service Providers. As always keep on NSXing !
Cheers
Tony Williamson