An unknown/unhealthy cluster status is almost always due to the k8s collector being blocked from access required endpoints.
1. The first and easiest check to run is confirming the network can access containers-api.edge.cloudhealthtech.com/v1/containers/ and api.cloudhealthtech.com endpoints on port 443. Both use HTTPS and do not use WebSocket.
2. The second check is to run two different curls from the deployed kubernetes collector pod to ensure connection to our endpoints. These MUST be run from the collector pod, not the cluster.
Enter the pod with:
kubectl exec -i --tty <pod-name> -- sh
Run the first curl to our health endpoint:
curl -v -X GET https://containers-api.edge.cloudhealthtech.com/api/v1/health
The expected response:
{"status":"healthy","time":"Fri, 29 Jan 2021 22:48:10 GMT"}
Run the second curl to mock the exact request made by the kubernetes collector (except without any k8s data cache payload). Replace the auth_token
and the cluster_id
as necessary:
-The auth_token
is our platform's auto generated token you used to deploy the collector.
-The cluster_id
is the name you set for the cluster.
curl --header "Content-Type: application/json" --request POST --data '{"auth_token":"INSERT_AUTHENTICATION_TOKEN_HERE","cluster_id":"INSERT_CLUSTER_ID_HERE"}' https://containers-api.edge.cloudhealthtech.com/v1/containers/kubernetes/state
The expected response (since we sent no payload):
{"result":201}
If either of these curls do not give the expected response then you must refer back to check 1 to ensure the expected endpoints are whitelisted on your network.
3. While we do not officially support proxy configuration, defining JAVA_OPTS environment variables has worked as well:
env:
name: JAVA_OPTS
value: -Dhttp.proxyHost=<PROXY> -Dhttp.proxyPort=8989
Dhttp.nonProxyHosts=kubernetes.default.svc -Dhttps.proxyHost=<PROXY>
Dhttps.proxyPort=8989 -Dhttps.nonProxyHosts=kubernetes.default.svc
4. If the customer is still having trouble, then you can request the pod logs using (Does not require you to be in pod):
kubectl logs --namespace default <pod-name>
Some important things to look for in the logs are the defined containers API endpoint, the agent version, and the cluster UID
Containers API endpoint: If you change the CHT_REGION variable (Setting variables is the first step in the deployment instructions so you can refer back to that guide) to anything other than us-east-1 then the collector will fail to connect. For example, here’s a customer who changed the variable to us-west-2. We can see this error in the logs where the collector can not connect to the us-west-2 endpoint (Containers team will eventually add support for other regions):
The Agent Version: Be sure to check the agent version. We always recommend the most recent one which you can confirm here https://hub.docker.com/r/cloudhealth/container-collector/tags:
The cluster UID: If you don’t see the cluster UID defined, then have the customer run the upgrade command (This is for helm deployed collectors.. manually deployed collectors should refer to the manual install guide):
helm upgrade cloudhealth-collector cloudhealth/cloudhealth-collector