You have followed steps in Deployment but something has gone wrong. You’re not sure what and how to fix it, or what information to collect to raise an issue. Welcome to the Submariner troubleshooting guide where we will help you get your deployment working again.
Basic familiarity with the Submariner components and architecture will be helpful when troubleshooting so please review the Architecture section.
The guide has been broken into different sections for easy navigation.
Before we begin troubleshooting, run subctl version
to obtain which version of the Submariner components you are running.
Run kubectl get services -n <service-namespace> | grep <service-name>
to get information about the service you’re trying to access. This
will provide you with the Service Name, Namespace and ServiceIP. If Globalnet is enabled, you will also need the globalIp of the
service by running
kubectl get service <service-name> -o jsonpath='{.metadata.annotations.submariner\.io/globalIp}'
Submariner deployment completed successfully but Services/Pods on one cluster are unable to connect to Services on another cluster. This can be due to multiple factors outlined below.
If you are unable to connect to a remote cluster, check its connection status in the Gateway resource.
kubectl describe Gateway -n submariner-operator
Sample output:
- endpoint:
backend: libreswan
cable_name: submariner-cable-cluster1-172-17-0-7
cluster_id: cluster1
healthCheckIP: 10.1.128.0
hostname: cluster1-worker
nat_enabled: false
private_ip: 172.17.0.7
public_ip: ""
subnets:
- 100.1.0.0/16
- 10.1.0.0/16
latencyRTT:
average: 447.358µs
last: 281.577µs
max: 5.80437ms
min: 158.725µs
stdDev: 364.154µs
status: connected
statusMessage: Connected to 172.17.0.7:4500 - encryption alg=AES_GCM_16, keysize=128
rekey-time=13444
This feature is only available for non-Globalnet deployments at the moment.
The Gateway Engine uses the Health Check IP
of the endpoint to verify connectivity.
The connection Status
will be marked as error
, if it cannot reach this IP,
and the Status Message
will provide more information about the possible failure reason.
It also provides the statistics for the connection.
If you are able to connect to remote service by using ServiceIP or globalIp, but not by service name, it is a Service Discovery Issue.
This is good time to familiarize yourself with Service Discovery Architecture if you haven’t already.
For a Service to be accessible across clusters, you must first export the Service via subctl
which creates a ServiceExport
resource.
Ensure the ServiceExport
resource exists and check if its status condition indicates `Exported’. Otherwise, its status condition will
indicate the reason it wasn’t exported.
kubectl describe serviceexport -n <service-namespace> <service-name>
Note that you can also use shorthand svcex
for serviceexport
and svcim
for serviceimport
.
Sample output:
Name: nginx-demo
Namespace: default
Labels: <none>
Annotations: <none>
API Version: multicluster.x-k8s.io/v1alpha1
Kind: ServiceExport
Metadata:
Creation Timestamp: 2020-11-25T06:21:01Z
Generation: 1
Resource Version: 5254
Self Link: /apis/multicluster.x-k8s.io/v1alpha1/namespaces/default/serviceexports/nginx-demo
UID: 77509e43-8fd1-4173-805c-e03c4581ebbf
Status:
Conditions:
Last Transition Time: 2020-11-25T06:21:01Z
Message: Awaiting sync of the ServiceImport to the broker
Reason: AwaitingSync
Status: False
Type: Valid
Last Transition Time: 2020-11-25T06:21:01Z
Message: Service was successfully synced to the broker
Reason:
Status: True
Type: Valid
Events: <none>
All cross-cluster service queries are handled by Lighthouse CoreDNS server. First we check if the Lighthouse CoreDNS Service is running properly.
kubectl -n submariner-operator get service submariner-lighthouse-coredns
If it is running fine, note down the ServiceIP
for the next steps. If not, check the logs for an error.
If the error is due to a wrong image, run kubectl -n submariner-operator get deployment submariner-lighthouse-coredns
and make sure
Image
is set to quay.io/submariner/lighthouse-coredns:<version>
and refers to the correct version.
For any other errors, capture the information and raise a new issue.
If there’s no error, then check if the Lighthouse CoreDNS server is configured correctly. Run kubectl -n submariner-operator describe configmap submariner-lighthouse-coredns
and make sure it has following configuration:
clusterset.local:53 {
lighthouse
errors
health
ready
}
Submariner requires the CoreDNS deployment to forward requests for the domain clusterset.local
to the Lighthouse CoreDNS server in the
cluster making the query. Ensure this configuration exists and is correct.
First we check if CoreDNS is configured to forward requests for domain clusterset.local
to Lighthouse CoreDNS Server in the cluster
making the query.
kubectl -n kube-system describe configmap coredns
In the output look for something like this:
clusterset.local:53 {
forward . <lighthouse-coredns-serviceip> ======> ServiceIP of lighthouse-coredns service as noted in previous section
}
If the entries highlighted above are missing or ServiceIp
is incorrect, it means CoreDNS wasn’t configured correctly. It can be fixed by
running kubectl edit configmap coredns
and making the changes manually. You may need to repeat this step on every cluster.
Next we check if the submariner-lighthouse-agent
is properly running. Run kubectl -n submariner-operator get pods submariner-lighthouse-agent
and check the status of Pods.
If the status indicates the ImagePullBackOff
error, run kubectl -n submariner-operator describe deployment submariner-lighthouse-agent
and check if Image
is set correctly to quay.io/submariner/lighthouse-agent:<version>
. If it is and the same error still occurs, raise an
issue here or ping us on the community slack channel.
If the status indicates any other error, run kubectl -n submariner-operator get pods
to get the name of the lighthouse-agent
Pod. Then
run kubectl -n submariner-operator logs <lighthouse-agent-pod-name>
to get the logs. See if there are any errors in the log. If yes, raise
an issue with the log contents, or you can continue reading through this guide to
troubleshoot further.
If there are no errors, grep the log for the service name that you’re trying to query as we may need the log entries later for raising an issue.
If the steps above did not indicate an issue, next we check if the ServiceImport resources were properly created for the service you’re trying to access. The format of a ServiceImport resources’s name is as follows:
<service-name>-<service-namespace>-<cluster-id>
Run kubectl get serviceimports --all-namespaces |grep <your-service-name>
on the Broker cluster to check if a resource was created for
your service. If not, then check the Lighthouse Agent logs on the cluster where service was created and look for any error or warning
messages indicating a failure to create the ServiceImport resource for your service. The most common error is Forbidden
if the RBAC wasn’t
configured correctly. Depending on the deployment method used, ‘subctl’ or ‘helm’, it should’ve been done for you. Create an
issue with relevant log entries.
If the ServiceImport resource was created correctly on the Broker cluster, the next step is to check if it exists on the cluster where you’re trying to access the service. Follow the same steps as earlier to get the list of the ServiceImport resources and check if the ServiceImport for your service exists. If not, check the logs of the Lighthouse Agent on the cluster where you are trying to access the service. As described earlier, it will most commonly be an issue with RBAC otherwise create an issue with relevant log entries.
If the ServiceImport resource was created properly on the cluster, run
kubectl -n submariner-operator describe serviceimport <your-serviceimport-name>
and check if it has the correct ClusterID
and ServiceIP
:
Name: nginx-demo-default-cluster2
Namespace: submariner-operator
Labels: lighthouse.submariner.io/sourceCluster=cluster2
lighthouse.submariner.io/sourceName=nginx-demo
lighthouse.submariner.io/sourceNamespace=default
submariner-io/clusterID=cluster2
Annotations: cluster-ip: 100.2.33.171
origin-name: nginx-demo
origin-namespace: default
API Version: multicluster.x-k8s.io/v1alpha1
Kind: ServiceImport
Metadata:
Creation Timestamp: 2020-11-25T06:21:02Z
Generation: 1
Resource Version: 5312
Self Link: /apis/multicluster.x-k8s.io/v1alpha1/namespaces/submariner-operator/serviceimports/nginx-demo-default-cluster2
UID: a4c4abe0-1c84-4118-ae09-760b26f7fe3c
Spec:
Ips:
100.2.33.171
Ports:
Session Affinity Config:
Type: ClusterSetIP
Events: <none>
For headless Service, you need to check EndpointSlice
resource.
If the data is not correct, you can manually edit the ServiceImport resource to set the correct IP as a workaround and create an issue with relevant information.
If the ServiceImport Ips
are correct but still not being returned from DNS queries, check the connectivity to the cluster
using subctl show endpoint
. The Lighthouse CoreDNS Server only returns IPs
from connected clusters.
For a headless Service, next we check if the EndpointSlice resources were properly created for the service you’re trying to access. EndpointSlice resources are created in the same namespace as the source Service. The format of a EndpointSlice resource’s name is as follows:
<service-name>--<cluster-id>
Run kubectl get endpointslices --all-namespaces |grep <your-service-name>
on the Broker cluster to check if a resource was created for
your Service. If not, then check the Lighthouse Agent logs on the cluster where the Service was created and look for any error or warning
messages indicating a failure to create the ServiceImport resource for your Service. The most common error is Forbidden
if the RBAC wasn’t
configured correctly. This is supposed to be done automatically during deployment so please file an
issue with the relevant log entries.
If the EndpointSlice resource was created correctly on the Broker cluster, the next step is to check if it exists on the cluster where you’re trying to access the Service. Follow the same steps as earlier to get the list of the EndpointSlice resources and check if the EndpointSlice for the Service exists. If not, check the logs of the Lighthouse Agent on the cluster where you are trying to access the Service. As described earlier, it will most commonly be an issue with RBAC so create an issue with relevant log entries.
If the EndpointSlice resource was created properly on the cluster, run
kubectl -n <your-service-namespace> describe endpointslice <your-endpointslice-name>
and check if it has the correct endpoint addresses:
Name: nginx-ss-cluster2
Namespace: default
Labels: endpointslice.kubernetes.io/managed-by=lighthouse-agent.submariner.io
lighthouse.submariner.io/sourceCluster=cluster2
lighthouse.submariner.io/sourceName=nginx-ss
lighthouse.submariner.io/sourceNamespace=default
multicluster.kubernetes.io/service-name=nginx-ss-default-cluster2
Annotations: <none>
AddressType: IPv4
Ports:
Name Port Protocol
---- ---- --------
web 80 TCP
Endpoints:
- Addresses: 10.242.0.5 -----> Pod IP
Conditions:
Ready: true
Hostname: web-0 -----> Pod hostname
Topology: kubernetes.io/hostname=cluster2-worker2
- Addresses: 10.242.224.4
Conditions:
Ready: true
Hostname: web-1
Topology: kubernetes.io/hostname=cluster2-worker
Events: <none>
If the Addresses
are correct but still not being returned from DNS queries, try querying IPs in a specific cluster
by prefixing the query with <cluster-id>.
If that returns the IPs correctly, then check the connectivity to the cluster
using subctl show endpoint
. The Lighthouse CoreDNS Server only returns IPs
from connected clusters.
For errors querying specific Pods of a StatefulSet, check that the Hostname
is correct for the endpoint.
If still not working, file an issue with relevant log entries.