Troubleshooting

Overview

You have followed steps in Deployment but something has gone wrong. You’re not sure what and how to fix it, or what information to collect to raise an issue. Welcome to the Submariner troubleshooting guide where we will help you get your deployment working again.

Basic familiarity with the Submariner components and architecture will be helpful when troubleshooting so please review the Architecture section.

The guide has been broken into different sections for easy navigation.

Automated Troubleshooting

Use the subctl utility to automate troubleshooting and collecting debugging information.

Install subctl:

curl -Ls https://get.submariner.io | VERSION=<your Submariner version> bash

Set KUBECONFIG to point at your clusters:

export KUBECONFIG=<kubeconfig0 path>:<kubeconfig1 path>

Show overview of, and diagnose issues with, each cluster:

subctl show all
subctl diagnose all

Diagnose common firewall issues between a pair of clusters:

subctl diagnose firewall inter-cluster --context <localcontext> --remotecontext <remotecontext>

Collect details about an issue you’d like help with:

subctl gather
tar cfz submariner-<timestamp>.tar.gz submariner-<timestamp>

When reporting an issue, it may also help to include the information in the bug-report.md template.

Manual Troubleshooting

Pre-requisite

Before we begin troubleshooting, run subctl version to obtain which version of the Submariner components you are running.

Run kubectl get services -n <service-namespace> | grep <service-name> to get information about the service you’re trying to access. This will provide you with the Service Name, Namespace and ServiceIP. If Globalnet is enabled, you will also need the globalIp of the service by running

kubectl get globalingressip <service-name>'

Connectivity Issues

Submariner deployment completed successfully but Services/Pods on one cluster are unable to connect to Services on another cluster. This can be due to multiple factors outlined below.

Check the Connection Statistics

If you are unable to connect to a remote cluster, check its connection status in the Gateway resource.

kubectl describe Gateway -n submariner-operator

Sample output:

   - endpoint:
        backend: libreswan
        cable_name: submariner-cable-cluster1-172-17-0-7
        cluster_id: cluster1
        healthCheckIP: 10.1.128.0
        hostname: cluster1-worker
        nat_enabled: false
        private_ip: 172.17.0.7
        public_ip: ""
        subnets:
        - 100.1.0.0/16
        - 10.1.0.0/16
      latencyRTT:
        average: 447.358µs
        last: 281.577µs
        max: 5.80437ms
        min: 158.725µs
        stdDev: 364.154µs
      status: connected
      statusMessage: Connected to 172.17.0.7:4500 - encryption alg=AES_GCM_16, keysize=128
        rekey-time=13444

The Gateway Engine uses the Health Check IP of the endpoint to verify connectivity. The connection Status will be marked as error, if it cannot reach this IP, and the Status Message will provide more information about the possible failure reason. It also provides the statistics for the connection.

Service Discovery Issues

If you are able to connect to remote service by using ServiceIP or globalIp, but not by service name, it is a Service Discovery Issue.

Service Discovery not working

This is good time to familiarize yourself with Service Discovery Architecture if you haven’t already.

Check ServiceExport for your Service

For a Service to be accessible across clusters, you must first export the Service via subctl which creates a ServiceExport resource. Ensure the ServiceExport resource exists and check if its status condition indicates `Exported'. Otherwise, its status condition will indicate the reason it wasn’t exported.

kubectl describe serviceexport -n <service-namespace> <service-name>

Note that you can also use shorthand svcex for serviceexport and svcim for serviceimport.

Sample output:

Name:         nginx-demo
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  multicluster.x-k8s.io/v1alpha1
Kind:         ServiceExport
Metadata:
  Creation Timestamp:  2020-11-25T06:21:01Z
  Generation:          1
  Resource Version:    5254
  Self Link:           /apis/multicluster.x-k8s.io/v1alpha1/namespaces/default/serviceexports/nginx-demo
  UID:                 77509e43-8fd1-4173-805c-e03c4581ebbf
Status:
  Conditions:
    Last Transition Time:  2020-11-25T06:21:01Z
    Message:
    Reason:
    Status:                True
    Type:                  Valid
    Last Transition Time:  2020-11-25T06:21:01Z
    Message:               Service was successfully synced to the broker
    Reason:
    Status:                True
    Type:                  Synced
Events:                    <none>

Check Lighthouse CoreDNS Service

All cross-cluster service queries are handled by Lighthouse CoreDNS server. First we check if the Lighthouse CoreDNS Service is running properly.

kubectl -n submariner-operator get service submariner-lighthouse-coredns

If it is running fine, note down the ServiceIP for the next steps. If not, check the logs for an error.

If the error is due to a wrong image, run kubectl -n submariner-operator get deployment submariner-lighthouse-coredns and make sure Image is set to quay.io/submariner/lighthouse-coredns:<version> and refers to the correct version.

For any other errors, capture the information and raise a new issue.

If there’s no error, then check if the Lighthouse CoreDNS server is configured correctly. Run kubectl -n submariner-operator describe configmap submariner-lighthouse-coredns and make sure it has following configuration:

    clusterset.local:53 {
        lighthouse
        errors
        health
        ready
    }

In order to enable debug logs in Lighthouse CoreDNS pods, you can replace errors with debug in the above configmap.

Check CoreDNS Configuration

Submariner requires the CoreDNS deployment to forward requests for the domain clusterset.local to the Lighthouse CoreDNS server in the cluster making the query. Ensure this configuration exists and is correct.

First we check if CoreDNS is configured to forward requests for domain clusterset.local to Lighthouse CoreDNS Server in the cluster making the query.

kubectl -n kube-system describe configmap coredns

In the output look for something like this:

    clusterset.local:53 {
        forward . <lighthouse-coredns-serviceip> ======> ServiceIP of lighthouse-coredns service as noted in previous section
    }

If the entries highlighted above are missing or ServiceIp is incorrect, it means CoreDNS wasn’t configured correctly. It can be fixed by running kubectl edit configmap coredns and making the changes manually. You may need to repeat this step on every cluster.

Check `submariner-lighthouse-agent`

Next we check if the submariner-lighthouse-agent is properly running. Run kubectl -n submariner-operator get pods submariner-lighthouse-agent and check the status of Pods.

If the status indicates the ImagePullBackOff error, run kubectl -n submariner-operator describe deployment submariner-lighthouse-agent and check if Image is set correctly to quay.io/submariner/lighthouse-agent:<version>. If it is and the same error still occurs, raise an issue here or ping us on the community slack channel.

If the status indicates any other error, run kubectl -n submariner-operator get pods to get the name of the lighthouse-agent Pod. Then run kubectl -n submariner-operator logs <lighthouse-agent-pod-name> to get the logs. See if there are any errors in the log. If yes, raise an issue with the log contents, or you can continue reading through this guide to troubleshoot further.

If there are no errors, grep the log for the service name that you’re trying to query as we may need the log entries later for raising an issue.

Check ServiceImport resources

If the steps above did not indicate an issue, next we check if the ServiceImport resources were properly created for the service you’re trying to access.

Run kubectl get serviceimports --all-namespaces |grep <your-service-name> on the Broker cluster to check if a resource was created for your service. If not, then check the Lighthouse Agent logs on the cluster where the service was created and look for any error or warning messages indicating a failure to create the ServiceImport resource for your service. The most common error is Forbidden if the RBAC wasn’t configured correctly. Depending on the deployment method used, ‘subctl’ or ‘helm’, it should’ve been done for you. Create an issue with relevant log entries.

If the ServiceImport resource was created correctly on the Broker cluster, the next step is to check if it exists on the cluster where you’re trying to access the service. The ServiceImport should exist in the service’s namespace with the same name as the service. If it doesn’t exist, check the logs of the Lighthouse Agent on the cluster where you are trying to access the service. As described earlier, it will most commonly be an issue with RBAC otherwise create an issue with relevant log entries.

Check EndpointSlice resources

If the ServiceImport resources are correct, next we check if the EndpointSlice resources were properly created for the service you’re trying to access. Run kubectl get endpointslices --all-namespaces -l multicluster.kubernetes.io/service-name=<your-service-name> on the Broker cluster to check if a resource was created for your Service. If not, then check the Lighthouse Agent logs on the cluster where the Service was created and look for any error or warning messages indicating a failure to create the EndpointSlice resource for your Service. The most common error is Forbidden if the RBAC wasn’t configured correctly. This is supposed to be done automatically during deployment so please file an issue with the relevant log entries.

If the EndpointSlice resource was created correctly on the Broker cluster, the next step is to check if it exists on the cluster where you’re trying to access the Service. The EndpointSlice should exist in the service’s namespace. If it doesn’t exist check the logs of the Lighthouse Agent on the cluster where you are trying to access the Service. As described earlier, it will most commonly be an issue with RBAC so create an issue with relevant log entries.

If the EndpointSlice resource was created properly on the cluster, run kubectl -n <your-service-namespace> describe endpointslice <your-endpointslice-name> and check if it has the correct endpoint addresses, and they indicate the Ready condition is true:

Name:         nginx-ss-cluster2
Namespace:    default
Labels:       endpointslice.kubernetes.io/managed-by=lighthouse-agent.submariner.io
              lighthouse.submariner.io/sourceCluster=cluster2
              lighthouse.submariner.io/sourceName=nginx-ss
              lighthouse.submariner.io/sourceNamespace=default
              multicluster.kubernetes.io/service-name=nginx-ss-default-cluster2
Annotations:  <none>
AddressType:  IPv4
Ports:
  Name  Port  Protocol
  ----  ----  --------
  web   80    TCP
Endpoints:
  - Addresses:  10.242.0.5  -----> Pod IP
    Conditions:
      Ready:    true
    Hostname:   web-0   -----> Pod hostname
    Topology:   kubernetes.io/hostname=cluster2-worker2
  - Addresses:  10.242.224.4
    Conditions:
      Ready:   true
    Hostname:  web-1
    Topology:  kubernetes.io/hostname=cluster2-worker
Events:        <none>

For a non-headless service, the EndpointSlice will contain a single endpoint referencing the service’s cluster IP address.

If the Addresses are correct but still not being returned from DNS queries, try querying IPs in a specific cluster by prefixing the query with <cluster-id>. If that returns the IPs correctly, then check the connectivity to the cluster using subctl show endpoint. The Lighthouse CoreDNS Server only returns IPs from connected clusters.

For errors querying specific Pods of a StatefulSet, check that the Hostname is correct for the endpoint.

If still not working, file an issue with relevant log entries.

Configuring TTL

The default TTL for Lighthouse DNS requests is 5 seconds. This can be customized via a dns.ttl setting in a ConfigMap named submariner-lighthouse-coredns in the submariner-operator namespace. This ConfigMap does not exist by default and must be manually created:

apiVersion: v1
kind: ConfigMap
metadata:
  name: submariner-lighthouse-coredns
  namespace: submariner-operator
data:
  dns.ttl: "2" # Specify value in seconds