Investigation: live cluster cronjob connectivity issues 



## Background
We have been observing for some time now large numbers of cronjobs failing due to DNS or network issue on our live cluster nodes. There seems to be a pattern wherein jobs that require access to endpoints (external and also inside cluster like example below) fail due to connection / resolving timeouts (DNS? < seems most likely candidate but can't rule out network issues). We've also noticed that this seems to happen on newly created nodes (the pattern we've observed is jobs beginning and failing early hours of the morning on newly recycled nodes). This might be unrelated ie maybe jobs get scheduled quickly onto new nodes??, but needs looking into.

Example of failed job log:


```
hmpps-audit-dev/queue-housekeeping-cronjob-28545640-zjd2r:housekeeping

curl: (7) Failed to connect to hmpps-audit-api port 80 after 5022 ms: Couldn't connect to server   
```

Theres a chance that this is pointing towards a deeper / more serious issue that we need to investigate and understand. We have examples of users reporting seeing network related job failures like [this one here](https://mojdt.slack.com/archives/C57UPMZLY/p1712654735504779).

Its also possible that these issues are user config problems. 

## What we've done so far:

- We have tested recycling a node and observed jobs failing in this way in realtime.

- We have setup a debug job running every minute which executes a verbose `curl` command (showing DNS lookup details). This is called `test-dns-*` in namespace `jaskaran-dev`. This job is also failing on connection timeouts as can be seen via this [kibana query](https://kibana.cloud-platform.service.justice.gov.uk/_plugin/kibana/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-7d,to:now))&_a=(columns:!(_source),filters:!(),index:'167701b0-f8c0-11ec-b95c-1d65c3682287',interval:auto,query:(language:kuery,query:'kubernetes.namespace_name:%20%22jaskaran-dev%22%20and%20kubernetes.container_name:%20%22test-dns%22%20and%20log:%20%22Could%20not%20resolve%20host%22'),sort:!()))

- It appears that across the whole cluster the error `Could not resolve host` is exclusive to Job spec containers, see [kibana here](https://kibana.cloud-platform.service.justice.gov.uk/_plugin/kibana/app/discover#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-30d,to:now))&_a=(columns:!(kubernetes.container_name),filters:!(),index:'167701b0-f8c0-11ec-b95c-1d65c3682287',interval:auto,query:(language:kuery,query:'%20%22Could%20not%20resolve%20host%22'),sort:!())).

This view also shows a pattern, in that these job connection issues are happening mostly between 00:00 > 06:00am daily. This coincides with node recycle window (00:00-03:00) , but why 06:00??

## What else we should do:

- Add more debugging to existing test job. ie traceroute / dig / nslookup etc

- Create another test job that curls an internal cluster service endpoint

- The set of namespaces for which we see Error jobs is not a huge one; we should reach out to some of the owners of these environments to get an understanding of what exactly their jobs are doing, whether they already have an idea of what might be the underlying cause.

## Helpful links
K8s DNS Debugging:
https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/





## Reference

[How to write good user stories](https://www.gov.uk/service-manual/agile-delivery/writing-user-stories)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation: live cluster cronjob connectivity issues #5475

Background

What we've done so far:

What else we should do:

Helpful links

Reference

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Investigation: live cluster cronjob connectivity issues #5475

Description

Background

What we've done so far:

What else we should do:

Helpful links

Reference

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions