Skip to content

Losing jobs due to network interruptions #393

@natefoo

Description

@natefoo

I had a couple get "lost" (stuck in the running state) this way - in the case of 66751002 the job was submitted and ran, but a network error occurred during postprocessing and no state files were left behind in the {manager}-*-jobs dirs, #354 is relevant here as well), and the last messaage logged for this job was:

2025-04-15 15:23:23,518 INFO  [pulsar.client.staging.down][[manager=vgp_jetstream2]-[action=postprocess]-[job=66751002]] collecting output database.dmnd with action FileAction[path=/corral4/main/objects/6/4/0/dataset_640edda5-f2c0-4209-bb9c-3f801701a638.dat,action_type=rem>

In the case of 66751004 the job did not finish preprocessing and there was a {manager}-preprocessing-jobs file for the job, this was the last message:

2025-04-15 15:22:51,993 DEBUG [pulsar.managers.staging.pre][[manager=vgp_jetstream2]-[action=preprocess]-[job=66751004]] Staging input 'dataset_98dadf00-5182-4fbb-a07e-9f9ca6210985.dat' via FileAction[path=/corral4/main/objects/9/8/d/dataset_98dadf00-5182-4fbb-a07e-9f9ca62>

So despite not logging anything else or raising exceptions for either of these jobs, there are clear network issues recorded for other jobs:

2025-04-15 15:26:17,569 INFO  [pulsar.managers.util.retry][[manager=vgp_jetstream2]-[action=postprocess]-[job=66783218]] Failed to execute staging out file /jetstream2/scratch/main/jobs-vgp/66783218/outputs/dataset_94265413-e9a8-4ff1-b12a-776d0948c293.dat via FileAction[pa>
Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/sentry_sdk/integrations/stdlib.py", line 128, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/lib64/python3.9/http/client.py", line 1377, in getresponse
    response.begin()
  File "/usr/lib64/python3.9/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.9/http/client.py", line 289, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 847, in urlopen
    retries = retries.increment(
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/util/retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/util/util.py", line 38, in reraise
    raise value.with_traceback(tb)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/urllib3/connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/sentry_sdk/integrations/stdlib.py", line 128, in getresponse
    rv = real_getresponse(self, *args, **kwargs)
  File "/usr/lib64/python3.9/http/client.py", line 1377, in getresponse
    response.begin()
  File "/usr/lib64/python3.9/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.9/http/client.py", line 289, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/request.py", line 82, in perform
    resp = requests.patch(
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/api.py", line 145, in patch
    return request("patch", url, data=data, **kwargs)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/requests/adapters.py", line 501, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/managers/util/retry.py", line 93, in _retry_over_time
    return fun(*args, **kwargs)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/managers/staging/post.py", line 82, in <lambda>
    self.action_executor.execute(lambda: action.write_from_path(pulsar_path), description)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/client/action_mapper.py", line 513, in write_from_path
    tus_upload_file(self.url, pulsar_path)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/pulsar/client/transport/tus.py", line 32, in tus_upload_file
    uploader.upload()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/uploader/uploader.py", line 45, in upload
    self.upload_chunk()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/uploader/uploader.py", line 59, in upload_chunk
    self._do_request()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/uploader/uploader.py", line 88, in _do_request
    self._retry_or_cry(error)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/uploader/uploader.py", line 102, in _retry_or_cry
    raise error
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/uploader/uploader.py", line 85, in _do_request
    self.request.perform()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/tusclient/request.py", line 92, in perform
    raise TusUploadFailed(error)
tusclient.exceptions.TusUploadFailed: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

And a publisher error:

2025-04-15 15:27:19,837 DEBUG [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] Acknowledging UUID 1ec6cd9c-1a0e-11f0-99d0-005056bc743e on queue setup_ack
2025-04-15 15:27:19,841 DEBUG [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] [publish:1ee11152-1a0e-11f0-a95b-fa163ed650e8] Begin publishing to key pulsar_vgp_jetstream2__setup_ack
2025-04-15 15:27:19,842 DEBUG [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] [publish:1ee11152-1a0e-11f0-a95b-fa163ed650e8] Have producer for publishing to key pulsar_vgp_jetstream2__setup_ack
2025-04-15 15:27:19,844 ERROR [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] [publish:1ee11152-1a0e-11f0-a95b-fa163ed650e8] Connection error while publishing: TimeoutError(110, 'Connection timed out')
Traceback (most recent call last):
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/kombu/connection.py", line 556, in _ensured
    return fun(*args, **kwargs)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/kombu/messaging.py", line 208, in _publish
    return channel.basic_publish(
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/amqp/channel.py", line 1791, in _basic_publish
    self.connection.drain_events(timeout=0)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/amqp/connection.py", line 526, in drain_events
    while not self.blocking_read(timeout):
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/amqp/connection.py", line 531, in blocking_read
    frame = self.transport.read_frame()
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/amqp/transport.py", line 294, in read_frame
    frame_header = read(7, True)
  File "/srv/pulsar/main/venv/lib64/python3.9/site-packages/amqp/transport.py", line 574, in _read
    s = recv(n - len(rbuf))  # see note above
  File "/usr/lib64/python3.9/ssl.py", line 1135, in read
    return self._sslobj.read(len)
TimeoutError: [Errno 110] Connection timed out
2025-04-15 15:27:19,856 INFO  [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] [publish:1ee11152-1a0e-11f0-a95b-fa163ed650e8] Retrying in 0 seconds
2025-04-15 15:27:20,150 DEBUG [pulsar.client.amqp_exchange][consume-setup-amqp://main_pulsar:********@mq.galaxyproject.org:5671//main_pulsar?ssl=1] [publish:1ee11152-1a0e-11f0-a95b-fa163ed650e8] Published to key pulsar_vgp_jetstream2__setup_ack

Which I have to assume are the causes of the loss here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions