Skip to content

Create Inactive Mailbox Status Using Heartbeats#404

Open
jz1909 wants to merge 3 commits into
academy-agents:mainfrom
jz1909:issue-366-pt2
Open

Create Inactive Mailbox Status Using Heartbeats#404
jz1909 wants to merge 3 commits into
academy-agents:mainfrom
jz1909:issue-366-pt2

Conversation

@jz1909

@jz1909 jz1909 commented May 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Created a new category of status for mailbox status - Inactive - so that we can distinguish between mailboxes that are actively listening and alive against mailboxes that haven't been terminated but are idling

A mailbox is inactive when it's last active heartbeat was over 2 minutes ago

Related Issues

Relates to Issue #366

Changes

  • Breaking (backwards incompatible changes to public interfaces)
  • Bug fix (non-breaking change which fixes an issue)
  • [!] Enhancement (non-breaking change or feature addition)
  • Refactor (internal code or design clean up)
  • Documentation (no changes to the code)
  • [!] Test (changes or additions to testing)
  • Build (change to CI workflows or build processes)
  • Package (changes to package metadata or dependency versions)

Testing

transport_test.py, backend_test.py

Pull Request Checklist

Please confirm the PR meets the following requirements.

  • [!] Relevant tags are added based on the types of changes.
  • [!] Code changes pass pre-commit (e.g., ruff, mypy, etc.).
  • [!] Tests have been added to show the fix is effective or that the new feature works.
  • [!] New and existing unit tests pass locally with the changes.
  • [!] Docs have been updated and reviewed if relevant.

Comment thread academy/exchange/transport.py Outdated
from academy.identifier import AgentT

# Mailbox is inactive after no recorded activity for 2 minutes.
HEARTBEAT_STALE_THRESHOLD: float = 120

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this configurable (i.e. a parameter to most of the exchange factories. For the HttpExchangeFactory and the GlobusExchangeFactory I don't think it will work, I think we'll have to put it in exchange/cloud/config.py).

I also think this is probably better expressed as a number of heartbeats --- that leaves less room for configuration errors like "I set my heartbeat at 60s and my threshold at 30s".

@AK2000 AK2000 changed the title Issue 366 pt2 Create Inactive Status Based on Missed Heartbeats May 1, 2026
@AK2000 AK2000 changed the title Create Inactive Status Based on Missed Heartbeats Create Inactive Mailbox Status Using Heartbeats May 1, 2026
@AK2000 AK2000 added the enhancement New features or improvements to existing functionality label May 1, 2026
@AK2000

AK2000 commented May 1, 2026

Copy link
Copy Markdown
Contributor

Close #366

@jz1909 jz1909 force-pushed the issue-366-pt2 branch 2 times, most recently from 934ce84 to a047e2c Compare May 10, 2026 08:48
@jz1909 jz1909 requested a review from AK2000 May 10, 2026 08:55
Comment thread academy/exchange/cloud/client.py Outdated
host=host,
port=port,
logger=LogConfig(level=level),
heartbeat_stale_threshold_s=heartbeat_stale_threshold_s,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere? If we want it to be configurable per client, we could have it be passed as a parameter to the status endpoint of the app?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in latest commit - now a parameter in the status end point, but also kept config's connection as a fallback

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the comment about why its used - i think this would be helpful if someone wanted to change the threshold globally via the config, which then via create_app, which calls backend = config.backend.get_backend( heartbeat_stale_threshold_s=config.heartbeat_stale_threshold_s, ), would then propagate that change to the transport

however, if they decide to not set it, then the default value here def __init__( self, message_size_limit_kb: int = 1024, heartbeat_stale_threshold_s: float = DEFAULT_THRESHOLD_S, ) will take over

Comment thread academy/exchange/client.py Outdated
async def _heartbeat_loop(self) -> None:
heartbeat_interval: int = 60
# Only runs for local exchange
if not hasattr(self._transport, 'heartbeat_interval_s'):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the local exchange has this property. so what is this trying to catch?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that this skips anything that's not non-cloud exchange, so heartbeat_interval = self._transport.heartbeat_interval_s only runs for non-cloud exchanges

Comment thread academy/exchange/redis.py Outdated
return MailboxStatus.TERMINATED
else:
return MailboxStatus.ACTIVE
last_heartbeat = await self.heartbeat_status(uid)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm trying to think about how we minimize the repeated code. It seems every transport follows the same pattern, and we create the heartbeat_interval_s inside every transport. What if instead hearbeat_stale_periods became an attribute of the client. And inside the clientstatus which used to just route messages to the transport, we did both a status check, then a conditional heartbeat check? That way we create this logic just once?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I thought about this for a bit, and while it is true that we should be minimizing code, i don't think using the client's status directly to resolve status checks for INACTIVE makes the most sense?

The issue that I thought about is that people in the future (or perhaps right now) might write something that hits the transport's status directly and do something with the return value (even though that isn't an issue for now, given nothing depends on INACTIVE yet). thus, if we advertise each transport's status as returning the whole range of MailboxStatus, which is what it seems like, then it could be misleading given that status actually doesn't return one of the four values?

Maybe we could write something like a helper function in transport.py and then use that in each of the transports to minimize the code for the resolution logic, but still keep the idea that each transport's status returns what its supposed to?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we moved the entire "status" abstraction out of the transport? Then transport would only be resposible for implementing heartbeat_status, and the client turns that time/error into "MISSING", "ACTIVE", "INACTIVE" or "TERMINATED"? This would avoid a redundant call to the transport and minimize the implementation of each transport

@AK2000 AK2000 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! I think the main thing is moving some fields into exchange/client.py instead of each transport, but I think the logic is there and I do like the way this change has improved checking status

Comment thread academy/exchange/hybrid.py Outdated
status = await self._redis_client.get(self._status_key(uid))
if status is None:
raise BadEntityIdError(uid)
elif status.decode() == _MailboxState.INACTIVE.value:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I don't think this is right. Even if a mailbox is inactive, we should be able to terminate that mailbox (i.e. stop new messages from being sent to it)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened was that for _MailboxState, the enum was named confusingly from the start, where for example, terminate() changes the transport's private status to INACTIVE. Thus, while it appears semantically wrong, the logic is actually right given that we raise TerminatedError on this INACTIVE that's actually suppose to mean mailbox terminated.

I'm going to change _MailboxState to just ACTIVE and TERMINATED, since it's purpose as a private status is just to distinguish between those two. It should now make sense semantically

Comment thread academy/exchange/hybrid.py Outdated
redis_host: str,
redis_port: int,
*,
heartbeat_stale_periods: int = 4,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that "status" is implemented in the client, I think the heartbeat_stale_periods should be a attribute of the client too

Comment thread academy/exchange/hybrid.py Outdated
redis_port: int,
*,
heartbeat_stale_periods: int = 4,
heartbeat_interval_s: float = 60,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And since the heartbeat loop is a method in client, I think this attribute should be in client as well

@jz1909 jz1909 force-pushed the issue-366-pt2 branch 2 times, most recently from d969900 to 7e3e9b5 Compare June 10, 2026 02:49
…alls, made update_heartbeat in client no longer a noop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New features or improvements to existing functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants