Skip to content

fix: JSON-RPC protocol compliance + protocol tests#988

Merged
ping-ke merged 5 commits into
upgrade/py313-baselinefrom
claude-review-3
Jun 12, 2026
Merged

fix: JSON-RPC protocol compliance + protocol tests#988
ping-ke merged 5 commits into
upgrade/py313-baselinefrom
claude-review-3

Conversation

@ping-ke

@ping-ke ping-ke commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Brings the HTTP and WebSocket JSON-RPC servers into spec compliance for malformed input,
notifications, and batch requests, and adds a protocol test suite. Also hardens the JSON-RPC client
to validate response ids.

Related Issue

#985

Changes

  • jsonrpc.py (HTTP/WS __handle): return -32700 Parse error on malformed JSON instead of
    swallowing the error and dispatching an empty dict. Drop the dead d = dict() fallback and guard
    non-dict payloads with isinstance.
  • jsonrpc_server.py (dispatch): suppress responses for notifications (requests with no
    "id"), even on the error paths. Batch (JSON array) requests are intentionally still unsupported
    but now rejected cleanly with -32600 instead of crashing with an AttributeError / HTTP 500.
  • jsonrpc_client.py: JsonRpcClient / AsyncJsonRpcClient now raise JsonRpcError(-32600)
    when the response id doesn't match the request id; also fixes a missing trailing newline.
  • tests/test_jsonrpc.py: add TestJSONRPCProtocol — parse error, batch rejection,
    notification-no-response, error-uses-HTTP-200, and a success baseline (standalone master-free
    server on port 38391, raw POST for malformed/array bodies).

Files

quarkchain/cluster/jsonrpc.py, quarkchain/cluster/jsonrpc_server.py,
quarkchain/jsonrpc_client.py, quarkchain/cluster/tests/test_jsonrpc.py

- HTTP/WS __handle: return -32700 Parse error on malformed JSON instead
  of swallowing it and dispatching an empty dict; drop the now-dead
  d=dict() fallback and guard non-dict payloads with isinstance
- dispatch: suppress responses for notifications (no "id"), even on error
- batch (JSON array) left intentionally unsupported: rejected cleanly with
  -32600 rather than crashing with an AttributeError/HTTP 500
- add TestJSONRPCProtocol: parse error, batch rejection, notification
  no-response, error-uses-HTTP-200, and success baseline (standalone
  master-free server on 38391, raw POST for malformed/array bodies)
Comment thread quarkchain/jsonrpc_client.py Outdated
Comment thread quarkchain/cluster/jsonrpc_server.py Outdated
Comment thread quarkchain/cluster/tests/test_jsonrpc.py
@ping-ke ping-ke requested a review from qzhodl June 11, 2026 07:25
Comment thread quarkchain/jsonrpc_client.py
@ping-ke ping-ke requested a review from qzhodl June 11, 2026 08:18

@blockchaindevsh blockchaindevsh left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Fable:

Review: PR #988 — fix: JSON-RPC protocol compliance + protocol tests

Verdict up front: Approve. All fixes are spec-correct, all prior reviewer comments are resolved at head 2afc221, and CI is green.

Overview

This PR (4 commits, base upgrade/py313-baseline) hardens the home-grown JSON-RPC layer that replaced the old jsonrpcserver 3.5.4 dependency. It fixes three classes of spec violations and adds a protocol-level test suite:

  1. Malformed JSON — the HTTP server previously swallowed the json.loads failure (d = dict() fallback) and dispatched an empty dict, producing a misleading -32600 Invalid Request; the WS server raised InvalidParams, which escaped the async for
    loop and killed the connection, silently dropping every subscription on it (via the finally cleanup at jsonrpc.py:1527). Both now return a proper -32700 Parse error.
  2. JSON arrays / scalars over HTTP — the base code called d.get("method", ...) on whatever json.loads returned, so a batch array crashed with AttributeError → HTTP 500. Now guarded with isinstance and rejected cleanly with -32600.
  3. Notifications — dispatch() suppressed responses only on the success path; error paths still returned bodies. Per JSON-RPC 2.0, errors are now suppressed too.
  4. Client hardening — JsonRpcClient/AsyncJsonRpcClient now validate that the response id matches the request id.

Correctness analysis

Each change was verified against the base (7f695c9) and head; the local working tree is byte-identical to the PR head for all four files.

  • jsonrpc.py:523-533 (HTTP __handle) — parse error returns {"jsonrpc":"2.0","error":{-32700},"id":null} with HTTP 200. Correct per spec (id must be null when it can't be determined). The isinstance(d, dict) guard at line 535 fixes the real
    AttributeError/500 on arrays; non-dict payloads flow into dispatch(), whose pre-existing check (jsonrpc_server.py:95-96) yields a clean -32600.
  • jsonrpc.py:1488-1496 (WS __handle) — sends -32700 and continues instead of raising, preserving the connection and its sub_ids state — the most user-visible fix. The isinstance guards on method/msg_id (lines 1497, 1503) prevent the same
    AttributeError for array/scalar frames over WS.
  • jsonrpc_server.py:94-101, 141-156 (dispatch) — is_notification is hoisted above validation and both except branches return None for notifications. Two subtleties, both fine: non-dict payloads raise InvalidRequest before is_notification is
    set, so the client still gets the -32600 body; and an object lacking "id" with a bad jsonrpc version gets no response — spec-ambiguous, but matches what jsonrpcserver did. logger.exception(...) is correctly placed before the notification
    early-return (line 152), so crashing fire-and-forget handlers still leave a server-side trace.
  • jsonrpc_client.py:31-34, 63-66 — the id check sits after the "error" check (the duplicate flagged in an earlier review comment was removed), so server errors carrying "id": null are no longer masked by a bogus "id mismatch". Verified in
    both sync and async clients.
  • No caller regressions: WS dispatch callers already handle None responses, and the HTTP handler's asyncio.shield wrapping is untouched.

All four prior review comments from qzhodl (error-before-id ordering ×2, logging before notification return, missing WS test) are addressed at head.

Code quality

Minor nits, none blocking:

  • logger.exception("... %s", locals().get("method", "")) works but is unidiomatic — pre-initializing method = "" before the try would be cleaner.
  • The triple isinstance checks in WS __handle are repetitive; an early non-dict rejection would read better.
  • Arrays are valid JSON-RPC (batches), so a "Batch requests not supported" message would be friendlier than the generic "Request must be object". The no-batch deviation itself is intentional and documented.

Test coverage

Good. TestJSONRPCProtocol covers parse error, batch rejection (no 500), notification-no-response, error-with-HTTP-200, and a success baseline using a standalone master-free server (the func.get binding trick in JSONRPCHttpServer.init
was verified to work). The WS test test_malformed_json_keeps_connection_alive verifies both the -32700 reply and that a subsequent subscribe on the same socket succeeds — exactly the regression that matters. Acceptable gaps:
array/notification frames over WS, explicit "id": null requests. Tests couldn't be run locally (no Python env with project deps), but CI runs them: all 4 checks green on head.

Performance / security

Neutral-to-positive: malformed input no longer produces unhandled exceptions, HTTP 500s, or torn-down WS subscription state — removes a trivial remote nuisance vector. Side effect: unparseable requests no longer increment self.counters, which
is fine since counters only meaningfully tracked parsed methods.

Optional follow-ups (non-blocking): pre-initialized method variable instead of locals().get, a "Batch requests not supported" message for arrays, and possibly HTTP 204 instead of an empty 200 body for notifications.

- Pre-initialize method = "" instead of locals().get() in the
  internal-error log path
- Reject JSON arrays with a clear "Batch requests not supported"
  message rather than the generic "Request must be object"

@blockchaindevsh blockchaindevsh left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Fable:

PR #988 incremental review (2afc2215548bac)

Changes: Only one new commit ("Improve JSON-RPC error messages and logging"), touching only jsonrpc_server.py — a direct response to the two non-blocking suggestions from the previous review:

  1. Pre-initialized method variable (jsonrpc_server.py:95): dispatch now initializes method = "" at the top, and the internal-error log uses it directly, replacing the previously flagged locals().get("method", "") pattern. Verified
    that in that branch method can only be "" or an already-validated string — it will never log None or a stale value.
  2. Explicit batch-request rejection (jsonrpc_server.py:98-99): JSON arrays now return "Batch requests not supported" (error code still -32600, id: null), conforming to what JSON-RPC 2.0 prescribes for an invalid batch. The WS path goes
    through the same dispatch, so it benefits automatically.

Status of prior feedback: Both suggestions have been addressed; the optional "HTTP 204 for notifications" suggestion was not adopted (existing tests pin the 200-with-empty-body behavior, which is acceptable); all of qzhodl's comments are
resolved at head.

CI: All 4 checks green.

Verdict: Approve. Only two purely cosmetic points remain (the log shows an empty-string method when extraction fails; the batch test could also assert the new error message) — neither is blocking.

@ping-ke ping-ke merged commit 69fb04d into upgrade/py313-baseline Jun 12, 2026
4 checks passed
@ping-ke ping-ke deleted the claude-review-3 branch June 12, 2026 01:55
ping-ke added a commit that referenced this pull request Jun 15, 2026
…ding dependencies, asyncio, ethash, jsonrpc, and UPnP. (#978)

* add bench for hashimoto

* resolve comment

* Part 3 - pow: remove pyethash C extension, always use pure Python ethash (#973)

* remove pyethash C extension, use pure Python ethash always

pyethash is a C++ extension that is not compatible with Python 3.13.
The pure Python implementation in ethereum.pow.ethash is sufficient.
Remove the conditional import and always use the Python path, adding
@lru_cache to get_cache_slow for the same performance benefit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* add bench for hashimoto

* add comment for pyethash removal in ethpow.py

Add comment explaining why pyethash C++ acceleration was removed
(not supported on Python 3.13) with link to #976

* resolve comment

* resolve upgrade/ethash comment

* optimize ethash pow verification: struct-based hashing and numpy fnv arrays

ethash_utils.py:
- replace hex-based encode_int/decode_int with struct.pack/unpack for
  serialize_hash and deserialize_hash (~30x faster per call)
- inline ethash_sha3_512/256 to skip intermediate list conversion (~5x faster
  on list input)
- add ethash_sha3_512_np and ethash_sha3_256_np: numpy ndarray variants that
  accept bytes or ndarray and return uint32 ndarray, eliminating tolist()/
  np.array() round-trips in the hot path
- consolidate keccak implementation here; ethash.py no longer duplicates it

ethash.py:
- store cache as 2D numpy uint32 ndarray (shape n x 16) via _get_cache
- use ethash_sha3_512_np/256_np throughout to keep data in ndarray form
- vectorize the 16-element mix update in calc_dataset_item and hashimoto
  inner loop using numpy arithmetic instead of list(map(fnv, ...))
- scalar fnv for cache_index uses plain Python int to avoid numpy scalar overhead

test_ethash.py:
- add TestEthashUtils covering serialize_hash, deserialize_hash, fnv,
  ethash_sha3_512/256 directly against reference implementations

Benchmark: hashimoto_light ~23% faster end-to-end vs pure Python baseline;
serialize_hash/deserialize_hash ~30x faster individually

* clean up ethereum/pow: remove unused code, simplify ethpow.py

ethash_utils.py:
- remove struct, _FMT_16I, _FMT_8I (only served deleted serialize/deserialize_hash)
- remove fnv (only used in tests, not in production path)
- remove ethash_sha3_512 list variant (replaced by numpy ndarray variant)
- remove serialize_hash, deserialize_hash, hash_words, xor, serialize_cache,
  deserialize_cache and related aliases (all replaced by ndarray.tobytes/frombuffer)

ethash.py:
- mkcache: use ethash_sha3_256_np(...).tobytes() directly, drop serialize_hash
- drop serialize_hash import (no longer needed)

ethpow.py:
- remove pyethash C-extension dead code paths (get_cache/hashimoto were always
  equal to get_cache_slow/hashimoto_slow after pyethash removal)
- keep get_cache_slow/hashimoto_slow structure as fallback for future Cython ext

test_ethash.py:
- remove test cases for deleted functions (serialize_hash, deserialize_hash, fnv,
  ethash_sha3_256, ethash_sha3_512 list variant)
- cache/dataset hex comparison uses ndarray.tobytes().hex() directly

bench_before_after.py, bench_hashimoto_compare.py:
- add old/mid/new three-way comparison
- old implementations kept inline for regression reference
- new side imports from current ethash module directly

* rename ethash_sha3_512/256_np -> ethash_sha3_512/256; refactor bench to old/R1/R2 format

* bench_before_after: rename mid -> R1 consistently

* add Cython inner loop for calc_dataset_item (R3): ~20x speedup over numpy

ethash_cy.pyx: typed C loop replacing the 256-iteration FNV parent mixing
in calc_dataset_item. ethash.py auto-imports when built, falls back to pure
Python otherwise. bench_hashimoto_compare.py extended with R3 column.

* gitignore: add Cython generated .c and .pyd files

* add Cython to requirements, update README install docs, add Cython vs fallback test

* add R2/R3/R4 ethash implementations and refactor bench/test

- ethash.py: rewrite with numpy uint32 arrays (R2); add ETHASH_LIB env var
  to select python/cython/auto at runtime
- ethash_cy.pyx: add mix_parents (R3), cy_calc_dataset_item and
  cy_hashimoto_light with C keccak (R4)
- keccak_tiny.c/h: portable C Keccak implementation for Cython R4
- ethpow.py: use ETHASH_LIB-aware hashimoto_light; simplify check_pow/mine
- setup.py: build Cython extension with keccak_tiny.c
- old_ethash.py: extract original hex-based implementation as reference baseline
- bench_hashimoto_compare.py: merge bench_before_after.py; add R3/R4 sections;
  import old impl from old_ethash.py
- test_ethash.py: use old_ethash as baseline for cython correctness test
- remove bench_before_after.py

* add Rust ethash extension and refactor ethash.py dispatch

- Add ethereum/pow/ethash_rs: full Rust implementation of ethash
  (mkcache, hashimoto_light, mix_parents) using PyO3 0.22 + tiny-keccak
- Integrate setuptools-rust into setup.py so build_ext --inplace
  places ethash_rs.so in the source tree alongside ethash_cy.so
- Refactor ethash.py: ETHASH_LIB branches only import functions;
  _get_cache and hashimoto_light defined once with None-check fallback
- Auto-detection order: ethash_rs -> ethash_cy -> pure Python
- Update Dockerfile to install Rust toolchain and build both extensions
- Update README and requirements.txt for Rust/maturin/setuptools-rust
- Add test_rust_matches_python_fallback to verify Rust output

* add R5 (Rust) to bench_hashimoto_compare and fix ethash_rs auto-detect

- bench_hashimoto_compare.py: add R5 section covering rs_mkcache,
  rs_calc_dataset_item, and rs_hashimoto_light; skipped gracefully when
  the Rust extension is not available (same pattern as R3/R4)
- ethash.py: fix auto-detect to check for the expected symbol
  (rs_hashimoto_light / cy_hashimoto_light) after import, preventing the
  ethash_rs/ Cargo source directory (a namespace package) from being
  mistaken for a built extension

* ethash: scope np.errstate, fix little-endian dtype, restore lru_cache(10)

- Replace global np.seterr(over='ignore') with np.errstate(over='ignore')
  scoped to the two FNV multiply sites in calc_dataset_item and hashimoto,
  avoiding unintended suppression of overflow warnings elsewhere
- Use dtype='<u4' (explicit little-endian) in ethash_sha3_512/256 instead
  of native-endian np.uint32, matching Ethereum spec on big-endian hosts
- Restore lru_cache(10): 8 shards can span ~6-7 different epochs
  simultaneously, so 2-3 slots would cause frequent cache eviction
- test_cython_matches_python_fallback and test_rust_matches_python_fallback
  now fail on ImportError instead of skipping

* fix: cast cmix and ndarray to uint32 before tobytes to avoid dtype mismatch

* resolve comments

---------

Co-authored-by: QuarkChain Dev <dev@quarkchain.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Part 4 - asyncio: removes all deprecated patterns and adopts modern asyncio idioms. (#972)

* modernize asyncio usage for Python 3.10+ compatibility

Python 3.10 deprecated the loop= parameter in most asyncio APIs,
and Python 3.12 removed it entirely. This removes all deprecated
patterns across the codebase and adopts modern asyncio idioms.

Changes:
- Remove loop= parameter from asyncio.Event, asyncio.wait,
  asyncio.ensure_future, asyncio.start_server, open_connection, etc.
- Replace asyncio.ensure_future() with asyncio.create_task()
- Replace get_event_loop().run_until_complete() with asyncio.run()
- Replace asyncio.get_event_loop() with asyncio.get_running_loop()
  where inside a running event loop
- Add _get_or_create_event_loop() helper in utils.py for call sites
  that run before the loop starts (SlaveServer, SlaveConnectionManager)
- Replace asyncio.Future used as a one-shot signal with asyncio.Event
  (active_future → active_event, ping_received_future → ping_received_event)
- Make start()/shutdown() methods async where they previously called
  loop.run_until_complete() internally
- Update eth_hash import: BasePreImage → PreImageAPI (eth-hash >=0.5 API)
- Update cryptography API: EllipticCurvePublicNumbers.from_encoded_point
  → EllipticCurvePublicKey.from_encoded_point
- Add conftest.py to cancel pending asyncio tasks between tests
- Replace assertRaisesRegexp (removed in Py 3.12) with assertRaisesRegex

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix bug

* fix asyncio task leaks causing test timeouts

Track and cancel all fire-and-forget asyncio tasks that were leaking
across tests, causing resource exhaustion and timeout failures.

- AbstractConnection: add _loop_task and _handler_tasks tracking;
  use try/finally in active_and_loop_forever to ensure cleanup on
  cancellation; cancel _loop_task in close()
- master.py: track SlaveConnection loop task and __init_cluster task;
  cancel _init_task on shutdown
- slave.py: track MasterConnection, SlaveConnection, PeerShardConnection
  loop tasks and __start_server task
- shard.py: track PeerShardConnection loop task
- miner.py: track and cancel mining task in disable()
- simple_network.py: track Peer loop task and connect_seed task
- test_utils.py: restructure shutdown_clusters with try/finally to
  guarantee task cleanup; await server.wait_closed() for slave servers
- conftest.py: multi-round task cancellation; reset aborted_rpc_count

* change CRLF to LF

* move p2p_server change to update/nat branch

* resolve comment

* move async related changes from update/jsonrpc branch

* resolve comment

* convert sync tests to async using IsolatedAsyncioTestCase

- convert test_cluster.py and test_jsonrpc.py to unittest.IsolatedAsyncioTestCase
- make create_test_clusters/shutdown_clusters async, ClusterContext async context manager
- replace call_async() with await, assert_true_with_timeout with async_assert_true_with_timeout
- replace _get_or_create_event_loop() with asyncio.get_running_loop() in master/slave
- fix mock_pay_native_token_as_gas to support both sync and async wrapped functions
- remove obsolete _get_or_create_event_loop, call_async, assert_true_with_timeout helpers
- fix conftest to restore event loop after IsolatedAsyncioTestCase closes it

* remove quarkchain.jsonrpc_client related change

* resolve comment

* revert 98621ea

* convert sync tests to async using IsolatedAsyncioTestCase

Migrate test classes to IsolatedAsyncioTestCase so that asyncio.create_task
calls in production code (shard_state, root_state) have a running event loop.
- Replace asyncio.ensure_future with asyncio.create_task in production code
- Replace _get_or_create_event_loop with asyncio.get_running_loop in master/slave
- Convert ClusterContext to async context manager
- Convert create_test_clusters / shutdown_clusters to async
- Add fire_and_forget and async_assert_true_with_timeout utilities
- Simplify conftest fixture to only restore event loop after IsolatedAsyncioTestCase teardown

* remove useless code

* resolve comments

---------

Co-authored-by: QuarkChain Dev <dev@quarkchain.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Part 5 - jsonrpc: replace jsonrpcserver/jsonrpcclient with custom implementation (#971)

* replace jsonrpcserver/jsonrpcclient packages with custom implementation

The jsonrpcserver (3.x) and jsonrpcclient (2.x) packages are
incompatible with Python 3.13 and are no longer maintained.

Changes:
- Add quarkchain/cluster/jsonrpcserver.py: lightweight custom
  JSON-RPC 2.0 server built on aiohttp (RpcMethods, JsonRpcError
  hierarchy, request dispatch with positional/named params)
- Add quarkchain/jsonrpc_client.py: synchronous JsonRpcClient
  (uses httpx) and asynchronous AsyncJsonRpcClient (uses aiohttp)
  replacing jsonrpcclient.HTTPClient / aiohttpClient
- Update quarkchain/cluster/jsonrpc.py to use the new RpcMethods
  class; make start_*_server() classmethod async; use get_running_loop()
- Replace jsonrpcclient calls in all tools with JsonRpcClient.call()
- Rename quarkchain/tools/stats → quarkchain/tools/stats.py
- Remove jsonrpcserver config log suppression (no longer needed)
- Update test_jsonrpc.py to use AsyncJsonRpcClient

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix remaining jsonrpc_async usages and import issues in tools

- adjust_difficulty.py: replace bare `import monitoring` with
  `from quarkchain.tools import monitoring` (breaks when run from
  outside the tools directory)
- adjust_difficulty.py: replace jsonrpc_async.Server with
  AsyncJsonRpcClient and use .call() method
- monitoring.py: replace jsonrpc_async.Server with AsyncJsonRpcClient
  and use .call() method; close session via .close()
- jsonrpc_client.py: fix AsyncJsonRpcClient.call signature to use
  *params (variadic) to match JsonRpcClient.call, fixing callers
  that pass positional arguments

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix jsonrpc test failures: params passing and websocket server shutdown

- Fix send_request in test_jsonrpc.py to unpack list params correctly
  instead of double-wrapping them (e.g. [["0x..."]] → ["0x..."])
- Add call_with_dict_params to AsyncJsonRpcClient for named params
- Implement JSONRPCWebsocketServer.shutdown() to actually close the
  server, fixing test isolation hangs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Revert "fix jsonrpc test failures: params passing and websocket server shutdown"

This reverts commit 9f603e6.

* fix jsonrpc test failures: params passing and websocket server shutdown

* move async related changes to update/asyncio branch

* add rationale comment for httpx dependency choice in jsonrpc_client

* rename jsonrpcserver to jsonrpc_server and fix related imports and issues

- Rename jsonrpcserver.py to jsonrpc_server.py for consistent naming
- Replace armor with asyncio.shield for cancellation protection
- Fix get_running_loop to get_event_loop for compatibility
- Fix subscription.py import from old jsonrpcserver package
- Fix indentation in test_jsonrpc.py and method name in stats.py

* fix tools compatibility with new JsonRpcClient

- Fix snake_case field names to camelCase for JSON-RPC responses
  (network_id->networkId, shard_size->chainSize, block_height->blockHeight,
   contract_address->contractAddress)
- Fix resp.data.result access pattern to direct dict access
- Add exception handling and cli.close() to prevent connection leaks
- Add 0x prefix for address in balance_watcher query_balance

* add unit test for jsonrpc

* remove call_with_dict_params, unify call() for both param types, log internal RPC errors

* resolve comments

---------

Co-authored-by: QuarkChain Dev <dev@quarkchain.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Part 6 - p2p: replace upnpclient with async-upnp-client for NAT port mapping (#970)

* replace upnpclient with async-upnp-client for NAT port mapping

upnpclient uses synchronous I/O and netifaces for interface discovery,
both incompatible or unmaintained under Python 3.13.

Switch to async-upnp-client which provides asyncio-native UPnP/IGD
support. Rewrite UPnPService.discover() to use async_search() for
device discovery and AiohttpSessionRequester for requests, removing
the blocking ThreadPoolExecutor path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix bugs in UPnPService nat.py

- Fix AttributeError: replace undefined self.external_port/self.internal_port/self.protocol with self.port
- Fix _discover timeout: wrap async_search with asyncio.wait_for instead of waiting after it completes
- Fix _run: guard _add_port_mapping with self._service check to avoid AttributeError
- Fix _delete_port_mapping: delete both TCP and UDP mappings to match AddPortMapping
- Remove dead code: _refresh_task and _running fields that were never used

* add __main__ entry point to nat.py for manual UPnP testing

* change CRLF to LF

* add p2p_server change

* resolve upgrade/p2p-nat comment

* replace upnpclient with async-upnp-client for NAT port mapping

upnpclient uses synchronous I/O and netifaces for interface discovery,
both incompatible or unmaintained under Python 3.13.

Switch to async-upnp-client which provides asyncio-native UPnP/IGD
support. Rewrite UPnPService.discover() to use async_search() for
device discovery and AiohttpSessionRequester for requests, removing
the blocking ThreadPoolExecutor path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix bugs in UPnPService nat.py

- Fix AttributeError: replace undefined self.external_port/self.internal_port/self.protocol with self.port
- Fix _discover timeout: wrap async_search with asyncio.wait_for instead of waiting after it completes
- Fix _run: guard _add_port_mapping with self._service check to avoid AttributeError
- Fix _delete_port_mapping: delete both TCP and UDP mappings to match AddPortMapping
- Remove dead code: _refresh_task and _running fields that were never used

* add __main__ entry point to nat.py for manual UPnP testing

* change CRLF to LF

* add p2p_server change

* resolve upgrade/p2p-nat comment

* add unit tests for UPnPService in nat.py

* refactor and expand UPnP NAT tests

- extract shared socket/aiohttp mocks into fixtures
- extract fake_wait helper to reduce duplication
- add tests for edge cases: exception handling, missing service, session cleanup
- fix coroutine never awaited warning in test_run_exits_on_cancel

* add full lifecycle test and extract test constants

* fix UPnP lifecycle bugs: add _cleanup, session safety, and conflict handling

- Override _cleanup() so stop() is called on service shutdown, preventing
  port mapping and aiohttp session leaks when the node exits
- Close any existing session before creating a new one in discover() to
  avoid orphaned sessions on repeated calls
- Handle SOAP error 718 (ConflictInMappingEntry) as success, matching the
  previous best-effort behavior from master
- Roll back partial port mappings if AddPortMapping fails mid-loop
- Import UpnpActionResponseError for typed SOAP error handling

* nat: serialise concurrent discover() calls with asyncio.Lock

Add _discover_lock so that concurrent callers cannot race on _session
and _service state. The lock wraps the entire discover() body, ensuring
each call fully completes (including session cleanup in the finally block)
before the next begins.

* nat: fix async_upnp_client compatibility — traverse embedded devices, fix bool type, deduplicate

- Guard on_response with early return if _service already found (SSDP
  sends one response per device sub-type, so the callback fires ~9x)
- Extract location from CaseInsensitiveDict headers (newer
  async_upnp_client passes headers dict, not a response object)
- Recursively traverse embedded_devices to find WANIPConnection service
  (root device only exposes Layer3Forwarding; WANIPConnection lives in
  the InternetGatewayDevice → WANDevice → WANConnectionDevice subtree)
- Pass NewEnabled=True (bool) instead of 1; async_upnp_client validates
  the UPnP boolean type strictly

* nat: cancel async_search early once WANIPConn service is found

Use asyncio.Event + wait_for so _discover returns as soon as the first
WANIPConn service is found instead of waiting the full 30-second timeout.
Add post-await re-check to guard against concurrent on_response tasks
that all pass the initial self._service guard before any of them completes.

* nat: derive internal IP from router address, not default route

Connect the UDP socket to the UPnP router's IP (parsed from the SSDP
location URL and stored in _router_host) instead of 8.8.8.8, so the OS
routing table selects the interface that actually reaches the router.
Matches the behaviour of the old netifaces-based
find_internal_ip_on_device_network() without the extra dependency.

Tests updated accordingly:
- test_get_internal_ip sets _router_host and asserts connect target
- Add test_get_internal_ip_no_router_host: returns None before discover()
- Add test_add_port_mapping_no_internal_ip: raises RuntimeError when
  _router_host is None
- Fix fake_search to pass dict response so response.get('location') works
- Set fake_device.embedded_devices={} to prevent MagicMock recursion
- Move MOCK_ROUTER_HOST to top constants with clarifying comments
- Remove trivial guard tests (no_service, already_none, exits_on_cancel)

* resolve comment

* resolve comment

---------

Co-authored-by: QuarkChain Dev <dev@quarkchain.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* Part 1 - upgrade: update all dependencies for Python 3.13 compatibility (#974)

* upgrade dependencies for Python 3.13 compatibility

- Bump all packages to versions supporting Python 3.13
- Replace python-rocksdb with rocksdict
- Replace jsonrpcserver + jsonrpcclient with httpx and custom client
- Replace upnpclient + netifaces with async-upnp-client
- Drop pyethash C extension (use pure Python ethash)
- Require Python >=3.13 in setup.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* bring in py3.13-slim Dockerfile from db-rocksdict branch

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* update test docker

* add back pyethash

* remove pyethash

* resolve comment

* add setuptools to requirements, build Cython extension in Dockerfile

* cherry-pick related files from caef64a

* fix C++20 template-id-cdtor warning and keep build tools in Docker image

qkchash_llrb.h: remove <T> from move constructor name — template-id on
constructors is ill-formed in C++20 (was a warning with -std=c++17)

Dockerfile: remove post-build cleanup of gcc/Rust/build-essential so the
test image retains the toolchain needed to rebuild extensions in CI

* update Dockerfile and action build-and-test.yml

* resolve comments

* fix assert

* merge Dockerfile-tiny into Dockerfile, use PREBUILT arg to switch modes

* refactor: use multi-stage build to control PREBUILT mode reliably

* update

* Dockerfile: support arm64 builds

Build libqkchash.so from source in PREBUILT=false so the image works on
any architecture. PREBUILT=true now rejects non-amd64 TARGETARCH since
the S3 artifacts are x86_64-only, and pulls libqkchash.so from S3 next
to the ethash extensions for consistency.

---------

Co-authored-by: QuarkChain Dev <dev@quarkchain.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix missing import

* update bench_hashimoto*

* small change

* bug fix

* Reduce master logging  (#980)

* Reduce logging

* resolve comments

* Add pyethash C extension backend with zero-copy cache and cross-impl tests (#979)

* perf: make pyethash cache arr a zero-copy view to halve memory usage

np.frombuffer without .copy() makes arr a read-only view into raw, so
both share the same ~16MB buffer instead of allocating ~32MB per epoch.
Safe because calc_dataset_item always copies a row before mutating it.
Add comments explaining the design: why no .copy(), and why hashimoto_light
calls _get_pyethash_cache a second time (O(1) LRU hit, not a recomputation).

* bug fix

* test

* add test

* use random epoch instead of epoch 42

* resolve comments

* refactor: replace ethash_cy/ethash_rs with pyethash C extension

- Remove Cython (ethash_cy) and Rust (ethash_rs) ethash backends,
  including source files, build config, and all references
- Add pyethash C extension as the sole accelerated backend
  (ETHASH_LIB=pyethash or auto-detected)
- Add set_ethash_lib() for runtime switching between pyethash and python
- Halve per-epoch cache memory via zero-copy np.frombuffer view in
  _get_pyethash_cache (shares ~16 MB buffer instead of allocating ~32 MB)
- Simplify Dockerfile: remove Rust toolchain build, remove S3 downloads
  for ethash_cy/ethash_rs; pyethash built via pip install (needs gcc)
- Drop Cython, maturin, setuptools-rust from requirements.txt and setup.py
- Rewrite bench_hashimoto_compare: compare pyethash vs pure-Python across
  30 epochs (mkcache + hashimoto_light + check_pow per epoch)
- Update tests: remove ethash_cy/ethash_rs cross-impl tests; simplify
  TestEthashCrossImplEpochs to verify pyethash at large epochs

* perf: cache get_full_size/get_cache_size and pre-bind pyethash fn

get_full_size and get_cache_size use isprime trial division up to
sqrt(~8M) ≈ 2896 iterations per call. Without caching, hashimoto_light
paid ~250us on every call for get_full_size alone (100 calls/epoch ×
250us = 25ms/epoch overhead).

- Add @lru_cache(256) to get_full_size and get_cache_size in ethash_utils
- Pre-bind _pyethash_call = _pyethash_mod.hashimoto_light at module load
  to avoid per-call attribute lookup
- Replace ETHASH_LIB string comparison with _pyethash_call is not None
  (pointer check) in mkcache and hashimoto_light hot paths
- Keep _pyethash_call in sync inside set_ethash_lib

Expected improvement: hashimoto_light ~0.9ms -> ~0.7ms per call

* refactor: split ethash into per-impl functions, fix set_ethash_lib

- Replace single mkcache/hashimoto_light with impl-specific variants:
  mkcache_pyethash / mkcache_python and
  hashimoto_light_pyethash / hashimoto_light_python
- Public mkcache / hashimoto_light become stable wrappers delegating to
  _mkcache_impl / _hashimoto_light_impl; callers using
  "from ethash import mkcache" continue to work after set_ethash_lib()
- set_ethash_lib: add _pyethash_mod and _pyethash_fn to globals so
  switching to pyethash works even when pyethash was absent at startup
- Pre-bind _pyethash_fn = _pyethash_mod.hashimoto_light to avoid
  per-call attribute lookup in hashimoto_light_pyethash
- set_ethash_lib clears only the relevant LRU cache on each switch

* fix: fall back to Python path for non-canonical cache/dataset sizes

mkcache_pyethash and hashimoto_light_pyethash always used pyethash
regardless of the requested size. pyethash ignores the cache_size /
full_size arguments and always returns canonical epoch sizes, so
passing a test size (1 KB cache, 32 KB dataset) silently produced
wrong results and broke all correctness tests.

Add size guards that fall back to the Python path when the requested
size does not match the canonical epoch size (e.g. is_test=True paths).
get_cache_size and get_full_size are @lru_cache'd so the check costs
~100 ns on the hot path.

* fix: restore canonical size guards and harden epoch-520 regression test

ethash.py:
- Add get_cache_size / get_full_size back to imports (lru_cache'd)
- mkcache_pyethash: fall back to Python path for non-canonical cache
  sizes (e.g. is_test=True uses 1 KB); pyethash ignores cache_size and
  always returns canonical epoch size, causing test failures otherwise
- hashimoto_light_pyethash: same guard for non-canonical full_size

test_ethash.py:
- TestEthashCrossImplEpochs: replace structural-only check for epoch 520
  with comparison against known-good pyethash output (mix digest + result)
- Add generation mechanism: set _EPOCH_520_MIX/RESULT to None to
  regenerate expected values
- Fill in epoch-520 known-good values from pyethash run

* change ethash_lib when is_Test

* Reduce logging

* refactor: change get_cache_size/get_full_size/mkcache to take epoch instead of block_number

Callers now pass block_number // EPOCH_LENGTH explicitly so the epoch
boundary is the canonical unit at call sites. Updated all callers in
ethash.py, ethpow.py, bench_hashimoto_compare.py, and test_ethash.py.

* update README.md

* chore: pin pyethash to commit 907b7d8

* resolve comment

* Revert "Reduce logging"

This reverts commit b899f4d.

* resolve comments

* fix: apply PoSW difficulty divider in external miner and fix Ctrl+C hang (#983)

* apply posw parms

* add apt install

* resolve comments

* fix(P3-8): skip dead connections when adding peers after active_event (#987)

* fix(P3-8): skip dead connections when adding peers after active_event

active_and_loop_forever's finally block now sets active_event even on early
close, so callers that await active_event.wait() can no longer distinguish
a live connection from one that died before becoming active.

Guard both call sites with conn.is_active() before calling add_peer():
- shard.py __init_connections: peer shard connections from master
- slave.py handle_create_cluster_peer_connection: cluster peer connections

* resolve comments

* fix: JSON-RPC protocol compliance + protocol tests (#988)

* fix: JSON-RPC protocol compliance + protocol tests

- HTTP/WS __handle: return -32700 Parse error on malformed JSON instead
  of swallowing it and dispatching an empty dict; drop the now-dead
  d=dict() fallback and guard non-dict payloads with isinstance
- dispatch: suppress responses for notifications (no "id"), even on error
- batch (JSON array) left intentionally unsupported: rejected cleanly with
  -32600 rather than crashing with an AttributeError/HTTP 500
- add TestJSONRPCProtocol: parse error, batch rejection, notification
  no-response, error-uses-HTTP-200, and success baseline (standalone
  master-free server on 38391, raw POST for malformed/array bodies)

* add

* resolve comments

* fix

* Improve JSON-RPC error messages and logging

- Pre-initialize method = "" instead of locals().get() in the
  internal-error log path
- Reject JSON arrays with a clear "Batch requests not supported"
  message rather than the generic "Request must be object"

* fix: Make `Shard.add_block` cancellation-safe; stop cancelling in-flight handlers (#989)

* fix: make Shard.add_block cancellation-safe; stop cancelling in-flight handlers

Per-connection finally cancelled in-flight message handlers, but
Shard.add_block is a multi-step commit (awaits between steps) with no
failure cleanup. A cancellation between awaits left add_block_futures[hash]
permanently pending, so waiters and SyncTask hung until process restart.

- protocol.py: active_and_loop_forever no longer cancels _handler_tasks
  (restores run-to-completion). Keeps the fd-leak fix (writer.close on
  cancel, reader.feed_eof) and the rpc_future_map abort that unblocks any
  handler awaiting a dead connection's RPC.
- shard.py: wrap the commit protocol in add_block / add_block_list_for_sync
  with try/except BaseException -> pop future + set_exception(RuntimeError)
  on any interruption (incl. CancelledError), then re-raise. Also fixes the
  pre-existing exception-path leak (broadcast to a dead slave). Sync now
  fails-and-retries on interrupted in-flight blocks instead of hanging.
- test_cluster.py: regression tests (cancelled add_block cleans up its
  future; a waiter returns False instead of hanging forever).
- slave.py: remove dead SlaveServer.add_block_futures (the real one lives
  on Shard).

* fix: clean up leaked add_block futures on sync early-return; suppress future-exception log noise

Address review findings on the cancellation-safety PR:

- add_block_list_for_sync created per-block futures inside the loop but only
  protected the broadcast/commit section. A validation failure (early
  return False) or an error building prev_root_height left futures from
  earlier blocks permanently pending; on retry those blocks read as
  BLOCK_COMMITTING and gather() hung forever -- the same hang class the PR
  fixes. Wrap the whole loop+commit in one try and extract a shared
  _abort_uncommitted_add_block_futures helper; the validation-failure path
  calls it explicitly since return bypasses except.
- Call fut.exception() after set_exception in both add_block and the helper
  so asyncio does not log 'Future exception was never retrieved' when no
  waiter consumes the future (common single-caller cancellation case).
- test_cluster.py: drop unused AsyncMock import.

* fix: Small correctness and cleanup fixes surfaced during the code review of PR #978. (#986)

* fix: apply P2/P4 review fixes from PR #978 code review

- balance_watcher: lstrip("0x") → removeprefix("0x"), fixes address
  corruption for addresses with leading zeros (~1/16 of all addresses)
- ethash: set_ethash_lib() was clearing the active cache instead of the
  inactive one; fix to clear the now-inactive side on each switch
- service.py: remove threadsafe_cancel() — get_running_loop() grabs the
  caller's loop, not the service loop; no callers exist in the repo
- cancel_token: remove unused loop= param from CancelToken.__init__
- test_jsonrpc: drop deprecated asyncio.get_event_loop() in setUp
- ethpow.py: remove trailing whitespace
- .gitignore: remove stale ethash_cy.c entry (no .pyx source exists)

* fix P1-1

* mv stats.py to stats

* resolve comments

* fix: Fix miner disable→restart lifecycle; bound test event-loop fd leak (#990)

* fix: correct miner disable/restart lifecycle and bound test event-loop fd leak

miner.py disable(): when a subprocess is running, signal termination via the
input_q sentinel only; do not also cancel the handler task. Cancelling left
the subprocess's terminating None unconsumed in output_q, so the next start()
read a stale None and exited immediately. The task is now cancelled only when
no subprocess exists yet.

miner.py handle_mined_block(): on the terminating None, join() the subprocess
and reset self.process to None. Previously self.process was never cleared, so
after a disable() the next start() posted work to the already-exited
subprocess and mining silently never resumed; join() also reaps the zombie.

conftest.py: track the gap-filling event loop the fixture creates and close it
once replaced by an IsolatedAsyncioTestCase framework loop, bounding the fd
leak to at most one live loop instead of one per async/sync-test boundary.

test_miner.py: regression tests for both disable() branches and a full
terminate->restart cycle that resumes mining and clears self.process.

* fix: use coro_join in async handle_mined_block to avoid blocking the event loop

handle_mined_block is an async coroutine, but reaped the finished mining
subprocess with the blocking Process.join(), stalling the event loop until
the OS finished tearing the child down. Use AioProcess.coro_join(), which
runs join() in an executor and yields to the loop.

* fix(miner): re-check enabled after block creation to prevent rogue subprocess

An in-flight mine_new_block task suspended at await create_block_async_func
could resume after disable() had already reaped the subprocess and reset
self.process to None. With self.process None, the spawn guard no longer
short-circuits, so the task would spawn a fresh subprocess that disable()
has finished tearing down and can no longer stop.

Re-check self.enabled right after the await and bail out before spawning.
Add a regression test reproducing the disable-during-block-creation race.

* Fix async task reference leaks and unclosed HTTP clients (#984)

* Fix async task reference and client resource leaks

- p2p_manager: save create_task() return value to prevent the server
  task from being garbage-collected mid-execution
- AsyncJsonRpcClient: add __aenter__/__aexit__ for async context use
- close AsyncJsonRpcClient at all leaking call sites (test_jsonrpc,
  monitoring, adjust_difficulty, fund_testnet, batch_deploy_contract)
  via async with or try/finally

* Fix async task GC leaks across cluster and tools

- master/shard Synchronizer: save ensure_future() return to
  self._run_task so the sync loop cannot be GC'd mid-execution,
  which would leave self.running stuck True (deadlock)
- master _main_async: save check_db task reference; await
  start_mining() directly (short init, no need for a separate task)
- tx_generator: save create_task() to self._gen_task; add
  self._gen_task = None in __init__
- simple_network connect_seed: replace per-peer create_task() with
  asyncio.gather(..., return_exceptions=True) to anchor all connect
  coroutines until they finish
- miner: anchor the recursive mine_new_block task spawned inside
  handle_mined_block via self._pending_block_tasks set with
  add_done_callback discard; avoids overwriting self._mining_task
  which is used by disable() for cancellation
- tools/monitoring fetch_peers_async: move server.close() into
  finally so it runs even on CancelledError (BaseException)
- tools/stats: wrap main loop in try/finally to close both
  JsonRpcClient (httpx.Client) instances on exit

* resolve comments

* resolve comment

* resolve comment

* fix: anchor remaining fire-and-forget notify/monitoring tasks (GC leak)

Completes the create_task GC-safety batch flagged in PR #978 review
comment #9. CPython's event loop only weakly references tasks, so an
unreferenced create_task/ensure_future can be GC'd mid-execution.

- shard_state.py: add _background_tasks set + _spawn_background helper
  (7 call sites); route all subscription notify_* and kafka monitoring
  spawns through it
- root_state.py: hold the single kafka monitoring spawn in a
  _background_tasks set (inlined, only one call site)
- subscription.py: notify_sync now collects sends and awaits gather,
  matching its notify_new_heads/notify_log siblings (no fire-and-forget)
- shard.py: anchor the notify_sync callers on the long-lived Synchronizer
  and on shard_state (not the transient SyncTask)

* refactor: use AsyncExitStack for client cleanup in adjust_difficulty

Replace nested try/finally close loops with a single AsyncExitStack that
registers every AsyncJsonRpcClient up-front via push_async_callback.

- adjust_imbalanced_hashpower: rich+poor clients registered before any
  gather runs, so a failure in either gather still closes all clients
  (previously a rich-gather failure leaked the poor clients)
- AsyncExitStack invokes every cleanup callback even if one close() raises,
  unlike the prior loop which aborted on the first failure

* resolve comment

---------

Co-authored-by: QuarkChain Dev <dev@quarkchain.io>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants