fix: Fix miner disable→restart lifecycle; bound test event-loop fd leak by ping-ke · Pull Request #990 · QuarkChain/pyquarkchain

ping-ke · 2026-06-11T04:38:24Z

Summary

The miner's disable()/start() cycle was broken in two ways that left mining silently dead after a restart, and the test event-loop fixture leaked one fd per async/sync-test boundary.

Related Issue

#985

Changes

miner.py disable() (stale sentinel): when a subprocess is running, signal termination through the input_q sentinel only — do not also cancel the handler task. Cancelling left the subprocess's terminating None unconsumed in output_q, so the next start() read a stale None and exited immediately. The task is now cancelled only in the elif branch where no subprocess exists yet.
miner.py handle_mined_block() (dead-process restart): on the terminating None, join() the subprocess and reset self.process = None. Previously self.process was never cleared, so a subsequent start() posted work to the already-exited subprocess and mining silently never resumed. join() also reaps the zombie. Resetting here avoids racing the subprocess's async exit.
conftest.py: track and close the gap-filling loop once replaced by an IsolatedAsyncioTestCase loop, bounding the fd leak to one live loop.
tests/test_miner.py: regression tests for both disable() branches and the terminate→restart cycle.

…p fd leak miner.py disable(): when a subprocess is running, signal termination via the input_q sentinel only; do not also cancel the handler task. Cancelling left the subprocess's terminating None unconsumed in output_q, so the next start() read a stale None and exited immediately. The task is now cancelled only when no subprocess exists yet. miner.py handle_mined_block(): on the terminating None, join() the subprocess and reset self.process to None. Previously self.process was never cleared, so after a disable() the next start() posted work to the already-exited subprocess and mining silently never resumed; join() also reaps the zombie. conftest.py: track the gap-filling event loop the fixture creates and close it once replaced by an IsolatedAsyncioTestCase framework loop, bounding the fd leak to at most one live loop instead of one per async/sync-test boundary. test_miner.py: regression tests for both disable() branches and a full terminate->restart cycle that resumes mining and clears self.process.

…event loop handle_mined_block is an async coroutine, but reaped the finished mining subprocess with the blocking Process.join(), stalling the event loop until the OS finished tearing the child down. Use AioProcess.coro_join(), which runs join() in an executor and yields to the loop.

ping-ke · 2026-06-12T06:15:28Z

Comments from Alan:

…bprocess An in-flight mine_new_block task suspended at await create_block_async_func could resume after disable() had already reaped the subprocess and reset self.process to None. With self.process None, the spawn guard no longer short-circuits, so the task would spawn a fresh subprocess that disable() has finished tearing down and can no longer stop. Re-check self.enabled right after the await and bail out before spawning. Add a regression test reproducing the disable-during-block-creation race.

…ding dependencies, asyncio, ethash, jsonrpc, and UPnP. (#978) * add bench for hashimoto * resolve comment * Part 3 - pow: remove pyethash C extension, always use pure Python ethash (#973) * remove pyethash C extension, use pure Python ethash always pyethash is a C++ extension that is not compatible with Python 3.13. The pure Python implementation in ethereum.pow.ethash is sufficient. Remove the conditional import and always use the Python path, adding @lru_cache to get_cache_slow for the same performance benefit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * add bench for hashimoto * add comment for pyethash removal in ethpow.py Add comment explaining why pyethash C++ acceleration was removed (not supported on Python 3.13) with link to #976 * resolve comment * resolve upgrade/ethash comment * optimize ethash pow verification: struct-based hashing and numpy fnv arrays ethash_utils.py: - replace hex-based encode_int/decode_int with struct.pack/unpack for serialize_hash and deserialize_hash (~30x faster per call) - inline ethash_sha3_512/256 to skip intermediate list conversion (~5x faster on list input) - add ethash_sha3_512_np and ethash_sha3_256_np: numpy ndarray variants that accept bytes or ndarray and return uint32 ndarray, eliminating tolist()/ np.array() round-trips in the hot path - consolidate keccak implementation here; ethash.py no longer duplicates it ethash.py: - store cache as 2D numpy uint32 ndarray (shape n x 16) via _get_cache - use ethash_sha3_512_np/256_np throughout to keep data in ndarray form - vectorize the 16-element mix update in calc_dataset_item and hashimoto inner loop using numpy arithmetic instead of list(map(fnv, ...)) - scalar fnv for cache_index uses plain Python int to avoid numpy scalar overhead test_ethash.py: - add TestEthashUtils covering serialize_hash, deserialize_hash, fnv, ethash_sha3_512/256 directly against reference implementations Benchmark: hashimoto_light ~23% faster end-to-end vs pure Python baseline; serialize_hash/deserialize_hash ~30x faster individually * clean up ethereum/pow: remove unused code, simplify ethpow.py ethash_utils.py: - remove struct, _FMT_16I, _FMT_8I (only served deleted serialize/deserialize_hash) - remove fnv (only used in tests, not in production path) - remove ethash_sha3_512 list variant (replaced by numpy ndarray variant) - remove serialize_hash, deserialize_hash, hash_words, xor, serialize_cache, deserialize_cache and related aliases (all replaced by ndarray.tobytes/frombuffer) ethash.py: - mkcache: use ethash_sha3_256_np(...).tobytes() directly, drop serialize_hash - drop serialize_hash import (no longer needed) ethpow.py: - remove pyethash C-extension dead code paths (get_cache/hashimoto were always equal to get_cache_slow/hashimoto_slow after pyethash removal) - keep get_cache_slow/hashimoto_slow structure as fallback for future Cython ext test_ethash.py: - remove test cases for deleted functions (serialize_hash, deserialize_hash, fnv, ethash_sha3_256, ethash_sha3_512 list variant) - cache/dataset hex comparison uses ndarray.tobytes().hex() directly bench_before_after.py, bench_hashimoto_compare.py: - add old/mid/new three-way comparison - old implementations kept inline for regression reference - new side imports from current ethash module directly * rename ethash_sha3_512/256_np -> ethash_sha3_512/256; refactor bench to old/R1/R2 format * bench_before_after: rename mid -> R1 consistently * add Cython inner loop for calc_dataset_item (R3): ~20x speedup over numpy ethash_cy.pyx: typed C loop replacing the 256-iteration FNV parent mixing in calc_dataset_item. ethash.py auto-imports when built, falls back to pure Python otherwise. bench_hashimoto_compare.py extended with R3 column. * gitignore: add Cython generated .c and .pyd files * add Cython to requirements, update README install docs, add Cython vs fallback test * add R2/R3/R4 ethash implementations and refactor bench/test - ethash.py: rewrite with numpy uint32 arrays (R2); add ETHASH_LIB env var to select python/cython/auto at runtime - ethash_cy.pyx: add mix_parents (R3), cy_calc_dataset_item and cy_hashimoto_light with C keccak (R4) - keccak_tiny.c/h: portable C Keccak implementation for Cython R4 - ethpow.py: use ETHASH_LIB-aware hashimoto_light; simplify check_pow/mine - setup.py: build Cython extension with keccak_tiny.c - old_ethash.py: extract original hex-based implementation as reference baseline - bench_hashimoto_compare.py: merge bench_before_after.py; add R3/R4 sections; import old impl from old_ethash.py - test_ethash.py: use old_ethash as baseline for cython correctness test - remove bench_before_after.py * add Rust ethash extension and refactor ethash.py dispatch - Add ethereum/pow/ethash_rs: full Rust implementation of ethash (mkcache, hashimoto_light, mix_parents) using PyO3 0.22 + tiny-keccak - Integrate setuptools-rust into setup.py so build_ext --inplace places ethash_rs.so in the source tree alongside ethash_cy.so - Refactor ethash.py: ETHASH_LIB branches only import functions; _get_cache and hashimoto_light defined once with None-check fallback - Auto-detection order: ethash_rs -> ethash_cy -> pure Python - Update Dockerfile to install Rust toolchain and build both extensions - Update README and requirements.txt for Rust/maturin/setuptools-rust - Add test_rust_matches_python_fallback to verify Rust output * add R5 (Rust) to bench_hashimoto_compare and fix ethash_rs auto-detect - bench_hashimoto_compare.py: add R5 section covering rs_mkcache, rs_calc_dataset_item, and rs_hashimoto_light; skipped gracefully when the Rust extension is not available (same pattern as R3/R4) - ethash.py: fix auto-detect to check for the expected symbol (rs_hashimoto_light / cy_hashimoto_light) after import, preventing the ethash_rs/ Cargo source directory (a namespace package) from being mistaken for a built extension * ethash: scope np.errstate, fix little-endian dtype, restore lru_cache(10) - Replace global np.seterr(over='ignore') with np.errstate(over='ignore') scoped to the two FNV multiply sites in calc_dataset_item and hashimoto, avoiding unintended suppression of overflow warnings elsewhere - Use dtype='<u4' (explicit little-endian) in ethash_sha3_512/256 instead of native-endian np.uint32, matching Ethereum spec on big-endian hosts - Restore lru_cache(10): 8 shards can span ~6-7 different epochs simultaneously, so 2-3 slots would cause frequent cache eviction - test_cython_matches_python_fallback and test_rust_matches_python_fallback now fail on ImportError instead of skipping * fix: cast cmix and ndarray to uint32 before tobytes to avoid dtype mismatch * resolve comments --------- Co-authored-by: QuarkChain Dev <dev@quarkchain.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Part 4 - asyncio: removes all deprecated patterns and adopts modern asyncio idioms. (#972) * modernize asyncio usage for Python 3.10+ compatibility Python 3.10 deprecated the loop= parameter in most asyncio APIs, and Python 3.12 removed it entirely. This removes all deprecated patterns across the codebase and adopts modern asyncio idioms. Changes: - Remove loop= parameter from asyncio.Event, asyncio.wait, asyncio.ensure_future, asyncio.start_server, open_connection, etc. - Replace asyncio.ensure_future() with asyncio.create_task() - Replace get_event_loop().run_until_complete() with asyncio.run() - Replace asyncio.get_event_loop() with asyncio.get_running_loop() where inside a running event loop - Add _get_or_create_event_loop() helper in utils.py for call sites that run before the loop starts (SlaveServer, SlaveConnectionManager) - Replace asyncio.Future used as a one-shot signal with asyncio.Event (active_future → active_event, ping_received_future → ping_received_event) - Make start()/shutdown() methods async where they previously called loop.run_until_complete() internally - Update eth_hash import: BasePreImage → PreImageAPI (eth-hash >=0.5 API) - Update cryptography API: EllipticCurvePublicNumbers.from_encoded_point → EllipticCurvePublicKey.from_encoded_point - Add conftest.py to cancel pending asyncio tasks between tests - Replace assertRaisesRegexp (removed in Py 3.12) with assertRaisesRegex Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix bug * fix asyncio task leaks causing test timeouts Track and cancel all fire-and-forget asyncio tasks that were leaking across tests, causing resource exhaustion and timeout failures. - AbstractConnection: add _loop_task and _handler_tasks tracking; use try/finally in active_and_loop_forever to ensure cleanup on cancellation; cancel _loop_task in close() - master.py: track SlaveConnection loop task and __init_cluster task; cancel _init_task on shutdown - slave.py: track MasterConnection, SlaveConnection, PeerShardConnection loop tasks and __start_server task - shard.py: track PeerShardConnection loop task - miner.py: track and cancel mining task in disable() - simple_network.py: track Peer loop task and connect_seed task - test_utils.py: restructure shutdown_clusters with try/finally to guarantee task cleanup; await server.wait_closed() for slave servers - conftest.py: multi-round task cancellation; reset aborted_rpc_count * change CRLF to LF * move p2p_server change to update/nat branch * resolve comment * move async related changes from update/jsonrpc branch * resolve comment * convert sync tests to async using IsolatedAsyncioTestCase - convert test_cluster.py and test_jsonrpc.py to unittest.IsolatedAsyncioTestCase - make create_test_clusters/shutdown_clusters async, ClusterContext async context manager - replace call_async() with await, assert_true_with_timeout with async_assert_true_with_timeout - replace _get_or_create_event_loop() with asyncio.get_running_loop() in master/slave - fix mock_pay_native_token_as_gas to support both sync and async wrapped functions - remove obsolete _get_or_create_event_loop, call_async, assert_true_with_timeout helpers - fix conftest to restore event loop after IsolatedAsyncioTestCase closes it * remove quarkchain.jsonrpc_client related change * resolve comment * revert 98621ea * convert sync tests to async using IsolatedAsyncioTestCase Migrate test classes to IsolatedAsyncioTestCase so that asyncio.create_task calls in production code (shard_state, root_state) have a running event loop. - Replace asyncio.ensure_future with asyncio.create_task in production code - Replace _get_or_create_event_loop with asyncio.get_running_loop in master/slave - Convert ClusterContext to async context manager - Convert create_test_clusters / shutdown_clusters to async - Add fire_and_forget and async_assert_true_with_timeout utilities - Simplify conftest fixture to only restore event loop after IsolatedAsyncioTestCase teardown * remove useless code * resolve comments --------- Co-authored-by: QuarkChain Dev <dev@quarkchain.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Part 5 - jsonrpc: replace jsonrpcserver/jsonrpcclient with custom implementation (#971) * replace jsonrpcserver/jsonrpcclient packages with custom implementation The jsonrpcserver (3.x) and jsonrpcclient (2.x) packages are incompatible with Python 3.13 and are no longer maintained. Changes: - Add quarkchain/cluster/jsonrpcserver.py: lightweight custom JSON-RPC 2.0 server built on aiohttp (RpcMethods, JsonRpcError hierarchy, request dispatch with positional/named params) - Add quarkchain/jsonrpc_client.py: synchronous JsonRpcClient (uses httpx) and asynchronous AsyncJsonRpcClient (uses aiohttp) replacing jsonrpcclient.HTTPClient / aiohttpClient - Update quarkchain/cluster/jsonrpc.py to use the new RpcMethods class; make start_*_server() classmethod async; use get_running_loop() - Replace jsonrpcclient calls in all tools with JsonRpcClient.call() - Rename quarkchain/tools/stats → quarkchain/tools/stats.py - Remove jsonrpcserver config log suppression (no longer needed) - Update test_jsonrpc.py to use AsyncJsonRpcClient Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix remaining jsonrpc_async usages and import issues in tools - adjust_difficulty.py: replace bare `import monitoring` with `from quarkchain.tools import monitoring` (breaks when run from outside the tools directory) - adjust_difficulty.py: replace jsonrpc_async.Server with AsyncJsonRpcClient and use .call() method - monitoring.py: replace jsonrpc_async.Server with AsyncJsonRpcClient and use .call() method; close session via .close() - jsonrpc_client.py: fix AsyncJsonRpcClient.call signature to use *params (variadic) to match JsonRpcClient.call, fixing callers that pass positional arguments Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix jsonrpc test failures: params passing and websocket server shutdown - Fix send_request in test_jsonrpc.py to unpack list params correctly instead of double-wrapping them (e.g. [["0x..."]] → ["0x..."]) - Add call_with_dict_params to AsyncJsonRpcClient for named params - Implement JSONRPCWebsocketServer.shutdown() to actually close the server, fixing test isolation hangs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert "fix jsonrpc test failures: params passing and websocket server shutdown" This reverts commit 9f603e6. * fix jsonrpc test failures: params passing and websocket server shutdown * move async related changes to update/asyncio branch * add rationale comment for httpx dependency choice in jsonrpc_client * rename jsonrpcserver to jsonrpc_server and fix related imports and issues - Rename jsonrpcserver.py to jsonrpc_server.py for consistent naming - Replace armor with asyncio.shield for cancellation protection - Fix get_running_loop to get_event_loop for compatibility - Fix subscription.py import from old jsonrpcserver package - Fix indentation in test_jsonrpc.py and method name in stats.py * fix tools compatibility with new JsonRpcClient - Fix snake_case field names to camelCase for JSON-RPC responses (network_id->networkId, shard_size->chainSize, block_height->blockHeight, contract_address->contractAddress) - Fix resp.data.result access pattern to direct dict access - Add exception handling and cli.close() to prevent connection leaks - Add 0x prefix for address in balance_watcher query_balance * add unit test for jsonrpc * remove call_with_dict_params, unify call() for both param types, log internal RPC errors * resolve comments --------- Co-authored-by: QuarkChain Dev <dev@quarkchain.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Part 6 - p2p: replace upnpclient with async-upnp-client for NAT port mapping (#970) * replace upnpclient with async-upnp-client for NAT port mapping upnpclient uses synchronous I/O and netifaces for interface discovery, both incompatible or unmaintained under Python 3.13. Switch to async-upnp-client which provides asyncio-native UPnP/IGD support. Rewrite UPnPService.discover() to use async_search() for device discovery and AiohttpSessionRequester for requests, removing the blocking ThreadPoolExecutor path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix bugs in UPnPService nat.py - Fix AttributeError: replace undefined self.external_port/self.internal_port/self.protocol with self.port - Fix _discover timeout: wrap async_search with asyncio.wait_for instead of waiting after it completes - Fix _run: guard _add_port_mapping with self._service check to avoid AttributeError - Fix _delete_port_mapping: delete both TCP and UDP mappings to match AddPortMapping - Remove dead code: _refresh_task and _running fields that were never used * add __main__ entry point to nat.py for manual UPnP testing * change CRLF to LF * add p2p_server change * resolve upgrade/p2p-nat comment * replace upnpclient with async-upnp-client for NAT port mapping upnpclient uses synchronous I/O and netifaces for interface discovery, both incompatible or unmaintained under Python 3.13. Switch to async-upnp-client which provides asyncio-native UPnP/IGD support. Rewrite UPnPService.discover() to use async_search() for device discovery and AiohttpSessionRequester for requests, removing the blocking ThreadPoolExecutor path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix bugs in UPnPService nat.py - Fix AttributeError: replace undefined self.external_port/self.internal_port/self.protocol with self.port - Fix _discover timeout: wrap async_search with asyncio.wait_for instead of waiting after it completes - Fix _run: guard _add_port_mapping with self._service check to avoid AttributeError - Fix _delete_port_mapping: delete both TCP and UDP mappings to match AddPortMapping - Remove dead code: _refresh_task and _running fields that were never used * add __main__ entry point to nat.py for manual UPnP testing * change CRLF to LF * add p2p_server change * resolve upgrade/p2p-nat comment * add unit tests for UPnPService in nat.py * refactor and expand UPnP NAT tests - extract shared socket/aiohttp mocks into fixtures - extract fake_wait helper to reduce duplication - add tests for edge cases: exception handling, missing service, session cleanup - fix coroutine never awaited warning in test_run_exits_on_cancel * add full lifecycle test and extract test constants * fix UPnP lifecycle bugs: add _cleanup, session safety, and conflict handling - Override _cleanup() so stop() is called on service shutdown, preventing port mapping and aiohttp session leaks when the node exits - Close any existing session before creating a new one in discover() to avoid orphaned sessions on repeated calls - Handle SOAP error 718 (ConflictInMappingEntry) as success, matching the previous best-effort behavior from master - Roll back partial port mappings if AddPortMapping fails mid-loop - Import UpnpActionResponseError for typed SOAP error handling * nat: serialise concurrent discover() calls with asyncio.Lock Add _discover_lock so that concurrent callers cannot race on _session and _service state. The lock wraps the entire discover() body, ensuring each call fully completes (including session cleanup in the finally block) before the next begins. * nat: fix async_upnp_client compatibility — traverse embedded devices, fix bool type, deduplicate - Guard on_response with early return if _service already found (SSDP sends one response per device sub-type, so the callback fires ~9x) - Extract location from CaseInsensitiveDict headers (newer async_upnp_client passes headers dict, not a response object) - Recursively traverse embedded_devices to find WANIPConnection service (root device only exposes Layer3Forwarding; WANIPConnection lives in the InternetGatewayDevice → WANDevice → WANConnectionDevice subtree) - Pass NewEnabled=True (bool) instead of 1; async_upnp_client validates the UPnP boolean type strictly * nat: cancel async_search early once WANIPConn service is found Use asyncio.Event + wait_for so _discover returns as soon as the first WANIPConn service is found instead of waiting the full 30-second timeout. Add post-await re-check to guard against concurrent on_response tasks that all pass the initial self._service guard before any of them completes. * nat: derive internal IP from router address, not default route Connect the UDP socket to the UPnP router's IP (parsed from the SSDP location URL and stored in _router_host) instead of 8.8.8.8, so the OS routing table selects the interface that actually reaches the router. Matches the behaviour of the old netifaces-based find_internal_ip_on_device_network() without the extra dependency. Tests updated accordingly: - test_get_internal_ip sets _router_host and asserts connect target - Add test_get_internal_ip_no_router_host: returns None before discover() - Add test_add_port_mapping_no_internal_ip: raises RuntimeError when _router_host is None - Fix fake_search to pass dict response so response.get('location') works - Set fake_device.embedded_devices={} to prevent MagicMock recursion - Move MOCK_ROUTER_HOST to top constants with clarifying comments - Remove trivial guard tests (no_service, already_none, exits_on_cancel) * resolve comment * resolve comment --------- Co-authored-by: QuarkChain Dev <dev@quarkchain.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * Part 1 - upgrade: update all dependencies for Python 3.13 compatibility (#974) * upgrade dependencies for Python 3.13 compatibility - Bump all packages to versions supporting Python 3.13 - Replace python-rocksdb with rocksdict - Replace jsonrpcserver + jsonrpcclient with httpx and custom client - Replace upnpclient + netifaces with async-upnp-client - Drop pyethash C extension (use pure Python ethash) - Require Python >=3.13 in setup.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * bring in py3.13-slim Dockerfile from db-rocksdict branch Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * update test docker * add back pyethash * remove pyethash * resolve comment * add setuptools to requirements, build Cython extension in Dockerfile * cherry-pick related files from caef64a * fix C++20 template-id-cdtor warning and keep build tools in Docker image qkchash_llrb.h: remove <T> from move constructor name — template-id on constructors is ill-formed in C++20 (was a warning with -std=c++17) Dockerfile: remove post-build cleanup of gcc/Rust/build-essential so the test image retains the toolchain needed to rebuild extensions in CI * update Dockerfile and action build-and-test.yml * resolve comments * fix assert * merge Dockerfile-tiny into Dockerfile, use PREBUILT arg to switch modes * refactor: use multi-stage build to control PREBUILT mode reliably * update * Dockerfile: support arm64 builds Build libqkchash.so from source in PREBUILT=false so the image works on any architecture. PREBUILT=true now rejects non-amd64 TARGETARCH since the S3 artifacts are x86_64-only, and pulls libqkchash.so from S3 next to the ethash extensions for consistency. --------- Co-authored-by: QuarkChain Dev <dev@quarkchain.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * fix missing import * update bench_hashimoto* * small change * bug fix * Reduce master logging (#980) * Reduce logging * resolve comments * Add pyethash C extension backend with zero-copy cache and cross-impl tests (#979) * perf: make pyethash cache arr a zero-copy view to halve memory usage np.frombuffer without .copy() makes arr a read-only view into raw, so both share the same ~16MB buffer instead of allocating ~32MB per epoch. Safe because calc_dataset_item always copies a row before mutating it. Add comments explaining the design: why no .copy(), and why hashimoto_light calls _get_pyethash_cache a second time (O(1) LRU hit, not a recomputation). * bug fix * test * add test * use random epoch instead of epoch 42 * resolve comments * refactor: replace ethash_cy/ethash_rs with pyethash C extension - Remove Cython (ethash_cy) and Rust (ethash_rs) ethash backends, including source files, build config, and all references - Add pyethash C extension as the sole accelerated backend (ETHASH_LIB=pyethash or auto-detected) - Add set_ethash_lib() for runtime switching between pyethash and python - Halve per-epoch cache memory via zero-copy np.frombuffer view in _get_pyethash_cache (shares ~16 MB buffer instead of allocating ~32 MB) - Simplify Dockerfile: remove Rust toolchain build, remove S3 downloads for ethash_cy/ethash_rs; pyethash built via pip install (needs gcc) - Drop Cython, maturin, setuptools-rust from requirements.txt and setup.py - Rewrite bench_hashimoto_compare: compare pyethash vs pure-Python across 30 epochs (mkcache + hashimoto_light + check_pow per epoch) - Update tests: remove ethash_cy/ethash_rs cross-impl tests; simplify TestEthashCrossImplEpochs to verify pyethash at large epochs * perf: cache get_full_size/get_cache_size and pre-bind pyethash fn get_full_size and get_cache_size use isprime trial division up to sqrt(~8M) ≈ 2896 iterations per call. Without caching, hashimoto_light paid ~250us on every call for get_full_size alone (100 calls/epoch × 250us = 25ms/epoch overhead). - Add @lru_cache(256) to get_full_size and get_cache_size in ethash_utils - Pre-bind _pyethash_call = _pyethash_mod.hashimoto_light at module load to avoid per-call attribute lookup - Replace ETHASH_LIB string comparison with _pyethash_call is not None (pointer check) in mkcache and hashimoto_light hot paths - Keep _pyethash_call in sync inside set_ethash_lib Expected improvement: hashimoto_light ~0.9ms -> ~0.7ms per call * refactor: split ethash into per-impl functions, fix set_ethash_lib - Replace single mkcache/hashimoto_light with impl-specific variants: mkcache_pyethash / mkcache_python and hashimoto_light_pyethash / hashimoto_light_python - Public mkcache / hashimoto_light become stable wrappers delegating to _mkcache_impl / _hashimoto_light_impl; callers using "from ethash import mkcache" continue to work after set_ethash_lib() - set_ethash_lib: add _pyethash_mod and _pyethash_fn to globals so switching to pyethash works even when pyethash was absent at startup - Pre-bind _pyethash_fn = _pyethash_mod.hashimoto_light to avoid per-call attribute lookup in hashimoto_light_pyethash - set_ethash_lib clears only the relevant LRU cache on each switch * fix: fall back to Python path for non-canonical cache/dataset sizes mkcache_pyethash and hashimoto_light_pyethash always used pyethash regardless of the requested size. pyethash ignores the cache_size / full_size arguments and always returns canonical epoch sizes, so passing a test size (1 KB cache, 32 KB dataset) silently produced wrong results and broke all correctness tests. Add size guards that fall back to the Python path when the requested size does not match the canonical epoch size (e.g. is_test=True paths). get_cache_size and get_full_size are @lru_cache'd so the check costs ~100 ns on the hot path. * fix: restore canonical size guards and harden epoch-520 regression test ethash.py: - Add get_cache_size / get_full_size back to imports (lru_cache'd) - mkcache_pyethash: fall back to Python path for non-canonical cache sizes (e.g. is_test=True uses 1 KB); pyethash ignores cache_size and always returns canonical epoch size, causing test failures otherwise - hashimoto_light_pyethash: same guard for non-canonical full_size test_ethash.py: - TestEthashCrossImplEpochs: replace structural-only check for epoch 520 with comparison against known-good pyethash output (mix digest + result) - Add generation mechanism: set _EPOCH_520_MIX/RESULT to None to regenerate expected values - Fill in epoch-520 known-good values from pyethash run * change ethash_lib when is_Test * Reduce logging * refactor: change get_cache_size/get_full_size/mkcache to take epoch instead of block_number Callers now pass block_number // EPOCH_LENGTH explicitly so the epoch boundary is the canonical unit at call sites. Updated all callers in ethash.py, ethpow.py, bench_hashimoto_compare.py, and test_ethash.py. * update README.md * chore: pin pyethash to commit 907b7d8 * resolve comment * Revert "Reduce logging" This reverts commit b899f4d. * resolve comments * fix: apply PoSW difficulty divider in external miner and fix Ctrl+C hang (#983) * apply posw parms * add apt install * resolve comments * fix(P3-8): skip dead connections when adding peers after active_event (#987) * fix(P3-8): skip dead connections when adding peers after active_event active_and_loop_forever's finally block now sets active_event even on early close, so callers that await active_event.wait() can no longer distinguish a live connection from one that died before becoming active. Guard both call sites with conn.is_active() before calling add_peer(): - shard.py __init_connections: peer shard connections from master - slave.py handle_create_cluster_peer_connection: cluster peer connections * resolve comments * fix: JSON-RPC protocol compliance + protocol tests (#988) * fix: JSON-RPC protocol compliance + protocol tests - HTTP/WS __handle: return -32700 Parse error on malformed JSON instead of swallowing it and dispatching an empty dict; drop the now-dead d=dict() fallback and guard non-dict payloads with isinstance - dispatch: suppress responses for notifications (no "id"), even on error - batch (JSON array) left intentionally unsupported: rejected cleanly with -32600 rather than crashing with an AttributeError/HTTP 500 - add TestJSONRPCProtocol: parse error, batch rejection, notification no-response, error-uses-HTTP-200, and success baseline (standalone master-free server on 38391, raw POST for malformed/array bodies) * add * resolve comments * fix * Improve JSON-RPC error messages and logging - Pre-initialize method = "" instead of locals().get() in the internal-error log path - Reject JSON arrays with a clear "Batch requests not supported" message rather than the generic "Request must be object" * fix: Make `Shard.add_block` cancellation-safe; stop cancelling in-flight handlers (#989) * fix: make Shard.add_block cancellation-safe; stop cancelling in-flight handlers Per-connection finally cancelled in-flight message handlers, but Shard.add_block is a multi-step commit (awaits between steps) with no failure cleanup. A cancellation between awaits left add_block_futures[hash] permanently pending, so waiters and SyncTask hung until process restart. - protocol.py: active_and_loop_forever no longer cancels _handler_tasks (restores run-to-completion). Keeps the fd-leak fix (writer.close on cancel, reader.feed_eof) and the rpc_future_map abort that unblocks any handler awaiting a dead connection's RPC. - shard.py: wrap the commit protocol in add_block / add_block_list_for_sync with try/except BaseException -> pop future + set_exception(RuntimeError) on any interruption (incl. CancelledError), then re-raise. Also fixes the pre-existing exception-path leak (broadcast to a dead slave). Sync now fails-and-retries on interrupted in-flight blocks instead of hanging. - test_cluster.py: regression tests (cancelled add_block cleans up its future; a waiter returns False instead of hanging forever). - slave.py: remove dead SlaveServer.add_block_futures (the real one lives on Shard). * fix: clean up leaked add_block futures on sync early-return; suppress future-exception log noise Address review findings on the cancellation-safety PR: - add_block_list_for_sync created per-block futures inside the loop but only protected the broadcast/commit section. A validation failure (early return False) or an error building prev_root_height left futures from earlier blocks permanently pending; on retry those blocks read as BLOCK_COMMITTING and gather() hung forever -- the same hang class the PR fixes. Wrap the whole loop+commit in one try and extract a shared _abort_uncommitted_add_block_futures helper; the validation-failure path calls it explicitly since return bypasses except. - Call fut.exception() after set_exception in both add_block and the helper so asyncio does not log 'Future exception was never retrieved' when no waiter consumes the future (common single-caller cancellation case). - test_cluster.py: drop unused AsyncMock import. * fix: Small correctness and cleanup fixes surfaced during the code review of PR #978. (#986) * fix: apply P2/P4 review fixes from PR #978 code review - balance_watcher: lstrip("0x") → removeprefix("0x"), fixes address corruption for addresses with leading zeros (~1/16 of all addresses) - ethash: set_ethash_lib() was clearing the active cache instead of the inactive one; fix to clear the now-inactive side on each switch - service.py: remove threadsafe_cancel() — get_running_loop() grabs the caller's loop, not the service loop; no callers exist in the repo - cancel_token: remove unused loop= param from CancelToken.__init__ - test_jsonrpc: drop deprecated asyncio.get_event_loop() in setUp - ethpow.py: remove trailing whitespace - .gitignore: remove stale ethash_cy.c entry (no .pyx source exists) * fix P1-1 * mv stats.py to stats * resolve comments * fix: Fix miner disable→restart lifecycle; bound test event-loop fd leak (#990) * fix: correct miner disable/restart lifecycle and bound test event-loop fd leak miner.py disable(): when a subprocess is running, signal termination via the input_q sentinel only; do not also cancel the handler task. Cancelling left the subprocess's terminating None unconsumed in output_q, so the next start() read a stale None and exited immediately. The task is now cancelled only when no subprocess exists yet. miner.py handle_mined_block(): on the terminating None, join() the subprocess and reset self.process to None. Previously self.process was never cleared, so after a disable() the next start() posted work to the already-exited subprocess and mining silently never resumed; join() also reaps the zombie. conftest.py: track the gap-filling event loop the fixture creates and close it once replaced by an IsolatedAsyncioTestCase framework loop, bounding the fd leak to at most one live loop instead of one per async/sync-test boundary. test_miner.py: regression tests for both disable() branches and a full terminate->restart cycle that resumes mining and clears self.process. * fix: use coro_join in async handle_mined_block to avoid blocking the event loop handle_mined_block is an async coroutine, but reaped the finished mining subprocess with the blocking Process.join(), stalling the event loop until the OS finished tearing the child down. Use AioProcess.coro_join(), which runs join() in an executor and yields to the loop. * fix(miner): re-check enabled after block creation to prevent rogue subprocess An in-flight mine_new_block task suspended at await create_block_async_func could resume after disable() had already reaped the subprocess and reset self.process to None. With self.process None, the spawn guard no longer short-circuits, so the task would spawn a fresh subprocess that disable() has finished tearing down and can no longer stop. Re-check self.enabled right after the await and bail out before spawning. Add a regression test reproducing the disable-during-block-creation race. * Fix async task reference leaks and unclosed HTTP clients (#984) * Fix async task reference and client resource leaks - p2p_manager: save create_task() return value to prevent the server task from being garbage-collected mid-execution - AsyncJsonRpcClient: add __aenter__/__aexit__ for async context use - close AsyncJsonRpcClient at all leaking call sites (test_jsonrpc, monitoring, adjust_difficulty, fund_testnet, batch_deploy_contract) via async with or try/finally * Fix async task GC leaks across cluster and tools - master/shard Synchronizer: save ensure_future() return to self._run_task so the sync loop cannot be GC'd mid-execution, which would leave self.running stuck True (deadlock) - master _main_async: save check_db task reference; await start_mining() directly (short init, no need for a separate task) - tx_generator: save create_task() to self._gen_task; add self._gen_task = None in __init__ - simple_network connect_seed: replace per-peer create_task() with asyncio.gather(..., return_exceptions=True) to anchor all connect coroutines until they finish - miner: anchor the recursive mine_new_block task spawned inside handle_mined_block via self._pending_block_tasks set with add_done_callback discard; avoids overwriting self._mining_task which is used by disable() for cancellation - tools/monitoring fetch_peers_async: move server.close() into finally so it runs even on CancelledError (BaseException) - tools/stats: wrap main loop in try/finally to close both JsonRpcClient (httpx.Client) instances on exit * resolve comments * resolve comment * resolve comment * fix: anchor remaining fire-and-forget notify/monitoring tasks (GC leak) Completes the create_task GC-safety batch flagged in PR #978 review comment #9. CPython's event loop only weakly references tasks, so an unreferenced create_task/ensure_future can be GC'd mid-execution. - shard_state.py: add _background_tasks set + _spawn_background helper (7 call sites); route all subscription notify_* and kafka monitoring spawns through it - root_state.py: hold the single kafka monitoring spawn in a _background_tasks set (inlined, only one call site) - subscription.py: notify_sync now collects sends and awaits gather, matching its notify_new_heads/notify_log siblings (no fire-and-forget) - shard.py: anchor the notify_sync callers on the long-lived Synchronizer and on shard_state (not the transient SyncTask) * refactor: use AsyncExitStack for client cleanup in adjust_difficulty Replace nested try/finally close loops with a single AsyncExitStack that registers every AsyncJsonRpcClient up-front via push_async_callback. - adjust_imbalanced_hashpower: rich+poor clients registered before any gather runs, so a failure in either gather still closes all clients (previously a rich-gather failure leaked the poor clients) - AsyncExitStack invokes every cleanup callback even if one close() raises, unlike the prior loop which aborted on the first failure * resolve comment --------- Co-authored-by: QuarkChain Dev <dev@quarkchain.io> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

ping-ke requested review from blockchaindevsh, iteyelmp, qizhou, qzhodl and syntrust June 11, 2026 04:38

qzhodl approved these changes Jun 11, 2026

View reviewed changes

ping-ke mentioned this pull request Jun 11, 2026

# PR #978 Claude Code Review：Upgrade to Python 3.13 #985

Closed

19 tasks

blockchaindevsh reviewed Jun 11, 2026

View reviewed changes

Comment thread quarkchain/cluster/miner.py Outdated

ping-ke requested a review from blockchaindevsh June 11, 2026 16:45

ping-ke mentioned this pull request Jun 12, 2026

QKC/OP Devs Meeting #99 QuarkChain/pm#147

Closed

blockchaindevsh approved these changes Jun 12, 2026

View reviewed changes

ping-ke merged commit 62d17a5 into upgrade/py313-baseline Jun 12, 2026
4 checks passed

ping-ke deleted the claude-review-4 branch June 12, 2026 07:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Fix miner disable→restart lifecycle; bound test event-loop fd leak#990

fix: Fix miner disable→restart lifecycle; bound test event-loop fd leak#990
ping-ke merged 3 commits into
upgrade/py313-baselinefrom
claude-review-4

ping-ke commented Jun 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

ping-ke commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ping-ke commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Uh oh!

Uh oh!

ping-ke commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ping-ke commented Jun 11, 2026 •

edited

Loading