When uploading a file to a peer, we could add an option to check the hash of that part before sending it.
We could use the SHA hash tree which hashes every 180KB part for this, which aligns better with the read parts algo in the upload thread. This feature could be toggled via a new preference option, defaulted to Off to stay aligned with the current behavior and not increasing CPU usage for the average user.
The performance cost breakdown would be:
- The file part needs to be read from disk for the upload anyway, so the main performance cost is already paid.
- We need to retrieve the SHA hashes for that part. I believe that during memory leaks analysis for the 3.0.0 release, it was said that the SHA hash tree is kept in memory, but I cannot find that thread now. If that is true, no extra IO/RAM cost. If not, there is a small extra cost to retrieve that information from the file.
- The remaining cost is CPU to calculate the hash. A modern CPU with SHA instructions is very fast (> 1GB/s), an old CPU could just keep the option off by default if they don't want to pay the cost.
The benefits of this feature are:
- Detect silent bit-rot in files. For part files this also allows an early detection, to re-ask for that part instead of waiting for the completion hash.
- Do not upload corrupted data to peers, and avoid being banned because of this.
When a hash failure is detected:
- At least, add a log line to inform the user about that error
- For part files, mark that part as uncompleted to re-ask for it
- For completed files, I'm unsure about the best course of action. At least, not sending that part to the peer, and maybe stop sharing the whole file? Maybe add the file to a corrupted file list for the user to act upon...Something else?
Opinions?
When uploading a file to a peer, we could add an option to check the hash of that part before sending it.
We could use the SHA hash tree which hashes every 180KB part for this, which aligns better with the read parts algo in the upload thread. This feature could be toggled via a new preference option, defaulted to Off to stay aligned with the current behavior and not increasing CPU usage for the average user.
The performance cost breakdown would be:
The benefits of this feature are:
When a hash failure is detected:
Opinions?