Reduce the memory highwatermark in DistributedClosestPoint::computeClosestPoints#1889
Reduce the memory highwatermark in DistributedClosestPoint::computeClosestPoints#1889MrBurmark wants to merge 10 commits into
Conversation
In DistributedClosestPoint::computeClosestPoints cleanup conduit nodes as soon as they are no longer needed instead of waiting until the end of the routine. Also refactor storage to use unique_ptr instead of shared_ptr.
…ndCompletionCleanUp
|
@MrBurmark thanks for this. i ran clang-format on your branch so the CI checks will run |
| std::vector<MPI_Request> reqs; | ||
| for(auto& isr : isendRequests) | ||
| reqs.reserve(isendRequests.size()); | ||
| for(auto const& isr : isendRequests) | ||
| { | ||
| reqs.push_back(isr.m_request); | ||
| reqs.push_back(isr.first.m_request); | ||
| } |
There was a problem hiding this comment.
Allocating and freeing a vector of requests every time this function is called seems not optimal, but I didn't want to change too much at once.
The Request object holds onto the packed data so we don't need to hold onto the original node while the isends are processing.
…eanUp' of github.com:llnl/axom into bugfix/burmark1/DistributedClosestPointSendCompletionCleanUp
|
I has codex take a look and it pointed out that the requests own their own packed buffers. So I am now removing the nodes even earlier. |
edponce
left a comment
There was a problem hiding this comment.
These are some general observations on how check_send_requests() is implemented and used which can be discussed with another PR. Ideally, the argument isendRequests would be a std::vector that can be used directly in MPI_[Wait|Test]some(). When using these MPI functions, there is no need to resize the requests array and inCount can remain invariant, but they can change if needed. The completed requests are nullified in the input requests array and ignored when used again. Another observation, the MPI standard defines MPI_Request as an opaque handle, and it is not recommended to consider it a copyable datatype although in many MPI implementations it is implemented as an integer.
Also, when waiting to complete all remaining non-blocking sends, these can be handled with MPI_Waitall() instead of using a while loop and invoking check_send_requests() multiple times.
|
@MrBurmark - are the host-side memory allocations that aren't made via Umpire the primary concern/target of these code changes? |
@publixsubfan Yes, the memory in the conduit nodes and requests is my main concern here. In this PR I reduce the lifetime of the nodes, but did not change the lifetime of the requests. |
Given the request abstraction, I'm not sure that will be an easy thing to do in general. Perhaps if there was a abstraction for collections of requests? |
|
@publixsubfan I have been working through different approaches at making the Axom host allocation interface more consistent and flexible. You may recall that I put up a couple of PRs recently looking for comments and feedback. After discussions with several folks, I closed those and went back to the drawing board. I am starting to work through a new PR where all host allocations will require an explicit allocation mechanism to be provided (e.g., Axom malloc, Umpire Host, or something else such as Umpire Pinned). Should we discuss this ASAP, or would it be better to talk about it when I have something close to done? I'm hoping to have a good draft by early next week. |
@rhornung67 I would be interested in talking earlier rather than later. |
I’ll be back around next week if you guys want to take this offline. That being said, presuming it’s just the changes here I don’t see anything too controversial in this PR, just that long-term a broader solution might be along the lines that @rhornung67 is proposing with a new host memory interface. |
…ndCompletionCleanUp
|
@MrBurmark I merged Axom develop into your PR branch and ran clang-format on it. Now all Axom tests pass. We should probably figure out what we are not testing in your use case. |
…ndCompletionCleanUp
Reduce the memory highwatermark in DistributedClosestPoint::computeClosestPoints. In DistributedClosestPoint::computeClosestPoints cleanup conduit nodes as soon as they are no longer needed instead of waiting until the end of the routine.
Also refactor storage to use unique_ptr instead of shared_ptr.
Summary