Thanks for your great-work. However When I Came accross 08-Advance-multiStream I have some problems. I see that
_, input_h0 = cudart.cudaHostAlloc(n_bytes_input, cudart.cudaHostAllocWriteCombined)
_, input_h1 = cudart.cudaHostAlloc(n_bytes_input, cudart.cudaHostAllocWriteCombined)
_, output_h0 = cudart.cudaHostAlloc(n_bytes_output, cudart.cudaHostAllocWriteCombined)
_, output_h1 = cudart.cudaHostAlloc(n_bytes_output, cudart.cudaHostAllocWriteCombined)
_, input_d0 = cudart.cudaMallocAsync(n_bytes_input, stream0)
_, input_d1 = cudart.cudaMallocAsync(n_bytes_input, stream1)
_, output_d0 = cudart.cudaMallocAsync(n_bytes_output, stream0)
_, output_d1 = cudart.cudaMallocAsync(n_bytes_output, stream1)
You alloced the memory using cudart api, and also sync it with cudart api
cudart.cudaMemcpyAsync(inputD, inputH, n_bytes_input, cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, stream)
But in fact the actual execution memory should be using
cudart.cudaMemcpyAsync(tw.buffer[i][1], tw.buffer[i][0].ctypes.data, tw.buffer[i][2], cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, stream)
tw.context.execute_async_v3(stream)
cudart.cudaMemcpyAsync(tw.buffer[o][0].ctypes.data, tw.buffer[o][1], tw.buffer[o][2], cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, stream)
Is this only for time counting? and What is the best way to use for actual inferring, also I want to know if there is only to manually create stream. Thank you.@wili-65535
Thanks for your great-work. However When I Came accross 08-Advance-multiStream I have some problems. I see that
You alloced the memory using cudart api, and also sync it with cudart api
But in fact the actual execution memory should be using
Is this only for time counting? and What is the best way to use for actual inferring, also I want to know if there is only to manually create stream. Thank you.@wili-65535