Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -27,15 +27,14 @@ class TrackingRecHitDevice : public PortableDeviceCollection<TrackingRecHitLayou
: PortableDeviceCollection<TrackingRecHitLayout<TrackerTraits>, TDev>(nHits, queue), offsetBPIX2_{offsetBPIX2} {
const auto device = alpaka::getDev(queue);

auto start_h = cms::alpakatools::make_host_view(hitsModuleStart, TrackerTraits::numberOfModules + 1);
auto start_h = cms::alpakatools::make_device_view(device, hitsModuleStart, TrackerTraits::numberOfModules + 1);
auto start_d =
cms::alpakatools::make_device_view(device, view().hitsModuleStart().data(), TrackerTraits::numberOfModules + 1);
alpaka::memcpy(queue, start_d, start_h);

auto off_h = cms::alpakatools::make_host_view(offsetBPIX2);
auto off_h = cms::alpakatools::make_host_view(offsetBPIX2_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid even this is not safe (unless the underlying GPU runtime+driver synchronize on the memcpy()). The TrackingRecHitDevice object (constructed in

TrackingRecHitsSoACollection<TrackerTraits> hits_d(queue, nHits, offsetBPIX2, clusters_d->clusModuleStart());

)
is moved around (the return from makeHitsAsync() may get elided, but in the end framework moves the object into the edm::Wrapper<T>), so the address of offsetBPIX2_ is not stable.

But is offsetBPIX2 really needed on the device? The only place I found was

alpaka::exec<Acc1D>(
queue,
cms::alpakatools::make_workdiv<Acc1D>(1, 1),
[] ALPAKA_FN_ACC(Acc1D const &acc,
OuterHitOfCell *isOuterHitOfCell,
OuterHitOfCellContainer *container,
int32_t const *offset) {
// this code runs on the device
isOuterHitOfCell->container = container;
isOuterHitOfCell->offset = *offset;
},
this->isOuterHitOfCell_.data(),
this->device_isOuterHitOfCell_.data(),
&hh.offsetBPIX2());

and everywhere else the offsetBPIX2 is used on the host. In this case it would be easy to pass it by value from host (as the calling function has it already and uses it for other purposes). Or are there future developments that would benefit from having offsetBPIX2 on the device as part of TrackingRecHitsDevice?

If offsetBPIX2 is really needed on the device, I think it should be copied through a host buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm.

The CUDA version does not have a synchronisation, either:

explicit TrackingRecHitSoADevice(uint32_t nHits,
int32_t offsetBPIX2,
ParamsOnGPU const* cpeParams,
uint32_t const* hitsModuleStart,
cudaStream_t stream)
: cms::cuda::PortableDeviceCollection<TrackingRecHitLayout<TrackerTraits>>(nHits, stream),
offsetBPIX2_(offsetBPIX2) {
cudaCheck(cudaMemcpyAsync(&(view().nHits()), &nHits, sizeof(uint32_t), cudaMemcpyDefault, stream));
// hitsModuleStart is on Device
cudaCheck(cudaMemcpyAsync(view().hitsModuleStart().data(),
hitsModuleStart,
sizeof(uint32_t) * int(TrackerTraits::numberOfModules + 1),
cudaMemcpyDefault,
stream));
cudaCheck(cudaMemcpyAsync(&(view().offsetBPIX2()), &offsetBPIX2, sizeof(int32_t), cudaMemcpyDefault, stream));
// cpeParams argument is a pointer to device memory, copy
// its contents into the Layout.
cudaCheck(cudaMemcpyAsync(&(view().cpeParams()), cpeParams, int(sizeof(ParamsOnGPU)), cudaMemcpyDefault, stream));
}

Which is not to mean that I disagree with your assessment, just that the async copy seems to work anyway ?

(side note: for copying a small number of values, I'm wondering if using a kernel is not faster than calling a cudaMemcpy variant ?)

As to whether it's needed or not, I agree it needs to be looked into.

For the time being, I'm more interested in recovering the performance of the CUDA implementation, but after that we can also take the opportunity to clean up the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this PR is not worse than what is done in CUDA, so I'm not against merging it now (with the caveat that it is at least theoretically unsafe and should be eventually addressed).

Which is not to mean that I disagree with your assessment, just that the async copy seems to work anyway ?

It could be the copy has been so quick that it has completed before the memcpy() has finished (really along "nothing else has overwritten the memory before CUDA runtime+driver read from the memory location"). I'd expect the problem manifest when "the GPU is at least nearly full", and even then it is unclear whether it would crash the job or lead to incorrect value to be read (and whatever that would imply downstream).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be the copy has been so quick that it has completed before the memcpy() has finished (really along "nothing else has overwritten the memory before CUDA runtime+driver read from the memory location").

Actually, the CUDA documentation is not very clear what "async memcpy" means:

  • for a synchronous copy it states that

    For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

  • for an asynchronous copy it states that

    If pageable memory must first be staged to pinned memory, the driver may synchronize with the stream and stage the copy into pinned memory.

So it's possible that the CUDA runtime may first perform a synchronous copy of offsetBPIX2_ to a staging area in pinned memory, and then an asynchronous copy from there to GPU memory.

Or not 🤷🏻‍♂️ .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the exact behavior is confusing, and I'd expect to depend on the runtime and driver versions and/or actual hardware. I'd personally assume the weakest guarantees and program around that.

(and with Alpaka we should worry about the behavior of other platforms too)

auto off_d = cms::alpakatools::make_device_view(device, view().offsetBPIX2());
alpaka::memcpy(queue, off_d, off_h);
alpaka::wait(queue);
}

uint32_t nHits() const { return view().metadata().size(); }
Expand Down
11 changes: 7 additions & 4 deletions DataFormats/TrackingRecHitSoA/test/alpaka/Hits_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -34,12 +34,15 @@ int main() {
{
uint32_t nHits = 2000;
int32_t offset = 100;
uint32_t moduleStart[pixelTopology::Phase1::numberOfModules + 1];

auto moduleStartH =
cms::alpakatools::make_host_buffer<uint32_t[]>(queue, pixelTopology::Phase1::numberOfModules + 1);
for (size_t i = 0; i < pixelTopology::Phase1::numberOfModules + 1; ++i) {
moduleStart[i] = i * 2;
moduleStartH[i] = i * 2;
}
TrackingRecHitsSoACollection<pixelTopology::Phase1> tkhit(queue, nHits, offset, moduleStart);
auto moduleStartD =
cms::alpakatools::make_device_buffer<uint32_t[]>(queue, pixelTopology::Phase1::numberOfModules + 1);
alpaka::memcpy(queue, moduleStartD, moduleStartH);
TrackingRecHitsSoACollection<pixelTopology::Phase1> tkhit(queue, nHits, offset, moduleStartD.data());

testTrackingRecHitSoA::runKernels<pixelTopology::Phase1>(tkhit.view(), queue);
tkhit.updateFromDevice(queue);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,7 @@ namespace cms::alpakatools {
nOnes,
nblocks,
ppsws,
alpaka::getWarpSizes(alpaka::getDev(queue))[0]);
alpaka::getPreferredWarpSize(alpaka::getDev(queue)));
} else {
h->finalize();
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@ int main() {
for (auto const& device : devices) {
std::cout << "Test prefix scan on " << alpaka::getName(device) << '\n';
auto queue = Queue(device);
const auto warpSize = alpaka::getWarpSizes(device)[0];
const auto warpSize = alpaka::getPreferredWarpSize(device);
// WARP PREFIXSCAN (OBVIOUSLY GPU-ONLY)
if constexpr (!requires_single_thread_per_block_v<Acc1D>) {
std::cout << "warp level" << std::endl;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -137,12 +137,9 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {

template <typename TrackerTraits>
void SiPixelRawToCluster<TrackerTraits>::acquire(device::Event const& iEvent, device::EventSetup const& iSetup) {
[[maybe_unused]] auto const& hMap = iSetup.getData(mapToken_);
auto const& hMap = iSetup.getData(mapToken_);
auto const& dGains = iSetup.getData(gainsToken_);
auto gains = SiPixelGainCalibrationForHLTDevice(1, iEvent.queue());
auto modulesToUnpackRegional =
cms::alpakatools::make_device_buffer<unsigned char[]>(iEvent.queue(), ::pixelgpudetails::MAX_SIZE);
const unsigned char* modulesToUnpack;

// initialize cabling map or update if necessary
if (recordWatcher_.check(iSetup)) {
// cabling map, which maps online address (fed->link->ROC->local pixel) to offline (DetId->global pixel)
Expand All @@ -151,6 +148,10 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
cabling_ = cablingMap_->cablingTree();
LogDebug("map version:") << cablingMap_->version();
}

// if used, the buffer is guaranteed to stay alive until the after the execution of makePhase1ClustersAsync completes
std::optional<cms::alpakatools::device_buffer<Device, unsigned char[]>> modulesToUnpackRegional;
const unsigned char* modulesToUnpack;
if (regions_) {
regions_->run(iEvent, iSetup);
LogDebug("SiPixelRawToCluster") << "region2unpack #feds: " << regions_->nFEDs();
Expand All @@ -159,7 +160,7 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {

modulesToUnpackRegional = SiPixelMappingUtilities::getModToUnpRegionalAsync(
*(regions_->modulesToUnpack()), cabling_.get(), fedIds_, iEvent.queue());
modulesToUnpack = modulesToUnpackRegional.data();
modulesToUnpack = modulesToUnpackRegional->data();
} else {
modulesToUnpack = hMap->modToUnpDefault();
}
Expand Down Expand Up @@ -235,7 +236,7 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
return;

// copy the FED data to a single cpu buffer
pixelDetails::WordFedAppender wordFedAppender(nDigis_);
pixelDetails::WordFedAppender wordFedAppender(iEvent.queue(), nDigis_);
for (uint32_t i = 0; i < fedIds_.size(); ++i) {
wordFedAppender.initializeWordFed(fedIds_[i], index[i], start[i], words[i]);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -122,12 +122,9 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {

class WordFedAppender {
public:
WordFedAppender();
~WordFedAppender() = default;

WordFedAppender(uint32_t words)
: word_{cms::alpakatools::make_host_buffer<unsigned int[], Platform>(words)},
fedId_{cms::alpakatools::make_host_buffer<unsigned char[], Platform>(words)} {};
WordFedAppender(Queue& queue, uint32_t words)
: word_{cms::alpakatools::make_host_buffer<unsigned int[]>(queue, words)},
fedId_{cms::alpakatools::make_host_buffer<unsigned char[]>(queue, words)} {};

void initializeWordFed(int fedId, unsigned int wordCounterGPU, const uint32_t* src, unsigned int length) {
std::memcpy(word_.data() + wordCounterGPU, src, sizeof(uint32_t) * length);
Expand Down