Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple GPU and CPU models #302

Merged
merged 77 commits into from
Nov 3, 2021
Merged

Decouple GPU and CPU models #302

merged 77 commits into from
Nov 3, 2021

Conversation

richfitz
Copy link
Member

@richfitz richfitz commented Nov 1, 2021

Another total redesign of the way that gpu models are included, reflecting how we actually use these now.

  • on compilation we might or might not support including gpu-like code. This is detected automatically or can be forced with the cpp11 pseudo-attribute dust::has_gpu_support (this supercedes the old has_gpu_support template as this is useful to know ahead of time).
  • on initialisation we might point at a device or not; that instance will then run entirely on either gpu or cpu
  • all the run/simulate/compare_data/filter functions lose their old device argument

This PR will represent a mid-point along a series of smaller cleanups, as the current gpu code is still a bit redundant.

Some thoughts on future cleanups that we might move into separate PRs

  • longer term it would be nice to exclude all cuda code from being included if not used (e.g., in interface/dust.hpp do we really want to include cuda/filter.hpp?)
  • rework filter_state_type definition in Dust and DustDevice? (hard to due disabled =/move constructor)
  • device_info should not be the only source of real_bits
  • redefine name of __nv_exec_check_disable__
  • move all cuda predefines somewhere into random
  • update vignette, which is now wrong (already has issue)
  • split the test-gpu.R test file into support and running

Fixes #292
Fixes #154 (or close enough anyway)
Fixes #254 (by preventing the problem for now)

@richfitz richfitz marked this pull request as ready for review November 2, 2021 17:37
@richfitz richfitz requested a review from johnlees November 2, 2021 17:37
const size_t n_time = step_end.size();
// The filter snapshot class can be used to store the indexed state
// (implements async copy, swap space, and deinterleaving)
// Filter trajctories not used as we don't need order here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Filter trajctories not used as we don't need order here
// Filter trajectories not used as we don't need order here

Comment on lines 320 to 322
// TODO: we should really do this via a kernel I think? Currently we
// grab the whole state back from the device to the host, then
// filter through it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same way as before, definitely inefficient. Is this ever called? To do via kernel you'd want to copy index over, and probably generalise run_select() to take an index argument rather than assuming the one in device state. index is smaller now so the copy is likely to be faster than this method (which wasn't previously the case, at one point index was the same size as state, but now we compute more of it)

Generally this reminds me that we should at some point probably also make a stride/destride kernel and move away from the CPU methods (I'll raise an issue for that). This would speed up index/state calls, and the history calls in the particle filter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is all stuff for later (and is the same as previous)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#311 posted for this now

}

// NOTE: this is only used for debugging/testing, otherwise we would
// make device_weights a class member.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// make device_weights a class member.
// make device_weights and scan class members.

Comment on lines +556 to +558
// delete move and copy to avoid accidentally using them
DustDevice ( const DustDevice & ) = delete;
DustDevice ( DustDevice && ) = delete;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may actually be ok to use these (were you talking about a case where that'd be useful) as I think all members observe the rule of five

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The profiler destructor does not though, and that causes issues as on move profiling stops!

Comment on lines 663 to 666
// TODO: This update function is wildly inefficient; we should
// probably support things like "copy one state to all the particles
// of that parameter index", possibly as a kernel.
void set_state_from_pars(const std::vector<pars_type>& pars) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the issue here that the initialisation of state from model can be stochastic? So even with the same pars you'd get different states?
Or do you mean making the model initialiser device compatible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be yes, but I am not sure we cope correctly with that yet and we do not test it anywhere - I think this is wrong in CPU code too #310 (added to comment too)

Comment on lines +164 to +165
state_swap.get_array(this->state_.data() + value_offset(),
host_memory_stream_, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant above would change code here. Rather than just using get_array which does a D->D memcpy, this would call a kernel which both destrides and does the memcpy at the same time (for probably no/little extra cost)

Comment on lines +48 to +50
if (run_block_size_int % 32 != 0) {
cpp11::stop("'run_block_size' must be a multiple of 32 (but was %d)",
run_block_size_int);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes setting block size and block count = 1 is useful for debugging so everything is serial

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unchanged from before - I'm inclined not to change it here atm as for debugging we can hack it in

cpp11::stop("Expected 'step' to be scalar or length %d",
obj->n_particles());
}
if (!std::is_same<T, Dust<typename T::model_type>>::value && len != 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why yikes here? this is less hairy than some of the template magic we had before 🙃

@@ -0,0 +1,77 @@
/// IMPORTANT; changes here must be reflected in inst/template/dust_methods.hpp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update these files in the developer notes, I would probably forget to look here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@richfitz richfitz requested a review from johnlees November 3, 2021 14:37
@johnlees johnlees merged commit 143f3a6 into master Nov 3, 2021
@johnlees johnlees deleted the i292-simpler-gpu-allcuda branch November 3, 2021 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants