-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple GPU and CPU models #302
Conversation
const size_t n_time = step_end.size(); | ||
// The filter snapshot class can be used to store the indexed state | ||
// (implements async copy, swap space, and deinterleaving) | ||
// Filter trajctories not used as we don't need order here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Filter trajctories not used as we don't need order here | |
// Filter trajectories not used as we don't need order here |
// TODO: we should really do this via a kernel I think? Currently we | ||
// grab the whole state back from the device to the host, then | ||
// filter through it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same way as before, definitely inefficient. Is this ever called? To do via kernel you'd want to copy index over, and probably generalise run_select()
to take an index argument rather than assuming the one in device state. index is smaller now so the copy is likely to be faster than this method (which wasn't previously the case, at one point index was the same size as state, but now we compute more of it)
Generally this reminds me that we should at some point probably also make a stride/destride kernel and move away from the CPU methods (I'll raise an issue for that). This would speed up index/state calls, and the history calls in the particle filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is all stuff for later (and is the same as previous)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#311 posted for this now
} | ||
|
||
// NOTE: this is only used for debugging/testing, otherwise we would | ||
// make device_weights a class member. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// make device_weights a class member. | |
// make device_weights and scan class members. |
// delete move and copy to avoid accidentally using them | ||
DustDevice ( const DustDevice & ) = delete; | ||
DustDevice ( DustDevice && ) = delete; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may actually be ok to use these (were you talking about a case where that'd be useful) as I think all members observe the rule of five
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The profiler destructor does not though, and that causes issues as on move profiling stops!
// TODO: This update function is wildly inefficient; we should | ||
// probably support things like "copy one state to all the particles | ||
// of that parameter index", possibly as a kernel. | ||
void set_state_from_pars(const std::vector<pars_type>& pars) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the issue here that the initialisation of state from model can be stochastic? So even with the same pars you'd get different states?
Or do you mean making the model initialiser device compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be yes, but I am not sure we cope correctly with that yet and we do not test it anywhere - I think this is wrong in CPU code too #310 (added to comment too)
state_swap.get_array(this->state_.data() + value_offset(), | ||
host_memory_stream_, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant above would change code here. Rather than just using get_array
which does a D->D memcpy, this would call a kernel which both destrides and does the memcpy at the same time (for probably no/little extra cost)
if (run_block_size_int % 32 != 0) { | ||
cpp11::stop("'run_block_size' must be a multiple of 32 (but was %d)", | ||
run_block_size_int); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes setting block size and block count = 1 is useful for debugging so everything is serial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unchanged from before - I'm inclined not to change it here atm as for debugging we can hack it in
cpp11::stop("Expected 'step' to be scalar or length %d", | ||
obj->n_particles()); | ||
} | ||
if (!std::is_same<T, Dust<typename T::model_type>>::value && len != 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yikes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why yikes here? this is less hairy than some of the template magic we had before 🙃
@@ -0,0 +1,77 @@ | |||
/// IMPORTANT; changes here must be reflected in inst/template/dust_methods.hpp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update these files in the developer notes, I would probably forget to look here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Another total redesign of the way that gpu models are included, reflecting how we actually use these now.
dust::has_gpu_support
(this supercedes the oldhas_gpu_support
template as this is useful to know ahead of time).device
argumentThis PR will represent a mid-point along a series of smaller cleanups, as the current gpu code is still a bit redundant.
Some thoughts on future cleanups that we might move into separate PRs
filter_state_type
definition in Dust and DustDevice? (hard to due disabled=
/move constructor)real_bits
__nv_exec_check_disable__
Fixes #292
Fixes #154 (or close enough anyway)
Fixes #254 (by preventing the problem for now)