New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Decouple GPU and CPU models #302

Merged

johnlees merged 77 commits into master from i292-simpler-gpu-allcuda

Nov 3, 2021

Member

richfitz commented Nov 1, 2021 •

edited

Loading

Another total redesign of the way that gpu models are included, reflecting how we actually use these now.

on compilation we might or might not support including gpu-like code. This is detected automatically or can be forced with the cpp11 pseudo-attribute dust::has_gpu_support (this supercedes the old has_gpu_support template as this is useful to know ahead of time).
on initialisation we might point at a device or not; that instance will then run entirely on either gpu or cpu
all the run/simulate/compare_data/filter functions lose their old device argument

This PR will represent a mid-point along a series of smaller cleanups, as the current gpu code is still a bit redundant.

Some thoughts on future cleanups that we might move into separate PRs

longer term it would be nice to exclude all cuda code from being included if not used (e.g., in interface/dust.hpp do we really want to include cuda/filter.hpp?)
rework filter_state_type definition in Dust and DustDevice? (hard to due disabled =/move constructor)
device_info should not be the only source of real_bits
redefine name of __nv_exec_check_disable__
move all cuda predefines somewhere into random
update vignette, which is now wrong (already has issue)
split the test-gpu.R test file into support and running

Fixes #292
Fixes #154 (or close enough anyway)
Fixes #254 (by preventing the problem for now)

richfitz and others added 30 commits

October 18, 2021 09:47


          Remove all gpu code from c++ models

6c3941f


          Much simpler gpu toggle for compiled code

445c7f2


          Skeleton for gpu version

219eeaa


          Construction of device object

6ef3104

WIP

c0eed41


          More WIP


          Try by removing CPU(DA) code

328c21a


          Put CPUDA code back

81f31ac


          Add device filter back

5d79057


          In dust_device obj, use set_x methods rather than initialisation

469cdc8


          Write remaining state and RNG interface fns

6ff6b90


          Remove tmp particle from constructor

ac63cb4


          Temporary dust interface for compiling

7e1cefc


          Add dustdevice header

d2ded66


          Fix some compile errors

36014ac


          Add namespace to filter state

583c465


          Remove unnamed const bools

ff757ef


          Go back to shared_ptr types


          Add host interface to device resample

58bcce1


          Change header order again

e1cf961


          Add filter state header

b22f716


          Interface compiler errors

e07a0d3


          Change filter interface calls to device version

a2e332f


          Start on new approach to compilation

b7276c5


          Return device info

b7bde85


          Better error messages

53d3c7d


          Partially working CPUDA code

9dd10c1


          Fix device select

05eb0e9


          Fix multiple parameter init

414cb76


          Conditionally enable gpu

3b89406

richfitz added 12 commits

November 2, 2021 14:53


          Test of parameter setting into multipar object

43e27b1


          Drop assertions

63e7f20


          Simplify rng state fetch

7d46093


          Adjust condition for error reset

ec26f09


          Simplify multiparameter setting

756fbdf


          Simplify device state construction

270b0a1


          Start removing redundant code

697236e


          Simplify further


          Simplify rng set

42e997d


          Make to/from device simpler

d69396f


          General tidyup

9ef7f1c


          Make function less mysterious

5a3f32b

richfitz mentioned this pull request

reset with vector of times on the gpu will be slow! #254

Closed

richfitz added 5 commits

November 2, 2021 17:05


          Combine similar package and dust code

cec8f94


          Small tidyups noticed in PR review

1636a0e


          Bump version number

0f2d439


          Regenerate/redocument

7b362c2


          Eliminate unused code

378bf9b

richfitz marked this pull request as ready for review

November 2, 2021 17:37

richfitz requested a review from johnlees

November 2, 2021 17:37

johnlees requested changes

View reviewed changes

inst/include/dust/cuda/dust_device.hpp Outdated

+                  const size_t n_time = step_end.size();
+                  // The filter snapshot class can be used to store the indexed state
+                  // (implements async copy, swap space, and deinterleaving)
+                  // Filter trajctories not used as we don't need order here

Member

johnlees Nov 3, 2021

Suggested change

      
                // Filter trajctories not used as we don't need order here
          
                // Filter trajectories not used as we don't need order here

inst/include/dust/cuda/dust_device.hpp Outdated

Comment on lines 320 to 322

+                // TODO: we should really do this via a kernel I think? Currently we
+                // grab the whole state back from the device to the host, then
+                // filter through it.

Member

johnlees Nov 3, 2021

This is the same way as before, definitely inefficient. Is this ever called? To do via kernel you'd want to copy index over, and probably generalise run_select() to take an index argument rather than assuming the one in device state. index is smaller now so the copy is likely to be faster than this method (which wasn't previously the case, at one point index was the same size as state, but now we compute more of it)

Generally this reminds me that we should at some point probably also make a stride/destride kernel and move away from the CPU methods (I'll raise an issue for that). This would speed up index/state calls, and the history calls in the particle filter.

Member Author

richfitz Nov 3, 2021

yeah, this is all stuff for later (and is the same as previous)

Member Author

richfitz Nov 3, 2021

#311 posted for this now

inst/include/dust/cuda/dust_device.hpp Outdated

+                }
+                // NOTE: this is only used for debugging/testing, otherwise we would
+                // make device_weights a class member.

Member

johnlees Nov 3, 2021

Suggested change

      
              // make device_weights a class member.
          
              // make device_weights and scan class members.

inst/include/dust/cuda/dust_device.hpp

Comment on lines +556 to +558

+                // delete move and copy to avoid accidentally using them
+                DustDevice ( const DustDevice & ) = delete;
+                DustDevice ( DustDevice && ) = delete;

Member

johnlees Nov 3, 2021

We may actually be ok to use these (were you talking about a case where that'd be useful) as I think all members observe the rule of five

Member Author

richfitz Nov 3, 2021

The profiler destructor does not though, and that causes issues as on move profiling stops!

inst/include/dust/cuda/dust_device.hpp Outdated

Comment on lines 663 to 666

+                // TODO: This update function is wildly inefficient; we should
+                // probably support things like "copy one state to all the particles
+                // of that parameter index", possibly as a kernel.
+                void set_state_from_pars(const std::vector<pars_type>& pars) {

Member

johnlees Nov 3, 2021

Isn't the issue here that the initialisation of state from model can be stochastic? So even with the same pars you'd get different states?
Or do you mean making the model initialiser device compatible?

Member Author

richfitz Nov 3, 2021

It can be yes, but I am not sure we cope correctly with that yet and we do not test it anywhere - I think this is wrong in CPU code too #310 (added to comment too)

inst/include/dust/cuda/filter_state.hpp

Comment on lines +164 to +165

		state_swap.get_array(this->state_.data() + value_offset(),
		host_memory_stream_, true);

Member

johnlees Nov 3, 2021

What I meant above would change code here. Rather than just using get_array which does a D->D memcpy, this would call a kernel which both destrides and does the memcpy at the same time (for probably no/little extra cost)

inst/include/dust/interface/cuda.hpp

Comment on lines +48 to +50

+                    if (run_block_size_int % 32 != 0) {
+                      cpp11::stop("'run_block_size' must be a multiple of 32 (but was %d)",
+                                  run_block_size_int);

Member

johnlees Nov 3, 2021

Sometimes setting block size and block count = 1 is useful for debugging so everything is serial

Member Author

richfitz Nov 3, 2021

This is unchanged from before - I'm inclined not to change it here atm as for debugging we can hack it in

inst/include/dust/interface/dust.hpp

                     cpp11::stop("Expected 'step' to be scalar or length %d",
                                 obj->n_particles());
                   }
+                  if (!std::is_same<T, Dust<typename T::model_type>>::value && len != 1) {

Member

johnlees Nov 3, 2021

yikes

Member Author

richfitz Nov 3, 2021

Why yikes here? this is less hairy than some of the template magic we had before 🙃

inst/template/dust_methods.cpp

		@@ -0,0 +1,77 @@
		/// IMPORTANT; changes here must be reflected in inst/template/dust_methods.hpp

Member

johnlees Nov 3, 2021

Can you update these files in the developer notes, I would probably forget to look here

Member Author

richfitz Nov 3, 2021

done


          Comments from review

b819f7c

richfitz mentioned this pull request

More efficient GPU partial state method #311

Open


          Add another issue link

6d43bd3

richfitz requested a review from johnlees

November 3, 2021 14:37

johnlees approved these changes

View reviewed changes

johnlees merged commit 143f3a6 into master

johnlees deleted the i292-simpler-gpu-allcuda branch

November 3, 2021 14:55

This was referenced Nov 3, 2021

Update to work with dust 0.11.0 mrc-ide/odin.dust#93

Merged

Tidy cuda vignettes #318

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet