-
Notifications
You must be signed in to change notification settings - Fork 49
Enhanced syntax
Notes and ideas related to the enhanced C++ syntax that we'd like to have to write Grappa programs easier (and better). Most likely these will consist of uses of C++11 features such as lambdas, GCC (or C++11-style) attributes (on types or code blocks), and good old-fashioned compiler pragmas.
- Nail down a list of global analyses and try to decide which ones are implementable
- See if it looks like DragonEgg would carry attributes through to LLVM IR, and look into how Cilk constructs would appear in LLVM IR (probably have to do that ourselves?)
- Come up with syntax for if we did
grappa_spawnandgrappa_foras hijacked function calls
- Use GCC-style variable attributes to annotate pointers that should be considered global.
#define global __attribute__(grappa_global)
int global * xs;- Straightforward transformation of dereferences to delegate ops
int x = xs[i]; // => int x = SoftXMT_delegate_read(xs+i);
xs[i] = 1; // => SoftXMT_delegate_write(xs+i, 1);- Make compiler 'aware' of these global pointers and allow it to do caching around them as long as it knows it can hold onto the value.
- If this pass is done after the compiler has determined which loads/stores it can keep in registers, then this might just happen automatically
- LLVM passes can determine where things could be kept in registers to eliminate redundant loads, etc.
- Trick would be making sure it uses normal loads/stores to put temp values on the task's own stack (if not done right, could issue delegate ops for all loads/stores)
- For instance, compiler can issue a single delegate read each for
xs[j]&xs[i]in:int distance = (xs[j]-xs[i])*(xs[j]-xs[i]).
- Special Grappa-specific transformations
- Cache adjacent dereferences (special case of 'bundle' operation below?)
int vstart = xoff[2*v]; int vend = xoff[2*v+1]; // generates: int _buf[2]; Incoherent<int>::RO cxoff(xoff+2*v, 2, _buf); int vstart = cxoff[0]; int vend = cxoff[1];
- Bundle multiple remote reads and issue them together to allow the delegate ops to proceed in parallel
- Requires knowing dependences between reads and allowable reorderings
- Also handle multiple "allreduce"s, e.g.:
nfrontier_edges += allreduce_add(delta_frontier_edges); nremaining_edges += allreduce_add(delta_frontier_edges);
-
Delay consistency where possible
- if no possible reads-after-write in a loop, generate feed-forward delegate instead, sync on the outermost scope where no RAW still holds (and that will have a global join)
- would require the ability to tag which global joiner to use (use template parameter to pick a global joiner statically, do this at forall loop to sync on and in delegate call)
- Automatically generate 'update' delegate ops (simpler special case of automatic delegates)
- There may be cases where a variable is simply read and updated (simplest case is +=, but more complicated operations may be done)
- It should be safe to transform these from a delegate read and delegate write into a single delegate 'update'
- Use lambdas to get a closure which 'captures' the stuff we'd need to package with the task
- Need compiler support in order to:
- Handle pass-by-reference captures
- either disallow them for public/remote tasks
- or change them into appropriate delegate reads/writes
-
Cilk-like
spawnkeyword- Cilk uses a special keyword, we don't want to hack the frontend ourselves, so options are:
- Use Cilk-Plus GCC branch, with DragonEgg plugin to do LLVM optimization and codegen passes.
- Not GCC-style attributes can't go on statements (specifically function calls)
- C++11 attributes can go on statements, but neither GCC or Clang supports these yet
- Do we want Cilk semantics as well?
- Implicit
syncwhen thespawngoes out of scope - Example:
int pair[2]; int x = 13; grappa_spawn [=]() { // `pair` needs to be global if task is stolen pair[0] = compute1(x); }(); auto t2 = [=](int n) { pair[1] = compute2(n); }; grappa_spawn t2(n); // private/remote spawn if given remote location Node target = 2; grappa_spawn(target) [=](int value) { xs[mynode()] = value; }(13);
-
Special function call
- Avoid messing with compiler frontend
- Example:
Node target = 2; int pair[2]; int x = 13; grappa_spawn([=](int value) { pair[0] = compute1(x); }); auto t2 = [=](int n) { pair[1] = compute2(n); }; // args come after... grappa_spawn(t2, n); // private/remote spawn if given remote location as first argument (function overloading) grappa_spawn(target, [=](int value) { xs[mynode()] = value; }, 13);
- Implementation Challenges:
- Variable size tasks: We currently assume tasks are a fixed size, but the size/type of a lambda is determined by the implicit/explicit captures. We'd need to either:
- Support variable-sized tasks
- Implicitly cache global * to lambda on execution of task
- Hack the compiler to give us an efficient compromise (like see if it can fit it into a 4-arg task directly or whether it has to generate RO/RW cache calls)
- Determine what's local/remote
- maybe we can get away with changing all pointers to
global *(and let short-circuiting help the case where the task isn't stolen)
- maybe we can get away with changing all pointers to
- Do pass to find minimal set of captures (I doubt the default implementation does this), so if
xandyor only ever accessed asx+ythen we should be able to just passx+yin the capture. -
sync: what do we want it to do? We would need to implement a slightly smarter 'GlobalTaskJoiner' that implicitly waits at the end of the scope, but this should be doable.
- Variable size tasks: We currently assume tasks are a fixed size, but the size/type of a lambda is determined by the implicit/explicit captures. We'd need to either:
- Generic delegate operations can allow multiple things to be done on the remote side without multiple round-trips
- If a number of operations can be shown to be to the same node, it may be possible to automatically eliminate round-trips by packaging them up into a delegate operation
- Example automatic delegate:
int x = 13; int global* a, global* b; // both on node 2 *a = *b + x; // currently would be: int b_val = delegate_read(b); delegate_write(b_val + x); // with delegate instead: grappa_delegate(2, [=]() { // all local *a = *b + x; });
- Continuations...
- Rather than suspending on long delegate operations, with some compiler help we could turn the stuff after a delegate operation into a 'continuation' (lambda) that gets executed when the delegate finishes
- It isn't clear if this is better than tying up a stack because the data for the continuation has to live somewhere
- Might only be useful if the amount of state needed for the continuation is small, then it can just be sent along with the delegate op and optionally executed at remote location
- Example:
int x = 13; int global * a; // 2D-address to node 2 grappa_delegate(2, [=]() { *a = x; }); x = 14; // becomes... int x = 13; int global * a; auto cont = [&]() { x = 14; } Node origin = mynode(); grappa_delegate(2, [=](){ *a = x; grappa_spawn(cont); });
- If this was done everywhere, we could eliminate the need to suspend tasks entirely...
We've discussed using Cilk-style semantics in for loops. In particular, it would be useful to us to delay consistency on writes until after the loop is completed. This shouldn't affect syntax, though, so maybe we can worry about that later.
-
Traditional loops (Cilk-style)
- Too general? We can only directly call
async_parallel_forif it's doingi++and no funny business. -
cilk_forworks this way, and it's just a compilation error if they try to write things in the loop header that Cilk doesn't understand (i.e. only a single declaration, only fixed increments, etc.) - Example:
extern int global * xs; int total = 0; grappa_for ( int i=0; i < N; i++ ) { fetch_add(&total, xs[i]); }
- Too general? We can only directly call
-
New C++11 for-each loops.
- For-each syntax with 'range' construct
- Allows to restrict to only strict ranges rather than having to make sure their loop increment is good
- Example:
grappa_for (int i : range(0, N)) { fetch_add(&total, xs[i]); }
- Could also generate "forall_local" calls if iterating over Grappa array:
extern double * global base; grappa::array<double> xs(base, N); grappa_for (double& x : xs) { // run locally where 'v' is }
-
Lambda-based loops
- To avoid changing the frontend to include a
grappa_forkeyword, we might want to just hijack a special function call which would then take a lambda for the body (strictly speaking, it would take a 'task', however you want to express it). - Example:
grappa_for (0, N, [=](int i) { fetch_add(&total, xs[i]); });
- Similar to C++11-style loop,
forall_localcould be generated:
grappa_for (xs, [=](double& v) { fetch_add(&total, v); });
- To avoid changing the frontend to include a
-
Redeclaration
- We can explicitly say what should be cached in a for loop with another GCC-style varible attribute
-
cached([induction expr = <induction variable>][, num elements = 1])-
induction expressionlets it know which elements will be accessed in this iteration- (on second thought, what does this really give us beyond what can be implicitly determined?)
-
num elementsis the number of elements to cache for each iteration
-
grappa_for (int i = 0; i < N; i++) { // explicitly cache with offset int cached(i+offset) * cxs = xs; int cached(2*i,2) * cxoff = xoff; int n = cxoff(2*i+1) - cxoff(2*i); fetch_add(&cxs[i], n); }
-
Explicit as part of C++11-style for-each loop
- Use attributes to make a 'vector'-style declaration of the array being iterated over and how it should be cached
- I have to say, I'm not actually very happy with this approach--seems to force a particular way of programming (one that I think not many C/C++ programmers are used to). And it's not a general solution because it doesn't allow you to index into multiple caches.
- Example:
#define cache(addr, start, end) __attribute__(incoherent_cache(addr, start, end)); int total = 0; grappa_for (int& x : cache(xs, 0, N)) { int n = xoff[2*i+1] - xoff[2*i]; fetch_add(&xs[i], n); }
- How to extend this kind of for-each loop to support multiple caches? Something like Chapel's zipper loop syntax? I'd rather just give indexes and let the compiler figure it out, it at all possible, but it may be that the analysis is too hairy.
-
Pragmas before loop
- Familiar approach, similar to OpenMP-style pragmas
- Can provide an essentially arbitrary amount of extra information
- Syntax:
cache(<variable>[, <induction expression>][, <num elements>]) - Example:
#pragma grappa cache(xs) cache(xoff,2*i,2) grappa_for (int i=0; i<N; i++) { int n = xoff[2*i+1] - xoff[2*i]; fetch_add(&xs[i], n); }
- Common things that are RO (base ptrs)
- maybe also handled transparently by assuming RO shared args and using "ArgCache"
- Common things we want local access to (push buffer)
- maybe this shouldn't be exposed at all, allocate on a node, talk to it through a global pointer, and the runtime sorts out where copies are placed or things are buffered in order to run efficiently