Enhanced syntax

Grappa Syntax

Notes and ideas related to the enhanced C++ syntax that we'd like to have to write Grappa programs easier (and better). Most likely these will consist of uses of C++11 features such as lambdas, GCC (or C++11-style) attributes (on types or code blocks), and good old-fashioned compiler pragmas.

TODO

Nail down a list of global analyses and try to decide which ones are implementable
See if it looks like DragonEgg would carry attributes through to LLVM IR, and look into how Cilk constructs would appear in LLVM IR (probably have to do that ourselves?)
Come up with syntax for if we did grappa_spawn and grappa_for as hijacked function calls

Global pointers

Use GCC-style variable attributes to annotate pointers that should be considered global.

#define global __attribute__(grappa_global)
int global * xs;

Compiler passes

Straightforward transformation of dereferences to delegate ops

int x = xs[i]; // => int x = SoftXMT_delegate_read(xs+i);
xs[i] = 1; // => SoftXMT_delegate_write(xs+i, 1);

Make compiler 'aware' of these global pointers and allow it to do caching around them as long as it knows it can hold onto the value.
- If this pass is done after the compiler has determined which loads/stores it can keep in registers, then this might just happen automatically
- LLVM passes can determine where things could be kept in registers to eliminate redundant loads, etc.
  - Trick would be making sure it uses normal loads/stores to put temp values on the task's own stack (if not done right, could issue delegate ops for all loads/stores)
- For instance, compiler can issue a single delegate read each for xs[j] & xs[i] in: int distance = (xs[j]-xs[i])*(xs[j]-xs[i]).
Special Grappa-specific transformations
- Cache adjacent dereferences (special case of 'bundle' operation below?)
```
int vstart = xoff[2*v];
int vend = xoff[2*v+1];
// generates:
int _buf[2];
Incoherent<int>::RO cxoff(xoff+2*v, 2, _buf);
int vstart = cxoff[0];
int vend = cxoff[1];
```
- Bundle multiple remote reads and issue them together to allow the delegate ops to proceed in parallel
- Requires knowing dependences between reads and allowable reorderings
- Also handle multiple "allreduce"s, e.g.:
```
nfrontier_edges += allreduce_add(delta_frontier_edges);
nremaining_edges += allreduce_add(delta_frontier_edges);
```
- Delay consistency where possible
  - if no possible reads-after-write in a loop, generate feed-forward delegate instead, sync on the outermost scope where no RAW still holds (and that will have a global join)
  - would require the ability to tag which global joiner to use (use template parameter to pick a global joiner statically, do this at forall loop to sync on and in delegate call)
  - Automatically generate 'update' delegate ops (simpler special case of automatic delegates)
  - There may be cases where a variable is simply read and updated (simplest case is +=, but more complicated operations may be done)
  - It should be safe to transform these from a delegate read and delegate write into a single delegate 'update'

Tasks

Use lambdas to get a closure which 'captures' the stuff we'd need to package with the task
Need compiler support in order to:
- Handle pass-by-reference captures
- either disallow them for public/remote tasks
- or change them into appropriate delegate reads/writes

Possible syntaxes

Cilk-like spawn keyword
- Cilk uses a special keyword, we don't want to hack the frontend ourselves, so options are:
- Use Cilk-Plus GCC branch, with DragonEgg plugin to do LLVM optimization and codegen passes.
- Not GCC-style attributes can't go on statements (specifically function calls)
- C++11 attributes can go on statements, but neither GCC or Clang supports these yet
- Do we want Cilk semantics as well?
- Implicit sync when the spawn goes out of scope
- Example:
```
 int pair[2];
 int x = 13;
 grappa_spawn [=]() {
 	// `pair` needs to be global if task is stolen
     pair[0] = compute1(x);
 }();
 auto t2 = [=](int n) { pair[1] = compute2(n); };
 grappa_spawn t2(n);
 // private/remote spawn if given remote location
 Node target = 2;
 grappa_spawn(target) [=](int value) {
     xs[mynode()] = value;
 }(13);
```

Special function call

Avoid messing with compiler frontend
Example:

 Node target = 2;
 int pair[2];
 int x = 13;
 grappa_spawn([=](int value) {
 	pair[0] = compute1(x);
 });
 auto t2 = [=](int n) { pair[1] = compute2(n); };
 // args come after...
 grappa_spawn(t2, n);
 // private/remote spawn if given remote location as first argument (function overloading)
 grappa_spawn(target, [=](int value) {
     xs[mynode()] = value;
 }, 13);

Implementation Challenges:
- Variable size tasks: We currently assume tasks are a fixed size, but the size/type of a lambda is determined by the implicit/explicit captures. We'd need to either:
  - Support variable-sized tasks
  - Implicitly cache global * to lambda on execution of task
  - Hack the compiler to give us an efficient compromise (like see if it can fit it into a 4-arg task directly or whether it has to generate RO/RW cache calls)
- Determine what's local/remote
  - maybe we can get away with changing all pointers to global * (and let short-circuiting help the case where the task isn't stolen)
- Do pass to find minimal set of captures (I doubt the default implementation does this), so if x and y or only ever accessed as x+y then we should be able to just pass x+y in the capture.
- sync: what do we want it to do? We would need to implement a slightly smarter 'GlobalTaskJoiner' that implicitly waits at the end of the scope, but this should be doable.

Automatic delegates, CPS Transform

Generic delegate operations can allow multiple things to be done on the remote side without multiple round-trips

If a number of operations can be shown to be to the same node, it may be possible to automatically eliminate round-trips by packaging them up into a delegate operation

Example automatic delegate:

 int x = 13;
 int global* a, global* b; // both on node 2
 *a = *b + x;
 // currently would be:
 int b_val = delegate_read(b);
 delegate_write(b_val + x);
 // with delegate instead:
 grappa_delegate(2, [=]() {
 	// all local
 	*a = *b + x;
 });

Continuations...
- Rather than suspending on long delegate operations, with some compiler help we could turn the stuff after a delegate operation into a 'continuation' (lambda) that gets executed when the delegate finishes
- It isn't clear if this is better than tying up a stack because the data for the continuation has to live somewhere
- Might only be useful if the amount of state needed for the continuation is small, then it can just be sent along with the delegate op and optionally executed at remote location
- Example:
```
 int x = 13;
 int global * a; // 2D-address to node 2
 grappa_delegate(2, [=]() {
 	*a = x;
 });
 x = 14;
 // becomes...
 int x = 13;
 int global * a;
 auto cont = [&]() { x = 14; }
 Node origin = mynode();
 grappa_delegate(2, [=](){
 	*a = x;
 	grappa_spawn(cont);
 });
```
- If this was done everywhere, we could eliminate the need to suspend tasks entirely...

For loops

We've discussed using Cilk-style semantics in for loops. In particular, it would be useful to us to delay consistency on writes until after the loop is completed. This shouldn't affect syntax, though, so maybe we can worry about that later.

Loop styles

Traditional loops (Cilk-style)
- Too general? We can only directly call async_parallel_for if it's doing i++ and no funny business.
- cilk_for works this way, and it's just a compilation error if they try to write things in the loop header that Cilk doesn't understand (i.e. only a single declaration, only fixed increments, etc.)
- Example:
```
 extern int global * xs;
 int total = 0;
 grappa_for ( int i=0; i < N; i++ ) {
 	fetch_add(&total, xs[i]);
 }
```
New C++11 for-each loops.
- For-each syntax with 'range' construct
- Allows to restrict to only strict ranges rather than having to make sure their loop increment is good
- Example:
```
 grappa_for (int i : range(0, N)) {
     fetch_add(&total, xs[i]);
 }
```
- Could also generate "forall_local" calls if iterating over Grappa array:
```
extern double * global base;
grappa::array<double> xs(base, N);
grappa_for (double& x : xs) {
  // run locally where 'v' is
}
```
Lambda-based loops
- To avoid changing the frontend to include a grappa_for keyword, we might want to just hijack a special function call which would then take a lambda for the body (strictly speaking, it would take a 'task', however you want to express it).
- Example:
```
 grappa_for (0, N, [=](int i) {
 	fetch_add(&total, xs[i]);
 });
```
- Similar to C++11-style loop, forall_local could be generated:
```
grappa_for (xs, [=](double& v) {
  fetch_add(&total, v);
});
```

Caching in for loops

Redeclaration
- We can explicitly say what should be cached in a for loop with another GCC-style varible attribute
- cached([induction expr = <induction variable>][, num elements = 1])
  - induction expression lets it know which elements will be accessed in this iteration
    - (on second thought, what does this really give us beyond what can be implicitly determined?)
  - num elements is the number of elements to cache for each iteration
```
 grappa_for (int i = 0; i < N; i++) {
     // explicitly cache with offset
     int cached(i+offset) * cxs = xs;
     int cached(2*i,2) * cxoff = xoff;
 	int n = cxoff(2*i+1) - cxoff(2*i);
 	fetch_add(&cxs[i], n);
 }
```
Explicit as part of C++11-style for-each loop
- Use attributes to make a 'vector'-style declaration of the array being iterated over and how it should be cached
- I have to say, I'm not actually very happy with this approach--seems to force a particular way of programming (one that I think not many C/C++ programmers are used to). And it's not a general solution because it doesn't allow you to index into multiple caches.
- Example:
```
 #define cache(addr, start, end) __attribute__(incoherent_cache(addr, start, end));
 int total = 0;
 grappa_for (int& x : cache(xs, 0, N)) {
     int n = xoff[2*i+1] - xoff[2*i];
 	fetch_add(&xs[i], n);
 }
```
- How to extend this kind of for-each loop to support multiple caches? Something like Chapel's zipper loop syntax? I'd rather just give indexes and let the compiler figure it out, it at all possible, but it may be that the analysis is too hairy.
Pragmas before loop
- Familiar approach, similar to OpenMP-style pragmas
- Can provide an essentially arbitrary amount of extra information
- Syntax: cache(<variable>[, <induction expression>][, <num elements>])
- Example:
```
 #pragma grappa cache(xs) cache(xoff,2*i,2)
 grappa_for (int i=0; i<N; i++) {
 	int n = xoff[2*i+1] - xoff[2*i];
 	fetch_add(&xs[i], n);
 }
```

Replicated stuff

Common things that are RO (base ptrs)
- maybe also handled transparently by assuming RO shared args and using "ArgCache"
Common things we want local access to (push buffer)
- maybe this shouldn't be exposed at all, allocate on a node, talk to it through a global pointer, and the runtime sorts out where copies are placed or things are buffered in order to run efficiently

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhanced syntax

Grappa Syntax

TODO

Global pointers

Compiler passes

Tasks

Possible syntaxes

Automatic delegates, CPS Transform

For loops

Loop styles

Caching in for loops

Replicated stuff

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally