Skip to content
This repository was archived by the owner on Apr 28, 2023. It is now read-only.

[wip] Grid synchronizations and mapping to blocks. #370

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

[wip] Grid synchronizations and mapping to blocks. #370

wants to merge 16 commits into from

Conversation

math-fehr
Copy link
Contributor

This is a wip allowing grid synchronizations to be produced. It also changes the mapping algorithm,
to be able to map bands bands to blocks even if there is no outermost coincident band.

This wip is based on #316, so it includes warp synchronizations.

Instead of finding the outermost band at the base of the tree, the mapping will find all the outermost
coincident bands (bands that have a coincident dimension, and that has no ancestor band with a
coincident dimension) in the tree, and place some zero-dimension band if necessary. It will then map the
threads below the selected bands, and the blocks above the bands. It will also insert grid synchronizations where it is necessary to ensure the correctness of the compiled kernel.

To be able to use the grid synchronization, the kernel should be launched with a cooperative launch.
This requires the co-residency of all blocks and threads. This means that the number of threads and
blocks should be low enough. This is checked before the mapping is done, so the new mapping
algorithm is only used when possible. Also, grid synchronization is only available with CUDA 9, but this
isn't currently checked. Grid synchronization is only available when --grid_sync option is on (it is on
by default).

Grid synchronizations seem to find a better mapping than the original algorithm for the 4fcrelu kernel.
However, I didn't tried it with many kernels, so I don't know if there is other kernels that can have better
mapping thanks to grid synchronizations.

There is still some improvement to do, such as reducing the number of grid synchronizations inserted, or
having a better shared memory promotion (the shared memory promotion is called multiple times on
distinct schedule tree, which maximize the promotion in the firsts schedule trees, instead of promoting
all the trees at the same time).

Copy link
Contributor

@nicolasvasilache nicolasvasilache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit should add a test in test_basic_gpu.cc to test the new path and cuLink calls independently of TC.

Copy link
Contributor

@nicolasvasilache nicolasvasilache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be folded in first commit

@ftynse
Copy link
Contributor

ftynse commented Jun 6, 2018

@nicolasvasilache as a reminder, github does not show which commit you reviewed so "this commit should add" is not helpful.

Copy link
Contributor

@nicolasvasilache nicolasvasilache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f3424d1 looks good

// Tile the outermost coincident bands. See the obtainOuterCoincidentBands
// function to see what these bands correspond to.
// Splits the bands into tile loop band and point loop band where point loops
// have fixed trop counts specified in "tiling", and returns a pointer to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: trip

// coincident bands is equal to the leafs. Also, these created bands are
// put at the highest possible level of the tree, which also minimize the
// number of bands inserted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spurious newline

return false;
}

// Return the outermost coincident band. These are the bands with at least
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coincident bands (plural)


// Return the outermost coincident band. These are the bands with at least
// one coincident dimension, that are permutable, and that have no such
// coincident band as ancestors. Some of these bands are created with zero
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the first condition but not the second.
Could you please rephrase and motivate?
My intuition is that these outer bands are meant to be used to tile and map to blocks ad that you want no child left without a block (even if it means mapping to a single block).

// in the tree.
bool hasAtLeastOneCoincidentLoop(detail::ScheduleTree* tree) {
if (auto band = tree->elemAs<ScheduleTreeElemBand>()) {
if (find(band->coincident_.begin(), band->coincident_.end(), true) !=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks more complex than necessary, just hoist the permutable_ condition and make a simple for loop?

if (auto band = tree->elemAs<ScheduleTreeElemBand>() && band->permutable_) {
   for (auto c : band->coincident_) {
      if (c) {
         return true;
      }
   }

auto nChildren = tree->numChildren();

// If there is no coincident loop, we create an outer band with no dimensions
if (not hasAtLeastOneCoincidentLoop(tree)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand why this is hasAtLeastOneCoincidentLoop on at "least one path" where I would expected it to be "along all paths" ?

if (n == 0) {
return false;
} else if (n == 1) {
return hasAtLeastOneCoincidentLoop(tree->child({0}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this function really wants to implement hasAtLeastOneCoincidentLoop along at least one path then by all means let's drop recursion and go a collect followed by a test (in which case, hoist the code above in a separate helper function).

However, see my comment in obtainOuterCoincidentBands I am wondering if you want an "along all paths" property?

// put at the highest possible level of the tree, which also minimize the
// number of bands inserted.

std::vector<detail::ScheduleTree*> obtainOuterCoincidentBands(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This functions looks significantly more complex than it should so let me ask 2 questions:

  1. my understanding is you want a set of bands with the permutability + coincident property that cover all children. You insert empty bands along paths that cannot be covered. Is this feature independently well-tested? Given that I see no standalone tests for it I would think that it isn't; yet it seems quite a complex behavior so you def want to test it in isolation.
  2. it seems you are implementing a quite complex behavior where you may pick the topmost bands along some paths and are left with injecting empty bands at various depths in various paths to children than cannot be covered initially. Is this complexity really worth it? (i.e. can you exhibit examples where this is needed and it clearly brings benefits?)

One of the most annoying things in polyhedral compilation is complexity debt and this function feels like it is adding a lot of debt to me.

Can we cut the complexity by taking the topmost band with the property you want and declare that this is the depth at which we insert zero bands along all paths?

More importantly, can we kill recursion with fire, I am 99% certain this function is generally incorrect even though I do not fully understand it. Even if you go for (what I understand is) the general behavior you want, it seems to me you should be able to:

  1. collectBFS and use an ordered map to remember covering nodes. For each new node if one of its ancestors is covering then you're done; othwerwise if it has the property mark it; otherwise skip;
  2. collectBFS again and filter for nodes such that no ancestor and no descendant are marked. These are the places where you want to insert 0-bands; mark them as you find them.
  3. for all nodes marked in 2, insert above.

Note that what I propose above assumes you really want to implement "all children are covered" which I don't think hasAtLeastOneCoincidentLoop allows you to achieve.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, I think a lot of documentation with various tree shapes that explain what you want to return in various cases is required and the tests too.
Still I would ask again if we should just err on the side of the simplest pattern that helps (e.g. get topmost band and inject 0-bands at only that depth in the tree, if needed) achieve something.

Of course this is all contingent on the end to end PR actually helping in concrete cases which from a recent discussion with @ftynse didn't seem very clear to me.

auto domain = root->elemAs<ScheduleTreeElemDomain>();
CHECK(domain);
auto band = ScheduleTree::makeEmptyBand(root);
if (nChildren == 0 || root == tree) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am extremely skeptical that this doubly-nested conditional + recursion is correct..

std::vector<detail::ScheduleTree*> Scop::tileOuterCoincidentBands(
const TilingView& tileSizes) {
using namespace tc::polyhedral::detail;
auto bands = obtainOuterCoincidentBands(scheduleRoot(), scheduleRoot());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one useful sanity check here seems to be that no 2 bands you return are nested. It does not seem superfluous to add this check (and maybe also a check that all paths are covered if I understood properly).

math-fehr added 15 commits June 20, 2018 18:38
As for now, there is a flag to disable grid synchronization.
When the grid synchronization flag is enabled, tc will try to launch
the kernel with a cooperative launch. The cooperative launch requires
co-residency in the device, which is not currently checked. It will thus
fail when there is too much blocks or threads.
This is a work in progress, and it currently lacks synchronization between the
different parts.

Instead of looking for an unique outermost band, looking for multiple outermost
coincident bands can result in more degrees of parallelism. For instance, in the
Copy2 example in test_cuda_mapper, this change result in the use of blocks,
whereas it used only one block before.

These change are only legal if there is a way to synchronize the different
parts. This is possible thanks to grid synchronization (available only in
CUDA 9 however).

The coincident outer bands are the bands that are permutable and that have
a dimension that is coincident. The outermosts coincident outer bands are the
coincident outer bands that do not have a ancestor which is a coincident outer
band. Some outermost coincident outer bands are added in the most higher level
of the tree possible, to have another property: All leafs has an outermost
coincident outer bands (which is unique by definition). By adding these bands
in the most higher level of the tree possible, we add the minimum amount of
these bands. Adding these bands are not yet necessary, but will be when adding
synchronizations. They will be useful to reduce the number of grid
synchronizations made.
The grid synchronizations are added above the outermost coincident bands,
in every sequence, and below every sequential band.
Theses informations are needed to know if a cooperative kernel can be launched.
Indeed, cooperative launch requires co-residency of all threads and blocks.
Instead of applying it on every branch one by one greedily.
Bands should not be inserted between a sequence and a filter.
Also, depth wasn't computed correctly.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants