Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving performance of Algorithm Backtrack #2

Open
devedse opened this issue May 29, 2019 · 22 comments
Open

Improving performance of Algorithm Backtrack #2

devedse opened this issue May 29, 2019 · 22 comments

Comments

@devedse
Copy link
Owner

devedse commented May 29, 2019

I'd like to track progress being made to improve the performance of the Algorithm backtrack.

I've created a new class that implements generics + structs to remove logging statements at compile/JIT to improve performance but I'm definitely looking for more.

Class with work in progress:
https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack2.cs

@redknightlois
Copy link

redknightlois commented May 30, 2019

If you need that to be fast, instead of having NoAction and HasAction bake-in the action itself in a struct and pass it by ref so you know who to call afterwards and inline the hell out of it ;). In that way you avoid the delegate invoke, I showed an example on the talk at DotNext I think.

Dont sweat, you have at least 6x or so to optimize on that routine :)

@devedse
Copy link
Owner Author

devedse commented May 30, 2019

@redknightlois , thanks a lot for this :), I don't think I quite understand everything you talked about but I'll try to figure out as much as I can. I'll come back with some questions for you probably soon :). Thanks in advance.

@redknightlois
Copy link

Which part you need more detail?

@devedse
Copy link
Owner Author

devedse commented May 31, 2019

Ill let you know soon once I have some time behind my code again :)).

@devedse
Copy link
Owner Author

devedse commented Jun 9, 2019

@redknightlois ,

Baseline for generating a Maze is about 3.81 seconds with the lowest being 3.72.
https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack2.cs

Currently back in action again :). I played around with your advise a bit:

  • The targets can be stackalloc.

I changed:

MazePoint[] targets = new MazePoint[4];

to:

Span<MazePoint> targets = stackalloc MazePoint[4];

This actually resulted in Mazes being generated a littlebit slower (generation time 3.92 seconds)

  • Don't use the variable, use the constant. That way you are not stalling the processor with a Read-after-write hazard.

What exactly do you mean by this? Every time I add a target I need to increase that targetCount to put the next one in the next slot. How would you propose this? (Or am I missing what you mean 😄)

  • You are paying for the edge comparison every time as it is a normal condition. check on top if you are on the edge or not, and use an optimized routine where it is not (the most usual).

You'd have to check if you're on an edge in every step still right? Now I'll only check if I'm on the left edge if I want to add the left target. If I would do all this beforehand I'd just have to store the results in a bool and then still do an ifcheck if (leftEdge || rightEdge || ....) before I can add the other targets.

Or would you have a better idea?

  • Dont make target sequential, use it as a 'long' (4 bytes). Each check writes always in the same one, then with a single AVX operation you check for 0 ;)

Do you mean that I bitshift the x and y in a long? Or make a Union out of it?

In the MazePointPos I've indeed set the struct layout as Sequential. But not in the default MazePoint. The reason for MazePointPos to be sequential is to save memory when using it in a big array.

  • You can optimize the layout of the data itself, all X together... all Y together and do that with AVX instructions for ultimate l33t :P

I'll see if I can do that. Would you have to do something special to get AVG working?

@devedse
Copy link
Owner Author

devedse commented Jun 11, 2019

I did some more playing around and tried to store the x - 2 that happens 3 times in a variable. This however didn't improve the times but rather slowed them down by about 5%.

Before:

MazePoint cur = stackje.Peek();
x = cur.X;
y = cur.Y;

int targetCount = 0;
if (x - 2 > 0 && !map[x - 2, y])
{
    targets[targetCount].X = x - 2;
    targets[targetCount].Y = y;
    targetCount++;
}
if (x + 2 < width - 1 && !map[x + 2, y])
{
    targets[targetCount].X = x + 2;
    targets[targetCount].Y = y;
    targetCount++;
}
if (y - 2 > 0 && !map[x, y - 2])
{
    targets[targetCount].X = x;
    targets[targetCount].Y = y - 2;
    targetCount++;
}
if (y + 2 < height - 1 && !map[x, y + 2])
{
    targets[targetCount].X = x;
    targets[targetCount].Y = y + 2;
    targetCount++;
}

After:

MazePoint cur = stackje.Peek();
x = cur.X;
y = cur.Y;

int xLeft = x - 2;
int xRight = x + 2;
int yUp = y - 2;
int yDown = y + 2;

int targetCount = 0;
if (xLeft > 0 && !map[xLeft, y])
{
    targets[targetCount].X = xLeft;
    targets[targetCount].Y = y;
    targetCount++;
}
if (xRight < width - 1 && !map[xRight, y])
{
    targets[targetCount].X = xRight;
    targets[targetCount].Y = y;
    targetCount++;
}
if (yUp > 0 && !map[x, yUp])
{
    targets[targetCount].X = x;
    targets[targetCount].Y = yUp;
    targetCount++;
}
if (yDown < height - 1 && !map[x, yDown])
{
    targets[targetCount].X = x;
    targets[targetCount].Y = yDown;
    targetCount++;
}

Anyone has any idea why this doesn't improve generation times?

@redknightlois
Copy link

redknightlois commented Jun 11, 2019

Because you are using 4 extra register, probably the JIT is doing a poorer allocation of registers and end up paying 2 times for the stack store and retrieve. Post both assembler listings and I can tell you for sure.

@devedse
Copy link
Owner Author

devedse commented Jun 11, 2019

@redknightlois , wow that's some fast response time ^^. Thanks. If you do have some time, just above here I also posted some additional questions to your first reply.

Let me see if I can get the assembler code.

@redknightlois
Copy link

What exactly do you mean by this? Every time I add a target I need to increase that targetCount to put the next one in the next slot. How would you propose this? (Or am I missing what you mean 😄)

You can get away without the targetCount at all. This is coupled with another of the comments. Always store the Left on index 0, Right on index 1, Up on index 2 and so on.

If you now use a byte array, you can check for 0 in an unsafe way just reading it as a long and comparing with 0.

@redknightlois
Copy link

redknightlois commented Jun 11, 2019

Try changing:

    targets[targetCount].X = xLeft;
    targets[targetCount].Y = y;

into

   ref t = targets[targetCount];
   t.X = xLeft
   t.Y = y

There shouldnt be any difference, but I am not entirely sure if the JIT of 2.0 is able to reuse the t memory location without using the ref.

EDIT: You can also do the same with the map statements... twice per get if you need to keep register use down (because you hit the threshold).

@redknightlois
Copy link

redknightlois commented Jun 11, 2019

If you need that to be fast, instead of having NoAction and HasAction bake-in the action itself in a struct and pass it by ref so you know who to call afterwards and inline the hell out of it ;). In that way you avoid the delegate invoke, I showed an example on the talk at DotNext I think.

This is doing exactly this:

    public readonly struct SomeObjectCallback : IAction
    {
        private readonly SomeObject _myObjectRef;

        public SomeObjectCallback(SomeObject objectRef)
        {
            this._myObjectRef = objectRef;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public void Invoke(int step, int total, long x, long y)
        {
            this._myuObjectRef.SomeMethodCallback(step, total, x, y);
        }
    }

@devedse
Copy link
Owner Author

devedse commented Jun 11, 2019

I'll wait with my further response untill I've checked out all your advise, I did however create a SharpLab entry if you would want to check out the assembler code emitted:
Link

Again, your help is really helpfull, I'd love to learn more about this :)

@redknightlois
Copy link

redknightlois commented Jun 11, 2019

About the ref, if you change the code you will see this:

With usage of ref

    L0174: lea rax, [r15+0x10]
    L0178: mov [rax], ecx
    L017a: mov [rax+0x4], r13d
    L017e: mov dword [rsp+0x54], 0x1

without ref:

    L01dc: movsxd r8, eax
    L01df: mov [r15+r8*8+0x10], ecx
    L01e4: movsxd rcx, eax
    L01e7: mov [r15+rcx*8+0x14], r13d
    L01ec: inc eax
    L01ee: mov [rsp+0x54], eax

See the difference? Pretty difficult to try to get all that information from SharpLab mainly because there is no marker on what part of the code is on each assembler line... hunting for the proper line is hard. This uses far less physical resources which if you saw the talk Scratched Metal is key to high-performance tight loops.

Yet again, that would depend on the targetted JIT and you have biggest offenders, but you can help older JITs with the ref...

@devedse
Copy link
Owner Author

devedse commented Jun 12, 2019

Thanks for the elaborate explanation.

What exactly do you mean by this? Every time I add a target I need to increase that targetCount to put the next one in the next slot. How would you propose this? (Or am I missing what you mean 😄)

You can get away without the targetCount at all. This is coupled with another of the comments. Always store the Left on index 0, Right on index 1, Up on index 2 and so on.

If you now use a byte array, you can check for 0 in an unsafe way just reading it as a long and comparing with 0.

The reason I need the targetCount is because I need to randomly choose one of the possible valid directions:

MazePoint target = targets[random.Next(targetCount)];

If I don't know which directions are valid ones, I also can't randomly choose a valid direction.

The only thing I can think of is predetermining all valid directions with 4 bools and then simply coding everything out. E.g.:

if (leftValid && topValid && bottomValid && rightValid)
{
    var randomDirection = rnd.Next(4);
    if (randomDirection == 0)
    {
        stackje.Push(new Target(x - 2, y));
    }
    ....
}
else if (leftValid && topValid && !bottomValid && !rightValid)
{
    var randomDirection = rnd.Next(2);
    if (randomDirection == 0)
    {
        stackje.Push(new Target(x - 2, y));
    }
    ....
}
else if (.....)
.....

However this would become a humoungous loop because you'd have to do it for all random scenario's and for all different scenario's where Left and Top are valid for example, but Bottom and Right are not. Because then the rnd.Next(4) should be rnd.Next(2) instead.

Try changing:

    targets[targetCount].X = xLeft;
    targets[targetCount].Y = y;

into

   ref t = targets[targetCount];
   t.X = xLeft
   t.Y = y

There shouldnt be any difference, but I am not entirely sure if the JIT of 2.0 is able to reuse the t memory location without using the ref.

EDIT: You can also do the same with the map statements... twice per get if you need to keep register use down (because you hit the threshold).

I'll definitely look into implementing this. Map statements will probably be harder because the map is implemented using a BitArray.

If you need that to be fast, instead of having NoAction and HasAction bake-in the action itself in a struct and pass it by ref so you know who to call afterwards and inline the hell out of it ;). In that way you avoid the delegate invoke, I showed an example on the talk at DotNext I think.

This is doing exactly this:
....

Thanks for the advise. In my case I don't really need the performance when an action is provided though. But if I do for some other scenario I'd definitely implement it like this.

Thanks a lot for the help.

@devedse
Copy link
Owner Author

devedse commented Jun 13, 2019

This is currently the fastest I managed to get:
https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack2.cs
Takes about 3.6 seconds

I did try to rewrite the whole algorithm to do less if checks:
https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack3.cs
For some reason this code seems to be a tiny bit slower then AlgorithmBacktrack2 though. Probably because the compiled method is now becoming huge? (If I disable inlining for that SetMap method it actually seems to become a tiny bit faster. But still slower then AlgorithmBacktrack2
Takes about 3.7 seconds

@redknightlois
Copy link

switch statements are killing you the cost of those is high for tigh loops.

This is a quite clever use of signaling interfaces, good job there ;)
https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack3.cs#L34

@devedse
Copy link
Owner Author

devedse commented Jun 14, 2019

switch statements are killing you the cost of those is high for tigh loops.

This is a quite clever use of signaling interfaces, good job there ;)
https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack3.cs#L34

Ah I thought switch statements would be better. If I would change this to if checks, would that be better?

I could either write out all cases in an else if or write them with some bit wise checks. E.g. doing a a bitwise check if Left is valid, and if so, then check if also up is valid, etc.etc.

Which one would you suggest?

@redknightlois
Copy link

First remove the exceptions, creating exceptions (even if they are not in a try-catch) is a good way to disrupt the code layout, unless you hit one of the optimizations by the JIT for cold code. You will have to look at the assembler generated to know for sure if the code is being moved away.

Mhhh, definitely not switch, but I would reword the whole process instead... the amount of repeated code you are writing (even if cleverly encapsulated on the function) is giving you a hint that the flow is potentially at fault here.

Do you have an example I can run with a known (deterministic) output that I can check in order to avoid making mistakes, I want to try a few things myself which are probably too convoluted to explain from zero.

@devedse
Copy link
Owner Author

devedse commented Jun 15, 2019

@redknightlois , I'll try to see what removing the exceptions does for performance, thanks for the tip.

In regards to trying yourself, you can simply clone the application and run the included ConsoleApplication. This piece of code will be executed with a deterministic seed:

        public static void ActualBenchmark2()
        {
            int size = 16384;

            var alg = new AlgorithmBacktrack3();

            Console.WriteLine($"Generating mazes using {alg.GetType().Name}...");

            int seed = 1337;
            while (true)
            {
                var w = Stopwatch.StartNew();

                var innerMapFactory = new InnerMapFactory<BitArreintjeFastInnerMap>(size, size);
                var randomFactory = new RandomFactory<XorShiftRandom>(seed);

                var actionThing = new NoAction();

                var maze = alg.GoGenerate2(innerMapFactory, randomFactory, actionThing);
                w.Stop();
                Console.WriteLine($"Generation time: {w.Elapsed}");
                seed++;



                //var result = MazeVerifier.IsPerfectMaze(maze);
                //Console.WriteLine($"Perfect maze verification time: {w.Elapsed}");
                //Console.WriteLine($"Is our maze perfect?: {result}");
            }
        }

@devedse
Copy link
Owner Author

devedse commented Jun 15, 2019

@redknightlois , Removing the Exceptions didn't really seem to do much. However rewriting everything as an if else seemed to improve the speed somewhat:
https://github.com/devedse/DeveMazeGeneratorCore/blob/master/src/DeveMazeGenerator/Generators/AlgorithmBacktrack4.cs

@redknightlois
Copy link

Still the problem there is a too long method with lots of register spilling.

@devedse
Copy link
Owner Author

devedse commented Jun 16, 2019

Yeah I understand. V2 had the shorter method but uses more checks instead so that's what's causing slowness there.

@devedse devedse mentioned this issue Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants