Update the `local attention mask` logic to work on MPS and CUDA in ModernBERT #561

kozistr · 2025-04-05T09:15:26Z

What does this PR do?

In the previous PR #459, local_attention can only be worked on CPU due to the abs() operation.

So, I've made a change to calculate the window mask in pure Rust and then create the Tensor from it. Maybe, this could make ModernBERT run on both MPS and CUDA too.

I've checked the output with this script and it seems identical to before.

tested devices

tested on CPU (local machine, WSL2)
tested on T4 GPU (Kaggle notebook, TEI server runs successfully)
tested on MPS (might work, but needs to be tested)

performance (`get_window_mask()`)

window size	seq len	latency (p50)
64	8192	20 ms
64	4096	6 ms
64	1024	160 us

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@Narsil, @alvarobartt, @ivarflakstad

backends/candle/src/models/modernbert.rs

Narsil · 2025-04-08T08:41:40Z

@kozistr I think everything already works on MPS, it's been checked by @ErikKaum #562

During the flash fix I forced the mask creation in CPU candle (to enable non-flash CUDA execution), and that fixed metal as well.
I think keeping everything in candle instead of Rust, makes it slightly nicer, because we could switch to on-device execution if we simply created the missing kernels (which is relatively easy to do).

No matter what, having a mask that is only created in the first part of the model is acceptable imho.

Narsil · 2025-04-08T09:04:16Z

That being said, we could definitely clean up a little bit that code imho.

The HashMap<bool, XX> could be replaced with a simple struct Mask{local: Tensor, global: Tensor} which should increase readability (and the overhead of a HashMap is not negligible even though I doubt it can be measure in this particular instance).

kozistr · 2025-04-08T09:22:51Z

@kozistr I think everything already works on MPS, it's been checked by @ErikKaum #562

During the flash fix I forced the mask creation in CPU candle (to enable non-flash CUDA execution), and that fixed metal as well. I think keeping everything in candle instead of Rust, makes it slightly nicer, because we could switch to on-device execution if we simply created the missing kernels (which is relatively easy to do).

No matter what, having a mask that is only created in the first part of the model is acceptable IMHO.

I thought there were still some issues on MPS, but I missed that PR. Thanks for checking on that.
And I agree with your point that keeping everything in Candle would be great!

That being said, we could definitely clean up a little bit that code imho.

The HashMap<bool, XX> could be replaced with a simple struct Mask{local: Tensor, global: Tensor} which should increase readability (and the overhead of a HashMap is not negligible even though I doubt it can be measure in this particular instance).

I second this. We could refactor the mask part with struct instead of HashMap regarding readability.

Then, I'm gonna close this PR and make another contribution later!

Thanks for your time to review this PR :)

Narsil · 2025-04-08T09:47:16Z

Thanks a lot to you.

kozistr added 3 commits April 5, 2025 15:51

update: not to calculate window_mask with candle

7a7e0c0

refactor: get_attention_mask

8be6ed1

refactor: window_size

0d09fa3

kozistr commented Apr 5, 2025

View reviewed changes

backends/candle/src/models/modernbert.rs Show resolved Hide resolved

kozistr added 2 commits April 5, 2025 19:52

Merge branch 'main' into refactor/local-attention

8dc4c8b

update: optimize the speed of get_window_mask()

e2d97f2

kozistr changed the title ~~Update the logic for local attetion mask in ModernBERT~~ Update the local attention mask logic to work on MPS and CUDA in ModernBERT Apr 8, 2025

kozistr closed this Apr 8, 2025

kozistr deleted the refactor/local-attention branch April 8, 2025 09:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update the `local attention mask` logic to work on MPS and CUDA in ModernBERT #561

Update the `local attention mask` logic to work on MPS and CUDA in ModernBERT #561

Uh oh!

kozistr commented Apr 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Narsil commented Apr 8, 2025

Uh oh!

Narsil commented Apr 8, 2025

Uh oh!

kozistr commented Apr 8, 2025

Uh oh!

Narsil commented Apr 8, 2025

Uh oh!

Uh oh!

Update the local attention mask logic to work on MPS and CUDA in ModernBERT #561

Update the local attention mask logic to work on MPS and CUDA in ModernBERT #561

Uh oh!

Conversation

kozistr commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

tested devices

performance (get_window_mask())

Before submitting

Who can review?

Uh oh!

Uh oh!

Narsil commented Apr 8, 2025

Uh oh!

Narsil commented Apr 8, 2025

Uh oh!

kozistr commented Apr 8, 2025

Uh oh!

Narsil commented Apr 8, 2025

Uh oh!

Uh oh!

Update the `local attention mask` logic to work on MPS and CUDA in ModernBERT #561

Update the `local attention mask` logic to work on MPS and CUDA in ModernBERT #561

kozistr commented Apr 5, 2025 •

edited

Loading

performance (`get_window_mask()`)