Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checks for inconsistent device types #272

Merged
merged 6 commits into from
Mar 3, 2025

Conversation

jwallwork23
Copy link
Collaborator

@jwallwork23 jwallwork23 commented Jan 30, 2025

Closes #269.

This PR adds checks that overloaded operators are only applied to tensors that have the same device type.

It's sufficient to add the checks in the assignment operator and the binary operators as the unary operators will go through the assignment anyway.

The PR also adds checks that tensors have been constructed before they are interrogated. Related fix in the autograd example - the output tensor Q wasn't set up so didn't have a device type yet.

Questions:

  • Do we also need to check the device indices are consistent?
    • -> future PR
  • And what about data types?
    • -> future PR
  • Can we add pFUnit unit tests that check these errors are raised as expected?
    • ->Doesn't seem so (see below).

@jwallwork23 jwallwork23 added bug Something isn't working autograd Tasks towards the online training / automatic differentiation feature testing Related to FTorch testing gpu Related to buiding and running on GPU labels Jan 30, 2025
@jwallwork23 jwallwork23 self-assigned this Jan 30, 2025
@jwallwork23 jwallwork23 changed the title 269 tensor op different device error Checks for inconsistent device types Jan 30, 2025
@jatkinson1000 jatkinson1000 added this to the Version v1.0 milestone Feb 26, 2025
@jatkinson1000 jatkinson1000 force-pushed the 269_tensor-op-different-device-error branch from 50125ff to f02e334 Compare February 27, 2025 11:10
@jatkinson1000 jatkinson1000 marked this pull request as ready for review February 27, 2025 11:10
Copy link
Member

@jatkinson1000 jatkinson1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jwallwork23 I just rebased this off of current main and then modified an example to check that if we try and add two with different devices we get an error raised as expected.

I used the below:

real(wp), dimension(5), target :: a, b, combine_data
type(torch_tensor) :: tensor_a, tensor_b, combine_tensor

a = [0.0_wp, 1.0_wp, 2.0_wp, 3.0_wp, 4.0_wp]
b = [0.0_wp, 1.0_wp, 2.0_wp, 3.0_wp, 4.0_wp]

call torch_tensor_from_array(tensor_a, a, tensor_layout, torch_kCPU)
call torch_tensor_from_array(tensor_b, b, tensor_layout, torch_kMPS)

combine_tensor = tensor_a + tensor_b
call torch_tensor_print(combine_tensor)

When tensors a and b are on CPU this passes fine, but in the above state (b on MPS) we get:

 Error :: cannot add tensors with different device types
Note: The following floating-point exceptions are signalling: IEEE_OVERFLOW_FLAG
STOP 1

Since we looked together in-person I am happy to approve this now.

For posterity we also discussed that this error warning could be made friendlier (parent routine, line number etc) but this is not a simple task and not to be done here.

@jwallwork23
Copy link
Collaborator Author

* Can we add pFUnit unit tests that check these errors are raised as expected?

I don't think it currently supports this - as far as I can tell, it can only mark certain tests as expected to fail, not interrogate error messages.

@jwallwork23 jwallwork23 force-pushed the 269_tensor-op-different-device-error branch from f02e334 to 4cfa4ab Compare February 27, 2025 11:49
@jwallwork23
Copy link
Collaborator Author

[Rebased on top of main, fixing various conflicts]

* Drop torch_tensor_to_<array/blob>
* Remove torch_tensor_to_array from autograd example
* Use std::move for assignment
* Update constructors/destructors tests to avoid to_array
* Update summary of autograd example
* Update overloads tests to avoid to_array
* Fix previous API change note
@jwallwork23 jwallwork23 merged commit 6deab9a into main Mar 3, 2025
5 of 6 checks passed
@jwallwork23 jwallwork23 deleted the 269_tensor-op-different-device-error branch March 3, 2025 18:13
@jatkinson1000
Copy link
Member

jatkinson1000 commented Mar 3, 2025

I think I hadn't re-approved this because of the issues I had seen when trying to print a tensor.
See where I was investigating on mix_devices_temp branch.

Did we get to be bottom of that? I know it was mentioned in the meeting this morning.

Looking at the simplenet that failed that I uploaded I wonder if it was to do with a lack of importing the overloaded =.
There are also a few things that have changed with the addition of #303 so I will try and re-check this at some point to be certain.

@jatkinson1000
Copy link
Member

OK, got nerd sniped.

The segfault was resolved by ensuring use ftorch, only : assignment(=) was present.

So that is good news. The bad news I guess is that this can be a tricksy bug/error to spot/resolve as there isn't much to go on from the FTorch output.
Not sure there is an easy fix to this however, but definitely something to be aware and make overly clear to users of when writing the docs @jwallwork23 .

@jwallwork23
Copy link
Collaborator Author

The segfault was resolved by ensuring use ftorch, only : assignment(=) was present.

So that is good news. The bad news I guess is that this can be a tricksy bug/error to spot/resolve as there isn't much to go on from the FTorch output. Not sure there is an easy fix to this however, but definitely something to be aware and make overly clear to users of when writing the docs @jwallwork23 .

Ah yes! Good work tracking it down. Opened #305 to address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autograd Tasks towards the online training / automatic differentiation feature bug Something isn't working gpu Related to buiding and running on GPU testing Related to FTorch testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Raise error if overloaded operator applied to tensors on different devices
2 participants