Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB Error #1060

Open
sherinei opened this issue Dec 7, 2024 · 12 comments
Labels

Comments

@sherinei
Copy link

sherinei commented Dec 7, 2024

Hello, we are getting the following error when trying to use nl.add:

`Running correctness test for conv2d kernel with larger images...[GCA035] Instruction: I-21-0 with opcode: TensorTensor couldn't be allocated in SB
Memory Location Accessed:
res.48_i0: 888 Bytes per Partition and total of: 113664 Bytes in SB
position_out_i0: 98568 Bytes per Partition and total of: 12616704 Bytes in SB
position_out_i0: 98568 Bytes per Partition and total of: 12616704 Bytes in SB
Total Accessed Bytes per partition by instruction: 198024
Total SB Partition Size: 196608

  • Please open a support ticket at https://github.com/aws-neuron/aws-neuron-sdk/issues/new. You may also be able to obtain more information using the 'XLA_IR_DEBUG' and 'XLA_HLO_DEBUG' environment variables.
    Traceback (most recent call last):
    File "/home/ubuntu/asst4-trainium/part2/test_harness.py", line 196, in
    test_result = test_correctness_conv2d_kernel(conv2d, use_larger_images=True)
    File "/home/ubuntu/asst4-trainium/part2/test_harness.py", line 85, in test_correctness_conv2d_kernel
    out = kernel(*args, **kwargs)
    File "neuronxcc/nki/compile.py", line 92, in neuronxcc.nki.compile.GenericKernel.call
    File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 174, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.Kernel.call
    File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 422, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.BaremetalKernel.post_process_call
    File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 425, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.BaremetalKernel.post_process_call
    File "neuronxcc/starfish/penguin/targets/nki/TraceKernel.py", line 508, in neuronxcc.starfish.penguin.targets.nki.TraceKernel.BaremetalKernel._compile
    RuntimeError: Compilation failed for fused_conv2d_maxpool with error Command '['neuronx-cc', 'compile', '--framework', 'XLA', 'penguin.py', '--internal-tensorizer-opt-level=nki', '--pipeline', 'compile', 'SaveTemps', '--target', 'trn1', '--disable-internal-io-dge', '--output=file.neff']' returned non-zero exit status 70.`

We're not sure what's causing this error. Any help would be appreciated. Thanks.

@aws-serina-tan
Copy link

aws-serina-tan commented Dec 7, 2024

Each SBUF partition in NeuronCore-v2 only has 196KiB of physical memory. When a TensorTensor instruction (triggered by a call of nl.add) is executed, all the input and output tensors must fit in SBUF. More info on SBUF: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/nki/trainium_inferentia2_arch.html#trainium-inferentia2-arch.

Here, you will need to reduce the tile size of your nl.add() calls. Use loops to iterate over different chunks of your original tensor.

@sherinei
Copy link
Author

sherinei commented Dec 7, 2024 via email

@AWSNB
Copy link
Contributor

AWSNB commented Dec 7, 2024

@sherinei could you share your latest code via gist or github with @aws-serina-tan @AWSNB @aws-zhehongb @JonathanHenson @aws-qieqingy @EmilyWebber

and if u can share the code you doing to adding in place ? are you using a = a+b, a+=b, a[...] = a+b, or a[...] += b ?

@sherinei
Copy link
Author

sherinei commented Dec 7, 2024 via email

@sherinei
Copy link
Author

sherinei commented Dec 7, 2024 via email

@AWSNB
Copy link
Contributor

AWSNB commented Dec 7, 2024

@sherinei couple of other comments on the code:

Line 185: nl.add(row_out, bias_i) ==> you are not assigning the result of add to any destination.
this should be: c = nl.add(a,b)

Line 184-192: try adding the bias after copying matmul to sbuf, and instead of +=, try nl.add

                             
                                row_out[...] = nl.matmul(w[:, in_i*c_in_pmax:in_i*c_in_pmax + c_in_pmax, i, j], x_row)
                                # nl.add(row_out, bias_i) -- move this to add after data is in sbuf
                                # copy per row output into corresponding index in sbuf array
                                row_out_sbuf = nl.ndarray(shape=row_out.shape, dtype=row_out.dtype, buffer=nl.sbuf)
                                row_out_sbuf = nl.copy(row_out, dtype=row_out.dtype) # from psum to sbuf
                                row_out_sbuf[...] = nl.add(row_out_sbuf, bias_i); # putting this here to do sbuf to sbuf
                                # print(row_out_sbuf.shape, bias_i.shape)
                                po_start_index = h * out_width
                                po_end_index = po_start_index + out_width
                                position_out[:, po_start_index:po_end_index] = nl.add( position_out[:, po_start_index:po_end_index], row_out_sbuf)   # changing from += to nl.add or can do something like a[...] = a + b

@sherinei
Copy link
Author

sherinei commented Dec 7, 2024 via email

@AWSNB
Copy link
Contributor

AWSNB commented Dec 7, 2024 via email

@sherinei
Copy link
Author

sherinei commented Dec 7, 2024 via email

@AWSNB
Copy link
Contributor

AWSNB commented Dec 7, 2024

you are right on (2) and these specific dimensions are indeed broadcastable so the code is good

re (1), see hongbin's comments about indices inside loops in case that helps

@sherinei
Copy link
Author

sherinei commented Dec 7, 2024

don't think those comments help with our issue, position_out[:, po_start_index:po_end_index] += row_out_sbuf runs fine but position_out[:, po_start_index:po_end_index] = position_out[:, po_start_index:po_end_index] + row_out_sbuf and position_out[:, po_start_index:po_end_index] = nl.add(position_out[:, po_start_index:po_end_index], row_out_sbuf) return errors. position_out is of shape(c_out_pmax, tiled_out_height * out_width)

@aws-zhehongb
Copy link

in

                res = nl.add(position_out, bias_i)

for test Running correctness test for conv2d kernel with larger images, both res and position_out have shape (128, 24642). In order for the addition to happen, you need both position_out and res in sbuf at the same time.

for fp32 datatype, you need 24642 * 4 per partition for position_out and 24642 * 4 per partition for res, i.e. you need 24642 * 4 * 2 bytes, which exceed the sbuf capacity = 192k per partition minus some extra overhead for the RT.

You need to consider to tile the computation into multiple smaller tiles to overcome this problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants