Skip to content

SIMD vectorization of Array.sum<int>, etc #18509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Thorium
Copy link
Contributor

@Thorium Thorium commented Apr 26, 2025

Description

Specific overloads (float, float32, int, int64) of Seq.sum, Seq.average, Array.sum and Array.average to take advantage of vectorization in System.Linq.Enumerable module.

This is potentially a naive first try to solve #16230 by the spirit of @T-Gro comment #16230 (comment)

Checklist

  • Test cases added
  • Performance benchmarks added in case of performance changes
  • Release notes entry updated:

    Please make sure to add an entry with short succinct description of the change as well as link to this pull request to the respective release notes file, if applicable.

…ge, Array.sum and Array.average to take advantage of vectorization in System.Linq.Enumerable module.
Copy link
Contributor

github-actions bot commented Apr 26, 2025

❗ Release notes required


✅ Found changes and release notes in following paths:

Change path Release notes path Description
src/FSharp.Core docs/release-notes/.FSharp.Core/9.0.300.md

Comment on lines 1591 to 1605
[<CompiledName("Sum")>]
let inline sumFloat (array: float array) : float =
System.Linq.Enumerable.Sum array

[<CompiledName("Sum")>]
let inline sumFloat32 (array: float32 array) : float32 =
System.Linq.Enumerable.Sum array

[<CompiledName("Sum")>]
let inline sumInt (array: int array) : int =
System.Linq.Enumerable.Sum array

[<CompiledName("Sum")>]
let inline sumInt64 (array: int64 array) : int64 =
System.Linq.Enumerable.Sum array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that this would be a reasonable place to use static optimization syntax to specify which types should delegate to the LINQ method and which to the existing code, e.g.,

let inline sum (array: ^T array) : ^T =
    existingSumCode array
    when ^T : int   = System.Linq.Enumerable.Sum array
    when ^T : int64 = System.Linq.Enumerable.Sum array
    …

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I expect "static optimization conditionals" are a compile-time thing and not runtime? Because I can't check easily with sharplab.io, it says "error FS0817: Static optimization conditionals are only for use within the F# library"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the appropriate implementation would be chosen at compile-time once the type parameter was resolved.

You could add some IL tests under https://github.com/dotnet/fsharp/tree/main/tests/FSharp.Compiler.ComponentTests/EmittedIL if you wanted.

@T-Gro
Copy link
Member

T-Gro commented Apr 26, 2025

It looks like the error case is now different for average over an empty collection.
If it is the same exception type, and only a different message, we could justify the break.

(we should not evaluate a seq before calling into Average, since this would be breaking in a different way)

@Thorium
Copy link
Contributor Author

Thorium commented Apr 26, 2025

I did some initial tests to see if this makes sense at all or not. I used the current benchmarks\CompiledCodeBenchmarks\MicroPerf\MicroPerf.fsproj with adding following tests:

    [<Benchmark>]
    member x.ArraySum() = // Array sum
        array
        |> Array.sum 
        |> ignore

    [<Benchmark>]
    member x.ArrayAverage() = // There has an extra map, because average needs float
        array
        |> Array.map float
        |> Array.average 
        |> ignore

    [<Benchmark>]
    member x.ArraySeqSum() = // Seq sum by using array as base data
        array
        |> Seq.sum 
        |> ignore

    [<Benchmark>]
    member x.ListSeqSum() = // Seq sum by using list as base data
        list
        |> Seq.sum 
        |> ignore

And here are the results with current main:


BenchmarkDotNet v0.13.10, Windows 11 (10.0.26100.3915)
13th Gen Intel Core i9-13900H, 1 CPU, 20 logical and 14 physical cores
.NET SDK 9.0.203
  [Host]     : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2 DEBUG
  Job-ZVNJYC : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2
  Job-NFLFBL : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2
  Job-ABSOZP : .NET 9.0.4 (9.0.425.16305), X64 RyuJIT AVX2

Server=True  

Method Job Arguments IterationCount LaunchCount WarmupCount Length Mean Error StdDev Ratio RatioSD Gen0 Allocated Alloc Ratio
ArraySum Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 1000 206.4 ns 1.13 ns 1.06 ns 1.00 0.00 - - NA
ArraySum Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 1000 205.9 ns 0.50 ns 0.39 ns 1.00 0.01 - - NA
ArraySum Job-ABSOZP Default 2 2 1 1000 207.8 ns 9.28 ns 1.44 ns 1.01 0.01 - - NA
ArrayAverage Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,538.6 ns 35.62 ns 105.04 ns 1.00 0.00 0.1736 8024 B 1.00
ArrayAverage Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,471.2 ns 29.23 ns 53.45 ns 0.94 0.07 0.1736 8024 B 1.00
ArrayAverage Job-ABSOZP Default 2 2 1 1000 1,495.0 ns 759.33 ns 117.51 ns 0.92 0.07 0.1736 8024 B 1.00
ArraySeqSum Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 1000 500.6 ns 3.69 ns 3.46 ns 1.00 0.00 - 32 B 1.00
ArraySeqSum Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 1000 496.4 ns 1.80 ns 1.40 ns 0.99 0.01 - 32 B 1.00
ArraySeqSum Job-ABSOZP Default 2 2 1 1000 498.9 ns 34.99 ns 5.41 ns 1.00 0.01 - 32 B 1.00
ListSeqSum Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,234.6 ns 3.12 ns 2.91 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,243.7 ns 19.42 ns 17.21 ns 1.01 0.01 - 40 B 1.00
ListSeqSum Job-ABSOZP Default 2 2 1 1000 1,251.6 ns 55.34 ns 8.56 ns 1.01 0.01 0.0019 40 B 1.00
ArraySum Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 10000 1,949.4 ns 8.32 ns 6.95 ns 1.00 0.00 - - NA
ArraySum Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 10000 1,956.5 ns 24.32 ns 22.75 ns 1.00 0.01 - - NA
ArraySum Job-ABSOZP Default 2 2 1 10000 1,966.3 ns 128.42 ns 19.87 ns 1.01 0.01 - - NA
ArrayAverage Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 10000 13,892.3 ns 415.41 ns 1,211.78 ns 1.00 0.00 1.8616 80024 B 1.00
ArrayAverage Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 10000 15,129.6 ns 242.87 ns 227.18 ns 1.09 0.13 1.9379 80024 B 1.00
ArrayAverage Job-ABSOZP Default 2 2 1 10000 12,991.8 ns 8,355.23 ns 1,292.98 ns 1.00 0.26 1.8616 80024 B 1.00
ArraySeqSum Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 10000 4,889.1 ns 48.55 ns 43.04 ns 1.00 0.00 - 32 B 1.00
ArraySeqSum Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 10000 4,828.3 ns 45.71 ns 38.17 ns 0.99 0.01 - 32 B 1.00
ArraySeqSum Job-ABSOZP Default 2 2 1 10000 4,833.0 ns 122.61 ns 18.97 ns 0.98 0.00 - 32 B 1.00
ListSeqSum Job-ZVNJYC /p:BUILDING_USING_DOTNET=true Default Default Default 10000 14,042.3 ns 280.58 ns 483.98 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-NFLFBL /p:BUILDING_USING_DOTNET=true Default Default Default 10000 13,767.9 ns 240.03 ns 420.40 ns 0.98 0.04 - 40 B 1.00
ListSeqSum Job-ABSOZP Default 2 2 1 10000 13,537.7 ns 4,046.83 ns 626.25 ns 0.95 0.07 - 40 B 1.00

Here are the results with this PR:

Method Job Arguments IterationCount LaunchCount WarmupCount Length Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
ArraySum Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 60.66 ns 0.489 ns 0.434 ns 60.59 ns 1.00 0.00 - - NA
ArraySum Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 1000 207.89 ns 3.431 ns 3.209 ns 206.96 ns 3.43 0.05 - - NA
ArraySum Job-BMQAWT Default 2 2 1 1000 60.18 ns 3.505 ns 0.542 ns 60.19 ns 0.99 0.02 - - NA
ArrayAverage Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,693.99 ns 29.205 ns 28.683 ns 1,694.44 ns 1.00 0.00 0.1736 8024 B 1.00
ArrayAverage Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,443.74 ns 47.701 ns 140.649 ns 1,446.84 ns 0.87 0.04 0.1736 8024 B 1.00
ArrayAverage Job-BMQAWT Default 2 2 1 1000 1,486.78 ns 226.790 ns 35.096 ns 1,493.10 ns 0.89 0.02 0.1736 8024 B 1.00
ArraySeqSum Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 60.23 ns 0.426 ns 0.378 ns 60.11 ns 1.00 0.00 - - NA
ArraySeqSum Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 1000 497.31 ns 4.195 ns 3.719 ns 496.69 ns 8.26 0.09 - 32 B NA
ArraySeqSum Job-BMQAWT Default 2 2 1 1000 61.30 ns 1.343 ns 0.208 ns 61.31 ns 1.02 0.00 - - NA
ListSeqSum Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,154.49 ns 9.687 ns 9.061 ns 1,155.70 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,228.62 ns 5.630 ns 4.395 ns 1,228.27 ns 1.06 0.01 - 40 B 1.00
ListSeqSum Job-BMQAWT Default 2 2 1 1000 1,164.41 ns 66.398 ns 10.275 ns 1,159.92 ns 1.01 0.02 0.0019 40 B 1.00
ArraySum Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 546.81 ns 1.656 ns 1.549 ns 547.20 ns 1.00 0.00 - - NA
ArraySum Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 10000 1,961.44 ns 25.704 ns 24.044 ns 1,961.63 ns 3.59 0.05 - - NA
ArraySum Job-BMQAWT Default 2 2 1 10000 541.97 ns 33.331 ns 5.158 ns 541.34 ns 0.99 0.01 - - NA
ArrayAverage Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 16,012.49 ns 316.673 ns 721.226 ns 16,238.87 ns 1.00 0.00 1.8463 80024 B 1.00
ArrayAverage Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 10000 13,651.54 ns 310.627 ns 865.903 ns 13,843.05 ns 0.84 0.08 1.8768 80024 B 1.00
ArrayAverage Job-BMQAWT Default 2 2 1 10000 17,271.21 ns 3,620.263 ns 560.240 ns 17,365.04 ns 1.10 0.04 1.8921 80024 B 1.00
ArraySeqSum Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 554.40 ns 3.815 ns 3.569 ns 554.73 ns 1.00 0.00 - - NA
ArraySeqSum Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 10000 4,813.90 ns 39.532 ns 33.011 ns 4,801.26 ns 8.69 0.09 - 32 B NA
ArraySeqSum Job-BMQAWT Default 2 2 1 10000 551.93 ns 12.280 ns 1.900 ns 552.36 ns 1.00 0.00 - - NA
ListSeqSum Job-WQFBJH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 14,523.24 ns 768.122 ns 2,240.648 ns 13,893.55 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-ZZOCDZ /p:BUILDING_USING_DOTNET=true Default Default Default 10000 15,181.03 ns 458.748 ns 1,338.189 ns 14,985.06 ns 1.07 0.18 - 40 B 1.00
ListSeqSum Job-BMQAWT Default 2 2 1 10000 13,985.62 ns 10,680.237 ns 1,652.779 ns 13,530.39 ns 1.00 0.17 - 40 B 1.00

Edit: I used Net 9 but the FSharp.Core.dll netstandard2.0 version in both, which was probably a mistake because Spans are truely efficient on netstandard2.1 only (?)

@Thorium
Copy link
Contributor Author

Thorium commented Apr 27, 2025

The IL for Enumerable is very complicated, and is heavily relying on the JIT to simplify it. In Release mode, I observe Enumerable to take half the time; in Debug mode Enumerable takes about 4x the time.

Hmm, I'm thinking if default FSI is compiled with debug mode and people use F# as a scripting language, then performance optimizations like this could be marked as #if !DEBUG. However, it would be even weirder if an error message were different between debug and release.

I did run the same performance tests with FSharp.Core.dll netstandard2.1 version

Main branch:

Method Job Arguments IterationCount LaunchCount WarmupCount Length Mean Error StdDev Ratio RatioSD Gen0 Allocated Alloc Ratio
ArraySum Job-WCJSKA /p:BUILDING_USING_DOTNET=true Default Default Default 1000 207.2 ns 2.34 ns 2.19 ns 1.00 0.00 - - NA
ArraySum Job-ZZTGPQ /p:BUILDING_USING_DOTNET=true Default Default Default 1000 204.2 ns 3.47 ns 2.90 ns 0.99 0.02 - - NA
ArraySum Job-EYKYAG Default 2 2 1 1000 206.8 ns 22.17 ns 3.43 ns 1.01 0.02 - - NA
ArraySeqSum Job-WCJSKA /p:BUILDING_USING_DOTNET=true Default Default Default 1000 490.6 ns 2.15 ns 1.80 ns 1.00 0.00 - 32 B 1.00
ArraySeqSum Job-ZZTGPQ /p:BUILDING_USING_DOTNET=true Default Default Default 1000 486.8 ns 2.41 ns 2.13 ns 0.99 0.01 - 32 B 1.00
ArraySeqSum Job-EYKYAG Default 2 2 1 1000 488.8 ns 74.19 ns 11.48 ns 1.00 0.02 - 32 B 1.00
ListSeqSum Job-WCJSKA /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,214.5 ns 2.83 ns 2.51 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-ZZTGPQ /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,241.1 ns 19.15 ns 17.91 ns 1.02 0.01 - 40 B 1.00
ListSeqSum Job-EYKYAG Default 2 2 1 1000 1,216.6 ns 34.26 ns 5.30 ns 1.00 0.00 0.0019 40 B 1.00
ArraySum Job-WCJSKA /p:BUILDING_USING_DOTNET=true Default Default Default 10000 1,905.3 ns 8.41 ns 7.02 ns 1.00 0.00 - - NA
ArraySum Job-ZZTGPQ /p:BUILDING_USING_DOTNET=true Default Default Default 10000 1,924.9 ns 12.14 ns 11.36 ns 1.01 0.01 - - NA
ArraySum Job-EYKYAG Default 2 2 1 10000 1,933.3 ns 95.32 ns 14.75 ns 1.01 0.01 - - NA
ArraySeqSum Job-WCJSKA /p:BUILDING_USING_DOTNET=true Default Default Default 10000 4,710.8 ns 18.93 ns 15.80 ns 1.00 0.00 - 32 B 1.00
ArraySeqSum Job-ZZTGPQ /p:BUILDING_USING_DOTNET=true Default Default Default 10000 4,760.8 ns 17.52 ns 15.53 ns 1.01 0.00 - 32 B 1.00
ArraySeqSum Job-EYKYAG Default 2 2 1 10000 4,776.5 ns 345.62 ns 53.49 ns 1.01 0.01 - 32 B 1.00
ListSeqSum Job-WCJSKA /p:BUILDING_USING_DOTNET=true Default Default Default 10000 14,616.1 ns 355.10 ns 1,030.22 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-ZZTGPQ /p:BUILDING_USING_DOTNET=true Default Default Default 10000 13,809.1 ns 270.69 ns 300.87 ns 0.94 0.08 - 40 B 1.00
ListSeqSum Job-EYKYAG Default 2 2 1 10000 15,396.3 ns 3,747.35 ns 579.91 ns 1.05 0.14 - 40 B 1.00

This PR

Method Job Arguments IterationCount LaunchCount WarmupCount Length Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
ArraySum Job-YRWBVV /p:BUILDING_USING_DOTNET=true Default Default Default 1000 59.19 ns 0.550 ns 0.514 ns 59.01 ns 1.00 0.00 - - NA
ArraySum Job-YQEOCP /p:BUILDING_USING_DOTNET=true Default Default Default 1000 206.10 ns 0.720 ns 0.674 ns 206.08 ns 3.48 0.03 - - NA
ArraySum Job-SEOIAS Default 2 2 1 1000 59.65 ns 4.451 ns 0.689 ns 59.64 ns 1.01 0.02 - - NA
ArraySeqSum Job-YRWBVV /p:BUILDING_USING_DOTNET=true Default Default Default 1000 59.88 ns 0.326 ns 0.305 ns 59.84 ns 1.00 0.00 - - NA
ArraySeqSum Job-YQEOCP /p:BUILDING_USING_DOTNET=true Default Default Default 1000 496.87 ns 6.676 ns 6.245 ns 493.92 ns 8.30 0.10 - 32 B NA
ArraySeqSum Job-SEOIAS Default 2 2 1 1000 60.61 ns 3.319 ns 0.514 ns 60.63 ns 1.01 0.01 - - NA
ListSeqSum Job-YRWBVV /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,163.60 ns 18.969 ns 17.743 ns 1,162.85 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-YQEOCP /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,230.86 ns 6.546 ns 6.123 ns 1,233.38 ns 1.06 0.02 - 40 B 1.00
ListSeqSum Job-SEOIAS Default 2 2 1 1000 1,156.84 ns 66.957 ns 10.362 ns 1,153.00 ns 0.99 0.02 0.0019 40 B 1.00
ArraySum Job-YRWBVV /p:BUILDING_USING_DOTNET=true Default Default Default 10000 553.52 ns 6.542 ns 6.119 ns 554.30 ns 1.00 0.00 - - NA
ArraySum Job-YQEOCP /p:BUILDING_USING_DOTNET=true Default Default Default 10000 1,944.70 ns 16.488 ns 15.423 ns 1,944.27 ns 3.51 0.05 - - NA
ArraySum Job-SEOIAS Default 2 2 1 10000 547.22 ns 53.119 ns 8.220 ns 543.16 ns 1.00 0.02 - - NA
ArraySeqSum Job-YRWBVV /p:BUILDING_USING_DOTNET=true Default Default Default 10000 549.69 ns 2.934 ns 2.600 ns 549.51 ns 1.00 0.00 - - NA
ArraySeqSum Job-YQEOCP /p:BUILDING_USING_DOTNET=true Default Default Default 10000 4,843.34 ns 64.337 ns 60.181 ns 4,826.96 ns 8.82 0.12 - 32 B NA
ArraySeqSum Job-SEOIAS Default 2 2 1 10000 551.21 ns 42.723 ns 6.611 ns 550.96 ns 1.00 0.02 - - NA
ListSeqSum Job-YRWBVV /p:BUILDING_USING_DOTNET=true Default Default Default 10000 13,349.78 ns 349.754 ns 992.195 ns 12,981.18 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-YQEOCP /p:BUILDING_USING_DOTNET=true Default Default Default 10000 14,501.57 ns 311.609 ns 894.065 ns 14,248.72 ns 1.09 0.10 - 40 B 1.00
ListSeqSum Job-SEOIAS Default 2 2 1 10000 12,715.10 ns 817.355 ns 126.487 ns 12,726.11 ns 0.99 0.03 - 40 B 1.00

By quick look of this, the Sum seems to be improved but the Average not really. Should I revert the Average change?

Edit: Average removed

@T-Gro
Copy link
Member

T-Gro commented Apr 28, 2025

1/As per your benchmark, the cost for average might be dominated by doing the casts and additional array allocations.
Better try without the mapping step for fully prove that.

2/The idea of using if !DEBUG - if this is placed in the added code, it would react on conditional defines of when we build FSharp.Core (.e.g when packing together a new .NET SDK), not based on user code. The shipped FSharp.Core is always RELEASE, even if user code is in DEBUG.

@vzarytovskii
Copy link
Member

What are the numbers when running on the full clr? 4.6-4.8? Enumerable probably has different implementation there.

@vzarytovskii
Copy link
Member

Average will now throw different exception, so it's a breaking change at this point.

@T-Gro
Copy link
Member

T-Gro commented Apr 28, 2025

Average will now throw different exception, so it's a breaking change at this point.

@Thorium:
This does not mean it is a showstopper.
If the change is good from all other perspective, such break can be accepted and documented (especially if it will be e.g. just a subtype from the same exception type).

@Thorium
Copy link
Contributor Author

Thorium commented Apr 28, 2025

Better try without the mapping step for fully prove that.

I did check this, and the results were worse for Enumerable.Sum
Main - current

Method Job Arguments IterationCount LaunchCount WarmupCount Length Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
ArrayAverage Job-XSEKSP /p:BUILDING_USING_DOTNET=true Default Default Default 1000 291.8 ns 5.86 ns 8.21 ns 1.00 0.00 - NA
ArrayAverage Job-BKIONA /p:BUILDING_USING_DOTNET=true Default Default Default 1000 276.6 ns 1.25 ns 1.11 ns 0.94 0.02 - NA
ArrayAverage Job-LYJOOZ Default 2 2 1 1000 277.2 ns 4.73 ns 0.73 ns 0.93 0.02 - NA
ArrayAverage Job-XSEKSP /p:BUILDING_USING_DOTNET=true Default Default Default 10000 2,669.5 ns 12.65 ns 11.21 ns 1.00 0.00 - NA
ArrayAverage Job-BKIONA /p:BUILDING_USING_DOTNET=true Default Default Default 10000 2,661.2 ns 4.14 ns 3.87 ns 1.00 0.00 - NA
ArrayAverage Job-LYJOOZ Default 2 2 1 10000 2,677.6 ns 139.82 ns 21.64 ns 1.00 0.01 - NA
------------- ----------- ------------------------------ --------------- ------------ ------------ ------- -----------: ----------: ---------: ------: --------: ----------: ------------:

This PR - with Enumerable.Sum

Method Job Arguments IterationCount LaunchCount WarmupCount Length Mean Error StdDev Ratio Allocated Alloc Ratio
ArrayAverage Job-GSXKPB /p:BUILDING_USING_DOTNET=true Default Default Default 1000 503.6 ns 9.32 ns 8.71 ns 1.00 - NA
ArrayAverage Job-LFPVKH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 275.4 ns 0.45 ns 0.37 ns 0.55 - NA
ArrayAverage Job-SPCSFF Default 2 2 1 1000 494.2 ns 12.91 ns 2.00 ns 0.98 - NA
ArrayAverage Job-GSXKPB /p:BUILDING_USING_DOTNET=true Default Default Default 10000 5,277.4 ns 8.47 ns 7.07 ns 1.00 - NA
ArrayAverage Job-LFPVKH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 2,661.1 ns 4.92 ns 4.11 ns 0.50 - NA
ArrayAverage Job-SPCSFF Default 2 2 1 10000 5,280.3 ns 83.03 ns 12.85 ns 1.00 - NA

…nchmarks\CompiledCodeBenchmarks\MicroPerf\MicroPerf.fsproj
@Thorium
Copy link
Contributor Author

Thorium commented Apr 28, 2025

Because this is just a forwarding, behaviour difference is easy to explore already in FSI:

[| 1.; Double.NaN |] |> Array.sum;;
//val it: float = nan
[| 1.; Double.NaN |] |> System.Linq.Enumerable.Sum;;
//val it: float = nan

Seems to be the same with edge-cases like Double.MaxValue and infinity.

@Thorium
Copy link
Contributor Author

Thorium commented Apr 28, 2025

What are the numbers when running on the full clr? 4.6-4.8? Enumerable probably has different implementation there.

I don't know how to easily run MicroPerf.fsproj in .NET, but I can run
array |> Array.sum vs array |> System.Linq.Enumerable.Sum in .NET 4.8.

Benchmarks.LinqEnumerableSum: .NET Framework 4.8(Runtime=.NET Framework 4.8) [Length=10000]
Runtime = .NET Framework 4.8.1 (4.8.9310.0), X64 RyuJIT VectorSize=256; GC = Concurrent Workstation
Mean = 29.934 us, StdErr = 0.069 us (0.23%), N = 15, StdDev = 0.266 us
Min = 29.483 us, Q1 = 29.784 us, Median = 29.985 us, Q3 = 30.094 us, Max = 30.387 us
IQR = 0.310 us, LowerFence = 29.320 us, UpperFence = 30.558 us
ConfidenceInterval = [29.649 us; 30.219 us] (CI 99.9%), Margin = 0.285 us (0.95% of Mean)
Skewness = -0.15, Kurtosis = 1.96, MValue = 2
-------------------- Histogram --------------------
[29.341 us ; 30.529 us) | @@@@@@@@@@@@@@@
---------------------------------------------------

// * Summary *

BenchmarkDotNet v0.13.12, Windows 11 (10.0.26100.3915)
13th Gen Intel Core i9-13900H, 1 CPU, 20 logical and 14 physical cores
[Host] : .NET Framework 4.8.1 (4.8.9310.0), X64 LegacyJIT VectorSize=256 DEBUG
.NET Framework 4.8 : .NET Framework 4.8.1 (4.8.9310.0), X64 RyuJIT VectorSize=256

Job=.NET Framework 4.8 Runtime=.NET Framework 4.8

Method Length Mean Error StdDev Ratio RatioSD Gen0 Allocated Alloc Ratio
FSharpArraySum 1000 456.7 ns 8.08 ns 7.56 ns 1.00 0.00 - - NA
LinqEnumerableSum 1000 3,033.3 ns 51.61 ns 48.28 ns 6.64 0.07 0.0038 32 B NA
FSharpArraySum 10000 4,480.0 ns 55.90 ns 52.29 ns 1.00 0.00 - - NA
LinqEnumerableSum 10000 29,933.7 ns 284.86 ns 266.46 ns 6.68 0.10 - 32 B NA

Not exactly the results I was hoping for.
Any ideas?

@T-Gro
Copy link
Member

T-Gro commented Apr 28, 2025

Framework-dependent runtime switch would solve this (and since it is static, I would hope JIT would eliminate such branching), but is not something we are doing in FSharp.Core.

On Desktop:

  • Only forward Seq calls to Enumerable, if anything (or maybe just keep as it is )

On Core:

  • Forward Seq.sum and Array.sum. Exclude average due to reasons above.

It would be great if we could make use of https://learn.microsoft.com/en-us/dotnet/api/system.numerics.tensors.tensorprimitives.sum?view=net-9.0-pp#system-numerics-tensors-tensorprimitives-sum-1(system-readonlyspan((-0))) for some of those implementations. Right now, any such support depending on Tensors will have to be done as a separate lib outside of FSharp.Core .

@vzarytovskii
Copy link
Member

Not exactly the results I was hoping for.
Any ideas?

Yeah, that's what I suspected will happen :( Bcl in full framework doesn't have those functions vectorised.

And the issue with fslib is that it ships as netstandard, so there's no way to tailor separate implementations for netfx and netcore.

@Thorium
Copy link
Contributor Author

Thorium commented Apr 28, 2025

The IEnumerable.Sum of NET 9 checks the support for vectors via Vector<T>.IsSupported and Vector.IsHardwareAccelerated.
But I assume there is not even a type of System.Numerics.Vector<T> in .NET Framework 4.8. / netstandard2.0.
Would there be a fast way to check something like Type.GetType("System.Numerics.Vector") <> null to check in static optimization conditionals?

On Desktop:

What do you mean by "On Desktop" and "On Core" ?

I can see that in FSharp.Core.fsproj there is <TargetFrameworks Condition="'$(Configuration)' != 'Proto'">netstandard2.0;netstandard2.1</TargetFrameworks> so there would be possible to do:
<DefineConstants Condition=" '$(TargetFramework)' == 'netstandard2.1' >$(DefineConstants);NETSTANDARD21</DefineConstants>
And then use that, however, this would add one more parallel code-path for long-term maintenance.

@vzarytovskii
Copy link
Member

vzarytovskii commented Apr 28, 2025

The IEnumerable.Sum of NET 9 checks the support for vectors via Vector<T>.IsSupported and Vector.IsHardwareAccelerated.

But I assume there is not even a type of System.Numerics.Vector<T> in .NET Framework 4.8. / netstandard2.0.

Would there be a fast way to check something like Type.GetType("System.Numerics.Vector") <> null to check in static optimization conditionals?

Reflection can likely neglect many optimizations (needs proving tho).

On Desktop:

What do you mean by "On Desktop" and "On Core" ?

Desktop is usually referred when full framework is used, and core is when coreclr runtime and bcl are used.

I can see that in FSharp.Core.fsproj there is <TargetFrameworks Condition="'$(Configuration)' != 'Proto'">netstandard2.0;netstandard2.1</TargetFrameworks> so there would be possible to do:

<DefineConstants Condition=" '$(TargetFramework)' == 'netstandard2.1' >$(DefineConstants);NETSTANDARD21</DefineConstants>

And then use that, however, this would add one more parallel code-path for long-term maintenance.

These should already be defined. However they will be defined on both net48 and net9.0

@T-Gro
Copy link
Member

T-Gro commented Apr 28, 2025

I meant a statically stored flag based on something like:

open System.Runtime.InteropServices
let runtimeDescription = RuntimeInformation.FrameworkDescription
let hasLinqAcceleration = runtimeDescription > "..."  // string-based logic here

And switching at runtime (so cannot be part of statically optimized when clauses in FSharp.Core, since those worked with inlining at compile-time.

e.g. top level function:
For unsupported types (anything but numerics), it would immediately call the existing code.
For statically optimized numeric primitives, it would then do the decision based on this static flag, and pick between existing code and Enumerable.Sum.

@Thorium
Copy link
Contributor Author

Thorium commented Apr 28, 2025

And switching at runtime

Runtime check added. Compared to this
#18509 (comment)
...this PR Array.sum still beats the current main over 50%:

This PR:

Method Job Arguments IterationCount LaunchCount WarmupCount Length Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
ArraySum Job-HPWEPX /p:BUILDING_USING_DOTNET=true Default Default Default 1000 71.20 ns 0.395 ns 0.330 ns 71.28 ns 1.00 0.00 - - NA
ArraySum Job-FURVPH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 203.16 ns 1.536 ns 1.437 ns 202.73 ns 2.85 0.03 - - NA
ArraySum Job-RIJJDM Default 2 2 1 1000 69.23 ns 3.499 ns 0.541 ns 69.03 ns 0.97 0.00 - - NA
ArraySeqSum Job-HPWEPX /p:BUILDING_USING_DOTNET=true Default Default Default 1000 72.44 ns 1.457 ns 1.842 ns 71.38 ns 1.00 0.00 - - NA
ArraySeqSum Job-FURVPH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 497.44 ns 6.606 ns 5.856 ns 495.07 ns 6.79 0.22 - 32 B NA
ArraySeqSum Job-RIJJDM Default 2 2 1 1000 70.06 ns 2.019 ns 0.312 ns 70.09 ns 0.94 0.01 - - NA
ListSeqSum Job-HPWEPX /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,143.31 ns 7.123 ns 6.663 ns 1,140.53 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-FURVPH /p:BUILDING_USING_DOTNET=true Default Default Default 1000 1,227.64 ns 13.709 ns 12.823 ns 1,222.18 ns 1.07 0.01 - 40 B 1.00
ListSeqSum Job-RIJJDM Default 2 2 1 1000 1,153.34 ns 41.507 ns 6.423 ns 1,155.69 ns 1.00 0.01 0.0019 40 B 1.00
ArraySum Job-HPWEPX /p:BUILDING_USING_DOTNET=true Default Default Default 10000 559.23 ns 6.984 ns 6.191 ns 559.52 ns 1.00 0.00 - - NA
ArraySum Job-FURVPH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 1,939.43 ns 19.240 ns 17.997 ns 1,932.29 ns 3.47 0.04 - - NA
ArraySum Job-RIJJDM Default 2 2 1 10000 550.02 ns 50.072 ns 7.749 ns 549.63 ns 0.99 0.01 - - NA
ArraySeqSum Job-HPWEPX /p:BUILDING_USING_DOTNET=true Default Default Default 10000 568.95 ns 10.940 ns 13.836 ns 564.94 ns 1.00 0.00 - - NA
ArraySeqSum Job-FURVPH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 4,784.67 ns 60.194 ns 56.306 ns 4,776.55 ns 8.41 0.23 - 32 B NA
ArraySeqSum Job-RIJJDM Default 2 2 1 10000 561.27 ns 85.016 ns 13.156 ns 557.93 ns 0.96 0.03 - - NA
ListSeqSum Job-HPWEPX /p:BUILDING_USING_DOTNET=true Default Default Default 10000 14,808.61 ns 704.739 ns 2,077.937 ns 14,333.68 ns 1.00 0.00 - 40 B 1.00
ListSeqSum Job-FURVPH /p:BUILDING_USING_DOTNET=true Default Default Default 10000 16,225.89 ns 609.042 ns 1,795.773 ns 16,355.29 ns 1.11 0.18 - 40 B 1.00
ListSeqSum Job-RIJJDM Default 2 2 1 10000 15,371.93 ns 14,341.671 ns 2,219.390 ns 15,512.85 ns 1.00 0.23 - 40 B 1.00

@Thorium
Copy link
Contributor Author

Thorium commented Apr 28, 2025

After Sum is solved, I guess it's time for SumBy to do the same, because SumBy is the more commonly used in F#. There is System.Linq.Enumerable.Sum overload that takes a tuple of IEnumerable and a projection.

let inline sum (array: ^T array) : ^T =
classicSum array
when ^T : float =
if System.Runtime.InteropServices.RuntimeInformation.FrameworkDescription.StartsWith ".NET Framework" then classicSum array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work.
Can you store it in a static boolean flag somewhere, is it possible?

@@ -1578,8 +1578,7 @@ module Array =
checkNonNull "array" array
Microsoft.FSharp.Primitives.Basics.Array.permute indexMap array

[<CompiledName("Sum")>]
let inline sum (array: ^T array) : ^T =
let inline private classicSum (array: ^T array) : ^T =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would call it sumImpl to indicate it is the sum implemented here. Or maybe fsharpSumImpl.
Would rather avoid the term classic

@T-Gro
Copy link
Member

T-Gro commented Apr 28, 2025

After Sum is solved, I guess it's time for SumBy to do the same, because SumBy is the more commonly used in F#. There is System.Linq.Enumerable.Sum overload that takes a tuple of IEnumerable and a projection.

Is it vectorized as well?
Hard to imagine how - since with a projection, the data to sum by is not stored next to each other.

Is it doing copies first? I am curious.

@Thorium
Copy link
Contributor Author

Thorium commented Apr 28, 2025

Oh, you are right, it isn't. That's sad, makes this very marginal feature especially when tools like FSharpLint prefer sumBy over (map >> sum)

@Thorium
Copy link
Contributor Author

Thorium commented Apr 28, 2025

However they will be defined on both net48 and net9.0

.NET Framework is not .NET Standard 2.1 compatible, thus .NET Standard 2.1 version shouldn't need runtime check.
I expect the "supported" frameworks (aka. where things should be as fast as possible) are only:

  • .NET Framework Full (4.62-...) with .NET Standard 2.0
  • dotnet 8/9 (both .NET Standard 2.0 and .NET Standard 2,1 should work here)

@vzarytovskii
Copy link
Member

vzarytovskii commented Apr 28, 2025

However they will be defined on both net48 and net9.0

.NET Framework is not .NET Standard 2.1 compatible, thus .NET Standard 2.1 version shouldn't need runtime check.

I expect the "supported" frameworks (aka. where things should be as fast as possible) are only:

  • .NET Framework Full (4.62-...) with .NET Standard 2.0

  • dotnet 8/9 (both .NET Standard 2.0 and .NET Standard 2,1 should work here)

Will all netstandard2.1 implementations be vectorized? If yes, then it's safe enough (and also it should be defined already).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New
Development

Successfully merging this pull request may close these issues.

4 participants