by Ricardo Fernández Serrata
By using loop unrolling and some kind of SIMD, the processing speed can be extremely high.
Because we're using "high-level SIMD", function calls are less frequent, which improves performance.
If you're 100% sure that the input array doesn't contain `null` you should remove the `|| out = null` check, for more speed. That's there to reduce latency when the array contains a `null` somewhere in the 1st half.
I decided to handle the remainder before the loop, because it uses less blocks and has same performance.
To handle the remainder after the loop, you must create a "fake length" variable, a var that stores the length of array floored to the nearest multiple of 8, by using any of these 3 expressions:
(the 1st is the fastest)
`#A & 0x7ffffff8`
`#A // 8 * 8`
`#A >> 3 << 3`
Generalization for any unrolling degree `e` (this flow uses e := 8):
`n & (0xffffffff << e)`
`floor((n | 0) / pow(2, e)) * pow(2, e)`
`n >>> e << e`
You would then iterate using the fake length, then do `#A - i` to calculate the remainder after the loop has ended.
I didn't include max(), because it's extremely easy to implement, there's no need to handle remainder, because max() ignores `null`
Disclaimer: The speed difference is only noticeable with bigger arrays. The example array is too small