As to [2] I am very skeptical of the value of all the NaNs, -0 and Infs floating...

jcranmer · on Jan 17, 2023

The value of -0 as distinguished from +0 has a few uses. The most obvious one is preserving sign in the case of overflow. A less obvious use case is handling branch cuts. There are uses in a few more cases: I've heard it's occasionally useful in things like coordinate systems, since something like "0°5'3" W" can be stored as (-0.0, 5, 3) after explosion and still display correctly. It's definitely niche, but it does have its uses.

Returning a distinct value that retains the fact that it overflowed is quite useful--if you get that result out of the computation, you know you overflowed the computation rather than silently getting a meaningless result. Note in particular that infinities end up being sticky values: once a value goes infinite, it tends to stay infinity, which isn't true for largest finite values. Distinguishing between various kinds of "invalid" values turns out to be moderately useful in practice--I've used infinities a couple of times in my own code.

NaNs are useful in representing a different kind of error than overflowed computation. Now there is a lot of room to criticize IEEE 754 here: "x != x" was quite frankly a mistake (basically the primary reason for it was the creators wanted to make testing for NaN easier than calling isnan(x)...). sNaNs are of course an abomination that just makes things worse. Multiple NaN payloads were originally intended (in part) to let developers debug the sources of NaNs, but this requires support that never really materialized. However, NaN payloads did find new use in making NaN-boxing a useful technique, and dedicating an entire exponent to special values simplifies several numerical analysis lemmas.

adgjlsfhk1 · on Jan 17, 2023

> handling branch cuts

I agree this sounds great in theory, but I don't think it works very well in practice. i.e. what about 1/(x+1)? Also branch cuts matter most for complex arithmetic, and there +-0 doesn't help since you don't know the phase of the zero. Also, realistically, floating point has finite precision so there are very few non-toy examples where you can do an actual computation and reliably end up on the correct branch. I'd rather have all the real numbers represented before we start adding hyper-reals to the number system.

> Returning a distinct value that retains the fact that it overflowed is quite useful

Agreed, and I think that NaR in Posits does a good job of that while not taking a ridiculous number of values.

jcranmer · on Jan 17, 2023

> I agree this sounds great in theory, but I don't think it works very well in practice.

I've actually done it once in practice myself. I forget the exact details, though. As I said, it is a niche use case, but it's a useful to have when you are in that niche.

SideQuark · on Jan 17, 2023

>NaN breaks x==x which is a pretty fundamental relationship for numbers to have

NaN is not a number, so it should NOT satisfy "fundamental relationships for numbers to have".

>+-Inf sound useful in theory, but in practice they rarely give you a more useful resu

There are algorithms that are more performant using infs, and without having a way to denote overflow, you'd have to pre-check evedry operation to do serious numerical work, which basically cuts your performance in half.

>Once you've gotten rid of -Inf, it becomes clear that -0.0 is a mistake

>It breaks the identity 0+x==x and 0-x==-x.

No, you have some fundamental misunderstanding. IEEE explicitly guarantees these hold, even for -0.

> Furthermore, IEEE specifies sqrt(-0.0)==-0.0 and log(-0.0)==-Inf which are both nonsensical if you consider -0.0 as a limit from the negatives.

You're making up strawmen. -0 is not a "limit from the negatives" any more than +0 is a limit form the positives, which would break other made up requirements. That is why making up stuff that has zero bearing on what IEEE 754 specifies is arguing strawmen.

>Floats also have the unfortunate property that inv(x) can be infinite for finite x.

Integers have the same property: -(X) can not be the negative of X. So this is not a problem except in made up goofiness.

Every objection you post is a lack of understanding numerical analysis and the needs of actual scientific software.

So you're skeptical- do you write numerical software professionally? I do, and have, and will do it in the future. There are very, very good reasons for all of those pieces you don't see the need for.

There's a reason unums have not caught on with the field of numerical software or numerical analysis - they simply don't allow writing robust, performant software, they solve no real issues, and add significant problems.

adgjlsfhk1 · on Jan 17, 2023

>so it should NOT satisfy "fundamental relationships for numbers to have".

If you have a list with a NaN in it, how should you make sort terminate (and where should the NaN end up)? I understand that in theory it is kind of arguable that NaN should be different, but breaking the total order is a really dumb decision.

>you'd have to pre-check evedry operation to do serious numerical work, which basically cuts your performance in half.

Can you give an example? Saturating overflow tends to do the same thing.

>IEEE explicitly guarantees these hold

This is kind of true. -0.0+0.0==0.0 and 0.0-0.0==0.0. IEEE does define -0.0==0.0 so IEEE does technically make this hold, but only by redefining == so that two different numbers are ==

> -0 is not a "limit from the negatives" Then what is it? it's not a real number, and Kahan's justification of them comes from branch cuts of analytic functions which is only makes sense in the context of limits https://homes.cs.washington.edu/~ztatlock/599z-17sp/papers/b...

> Integers have the same property Yeah and it sucks there too. In the fp case it makes it really annoying to do things like calculate divide an array by a float quickly and accurately. You would want to take the inverse of the divisor and multiply by that, but doing so isn't safe if the divisor is subnormal.

Yes. My day job is in solving Differential Algebraic equations, but I also have written a bunch of Julia's Libm.

SideQuark · on Jan 17, 2023

>If you have a list with a NaN in it, how should you make sort terminate

Do whatever you want. If you're sorting floats, sort them to the front. Every language I've ever used for developing numerical software has a trivial IsNaN equivalent. So that's not a complaint worthy of claiming NaNs are not useful. I've written lots of numerical software and not once has this been an issue for me.

What value do you assign sqrt of a negative without some NaN type item? Or any of tons of other "not a number" results?

>so IEEE does technically make this hold, but only by redefining ==

There's no "redefining ==" here. You are upset that bit patterns are different, but == is not for bit patterns. You are confusing == for floats with == for bit patterns, which are not and need not be the same thing. I've never seen a language that gets these confused. If you want float ==, simply use language ==. If you want bitwise ==, then you usually have to do (often not portable) fiddling to convert to a bit pattern. It's like claiming reference == and structure field == should be the same, but both have uses. So languages have all sorts of ways to use the concept of equality, and they are all useful. Confusing them does not make the ones you don't like invalid or not extremely useful for people that do understand and use them.

>Yes. My day job is in solving Differential Algebraic equations, but I also have written a bunch of Julia's Libm.

Good. Then you should understand why, as an example, C++ std lib has a massive amount of functions like fma, expm1, log1p, hypot, and many more. Sure you can simply write log(1+x) instead of using log1p, but log1p is vastly better in this case because properties of IEEE 754 allow more precision. instead of hypot(x,y) you could write sqrt(xx+yy), but hypot is much better. These functions exist since IEEE provides tools to analyze these and make much better versions than the naive way to write them. Unums, with varying precision, make this vastly harder (and losing precision over the domain, making it hard to analyze anything).

So unums, with varying precision, violate fundamental properties for scientific computing, namely, they lose precision in really messy ways. You cannot start with P digits of precision and do even simple math and get an answer with P digits of precision. IEEE does allow this.

For example, sqrt(x^2)=|x| in IEEE (for no under/overflow). This does not work in unums, since they lose precision. Square something and lose digits. Fundamental to lots of scientific computation is the requirement to maintain precision throughout a calculation. Unums fail this spectacularly, making it incredibly messy to do correct scientific work.

adgjlsfhk1 · on Jan 17, 2023

the posit standard has a NaR value that does everything I wish NaN and Inf were in ieee it is the result of 1/0, and sqrt(-1) etc. there is only one of them and it compares equal with itself and is defined as less than all other posits. Real numbers have a total order so it's silly that floating point doesn't. Furthermore the Posit ordering operations (bitwise) are the same as the signed integer ones which makes your processor simpler and makes it easier to do things like write radix sorts for floats.

> You are confusing == for floats with == for bit patterns

The problem is that == for floats doesn't behave like an equality operation. x==x doesn't hold (reflexivity) and x==y => f(x)==f(y) doesn't hold. These are The two most important parts of what equality means.

To take your example of sqrt(xx), for Float16, of the 65k values, 34k give exact answers (counting NaNs as exact otherwise subtract 2k), 16k overflow and 5k underflow. There are also 9k inexact answers of which 6k are within 2 ULPs, and the others are further off (since xx loses precision due to subnormals). so in other words you get exact answers 1/2 of the time and close answers 60% of the time. With Posit16, you get 47k exact answers, and 18k inexact answers. How inexact are these inexact answers? 15k are within 2 ULP and only 2.9k aren't. (Of the 2.9k that aren't, Float16 would have overflowed or underflowed in all but 278 of the cases and these 278 cases are all accurate to less than 4 ULPs).

Posits do lose the ability to do error free transforms, but IMO for 32 bit and smaller math, this isn't a major loss as if you want more accuracy you can use more bits and it will usually be faster than the error free transform.

adgjlsfhk1 · on Jan 17, 2023

I've done a similar experiment with log1p(expm1(x)) and for that FLoat16 has 35k exact, 26k overflow, 1.3k within 4 ULP and 3k with more than 4 ULP error. Posits for comparison are 38k exact, 19k within 4 ULP, and 8k more than 4 ULP.

SideQuark · on Jan 18, 2023

>To take your example of sqrt(xx),

Yes, for small floats posits do ok, but they fail for other sizes. For example, here's float32 vs posit32 for 100,000 random values in ranges 1e2,1e4,..,1e18.

Posit32 fails on (respectively) 21%, 71%, 91%, 97%, 99%, 99.9%, 99.97%, 99.99% of the cases. Float32 fails on 0 of them. Julia code at the bottom.

Posit even fails on simple integer multiplication so often that you'd be terribly pressed to know ahead of time when it happens. For example, take integers 1 to 40 for i and j, multiple as posit16 and as float16, an see how they do. Posit fails 1.75% of the values, float fails none.

This is simple multiplication of numbers well within range. The same problem happens in posit32,64,anysize, but not for the same sized floats.

>These are The two most important parts of what equality means.

As a PhD in math, this is not what equality means. You'll find nothing like that here for example https://en.wikipedia.org/wiki/Equality_(mathematics)

And if you're worried about equality, you might notice that in posit16, 2739 gives 1052 instead of 1053, which is real (in)equality. You worry so much about made up concerns that you miss the crazy bad results scattered throughout posits.

Posits of all sizes make errors when multiplying by powers of two that floats do not make (die to their inability to keep digits). For example, in posit8, 2.01.03125 returns 2.0, 102 returns 16, and examples this bad can be found for any size posit.

To see this, take 1e6 random values in 0-100, mult by 2, then divide by 2, and see how many made it round trip. All float16 values do. 4% of posit16 values do not round trip. These are small numbers - the entire computation stays in the range 0-200, and this is even the base of the underlying number. Posit32 has the same failure rate for the same reason: posits lose precision even under small multiplications.

As a result, posits fail at x-y=0 means x=y, which is also pretty fundamental, is it not?

Want to compute a discriminant sqrt(bb-4ac)? Good luck, nearby values for a,b,c don't give smooth results, and routinely give imaginary numbers when they should be real (due to the above screwiness around powers of 2).

There's so many failure cases, not even at the edge of the ranges, where posits fail and equivalent sized floats don't, that doing any simple computations is error prone.

Here's the Julia code for the sqrt failures. You can do similar error checks for a ton of computations and you'll find posits failing a significant amount of them.

     # count failures of float32 and Posit32 in Julia
     # for sqrt(x*x) ==?= x
     using Random

     Random.seed!(1234) # make reproducible

     scale = 1.0f0 # try exponent 4,6,8,10,etc
     for s in 1:9 # powers 2,4,5,8,10,...18
         scale *= 100.0f0
  
         badF,goodF = 0,0
         badP,goodP = 0,0
         for i in 1:10000
             f = rand()*scale
   
             f1::Float32 = f
             f2 = f1*f1
             f3 = sqrt(f2)

             @assert typeof(f1) == Float32
             @assert typeof(f2) == Float32
             @assert typeof(f3) == Float32

             p1 = Posit32(f)
             p2 = p1*p1
             p3 = sqrt(p2)

             @assert typeof(p1) == Posit32
             @assert typeof(p2) == Posit32
             @assert typeof(p3) == Posit32

             if f1 != f3
                badF+=1
             else
                goodF+=1
             end

             if p1 != p3
                badP+=1
             else
                goodP+=1
             end
          end

          println("Scaling: $(scale)")
          println("float: $(goodF) good, $(badF) bad, $(100*badF/(goodF+badF)) % failed")
          println("posit: $(goodP) good, $(badP) bad, $(100*badP/(goodP+badP)) % failed")

Dylan16807 · on Jan 19, 2023

> Posit32 fails on (respectively) 21%, 71%, 91%, 97%, 99%, 99.9%, 99.97%, 99.99% of the cases. Float32 fails on 0 of them. Julia code at the bottom.

A lot of those are only losing a bit or two of precision, though, and many of them are happening in a region where posit32 has more bits of precision than float32 to start with.

I'm not very fond of having only 2 exponent bits on the standard posit32. It makes the region with significant precision loss a lot bigger. But if you give a posit exactly two fewer exponent bits than a float, it shouldn't do worse than that float by more than one bit anywhere. I think that would be a better apples-to-apples comparison. Standard posit32 is tuned much more toward having bonus precision near 1.0, and other tests would show it beating float32 for many tasks.

> Posit even fails on simple integer multiplication so often that you'd be terribly pressed to know ahead of time when it happens. For example, take integers 1 to 40 for i and j, multiple as posit16 and as float16, an see how they do. Posit fails 1.75% of the values, float fails none.

Posit16 starts failing to represent odd numbers at 1024. Float16 starts failing to represent odd numbers at 2048.

Not ideal but not a big issue.

1052 vs. 1053 is definitely not a "crazy bad problem"!

And on the other hand, consider multiplying numbers from 1 to 400. Around the high end of the results, posit16 will store numbers below 65k with 8 bits of mantissa, and numbers above 65k with 7 bits of mantissa. Float16 will store numbers below 65k with 10 bits of mantissa, and everything above 65k becomes infinity.

> Posits of all sizes make errors when multiplying by powers of two that floats do not make (die to their inability to keep digits). For example, in posit8, 2.0 * 1.03125 returns 2.0, 10 * 2 returns 16, and examples this bad can be found for any size posit.

What's your chosen IEEE competitor?

The example on wikipedia has 3 bits of mantissa. It also says 2.0 * 33/32 == 2.0

And while that format can represent 20, it only has a third of the dynamic range. Tradeoffs. No 8 bit format is going to be good.

What kind of float would pass the 1.03125 test, anyway? You'd need 5 bits of mantissa to represent that. In an 8 bit float, that means you have a... 2 bit exponent? No, that would be ridiculous, it would have to be a 3 bit exponent with no sign bit? That would imply the smallest normal value is 1/4 and the largest value is 15.75

If we face that against an 8 bit unsigned posit with es=1, then it would have 5 bits of mantissa in [1/4, 4), 4 bits of mantissa in [4, 16), and 3 bits of mantissa in [16, 64).

That means this posit would be able to represent both 2.0 * 1.03125 and 20, and it would have four times as much as dynamic range.

I wasn't expecting such a blowout until I started doing the math. Wow, congrats posit.

> To see this, take 1e6 random values in 0-100, mult by 2, then divide by 2, and see how many made it round trip. All float16 values do. 4% of posit16 values do not round trip. These are small numbers - the entire computation stays in the range 0-200, and this is even the base of the underlying number. Posit32 has the same failure rate for the same reason: posits lose precision even under small multiplications.

Yes, they lose one bit of precision if you cross certain thresholds. That is a cost to keep in mind. But look at which values fail to round-trip. It should mostly be numbers between 8 and 16, I think? Those numbers started with 11 bits of mantissa, and they were reduced to 10. Float16 is always 10. This is not a problem. For posit32 it's even more stark. Those numbers start with 27 bits of mantissa, and get reduced to 26 bits. Float32 is always 23.

> As a result, posits fail at x-y=0 means x=y, which is also pretty fundamental, is it not?

I don't think that would happen. Do you have an example number?

What can happen is that k*x - k*y=0 even though x!=y. With normal floating point that can't happen when k is a power of 2, but it can happen with lots of values of k. 4/3 * 3 - 4/3 * 3.0000000000000004 = 0, with normal floating point. In fact every third pair of adjacent floats incorrectly returns 0.

SideQuark · on Jan 19, 2023

>A lot of those are only losing a bit or two of precision, though, and many of them are happening in a region where posit32 has more bits of precision than float32 to start with.

Most aren't in those regions. Posits only have more precision for small numbers compared to most of those in this test. Numbers in these ranges are common in computing - write a mapper, or any physics sim, or a CAD tool, or a video game, or nearly anything, and you'll soon find numbers needed over many orders of (decimal) magnitude. Posits simply don't handle any of these cases well.

And losing "only a few bit or two" when doing two operations leads to massive loss in long calculations with even less stable operations. That these lose so much immediately is absolutely terrible for things that do actual numerical work. Things like BLAS, which underlie huge amounts of computing, have plenty of papers on analysis of what happens in actual practice, and posits will not handle much at all of it.

>1052 vs. 1053 is definitely not a "crazy bad problem"!

Really? When it shows up in spreadsheets in various forms I expect people will think otherwise. The Intel floating point fiasco and Excel bugs with vastly less error certainly caused major problems.

>And on the other hand, consider multiplying numbers from 1 to 400. Around the high end of the results, posit16 will store numbers below 65k with 8 bits of mantissa, and numbers above 65k with 7 bits of mantissa. Float16 will store numbers below 65k with 10 bits of mantissa, and everything above 65k becomes infinity.

Both overflow - float16 to inf (denotes overflow, correct), posit16 to NaR (not a real, incorrect - the result is a real number). float16 also then says 0 < inf, correct. Posit16 says NaR < 0, which is not correct.

>Standard posit32 is tuned much more toward having bonus precision near 1.0,

Congrats if you only have calcs that stay near 1.0. If that is your use case, then a format tuned specifically to that case will outperform posits (like lots of things being tried in ML).

>What kind of float would pass the 1.03125 test, anyway?

All of the IEEE formats - multiplying or dividing by two never loses precision until overflow or underflow since it's merely incrementing/decrementing the exponent. Posits with changing precisions loses significance over it's entire range. A fundamental rule in scientific computing is to keep the same precision - if you know something with D digits (or bits) of accuracy, you should keep that precision, otherwise you simply get bad answers. This is taught in elementary schools if I recall.

Also why did you ignore the 10*2 = 16 posit case? And these happen for all size posits for reasonable ranges, but none of the IEEE formats. The values I gave were for examples. Run your own checks and you'll find them all over the map for any size posit.

>> As a result, posits fail at x-y=0 means x=y, which is also pretty fundamental, is it not?

>I don't think that would happen. Do you have an example number?

It's a fundamental property of IEEE 754 that the difference (or sum) of two representable values is itself representable in the same IEEE 754 format (and it's also true for mult and div, all of which allows multiprecision like double-double to exist). These properties are fundamental in proving theorems and are used by compilers to optimize code. It's also a theorem for posits that none of these fundamental properties hold, with the conclusion that there exist values violating it.

Examples where the error in addition is not representable are a= Posit16(0x0001)=2^-114, a+a should be 2^-113, not representable as posit16, a+a as posit 16 gives 2^-112, with error from correct of 2^-113, not representable. For Posit32 this happens for example at Posit32(0x0...03). I gave plenty of examples above from which you can compute that posit errors are not posit representable, making plenty of algorithm impossible without significant more computation.

IEEE was ratified in 1985 and had already been incorporated into hardware from major vendors (using pre-draft). bfloat was invented around 2018 and has significant major hardware vendor support (Google, Intel (even in Xeon processors), FPGAs, ARMv8.6, AMD ROC, and NVidia CUDA among others). Unums were invented in 2015 and have what major hardware support do they have? None? There's even plenty of other float formats adopted in microcontrollers and other hardware used in practice. I've seen none that implement posits. Usually when something is really good it gets incorporated into hardware quite rapidly. Posits have not.

Why do you think that is?

Dylan16807 · on Jan 19, 2023

> Most aren't in those regions.

If you're only looking at places where posit is worse than float, and ignoring the places where it's better, you should use a posit with a longer exponent where none of the values lose more than a single bit of precision. (There will be low-precision values, but they will be numbers that are completely unrepresentable in the equivalent float)

> Really? When it shows up in spreadsheets in various forms I expect people will think otherwise. The Intel floating point fiasco and Excel bugs with vastly less error certainly caused major problems.

Surely 2257 vs. 2258 is roughly as bad? But float16 has the exact same problem with those numbers. It's not "crazy bad" that the threshold is in a slightly different spot. 16 bit numbers are not appropriate for spreadsheets no matter what format.

> Both overflow - float16 to inf (denotes overflow, correct), posit16 to NaR (not a real, incorrect - the result is a real number). float16 also then says 0 < inf, correct. Posit16 says NaR < 0, which is not correct.

posit16 most certainly does not overflow on 400x400. It doesn't overflow on 4000x4000 either, or 4 million x 4 million.

And if you treat "infinity" literally it's not correct at all. If you're going to say "denotes overflow, correct" then it's only fair to say the same thing about NaR. Pretend it's "not a result" maybe? It's just a name.

> All of the IEEE formats - multiplying or dividing by two never loses precision until overflow or underflow since it's merely incrementing/decrementing the exponent.

Let me rephrase. What 8 bit float can represent 1.03125 in the first place?

> Also why did you ignore the 10*2 = 16 posit case? And these happen for all size posits for reasonable ranges, but none of the IEEE formats. The values I gave were for examples. Run your own checks and you'll find them all over the map for any size posit.

There is no standard 8 bit IEEE format as far as I know.

I didn't ignore that case. I pointed out how the standard posit8 fails it. But if we're using some kind of weird custom float with no sign bit, I think it's valid to use a better-balanced posit. Posit8_1 is the best competitor to a custom float with 5 mantissa bits. If you also remove the sign bit, then it can do 10 * 2 = 20.

> It's a fundamental property of IEEE 754 that the difference (or sum) of two representable values is itself representable in the same IEEE 754 format

Except when you hit infinity.

> Examples where the error in addition is not representable are a= Posit16(0x0001)=2^-114, a+a should be 2^-113, not representable as posit16, a+a as posit 16 gives 2^-112, with error from correct of 2^-113, not representable.

1. Such aggressive rounding only happens in the last couple values next to 0. If you were using float16 you wouldn't be able to represent that value, it would just be 0.

2. Does that give you x - y = 0 for different x and y, the thing I asked about?

> Why do you think that is?

I have no idea how hard it is to implement, to be honest. But adding a few more bits is easy in comparison. And it's not that different from normal floating point, so who wants to bother?

bfloat is just truncating differently; I'm not surprised it was very quickly implemented.

adgjlsfhk1 · on Jan 17, 2023

Also to be clear, unums v1 and v2 were (mostly) dumb ideas that haven't gone anywhere. Unums v3 (aka posits) are a (IMO) really good idea for how to generate a better floating point standard (see https://posithub.org/docs/posit_standard-2.pdf)