<tnt>
Any clue to what would yield nexpnr runs to be inconsistent between them when specifying --seed ?
<tnt>
(same machine ... same binary ... same options ... run it twice and get different output)
<emilazy>
bugs, for one :(
<tnt>
yeah, unfortunately I can only reproduce this on my laptop and a build takes 40 min there so I can't realistically bisect 5 months of commits (previous build I was using that didn't exhibit the issue).
<gatecat>
I think it's to do with a bad reliance on unordered_map ordering
<gatecat>
(most C++ libraries use insertion ordering or something but some don't)
<gatecat>
really it needs to be replaced by another structure, ideally, like the Yosys hashlib dict
<tnt>
mmm ... I'm using gcc 9.3.0 on both laptop/workstation and they have different behavior. (granted one is the gentoo variant, the other the ubuntu one so no clue what patches/optiosn they ended up being built with).
<gatecat>
which one is nondeterministic? I have never seen this here on Arch, so it'd be useful to be able to reproduce
<tnt>
my laptop
<tnt>
(gentoo)
<tnt>
Trying to bisect anyway (even given long build) and first commit results in segfault, not a great start.
<tnt>
if you have the same boost version (1.72) I can send you the binary
<gatecat>
I'm not sure if a bisect will even help here as this is a bug that's been in there for a long time, if it's the one I'm thinking of
<gatecat>
is it possible that the compiler or C++ library has changed? (not that it's a bug on that side, just a bad assumption in nextpnr)
<tnt>
I never had this behavior before on my laptop. And no, same exact gcc version.
<gatecat>
huh, maybe it is something else then
<tnt>
The build from fpga-toolchain is constitent, but it exhibits poor QoR. 17 fails out of 32. (vs 3 for the same oss-cad-suite build, both nightly builds).
<gatecat>
is that with the same netlist?
<tnt>
yup, same netlist, same pcf, same options.
<tnt>
scanning seed from 0 to 31
tokomak has joined ##openfpga
<tnt>
From bisect result, I suspect timing: Use new engine for HeAP
<tnt>
ebc2527368d920ea3c40a9ca83f73df242785044
<tnt>
(hard to be 100% sure because a lot of commit don't build or plain segfault ...)
<gatecat>
OK, that's definitely something I should look into
<gatecat>
can you point me to the design? I can probably find a way of triggering the QoR issue here
egg|cell|egg has quit [Read error: Connection reset by peer]
egg|cell|egg has joined ##openfpga
egg|cell|egg has quit [Ping timeout: 264 seconds]
<tnt>
gatecat: The PR fixes valgring here as well. And seems to indeed make the results stable/consistent.
<tnt>
It also fixed the "bad" QoR AFAICT. Now on my laptop / self-built nextpnr, I get only 5 out of 63 fails and only very near misses.
<tnt>
like ... Max frequency for clock 'clk': 47.99 MHz (FAIL at 48.00 MHz)
<gatecat>
tnt: great, thanks for testing!
<tnt>
So I guess HeAP was operating with bad timing estimates causing it to do ... bad things.
<tnt>
I still seem to get different results on different machine, which is a bit strange. I'll rebuild from scratch to make sure it's not some residue from all the testing.
egg|cell|egg has joined ##openfpga
<tnt>
Nope, still different results depending on the machine where I built it ... meh, I guess there must be some subtle build flag or compiler differences or something. In anycase, both exhibit statistically similar QoR so it's not an issue, just intriging.
egg|cell|egg has quit [Ping timeout: 252 seconds]
egg|cell|egg has joined ##openfpga
freeemint has quit [Ping timeout: 252 seconds]
freeemint has joined ##openfpga
<q3k>
does nextpnr rely on floating point math? these can also be quite notorious reproducibility killers
<q3k>
s,reproducibility,determinism,
<tnt>
Yes it does.
<tnt>
And that's what I'd suspect ... some SIMD optimization difference could easily explain that since a lot of those are not strictly compliant with IEEE.
<q3k>
i mean, if you're going from source, then yeah, a lot of things can cause bit-to-bit differences
<q3k>
compilers can pick to emit different instructions for the same high-level code, that are okay acording to the C/C++ standards but slightly differ on bit-to-bit results
<mwk>
yosys ended up with its own dict/set implementations, specifically to avoid C++ library differences between compiler causing different output
<q3k>
and even if you ship the same binary, it can still cause bit-to-bit differences if it eg. does per-cpu-capability dispatch, ie. uses SSE2 for sqrt if present, but defaults to an x87 impl otherwise
<mwk>
(with limitted success, but eh)
cr1901 has left ##openfpga [##openfpga]
cr1901 has joined ##openfpga
<gatecat>
initialiser ordering is the other one that's bitten us before in nondeterminism between compilers/machines
<thaytan>
q3k, I remember hitting one floating point bug once that depended on whether a computation stayed in the FPU and was computed in full 80-bit precision the whole way, or was evicted to a floating point register and truncated to 64-bits mid-way
<q3k>
thaytan: yep, exactly
<mwk>
that's classic, though luckily mostly extinct along with 32-bit x86
fibmod has quit [Ping timeout: 272 seconds]
_whitelogger has joined ##openfpga
egg|cell|egg has quit [Ping timeout: 264 seconds]
egg|cell|egg has joined ##openfpga
egg|cell|egg has quit [Read error: Connection reset by peer]
egg|cell|egg has joined ##openfpga
freeemint has quit [Ping timeout: 244 seconds]
freemint has joined ##openfpga
SmutLord_ has quit [Read error: Connection reset by peer]
SmutLord_ has joined ##openfpga
freemint has quit [Ping timeout: 272 seconds]
specing has joined ##openfpga
specing_ has quit [Ping timeout: 264 seconds]
freemint has joined ##openfpga
egg|cell|egg has quit [Ping timeout: 264 seconds]
egg|cell|egg has joined ##openfpga
egg|cell|egg has quit [Read error: Connection reset by peer]
egg|cell|egg has joined ##openfpga
egg|cell|egg has quit [Ping timeout: 272 seconds]
egg|cell|egg has joined ##openfpga
esden has quit [*.net *.split]
renze has quit [*.net *.split]
mwk has quit [*.net *.split]
Hoernchen has quit [*.net *.split]
rektide has quit [*.net *.split]
sorear has quit [*.net *.split]
pie_bnc has quit [*.net *.split]
hl has quit [*.net *.split]
keesj has quit [*.net *.split]
renze has joined ##openfpga
keesj has joined ##openfpga
sorear has joined ##openfpga
esden has joined ##openfpga
egg|cell|egg has quit [Read error: Connection reset by peer]
mwk has joined ##openfpga
pie_bnc has joined ##openfpga
hl has joined ##openfpga
Hoernchen has joined ##openfpga
egg|cell|egg has joined ##openfpga
rektide has joined ##openfpga
freemint has quit [Remote host closed the connection]
freemint has joined ##openfpga
freemint has quit [Remote host closed the connection]
freemint has joined ##openfpga
freemint has quit [Remote host closed the connection]
freemint has joined ##openfpga
tokomak has quit [Ping timeout: 272 seconds]
freemint has quit [Remote host closed the connection]
freemint has joined ##openfpga
freemint has quit [Remote host closed the connection]
freemint has joined ##openfpga
freemint has quit [Ping timeout: 244 seconds]
cr1901 has quit [Quit: Leaving.]
cr1901 has joined ##openfpga
cr1901 has quit [Read error: Connection reset by peer]