ChanServ changed the topic of #yosys to: Yosys Open SYnthesis Suite: https://github.com/YosysHQ/yosys/ | Channel logs: https://libera.irclog.whitequark.org/yosys/
tpb has quit [Remote host closed the connection]
tpb has joined #yosys
vidbina has quit [Ping timeout: 268 seconds]
citypw has joined #yosys
bl0x_ has quit [Ping timeout: 240 seconds]
bl0x_ has joined #yosys
ec has quit [Ping timeout: 276 seconds]
tlwoerner has quit [Ping timeout: 240 seconds]
tlwoerner has joined #yosys
tlwoerner has quit [Ping timeout: 268 seconds]
tlwoerner has joined #yosys
FabM has joined #yosys
FabM has joined #yosys
<ikskuh> hm, i figured out one hot path without reading the whole source code *grin*
<ikskuh> you shall not modulo arbitrary numbers
<ikskuh> in synthesis
vidbina has joined #yosys
sagar_acharya has joined #yosys
sagar_acharya has quit [Quit: Leaving]
<lofty> ikskuh: oh yeah, that's really something to avoid
<ikskuh> yeah
<lofty> It might be feasible to do sequentially
<ikskuh> still impressed that i can run 16 mio divs per second
<lofty> But, definitely not combinationally
<ikskuh> when i removed it, the synth/pnr jumped to 112 mhz
<ikskuh> which is already way better
<ikskuh> but now i need to compile the nextpnr gui so i can figure out where the hotpath actually is
<ikskuh> because i feel like it should be faster
<ikskuh> my goal is stable 160 MHz CPU clock with at least 20 MHz margin
<tnt> is that on ecp5 ?
<ikskuh> yeah
<ikskuh> current design without CPU synthesizes to roughly 202 MHz
Lord_Nightmare has quit [Ping timeout: 240 seconds]
sagar_acharya has joined #yosys
<ikskuh> the gui is really slow /o\
<ikskuh> are there requirements for the visualization? i might be able to improve the rendering a lot
<ikskuh> is there a way to show the critical paths?
<ikskuh> or reverse net/cell names back to their verilog lines?
<tnt> try to guess based on the name of the nets ...
<ikskuh> from to is both "posedge $glbnet$clk"
sagar_acharya has quit [Quit: Leaving]
_whitelogger has joined #yosys
dnm_ is now known as dnm
oldtopman has joined #yosys
tux3 has joined #yosys
tux3 has joined #yosys
tux3 has quit [Changing host]
Kamilion|ZNC is now known as Kamilion
rektide_ has quit [Ping timeout: 256 seconds]
rektide has joined #yosys
gruetze_ is now known as gruetzkopf
trabucay1e is now known as trabucayre
Sarayan has joined #yosys
<tpb> Title: View paste RW3Q (at bpa.st)
<ikskuh> could someone take a quick peek at this? It's a human readable display of the critical paths
<ikskuh> every line with a "!" is where the delay is more than the budget, and the cells in the path are displayed afterwards
<ikskuh> to me it looks like the "cpu halted" is delaying everything?
<tnt> well kind of hard to say without seeing the sources ...
<ikskuh> oh, sure
<tpb> Title: ashet/mini - mini - Random Projects: Code for the masses (at git.random-projects.net)
<ikskuh> spu.v is the problematic file
<ikskuh> if i remove the instance of the spu_core module, i'm at ~200 MHz
cr1901_ is now known as cr1901
Lord_Nightmare has joined #yosys
ec has joined #yosys
citypw has quit [Ping timeout: 276 seconds]
cr1901 has quit [Ping timeout: 240 seconds]
Moe_Icenowy is now known as MoeIcenowy
<lofty> ikskuh: this state machine is pretty big
<lofty> Oh, is this the entire CPU as a state machine?
<ikskuh> yes
<ikskuh> "smol" cpu
<ikskuh> well, it actually is
<ikskuh> 32 instructions and only some transitions before/after
sagar_acharya has joined #yosys
sagar_acharya has quit [Client Quit]
<lofty> ikskuh: well, CPUs as state machines tend not to go so well
<lofty> There's a lot of things that Yosys has to merge together when it's probably not necessary
cr1901 has joined #yosys
<lofty> For example the ALU carry things
gsmecher has joined #yosys
ec has quit [Ping timeout: 276 seconds]
ec has joined #yosys
vidbina has quit [Ping timeout: 268 seconds]
peepsalot has quit [Quit: Connection reset by peep]
peepsalot has joined #yosys
sagar_acharya has joined #yosys
sagar_acharya has quit [Quit: Leaving]
<ikskuh> lofty: how do i do the CPU then instead?
<ikskuh> i only know the state machine way
<lofty> ikskuh: a 3-stage pipeline should be doable
<ikskuh> hm?
<lofty> pipelined CPU.
<ikskuh> doesn't work
<ikskuh> instructions are 100% dependent on each other
<lofty> do you think they weren't on older RISC CPUs?
<lambda> does your CPU not execute instructions in order?
<ikskuh> lofty, so how does it work then?
<ikskuh> lambda: i have a stack machine
<ikskuh> that means op 1 has to be 100% completed before op 2 can fetch data
<lofty> *or* that op 1 can feed its data into op 2
<ikskuh> right, but the memory write still has to happen
<lofty> And it will
<ikskuh> but then i don't understand what you mean
* ikskuh is still very much an FPGA noob
truc is now known as bjonnh
<lofty> Imagine, fetch/execute/writeback
<ikskuh> yeah, i have done that as a state machine right now
<lambda> ikskuh: I think for the vast majority of unstructions you don't need to fully execute them in order to know that the next instruction will be at pc+1
<lambda> so you can already fetch the next instruction while the current one is being executed
<ikskuh> lambda: i can't, i'm already at 100% memory pressure
<lofty> I think on a 3-stage pipeline you can resolve the next PC immediately though, right?
<ikskuh> kinda.
<lofty> Either a) split instruction and data buses
<lofty> or b) caches
<lofty> Which I suppose is a way of achieving a
<ikskuh> okay, so if i split them and have a instruction cache
<Sarayan> if it's a stack machine it should already be split shouldn't it?
<ikskuh> it isn't, it's a von neumann
<ikskuh> but i still don't understand what you mean with pipeline exactly
<ikskuh> do i chain stuff together as fifos?
<ikskuh> *as=>with
<lofty> No
<lofty> Just flops
<ikskuh> hm, okay
<lofty> okay, let's temporarily put aside the stack machine-ness to look from an ivory tower
<lofty> You need an ALU - this ALU can do things like shift, add, subtract, etc
<ikskuh> right
<lofty> Then you need something to program this ALU to perform an operation
<lofty> This is your fetch stage
<ikskuh> is memory read/write performed by alu?
<Sarayan> no
<Sarayan> memory access is another logical block
<ikskuh> okay
<lofty> You're turning the 3 stage pipeline into a 4-stage pipeline, Sarayan :P
<Sarayan> alu doesn't care where what comes in comes from, out goes
<lofty> But maybe 4 stages is easy to explain
<Sarayan> lofty: then I fold it back, you'll see ;-)
<ikskuh> okay
<ikskuh> so i have a thing that fetches the instruction itself
<ikskuh> which stage/unit fetches the instruction inputs?
<Sarayan> the #1 trick for a fast stack machine being not to have a real stack in the first place, but independant registers for the top and filling/spilling as needed
<ikskuh> Sarayan: that's an optimization for future days
<lofty> unfortunately it isn't.
<ikskuh> which i have planned already, but i don't wanna do now for simplicities sake
<lofty> "where do your instruction inputs come from" - from the register file
<lofty> not from memory
<lofty> You're at 100% memory utilisation, this is why :P
<ikskuh> well
<ikskuh> that is all not a problem atm
<ikskuh> i know that these are architectural/high level design problems
<ikskuh> but right now i can live with raw uncached memory access
<lofty> You came to the channel asking for help with this, did you not?
<lofty> "why is my CPU slow" because at a high level, it is a state machine
<lofty> And state machines do not fit FPGAs well because all the logic is happening at once
<ikskuh> yes, exactly. but i don't see how a "register file" (which i don't know what it is) can help
<lofty> This is why multiplication, division and modulo are slow
<Sarayan> well, at higher level your cpu is slow because it does way too many memory accesses and memory accesses are slow
<ikskuh> ↑ i know this part
<lofty> If you put the top of the stack in a small memory of its own
<lofty> a lot of the operations will use this
<lofty> Maybe even the top N of the stack
<lofty> And then you can avoid memory operations
<ikskuh> yes, but this will make the *implementation* itself slower
<lofty> And you can speed your CPU up by *already having the data*
<ikskuh> as i need more combinatoric logic
<lofty> No, you need *less*
<ikskuh> huh?
<Sarayan> the amount is not what sets the speed, it's the depth
<lofty> You can already load data from the stack, can you not?
<ikskuh> Sarayan: yes, exactly
<ikskuh> that's why i was thinking it's slower as i need more decisions
<Sarayan> so more logic is not necessarily slower if it's more parallel logic
<lofty> So, if you are spending less time fetching operands you need fewer cycles
<ikskuh> lofty: cycles aren't my problem
<lofty> And thus your CPU executes the same instruction stream faster
indy has quit [Quit: ZNC 1.8.2 - https://znc.in]
<Sarayan> but in any case if your cpi goes from 4 to 1.mumble even if your cycle gets slower it doesn't amtter
<Sarayan> (4 = read instruction, read two values, write one)
<lofty> "cycles aren't my problem" <-- then why come to us for help speeding up your CPU if it doesn't matter?
<ikskuh> i don't want "less cycles", but "faster cycles"
<ikskuh> i want to have a clk of 160 MHz
<lofty> That's why you have the pipeline!
<ikskuh> due to other parts in the design
<ikskuh> and a central memory bu
<ikskuh> *bus
<Sarayan> ok
<ikskuh> so if my cpu access this bus, it needs to react in 1 clk
<Sarayan> how are your instructions encoded?
<ikskuh> 1 u16 for "all information", then up to two immediates or stack operations as input0 and input1
<lofty> You are talking to us about how the number of decisions causes your CPU to be slow, right?
<lofty> You are trying to decide everything at once per cycle
<lofty> The solution is therefore to not decide everything at once
<Sarayan> so you have 4 cycles per instruction pretty much always?
<ikskuh> Sarayan: kinda, yeah. except for memory ops (ld8, ld16, st8, st16) and mul/div/mod
<tpb> Title: SPU Mark II - Ashet Home Computer (at ashet.computer)
<lofty> You don't need a mod instruction, but anyway
<ikskuh> lofty: so one solution would be to have "more, but dumber steps" in the state machine?
<lofty> Having more steps means more decisions
<lofty> And having fewer steps means more decisions done per cycle
<Sarayan> well, the #1 question is do you have prefetching?
<ikskuh> no
<lofty> fundamentally, you cannot resolve your problem within the framework of a finite state machine
<lofty> and so you must leave it.
<ikskuh> lofty: i got that part, but i don't understand HOW to leave
<ikskuh> this is something i've never done
<lofty> Can you draw a graph of the steps of your state machine
<ikskuh> i'll try
<lofty> the goal is that a chain of steps within your state machine becomes a pipeline
<Sarayan> if you don't have prefetching you have an extremely hot path where in the cycle the memory must return the instruction read result, you must decode the instruction and push the next address on the bus
<Sarayan> that's a lot of computation which starts with waiting for the memory to answer
<Sarayan> you really need to separate fetch from decode
<lofty> Here's the teaching example I normally go for: washing batches of laundry. You have a washing machine and a tumble dryer.
<lofty> What you are doing here is taking the time to get a batch of laundry, then standing around waiting for it to wash, then standing around waiting for it to dry, then hanging it up
<lofty> whereas you could be simultaneously getting the laundry while one batch washes and another batch dries
<lofty> That is, fundamentally, a pipeline
<lofty> Where you have a state machine that has the state for steps A, B, C, D and executes them in the order A -> B -> C -> D, then what Yosys will do is execute A, B, C, and D simultaneously and then decide the result
<lofty> Which is slower than doing A, B, C, and D simultaneously and unconditionally
<ikskuh> state machine
<lofty> I'm rambling, sorry.
<Sarayan> otoh, pipelines are rather hard with stack machines
<Sarayan> but prefetching isn't, and it's going to help a lot
<lofty> Sarayan: this state machine looks very pipeline-y to me
<ikskuh> lofty: but each part of the pipeline will only activate every nth cycle, right?
<lofty> For your case, yes
<Sarayan> something easy to try, instead of executing the instructions in the order "read instruction, read param1, read param2, write result" try "read param1, read param2, read next instruction, write result", with an initial read instruction at reset and on jumps of course
<Sarayan> if you manage it your hot path should be way less hot
<Sarayan> because 1/ you can compute the result to be written while you're reading the next instruction
<ikskuh> Sarayan: problem is: i left interrupt handling out /o\
<ikskuh> but let's ignore that for now
<Sarayan> 2/ you can decode the instruction while writing the result
<Sarayan> these are two hot paths fetch to decode and compute to write
<Sarayan> that way you split them over two cycles
<ikskuh> what does "decoding" mean exactly?
<ikskuh> the instruction is just a bunch of bit fields
<ikskuh> determining what each part of the "pipeline" does
<Sarayan> decoding for instance is computing where to read param1
<ikskuh> ah, so "look at the instruction word"
<Sarayan> you have the whole "write result" cycle to compute that address rather than having to do it immediatly
<ikskuh> let me think about that
<ikskuh> and if i don't have "write result", i just stall then?
<Sarayan> yeah
Max-P has joined #yosys
<ikskuh> hm
<ikskuh> but only for one clk, right?
<Sarayan> yep
<ikskuh> and if i actually have to do memory access, i'll "stall" for "how long that memory access takes"
<Sarayan> you memory is slower than 160MHz?
<ikskuh> yes, i have an external 16 MB SPI flash and 16 MB SDRAM
<Sarayan> ouch, you really really really want some kind of cache for instructions and stack :-)
<ikskuh> yes, i know
<ikskuh> but those are secondary problems
<ikskuh> i wanna do everything step by step
<Sarayan> try prefetching, it should have an interesting impact
<ikskuh> and my current design allows me to tack caches for dram, flash, ... :)
<ikskuh> yeah, i think i got the rough idea now
<ikskuh> this will take a while to get right
<ikskuh> but one very good thing: i already have a behaviour test suite
<ikskuh> so i can see if i break things on the way
<Sarayan> that's useful
<ikskuh> or if everything still works as intended
<ikskuh> but i think i will try prefetching and "pipelining"
<ikskuh> also separate instruction and data bus for separate caches
<Sarayan> pipelining is hard to get right with a stack architecture
<ikskuh> yeah, but the general idea of "not doing a state machine"
<Sarayan> honestly it's all a state machine in the end
<ikskuh> right
<ikskuh> but trying to do more in parallel
<ikskuh> and not cramping everything into a single process
<Sarayan> yes, that's the important part
<ikskuh> this will take a day to see what i can all do in parallel
<Sarayan> not having long chains of dependencies in the same clock
<ikskuh> but the idea of running "add" in parallel to "sub" and "shift" is the right idea, yeah?
<Sarayan> if you have to data to run them with in parallel
<ikskuh> alu only has two inputs "input0" and "input1"
<ikskuh> and will output a single value which you can inspect for flags and/or push
<ikskuh> so i have one step "add input0, input1, store output", one "select the output from the alu", one "push the output" parallel to "set the flags", right? and while i compute the output of the alu, i can fetch and decode the next instruction
<ikskuh> did i get this right?
<Sarayan> you fetch it while you compute the output of the alu, you decode it while you write the output of the alu
<Sarayan> you don't want to do fetch and decode in the same cycle, it's too much
<ikskuh> okay
<ikskuh> so "way more registers" :D
<Sarayan> of course
tlwoerner has quit [Remote host closed the connection]
tlwoerner has joined #yosys
FabM has quit [Ping timeout: 268 seconds]
ec has quit [Remote host closed the connection]
ec has joined #yosys
bluesceada has quit [Changing host]
bluesceada has joined #yosys
ec has quit [Ping timeout: 276 seconds]