<geist>
oh hey i think i figured out what was wrong with my wyze50 terminal
<geist>
it occasionally made sparking and pop sounds
<geist>
looks like a diode in the power supply which was standing high off the motherboard for some reason was bent over and juuuuust touching one of the windings on a transformer right next to it
<geist>
looks like there are some burnt spots there, so maybe it occasionally arced
darkstarx has quit [Remote host closed the connection]
darkstarx has joined #osdev
<klange>
ugh, race in my pipes... too many layers in those
gog has quit [Quit: byee]
Vercas has quit [Remote host closed the connection]
Vercas has joined #osdev
k0valski1 has joined #osdev
sortie has joined #osdev
<klange>
Well, I didn't fix the race but I did make pipes faster... I think I need to juggle a lock better here.
mahmutov has quit [Ping timeout: 244 seconds]
<sortie>
I'm this close to strerror(EPIPE) = "Ceci n'est pas une pipe"
* sortie
. o O (Is a broken pipe a pipe?)
<GeDaMo>
"Ceci n'est pas une pipe"
<GeDaMo>
Huh, I missed that you already said that :P
<sortie>
This is not a "Ceci n'est pas une pipe"
mahmutov has joined #osdev
Coldberg has quit [Ping timeout: 260 seconds]
ElectronApps has quit [Quit: Leaving]
ElectronApps has joined #osdev
vai has quit [Ping timeout: 260 seconds]
Vercas has quit [Remote host closed the connection]
GeDaMo has quit [Remote host closed the connection]
<geist>
vin: hmm, can you elaborate?
Oli has joined #osdev
mctpyt has joined #osdev
Oli_ has joined #osdev
Coldberg has joined #osdev
PapaFrog has joined #osdev
LostFrog has quit [Ping timeout: 252 seconds]
<vin>
geist: Or rather let me rephrase my question to be more generic. If you are interleaving pages/blocks across devices for larger throughput, where the granulatiry of a single operation to the client is sum of all pages interleaved, can you do better than pages starting at same offset (these pages form a logical block, and block is the access granularity)?
<geist>
ah so you're talking about at the SSD firmware level and devices in this case are nand banks?
<geist>
i *think* most modern high end SSDs the translation is pretty much at the page level, so probably 4K, so logical to physical is pretty much perfecly swizzled
<geist>
the erase size is probably still much larger than a page though, so that's obviously the issue, but since the SSD is most likely writing things out in a journalled way it's generally just appending new writes somewhere else. also the whole SLC caching scheme that modern SSDs do really makes it more complicated too
<geist>
since they're essentially temporarily writing new stuff to a SLC cache which is then flattened out
<geist>
but since there are multiple devices i guess it keeps multiple outstanding journals, one per device? or i suppose it just always stripes across all devices, or at least a gang of them. unclear to me
<geist>
right? you could basically treat all the nand planes as a huge raid1 style stripe, or you can treat them pretty much entirely independently, but balance all the writes across all of them and independently wear level each plane
<geist>
i suppose theyd both have different performance characteristics, though presuambly the latter would be much more complicated to track
<geist>
that being said simple 'SSDs' like SD cards or MMC or whatnot probably do the former and just have a few planes and treats them in a striped way
Oli_ has quit [Ping timeout: 244 seconds]
<geist>
not really answering your question, but it does have me thinking
Oli has quit [Ping timeout: 264 seconds]
<clever>
ive also found that trim commands can still be handled async and resumed after a power loss
<clever>
i issued a `blkdiscard` against an entire card, and ejected it immediately afterwards (on purpose)
<geist>
i do actually wonder precisely what modern SSD algorithms look like. my only experience with real shipping SSD firmware was in the form of firmware for <redacted's> SD card about 10 years ago
<clever>
and after re-inserting the card, it did show up as entirely blank
<clever>
i feel like it both nuked the translation tables, so all reads just return null and dont even read
<geist>
yah lots of SSDs just fiddle with the translation table and mark the page blank and not erase immediately
<clever>
and also scheduled a full device erase in the background
<geist>
doesn't even have to do that, there's nothing in trim that says you have to erase the device, that's what the secure erase command is for (on ATA at least)
Oli has joined #osdev
<clever>
yeah, it could just keep say 2mb pre-erased
<clever>
and erase on-demand as that runs out
<geist>
it can just mark it for erase which still helps the 'find a new page' algorithm later on
<clever>
and the discarded blocks, just give it far better choice on what to erase&use next
<geist>
right
<clever>
yep
<geist>
but depending on the controller and firmware, some can take a lot longer to trim, so it's clearly a bit more complicated than that in some cases
<geist>
almost all of my ssds are samsung nowadays which seem to trim *fairly* quickly
<clever>
ive also read a research paper, where some sata ssd's corrupted the translation tables on simple power loss
<geist>
early on i was using a lot of sandforce SSDs which were early to the game. they use actual data compression to help
<clever>
and that then leads to total data loss
<geist>
and those take *ages* to trim, since presumably they have to look through and deal with the compressed block you're actually trimming
<clever>
ive also seen some repair guys on youtube, that dump the raw nand flash, recover the translation tables, and re-assemble the disk image
<geist>
yah. the fun one is all the SLC caching stuff that modern stuff really oes
<geist>
the breakthrough in my brain is some blab about it on anandtech, because i was always wondering where the SLC cache came from. like is 10% of the device made differently?
<geist>
answer is no, you can take MLC and TLC and QLC flash and erase it as SLC and back again
<geist>
you just gang up a bunch of cells and treat them as the same thing
<geist>
trick is you can erase and use a TLC/etc cell as SLC and the erase cycle is much faster so it has the performance of SLC, but of course is not space efficient
<clever>
in this paper, they took 15 SSD's and 2 mechanical drives, and subjected them to a torture test
<geist>
so modern SSD controllers, in the last few years, now dynamically switch cells back and forth between SLC and higher stuff
<clever>
yanking the sata power in the middle of bulk write operations
<geist>
so the translation stuff is even more complicated
elastic_dog has quit [Ping timeout: 264 seconds]
<clever>
one drive, after only 8 power loss events, corrupted its translation tables
<clever>
any read past 256gig into the disk, failed with IO errors
<geist>
absolutely. do not power pull your device
<geist>
i dont trust SSDs and never will in that situation
<geist>
SD cards are also fairly easy to corrupt, but they're usually much simpler so they have less crap in flight
<clever>
ive also heard reports on the rpi forums, that SD cards are more likely to corrupt if your voltage rail sags
<geist>
it's almost certainly one of the big differences between enterprise and consumer SSDs: on board ram, supercaps to keep it going to complete the transation, etc
<geist>
absolutely
<clever>
and a seemingly large number of users are having cards die, or just getting fake cards
<clever>
so far, ive only murdered one card, it died in the middle of a gcc compile
<clever>
but it knows its on the deathbed, so it just ignores all writes
<geist>
yep. using SD cards as your root for your OS is fairly dangerous. back up and/or be ready to have to repave
<clever>
reads still work perfectly and it can still boot
<geist>
yep. 'good' SD card firmware goes into RO mode so you can at least get your shit off
<geist>
bad firmware just goes dead
<clever>
so things get really funky, when writes randomly revert (when the read cache expires), and then it just starts crashing
<clever>
yep
<geist>
i used to have a drawer at my work desk full of dead sd cards
<geist>
fairly easy to corrupt if you're working on a SD stack. lots of times they dont gracefully handle bad or corrupt commands or lots of power cycles as you load firmware on your board
<clever>
ive also been digging into volk and gnu-radio recently
<bslsk05>
github.com: volk/volk_32f_x2_dot_prod_32f.h at main · gnuradio/volk · GitHub
<clever>
this is a routine for neon based dot product with floats
<clever>
from that, i can see that neon appears to be based on float[4] vectors, and it has an `a = b + (c*d)` opcode, but no way to sum every element in a vector
elastic_dog has joined #osdev
<clever>
it also looks like there is some data dependency issues? where `a = b + (c*d); e = a + (f*g);` would stall out, waiting for the previous opcode
<geist>
gosh i kinda wish you'd finish up that pending stack for LK
<geist>
i'm about to just start pulling pieces out of it myself and finishing it off
<geist>
ext4 in particular
<clever>
ah, the ext4 stuff? yeah, i should just finish it off in qemu
<clever>
where would i find better docs on what arm neon can do exactly? the arm site is a bit tricky to navigate when you dont know what things are named
<geist>
do you mean arm32 or arm64?
<geist>
they basically renamed it to ASIMD in arm64
<geist>
may be why you're not finding the NEON docs
<geist>
also it's simply part of the ARMv8 ARM
<clever>
interested in both, but i can start at either
<clever>
let me check my armv8 docs...
<geist>
basically same thing, just different ISA to get to it, so the mnemonics are not the same
<geist>
also ARM64 redid how vector registers are mapped to lower level ones so its far more straigthfrward
<geist>
since arm32 had a pretty dumb way of mapping registers
mahmutov has quit [Ping timeout: 252 seconds]
<geist>
ie, '[s0, s1] = d0' '[s2, s3] = d1'
<geist>
[d0, d1] = v0, etc
<geist>
arm64 does what you expect and s0 is the bottom of d0 is the bottom of v0
<vin>
geist | right? you could basically treat all the nand planes as a huge raid1 style stripe, or you can treat them pretty much entirely independently, but balance all the writes across all of them and independently wear level each plane. || Most modern SSDs use the planes iin RAID0 and send P/E cycles to all blocks part of the array.
<vin>
Offcourse the writes to these blocks are actually made in parallel.
<vin>
But using greedy victim selection for GC forces good blocks to also be invalidated in the array.
<geist>
oh dont get me started. i sold my bitcoins at the absolute lowest it ever got to
<geist>
OTOH i'm basically okay with it because fuck bitcoins
<ZetItUp>
geist how much coins did you have? :P
<geist>
2.5
<ZetItUp>
oh :D
<vin>
Do SSD FTLs keep their address translation tables iin host RAM?
<clever>
vin: ive heard that more expensive nvme drivers have dedicated ram on the nvme module for that
<clever>
vin: while cheaper ones can steal some host ram, and just dma into it
<geist>
yep. very low end SSDs use a thing called... crap what is it
<clever>
vin: ive also heard that some ssd's, will not bother saving the translation table back to flash, and a supercap will then fuel a mad dash to commit things upon power-loss
<geist>
it's a nvme feature that lets you (the host) bequeath a block of host ram to the card and it puts its translation table there
<geist>
but 'good' SSDs have 512MB-1GB+ onboard DRAM
<clever>
xhci has a similar feature, and xhci calls it scratch space
<geist>
mostly holding the translation table live
<geist>
for example the WD blue low end nvme i have does the nvme host thing
<vin>
Right clever , I wonder what are the durablity gurantees for these tables during a crash. Let's say an update is made to the table (to remap few blocks for fresh writes) in RAM and when commiting this to the drive, there is a power failure. How does SSD recover from faulty table?
<geist>
if the host doesn't partake you get reduced performance sicne the nvme controller is paging the translation table in and out of its own internal sram
<clever>
vin: i would assume that if using host ram, the drive wont claim the write is complete until it has also saved the new translation tables
<clever>
and the host ram is purely a read cache, to make lookups faster
<geist>
sure, it can treat it basically like a write through cache, and journal the updates to the on flash structure
<geist>
which is probably distributed around the part
<geist>
so it can be made safe
<vin>
hmm GC should update the table as well correct? Since the GC happens on drive cpus it needs to update the RAM copy of ATT (Address translation table)
<geist>
yep. think of it as just a WT cache of the translation table. speeds up reads immensely
<geist>
that's why nvmes that have the host memory feature, if they dont get the ram they still work, just reducted performance
<geist>
anandtech did some performance reviews a while back on some newer devices with the feature
<vin>
I see
<geist>
and exactly as you expect the random read performance dropped immensely
<geist>
and you can actually kinda do the math
<geist>
if you have saya 2^40 sized device, split into 2^9 sized blocks, then that's 2^31 blocks. if you then have a translation table with 4 byte pointers, then that's 2^33 worth of table
<geist>
that's assuming you translate at 512 bytes
<geist>
if you use say 4K pages then the size goes down by 2^3
<vin>
Yup
dude12312414 has quit [Quit: THE RAM IS TOO DAMN HIGH]
<geist>
i fiddled with the WD blue before i put it to use and reformatted it as 4K (sadly yo ucan't do this on samsung devices) and lo and behold the amount of stolen memory went down a lot
<geist>
and actually added up as i expected
<clever>
ive talked to a guy in #raspberrypi that re-formatted a mechanical drive from 4k to 512 mode
<clever>
turns out, the rpi boot firmware, cant deal with 4k drives
<geist>
so TL;DR it's worth your time to use 4K native block sizes especially if your nvme is doing stolen memory mode
<geist>
a SSD with dedicated 1GB ram or so probably benefits less because it already has the ram
<vin>
I didn't understand why having 4K block size improves performance, because smaller ATT? Then wouldn't 2MB do even better?
<geist>
tradeoffs but yeah
<geist>
actually the SD card i got to see the firmware to actually translated at 2MB iirc
<geist>
ie an erase zone size
<geist>
but it had a whole journalling system such that it didn't have to do a 2MB RMW on every write
<geist>
it would journal up N pages across M 2MB blocks, then GC those
<vin>
Yup erase are always block sizes, individual pages can be programmed specifically
<geist>
so the full translation was 2MB translation + all of the outstanding journals
<clever>
ive heard of filesystems in linux for raw nand flash, where it just treats the entire device like one big journaled ringbuffer
<clever>
every time you write, you just write something like inode#, block#, and data, appending to a ringbuffer
<clever>
and you gc the other end, to consolidate free space, and copy still in-use blocks to the new write pointer
<geist>
right, i suspect at some point pretty much anything that deals with nand has some sort of journal, if not just part of a transction, since the way the physical device works is very much amenable to that
<geist>
yep, the SD card thing i was talking about was basically exactly that
<geist>
except the whole thing wasn't a journal, the journal itself floated around the device
<geist>
a new block was picked as a journal, it was the current 'head' basically, and new page writes were scribbled down there. as the 2MB block filled up it picked a new one as the journal
<geist>
and then journal blocks were GCed in the background
<clever>
something else i was wondering about flash recently, which does more "damage", the erase cycle, or flipping a bit from the default to non-default state?
<geist>
so it was a combination of an ever increasing journal + 'cold storage' blocks translated at 2MB boundaries
<clever>
could you do less wear, by programming a given region multiple times, and even the same bytes
<geist>
erase cycle
<clever>
ah, so i could massively increse the lifetime, by having a sort of bitfield array, and only flip one bit at a time
<geist>
and yes you can program more than once, but usually can only make bits go to 0 or 1 (depending on how you interpret the cells being 'full' or 'empty' of charge)
<clever>
and once i exaust every bit in the erase block, then i erase and repeat
<geist>
so there's some trickery there too wher eyou can write an entry, and then go back later and overwrite it, as long as you're only flipping bits in one direction
<clever>
yep
<vin>
clever: programming a particular region multiple times will for sure saturate the block's P/E cycles.
<clever>
the OTP in the rpi plays by the same rules (only 0->1 i believe)
<clever>
and the SPI flash i last read the datasheet on, is 1->0 only
<geist>
an erase cycle is the slow part since it involves basically heating up the cell i believe and 'filling' it with charge
<clever>
with erase returning it to 0xff
<geist>
yah can be 0 -> 1 or 1 -> 0 depending on how you or the controller interprets a full cell
<geist>
some controllers i've seen let you set that
<clever>
yep
<geist>
also the erase cycle is the thing they list as say only can handle 3000 or 1000 times
<clever>
that would basically just be xor'ing a config flag with every bit you read out
<geist>
as i was talking about beofre the current hotness where they can take a MLC/TLC/QLC flash and erase it as SLC i think the reason it's faster is it doesn't go through as deep of an erase cycle
<geist>
since it doens't have to be as 'precise' about how much it fills the cells, since it's ganging up multiple cells into a single SLC cell
<clever>
do any flash types use analog levels?
<geist>
also why the multilevel cells (MLC/TLC/QLC) is slower, i believe
<geist>
yah i dunno how the sense logic works there, but i guess it's effectively analog
<clever>
i was reading this one earlier, and i believe it stated 100k erase cycle limit
<clever>
down on page 4
<geist>
yep. low density stuff like that can handle a lot more erase cycles
<geist>
the high density MLC/TLC/QLC stuff the number of cycles it handles has been going down somewhat exponentially
<clever>
and the smallest erase block is 4k, but it also has commands for 32k, 64k, and whole-chip erase
<geist>
i think 1k is the current QLC limit
<geist>
that SPI flash thing you've linked is potentially NOR flash, which also handls a lot of erase cycles
<vin>
To avoid redundant invalidation of good blocks when greedy victim selection is done in a superblock, maybe the superblock array should be a set of blocks at different offsets in each plane
<clever>
the protection bits are also surprisingly limited, it can protect none/64k/128k/256k/all, starting at either the top or the bottom of the chip
<clever>
and the protection itself, is configured via a dedicated config register, that is basically just another 8bits of flash memory
<clever>
and the physical write-protect pin, only protects that config register, and nothing else
<clever>
geist: another thing, is erasing flash via UV light
<clever>
ive seen a hackaday article, where somebody decapped an AVR MCU with acid, masked off the program flash with tape, and then used UV light to erase the "fuse bits"
<clever>
to remove the protections that stop you from reading the program memory
<clever>
would that work on all types of flash? would it even go to the same level as a proper erase?
<geist>
ah cute
<geist>
reminds me, i keep meaning to pick up a UV eraser from ebay
<geist>
though mostly deal with eeproms when i have them
<clever>
in the avr case, there is a metal layer over the fuses, for just this reason
<clever>
but if you fire the uv in diagonally, it will bounce between metal layers like a fiber-optic cable, and still hit the cell
<geist>
noice
<clever>
but that makes me wonder, how physical large might the OTP cells in an rpi soc be?
<clever>
if i blast one with UV, will it return to 0 or 1? (1's are permanent normally)
<vin>
What would be good workloads to evaluate FTL policies?
<clever>
i can see some security exploits if i can target a specific OTP register
<clever>
geist: the docs for the new CM4 secureboot where recently found on github, and there are OTP flags to disable developer keys, and disable vpu jtag
* Ameisen_
sees AVR discussion
<clever>
but the docs didnt mention anything about signing the bootcode.bin blob, so i can only assume that broadcoms RSA key is still part of the trust root
<Ameisen_>
Re: erasing the fuse bits: but why?
<clever>
Ameisen_: to dump a protected program
<Ameisen_>
ah
<Ameisen_>
I was trying to contextualize it as why would _I_ want to do that
<Ameisen_>
forgot that other people have different motivations
<clever>
Ameisen_: normally, the LOCK flag protects the code, and you must do a full chip-erase to unlock it, but by erasing fuses with UV light, you can unlock without a full erase
<Ameisen_>
I am 99.9% sure if I tried to do any of that, I would just break the chip permanently.
<clever>
Ameisen_: well, it does involve melting the top off the chip with acid, without breaking any bond wires....
<clever>
the problem, is putting a piece of tape over the 8kb flash array, but leaving the security fuse array exposed
<Ameisen_>
I decapped a Geforce 3 a long time ago. It was not intentional.
<Ameisen_>
back when the GeForce 3 was top-of-the-line.
<Ameisen_>
:(
<Ameisen_>
performance-wise in basically every single aspect other than power usage, couldn't you put a lightweight AVR emulator onto a Cortex-M chip and still... beat any AVR chip?>
<Ameisen_>
I've been meaning to tinker with both AVR emulation on AVR and ARM
<clever>
Ameisen_: possibly, the new rp2040 from RPF has 264kb of sram, and has a dual core 125mhz cortex-m0+
<Ameisen_>
on AVR being I want to profile the performance of emulating itself to see how slow execution of AVR instructions from memory is.
<clever>
so you could definitely try and emulate avr there, but getting deterministic execution out may be a bit tricky
<Ameisen_>
this is true. AVR is entirely deterministic clock-wise
<Ameisen_>
so you're emulation layer would have to try to take things like that into account, particularly in regards to inputs
<Ameisen_>
hrmm
<clever>
the rp2040's cortex-m0+ is also deterministic, but if 2 bus masters fight over a bus slave (like a ram bank), one of them will have to stall
<clever>
but you can set a priority, so a certain master always wins
gxt_ has quit [Remote host closed the connection]
<Ameisen_>
it'd still be an interesting project
<Ameisen_>
though I want to get an emulator running on the AVR itself, first
<Ameisen_>
as I want to execute AVR machine code from RAM
gxt_ has joined #osdev
<Ameisen_>
Mainly I'm curious what the performance would be like (awful is expected, but _how_ awful)
<Ameisen_>
plus it will be interesting to microoptimize such an emulator - it's easier to do that on AVR than, say, x86
<clever>
but platforms like the rp2040 dont need such hacks, since it can just run code from ram directly
pretty_dumm_guy has quit [Quit: WeeChat 3.3]
<Ameisen_>
Right.
<Ameisen_>
Such an emulator can just do a static recompile of the AVR program
<Ameisen_>
can't do that when running AVR on AVR though
<Ameisen_>
I'm basically thinking something like x86emu
<Ameisen_>
though I absolutely don't want to emulate x86 on AVR. Though that'd be an interesting experiment; but I don't think the emulator would fit in AVR's program memory... you'd have to first have an AVR emulator, then the x86 emulator running in the AVR emulator on the AVR...
<clever>
Ameisen_: oh, that reminds me, of another avr emulator project
<clever>
Ameisen_: https://spritesmods.com/?art=avrcpm a z80 emulator, complete with SD and DRAM bit-banging, and a CP/M bios, so it can boot full CP/M
<Ameisen_>
I want to add an AVR target to vemips, but I have zero idea how to handle program memory separation there
<Ameisen_>
I had a crazy idea to let vemips load binaries of basically any supported target and let them interact.
<Ameisen_>
but I'd have to figure out a way for it to know the difference between an address to 'program memory' and to normal memory
<Ameisen_>
particularly when addresses might get passed to functions that were originally from a target that had no such concept
<Ameisen_>
best I can think is some sort of prefix or suffix with the address
<clever>
Ameisen_: i think the avr-gcc toolchain, just uses some extra bits in the 32bit addr, to denote if its flash or ram
<Ameisen_>
Yeah, but I cannot rely on that for this.
<clever>
and its up in the range where those bits dont exist on real hw
<Ameisen_>
I'm saying that this would be able to load a MIPS32r6 library, an 8-bit AVR binary, and they could interact in the vemips environment
<Ameisen_>
but if the AVR binary passed an address to a function that originally came from MIPS, and the MIPS-side, say, called it
<Ameisen_>
the MIPS-side would have to know in the interpreter that it needs to point to a virtualized program memory space
<Ameisen_>
setting some flags in the internals of the interpreter for the value could work
<Ameisen_>
just tricky
<Ameisen_>
the values from the AVR-side wouldn't be universal pointers; they'd probably be normal bare 16-bit ones
<Ameisen_>
I'm just not sure how the interpreter would actually know that it's a program memory address if it's being passed as, say, an argument
<Ameisen_>
the idea breaks down at that point
<clever>
that gets into the printf vs printf_P stuff i think?
<clever>
avr-libc has variants of most functions, that expect the pointer to be pointing to flash instead of ram
<clever>
so you can do printf_P(PSTR("foo %d bar %d\n"), foo, bar);
<Ameisen_>
Yes; but as said, in this case I'm taking about an interpreter that can load both AVR and MIPS binaries and have them interact
<clever>
and it wont waste a dozen bytes of ram on the string
<Ameisen_>
without having any real awareness of one another
<clever>
you would likely need a type code on each function you can pass to the interpreter
<Ameisen_>
the binaries shouldn't need to know about the interpreter ;P
<Ameisen_>
that's where it breaks down
<Ameisen_>
generic AVR binaries wouldn't provide enough information to the interpreter to resolve this
<Ameisen_>
they'd have to be ones specifically built for the purpose
<Ameisen_>
and that's sorta lame
Oli has quit [Read error: Connection reset by peer]