wingsorc has quit [Remote host closed the connection]
wingsorc has joined #ocaml
Haudegen has joined #ocaml
spip has quit [Quit: Konversation terminated!]
bgs has joined #ocaml
motherfsck has quit [Quit: quit]
Serpent7776 has joined #ocaml
bartholin has joined #ocaml
m5zs7k has quit [Ping timeout: 240 seconds]
m5zs7k has joined #ocaml
Serpent7776 has quit [Ping timeout: 240 seconds]
mro has joined #ocaml
wingsorc__ has joined #ocaml
wingsorc has quit [Ping timeout: 265 seconds]
bartholin has quit [Quit: Leaving]
Serpent7776 has joined #ocaml
olle has joined #ocaml
xd1le has joined #ocaml
<adrien>
for each byte of [0..200M] I might currently store int*int*int; I'm on a 64-bit machine and that means 24 bytes at least and therefore at least 10GB of data but this could be int32*int32*int16 which would use half the space roughly except I don't know if that would work due to boxing
<octachron>
With boxing? Are you thinking of using Int32.t? This is not the right solution: you have to do the packing yourself. With Bytes and a custom operator that should not be too painful.
<adrien>
I've also managed to bring the memory use down to 17GB or less (or 14GB but at a much higher CPU cost)by avoiding loading up all the data upfront; that didn't use that much memory but that had to be kept around during the most intensive operation
<adrien>
ok, thanks, that's what I thought; I'll probably get around to doing it at some point today
<adrien>
indeed, I think it's going to be doable because I only have a few places that need to be changed (but I need to clean the surrounding code first)
<adrien>
and a possible subsequent optimization would be to re-implement BatIMap on top of a fixed-size array
kakadu has joined #ocaml
azimut has quit [Ping timeout: 255 seconds]
waleee has quit [Quit: WeeChat 3.8]
waleee has joined #ocaml
<adrien>
my type is "type t = | Literal | Match of int * int * int | Padding"; shall I re-encode everything or only the int*int*int part?
<octachron>
How is the data stored?
<octachron>
A first change might be to replace `Match of int * int * int` by `Match of int * int`.
<adrien>
Batteries' BatIMap, which I think is an AVL-tree
<adrien>
I've packed the three ints into 10 bytes and it's currently running; unfortunately the first two ints can easily reach 2^30
<adrien>
the last one is <= 273
<adrien>
(this is LZMA's match finder btw)
mro has quit [Remote host closed the connection]
mro has joined #ocaml
<octachron>
Hm, it should be possible to compress further by storing native ints in the map to represents either Literal | Padding or an offset in a bytes array which avoids the block header.
<adrien>
I would only have the array after I've built the map I think
<adrien>
also, I messed up and I'm currently using int*int*int where all of them are offsets; I wanted to switch to what amounts to "pos, pos+len, other_pos" but I haven't done it yet and it requires a bit of care (lzma offsets are not really 0-based)
<adrien>
and I have a concern with the array: I'm not sure I would be able to allocate one that is large enough
Haudegen has quit [Ping timeout: 250 seconds]
<adrien>
I tried with 12 bytes and I'm not getting a lower memory usage (but I'm getting assert failures :) )
<adrien>
I'll stop there for now
wingsorc__ has quit [Ping timeout: 255 seconds]
spip has joined #ocaml
mro has quit [Remote host closed the connection]
mro has joined #ocaml
emp_ has quit [Ping timeout: 252 seconds]
John_Ivan has quit [Remote host closed the connection]