barak has quit [Remote host closed the connection]
barak has joined #ocaml
pieguy128 has quit [Ping timeout: 252 seconds]
pieguy128 has joined #ocaml
barak has quit [Ping timeout: 248 seconds]
barak has joined #ocaml
szkl has quit [Quit: Connection closed for inactivity]
barak has quit [Ping timeout: 265 seconds]
spip has quit [Quit: Konversation terminated!]
barak has joined #ocaml
Haudegen has joined #ocaml
mbuf has joined #ocaml
motherfsck has joined #ocaml
barak has quit [Remote host closed the connection]
barak has joined #ocaml
barak_ has joined #ocaml
barak has quit [Ping timeout: 246 seconds]
bgs has joined #ocaml
Serpent7776 has joined #ocaml
wingsorc__ has quit [Quit: Leaving]
barak_ has quit [Remote host closed the connection]
barak_ has joined #ocaml
barak_ has quit [Remote host closed the connection]
barak_ has joined #ocaml
Techcable has quit [Ping timeout: 250 seconds]
waleee has quit [Quit: WeeChat 3.8]
barak__ has joined #ocaml
barak_ has quit [Ping timeout: 252 seconds]
bartholin has joined #ocaml
bgs has quit [Remote host closed the connection]
barak__ has quit [Ping timeout: 256 seconds]
barak has joined #ocaml
famubu has quit [Ping timeout: 248 seconds]
barak_ has joined #ocaml
barak has quit [Ping timeout: 250 seconds]
Techcable has joined #ocaml
szkl has joined #ocaml
mro has joined #ocaml
olle has joined #ocaml
azimut has quit [Ping timeout: 255 seconds]
bartholin has quit [Quit: Leaving]
barak_ has quit [Ping timeout: 240 seconds]
<adrien>
that's probably the epilogue of my memory optimization journey for this: I think the intervals in the interval tree were only 2 or 3 bytes long; that's pretty terrible and the interval tree added a huge cost for tiny savings
<adrien>
I switched to bigarray that's the same length as the file and holds Int32; that makes the usage 4*input_size and the algorithm simpler (although I wouldn't have arrived to that algorithm without first using BatIMap)
<adrien>
and I populate a map of BatISet.t as before so that I can easily compute their intersection
<adrien>
I was kind of reluctant to actively share code for which the usage instruction started like "step 0: get 64GB of RAM"
kakadu has joined #ocaml
spip has joined #ocaml
<olle>
adrien: What about moving data to database? Not possible?
<adrien>
well, for starters I don't need to optimize things further because usage is down to a few gigabytes at most (and I might have some low hanging fruits which are now more noticeable)
<adrien>
but more importantly, I need to loop over the array in a quadratic fashion
waleee has joined #ocaml
<olle>
Pre-fetch the parts you need from db before you arrive at them? :)
<olle>
But yeah, few gigs is ok then
<adrien>
one possible improvement is to split that bigarray into several smaller (up to 6000 ones with my current testcases but of uneven sizes)
<adrien>
which would allow to move 90% of that into a cold storage
<adrien>
(cold-ish but it could be refined)
<olle>
Yeah, but doesn't sound necessary anymore
<adrien>
I'll probably be working parallelization next
<adrien>
but not this month
Haudegen has quit [Ping timeout: 276 seconds]
<adrien>
but I'll add a progress indicator
<discocaml>
<darrenldl> adrien: i'm going to ask silly question which you may have already answered long ago - is the repo online?
<companion_cube>
adrien: what was the max integer? Could bitvectors work?
mro has quit [Remote host closed the connection]
mro has joined #ocaml
waleee has quit [Ping timeout: 240 seconds]
waleee has joined #ocaml
waleee has quit [Quit: WeeChat 3.8]
szkl has quit [Quit: Connection closed for inactivity]
waleee has joined #ocaml
<adrien>
companion_cube: for intersection of BatISet.t? that's possible, in any case I'll probably do something for performance fairly soon because O(n²) is too painful on one of my tests and counting the number of elements in both sets can be much faster
<companion_cube>
bitvectors, only, work better for a large set of values
<adrien>
and the code is at https://gitlab.com/adrien-n/compsort/ but it's completely lacking documentation, is ugly, contains various weird things/attempts to limit memory usage, and you can still see the first version was shell script
<adrien>
oh, I could look at them indeed; I remember seeing them a few months ago and thinking I definitely didn't have a use for them :P
<adrien>
but I might not have concerns about memory usage here
<adrien>
the first step is a modified run of "xz" which takes a lot of memory (10 * input_size) and if my program doesn't use more, it's not going to be a bottleneck
<adrien>
(a future change is to implement a match finder; this might reduce the memory usage, in which case it might make sense to revisit other steps)
waleee has quit [Ping timeout: 246 seconds]
Anarchos has joined #ocaml
<adrien>
and basically, the program is about optimizing the order of files in archives that are to be compressed; there can be some significant gains and for distributions, it is an operation that can be done very infrequently
barak has joined #ocaml
<adrien>
companion_cube: but roaring bitmaps could reduce memory usage by a lot; I'll have to compare the CPU time
<companion_cube>
reducing memory => reducing time, often
<adrien>
yup but in this case I think I need a bigarray of the same size as my first bigarray, then for a small-ish range of the first bigarray, store 1 in corresponding cells of the second bigarray, then for another small-ish range of the first bigarray, look for 1s in corresponding cells of the second bigarray and count how many matches there are
<adrien>
that's 2*k where k is the average size of files in the archive, which is fairly small
<adrien>
and overall that should be like going through one of these bigarrays twice only
mro has quit [Quit: Leaving...]
Myrl-saki has quit [Ping timeout: 265 seconds]
Anarchos has quit [Quit: Vision[]: i've been blurred!]
chrisz has quit [Ping timeout: 248 seconds]
barak has quit [Ping timeout: 246 seconds]
chrisz has joined #ocaml
motherfsck has quit [Ping timeout: 255 seconds]
motherfsck has joined #ocaml
Haudegen has joined #ocaml
olle has quit [Ping timeout: 248 seconds]
Anarchos has joined #ocaml
motherfsck has quit [Ping timeout: 248 seconds]
oriba has joined #ocaml
berberman_ has quit [Ping timeout: 256 seconds]
Exa has quit [Quit: see ya!]
joseemds has joined #ocaml
mbuf has quit [Quit: Leaving]
joseemds has quit [Client Quit]
Anarchos has quit [Quit: Vision[]: i've been blurred!]
Serpent7776 has quit [Quit: leaving]
bartholin has joined #ocaml
berberman has joined #ocaml
alexherbo2 has joined #ocaml
<discocaml>
<darrenldl> oh wew, getting started with eio takes quite a bit of elbow grease
Exa has joined #ocaml
szkl has joined #ocaml
waleee has joined #ocaml
alexherbo2 has quit [Ping timeout: 260 seconds]
alexherbo2 has joined #ocaml
olle has joined #ocaml
Stumpfenstiel has joined #ocaml
olle has quit [Ping timeout: 240 seconds]
barak has joined #ocaml
gentauro has quit [Read error: Connection reset by peer]