rgrinberg has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
Haudegen has joined #ocaml
<greenbagels>
unrelated: so taking substrings of strings always allocates and copies the substring, right? So say i'm trying to parse a TSV file but only care about fields with indices 2 and 17 out of 30 or something
<greenbagels>
if i have a lot of lines in my file, using String.split is pretty wasteful of time and memory if i only need the two fields, right? and similarly, writing a tokenizer that keeps taking substrings until the 17th field would also lead to a lot of new allocations and wouldn't be good either
<greenbagels>
so what is the best / most idiomatic solution for something like this? I feel like tokenizing / manipulating strings is one of my weakest skills in ocaml
<greenbagels>
I guess I could just scan the string with a for loop and an index but there's got to be a performant high level way to tackle it
mbuf has quit [Quit: Leaving]
sim642 has quit [Ping timeout: 260 seconds]
waleee has quit [Ping timeout: 256 seconds]
sim642 has joined #ocaml
bgs has joined #ocaml
rgrinberg has joined #ocaml
rgrinberg has quit [Ping timeout: 256 seconds]
rgrinberg has joined #ocaml
rgrinberg has quit [Ping timeout: 240 seconds]
mro has joined #ocaml
jao has quit [Ping timeout: 260 seconds]
bartholin has joined #ocaml
xd1le has joined #ocaml
troydm has quit [Ping timeout: 268 seconds]
sparogy has quit [Remote host closed the connection]
sparogy has joined #ocaml
sparogy has quit [Remote host closed the connection]
sparogy has joined #ocaml
Serpent7776 has joined #ocaml
Serpent7776 has quit [Ping timeout: 240 seconds]
Serpent7776 has joined #ocaml
cizra has left #ocaml [#ocaml]
<zozozo>
greenbagels: you could write some kind of iter function that would scan the string and call a given function for each field, while only allocating the substring if needed
<zozozo>
something like type range = int * int (* start + offset *), val fold_on_fields : string -> (int -> range -> 'acc -> 'acc) -> 'acc -> 'acc, and a function val extract_field : string -> range -> string
bartholin has quit [Quit: Leaving]
olle has joined #ocaml
mro has quit [Remote host closed the connection]
mro has joined #ocaml
mro has quit [Remote host closed the connection]
szkl has quit [Quit: Connection closed for inactivity]
mro has joined #ocaml
keyboard has joined #ocaml
troydm has joined #ocaml
mro has quit [Remote host closed the connection]
mro has joined #ocaml
szkl has joined #ocaml
bobo has quit [Quit: Konversation terminated!]
Techcable has quit [Ping timeout: 256 seconds]
mro has quit [Remote host closed the connection]
mro has joined #ocaml
John_Ivan_ has joined #ocaml
John_Ivan has quit [Ping timeout: 256 seconds]
mro has quit [Remote host closed the connection]
rgrinberg has joined #ocaml
spip has joined #ocaml
Haudegen has quit [Quit: Bin weg.]
pie_ has quit []
pie_ has joined #ocaml
rgrinberg has quit [Quit: My MacBook has gone to sleep. ZZZzzz…]
rgrinberg has joined #ocaml
Haudegen has joined #ocaml
Everything has joined #ocaml
spip has quit [Ping timeout: 240 seconds]
spip has joined #ocaml
troydm has quit [Ping timeout: 248 seconds]
mro has joined #ocaml
<companion_cube>
tbh greenbagels, if the strings are short lived, it's fine
<companion_cube>
they'll die in the next minor collection
jao has joined #ocaml
mro has quit [Remote host closed the connection]
gwizon has joined #ocaml
azimut_ has joined #ocaml
mro has joined #ocaml
azimut has quit [Ping timeout: 255 seconds]
gwizon has quit [Client Quit]
mro has quit [Remote host closed the connection]
mro has joined #ocaml
azimut_ has quit [Ping timeout: 255 seconds]
azimut has joined #ocaml
pqwy[m] has quit [Quit: You have been kicked for being idle]
milia has quit [Quit: leaving]
<greenbagels>
zozozo: yeah i guess i can just write my own strtok
<zozozo>
greenbagels: and as companion_cube said, you can first try a verisn that allocates all tokens, and see how it behaves
<zozozo>
and then, if needed/canted, try a more optimised version (and compare it to the simple version)
<greenbagels>
yeah i think i previously had issues with String.split_on_char being slow, but that might be because it tries to keep all tokens in a single list at a given time
<greenbagels>
(and who knows it could very well have just been the slowest part of my program for normal reasons)
<zozozo>
indeed
<companion_cube>
yeah do it line by line
<companion_cube>
I mean, if you have a really large csv file.
<greenbagels>
its genomic data, so unfortunately small line counts are not in my future today :(
<greenbagels>
haha
<greenbagels>
but yeah i will play with it, thanks for the ideas
<zozozo>
greenbagels: i can sympathize, I have a project that implements parser and typer for some languages where problems can be ridiculously huge (for instance, recently I was testing it on a 2go text file)
<greenbagels>
oh lord
<zozozo>
yeah, it brings about a whole new set of problems, XD
szkl has quit [Quit: Connection closed for inactivity]