verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
pbrunet has joined #mlpack
travis-ci has joined #mlpack
< travis-ci>
mlpack/mlpack#212 (master - 236d5bc : Marcus Edel): The build is still failing.
< pbrunet>
hi, I am looking for ML libs in c++ and find mlpack. I try to compare it with dlib and shark but I didn't find any becnhmark about it. Did you bench these libs?
< naywhayare>
pbrunet: we don't have any benchmarking scripts for dlib or shark, unfortunately
< naywhayare>
the scripts tend to actually be pretty easy to write, so if you know what you want to compare, I can help write the script and integrate it into the benchmarking system
< pbrunet>
ok :-( Do you have a good reason for not comparing with them? Or you just don't have time for it?
< naywhayare>
I think when we originally wrote the benchmarking scripts we maybe weren't aware of dlib and shark implementations
< naywhayare>
but otherwise no particularly good reason, just lack of time :)
< naywhayare>
I met a shark developer at ICML some time back... but I can't remember who he was or anything else about the interaction
< naywhayare>
what specific algorithms are you interested in benchmarks for? I can look into it further
< pbrunet>
:-) thanks for this information. Indeed, I am looking for a ML libs with really good performances to handle 100.000.000 data with 200 dimensions for each. It looks like only MLPack, dlibs and shark "looks" to be able to do this but I didn't bench it for now
< pbrunet>
Basically, we will start with a simple kmean but it may change in the future.
< naywhayare>
yeah, I think mlpack can do kmeans on that large of a dataset
< naywhayare>
if you use the command-line interface, the "-a" option lets you select the algorithm. I'd try the "elkan" and "hamerly" algorithms for 200-dimensional data... the "naive" algorithm will be too slow for that much data, and the "blacklist" and "dualtree" algorithms may be slow in such high dimensions
< naywhayare>
I'll look into dlib and shark implementations and see if I can easily add them to the benchmarking system (it may take me a while)
< pbrunet>
Thanks a lot for this help. I will also prepare some scripts here to test it on our data.
< pbrunet>
Just to know, what is the blas lib used for benchmarking this? And does it run in parallel? SIMD? (I guesse yes...)
< naywhayare>
for the benchmarks posted on the website, ATLAS is used
< naywhayare>
to the best of my knowledge, ATLAS is singlethreaded
< naywhayare>
but you can use Armadillo with OpenBLAS, which is parallel, and that may give additional speedup
< naywhayare>
also, if you're willing to share any scripts you have to run dlib/shark kmeans, that'll make it a lot easier to add them to the benchmarking system :)
< pbrunet>
:-) I will share my scripts once it works. For information, atlas could be parallel but on "big" computer, altas is pretty slow compare to OpenBlas or the MKL. Do you add some extra SIMD/parallelism on the MLPack codebase?
< naywhayare>
at the moment, only the density estimation tree code has OpenMP
< naywhayare>
and there isn't yet any special code to exploit SIMD instructions
< naywhayare>
(I would like to add that; I just haven't had time)
< naywhayare>
the compiler *might* be auto-generating SIMD instructions through Armadillo, but I haven't checked
< pbrunet>
I am also core dev on an opensource project and I understand your "I just haven't had time". Thanks for these informations.
< naywhayare>
:)
< naywhayare>
the list of things I want to do is far longer than the list of things I have time to do :(
< pbrunet>
Also, I see you need col major values. I guess it is because blas use this representation? Using cblas can't solve this problem?
< naywhayare>
we use column-major matrices mostly because Armadillo was designed that way
< naywhayare>
the loading and saving functions in mlpack automatically transpose to the correct representation, so if your data is on disk as CSVs with one observation per row, it'll be loaded correctly
< naywhayare>
(assuming you use the mlpack::data::Load() function, or the command-line executables that mlpack comes with)
< pbrunet>
ok, So for now, I have a file with many rows and each contains a lot of values. I run kmean with : mlpack_kmeans -c 5 -o HERE.csv -i CFC_1000_Accelerometers_9019119_Velocity_magnitude_mm_millisec_001.csv and it only use the first value : Size is 1 x 6428. Is it normal?
< pbrunet>
ok, It xas because of tabs instead of comma
< naywhayare>
yeah, the load support is supposed to be automatic based on the file extension
< naywhayare>
so if it was .tsv, then it would have loaded correctly
< naywhayare>
but the current code unfortunately doesn't output an error if it's a TSV instead of a CSV
< naywhayare>
yeah, I figured maybe it was forgotten or something like that, but I didn't know if you had further plans for it so I thought it was best to ask before doing anything :)
< pbrunet>
Last question before going home, Is MLpack thread safe? Can we run multiple kmean on different data at the same time in our program?
< zoq>
naywhayare: Maybe I can find the commit, let's see
< naywhayare>
pbrunet: the KMeans code doesn't have any parallelism at the moment, so you can easily instantiate a few KMeans objects and run them on different data with no issue
pbrunet has quit [Ping timeout: 255 seconds]
< naywhayare>
okay, hopefully the latest commit will fix the travis failure, which was actually a test failure with SVDBatchLearning
< naywhayare>
I felt very much like an idiot when I found what was wrong...
travis-ci has joined #mlpack
< travis-ci>
mlpack/mlpack#213 (master - 86771a4 : Ryan Curtin): The build was fixed.