#mlpack on 2015-09-01 — irc logs at libera.irclog.whitequark.org

2015-01-15 23:05 verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/

06:40 pbrunet has joined #mlpack

08:12 travis-ci has joined #mlpack

08:12 < travis-ci> mlpack/mlpack#212 (master - 236d5bc : Marcus Edel): The build is still failing.

08:12 < travis-ci> Change view : https://github.com/mlpack/mlpack/compare/7b68ca3376cc...236d5bcda268

08:12 < travis-ci> Build details : https://travis-ci.org/mlpack/mlpack/builds/78180815

08:12 travis-ci has left #mlpack []

13:53 < pbrunet> hi, I am looking for ML libs in c++ and find mlpack. I try to compare it with dlib and shark but I didn't find any becnhmark about it. Did you bench these libs?

13:55 < naywhayare> pbrunet: we don't have any benchmarking scripts for dlib or shark, unfortunately

13:57 < naywhayare> the scripts tend to actually be pretty easy to write, so if you know what you want to compare, I can help write the script and integrate it into the benchmarking system

13:58 < pbrunet> ok :-( Do you have a good reason for not comparing with them? Or you just don't have time for it?

13:59 < naywhayare> I think when we originally wrote the benchmarking scripts we maybe weren't aware of dlib and shark implementations

13:59 < naywhayare> but otherwise no particularly good reason, just lack of time :)

14:00 < naywhayare> I met a shark developer at ICML some time back... but I can't remember who he was or anything else about the interaction

14:02 < naywhayare> what specific algorithms are you interested in benchmarks for? I can look into it further

14:03 < pbrunet> :-) thanks for this information. Indeed, I am looking for a ML libs with really good performances to handle 100.000.000 data with 200 dimensions for each. It looks like only MLPack, dlibs and shark "looks" to be able to do this but I didn't bench it for now

14:03 < pbrunet> Basically, we will start with a simple kmean but it may change in the future.

14:05 < naywhayare> yeah, I think mlpack can do kmeans on that large of a dataset

14:05 < naywhayare> if you use the command-line interface, the "-a" option lets you select the algorithm. I'd try the "elkan" and "hamerly" algorithms for 200-dimensional data... the "naive" algorithm will be too slow for that much data, and the "blacklist" and "dualtree" algorithms may be slow in such high dimensions

14:06 < naywhayare> I'll look into dlib and shark implementations and see if I can easily add them to the benchmarking system (it may take me a while)

14:07 < pbrunet> Thanks a lot for this help. I will also prepare some scripts here to test it on our data.

14:08 < pbrunet> Just to know, what is the blas lib used for benchmarking this? And does it run in parallel? SIMD? (I guesse yes...)

14:11 < naywhayare> for the benchmarks posted on the website, ATLAS is used

14:11 < naywhayare> to the best of my knowledge, ATLAS is singlethreaded

14:11 < naywhayare> but you can use Armadillo with OpenBLAS, which is parallel, and that may give additional speedup

14:18 < naywhayare> also, if you're willing to share any scripts you have to run dlib/shark kmeans, that'll make it a lot easier to add them to the benchmarking system :)

14:27 < pbrunet> :-) I will share my scripts once it works. For information, atlas could be parallel but on "big" computer, altas is pretty slow compare to OpenBlas or the MKL. Do you add some extra SIMD/parallelism on the MLPack codebase?

14:28 < naywhayare> at the moment, only the density estimation tree code has OpenMP

14:28 < naywhayare> and there isn't yet any special code to exploit SIMD instructions

14:28 < naywhayare> (I would like to add that; I just haven't had time)

14:29 < naywhayare> the compiler *might* be auto-generating SIMD instructions through Armadillo, but I haven't checked

14:31 < pbrunet> I am also core dev on an opensource project and I understand your "I just haven't had time". Thanks for these informations.

14:32 < naywhayare> :)

14:32 < naywhayare> the list of things I want to do is far longer than the list of things I have time to do :(

14:50 < pbrunet> Also, I see you need col major values. I guess it is because blas use this representation? Using cblas can't solve this problem?

14:58 < naywhayare> we use column-major matrices mostly because Armadillo was designed that way

14:58 < naywhayare> the loading and saving functions in mlpack automatically transpose to the correct representation, so if your data is on disk as CSVs with one observation per row, it'll be loaded correctly

14:59 < naywhayare> (assuming you use the mlpack::data::Load() function, or the command-line executables that mlpack comes with)

15:14 < pbrunet> ok, So for now, I have a file with many rows and each contains a lot of values. I run kmean with : mlpack_kmeans -c 5 -o HERE.csv -i CFC_1000_Accelerometers_9019119_Velocity_magnitude_mm_millisec_001.csv and it only use the first value : Size is 1 x 6428. Is it normal?

15:15 < pbrunet> ok, It xas because of tabs instead of comma

15:19 < naywhayare> yeah, the load support is supposed to be automatic based on the file extension

15:19 < naywhayare> so if it was .tsv, then it would have loaded correctly

15:20 < naywhayare> but the current code unfortunately doesn't output an error if it's a TSV instead of a CSV

15:22 < naywhayare> I made a bug: https://github.com/mlpack/mlpack/issues/449 ; I'll fix that when I have a chance (currently tracking down some other issue)

15:22 < pbrunet> For now, I use the debian package so tsv format is not supported.

15:23 < naywhayare> pbrunet: ah, okay; in that case, unfortunately, the fast k-means algorithms aren't implemented

15:23 < pbrunet> ok, I may switch to the dev version then

15:23 < naywhayare> zoq: I noticed this bug https://github.com/mlpack/mlpack/issues/448 -- I seem to remember a commit fixing the issue, maybe the bug should be updated?

15:26 < zoq> naywhayare: right, we can close the issue

15:26 < naywhayare> yeah, I figured maybe it was forgotten or something like that, but I didn't know if you had further plans for it so I thought it was best to ask before doing anything :)

15:26 < pbrunet> Last question before going home, Is MLpack thread safe? Can we run multiple kmean on different data at the same time in our program?

15:28 < zoq> naywhayare: Maybe I can find the commit, let's see

15:31 < naywhayare> pbrunet: yeah, you can do that

15:32 < naywhayare> zoq: looks like this one https://github.com/mlpack/mlpack/commit/98ecfbc07f93036476240cba26a4c4a73d14466f

15:32 < naywhayare> pbrunet: the KMeans code doesn't have any parallelism at the moment, so you can easily instantiate a few KMeans objects and run them on different data with no issue

15:35 pbrunet has quit [Ping timeout: 255 seconds]

18:48 < naywhayare> okay, hopefully the latest commit will fix the travis failure, which was actually a test failure with SVDBatchLearning

18:48 < naywhayare> I felt very much like an idiot when I found what was wrong...

19:09 travis-ci has joined #mlpack

19:09 < travis-ci> mlpack/mlpack#213 (master - 86771a4 : Ryan Curtin): The build was fixed.

19:09 < travis-ci> Change view : https://github.com/mlpack/mlpack/compare/236d5bcda268...86771a458909

19:09 < travis-ci> Build details : https://travis-ci.org/mlpack/mlpack/builds/78277354

19:09 travis-ci has left #mlpack []

20:18 benchmark has joined #mlpack

20:18 benchmark has quit [Client Quit]