#mlpack on 2016-05-02 — irc logs at libera.irclog.whitequark.org

2015-01-15 23:05 verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/

03:59 Nilabhra has joined #mlpack

04:40 govg has joined #mlpack

05:18 Mathnerd314 has quit [Ping timeout: 268 seconds]

05:21 Mathnerd314 has joined #mlpack

05:31 Mathnerd314 has quit [Ping timeout: 268 seconds]

06:01 govg has quit [Quit: leaving]

06:01 govg has joined #mlpack

10:33 nilabhra_ has joined #mlpack

10:36 nilabhra_ has quit [Remote host closed the connection]

10:37 Nilabhra has quit [Ping timeout: 276 seconds]

10:53 tham has joined #mlpack

11:07 govg has quit [Ping timeout: 260 seconds]

11:19 govg has joined #mlpack

12:03 govg has quit [Quit: leaving]

13:11 Nilabhra has joined #mlpack

14:07 Mathnerd314 has joined #mlpack

15:02 < rcurtin> mentekid: I did some benchmarking of the LSH improvements in #623; here's what I found on a handful of datasets:

15:02 < rcurtin> (with default parameters)

15:03 < rcurtin> miniboone dataset (50x130064) -- master 43.155s, improved 32.944s

15:03 < rcurtin> pokerhand dataset (10x1025010) -- master 1684.4s, improved 1182.6s

15:04 < rcurtin> phy dataset (78x150000) -- master 81.3s, imrpoved 104.9s

15:04 < rcurtin> corel dataset (32x37749) -- master 9.9s, improved 15.3s

15:05 < rcurtin> my best thought at the moment is, arma::unique() can be slower than the previous array approach for smaller datasets, but as the dataset gets larger unique() is a faster approach

15:06 < rcurtin> and regardless of dataset size I'm pretty certain the unique() approach will use less RAM

15:07 < rcurtin> I'm trying to think if maybe there is another solution than arma::unique()... the goal is to collect all the unique indices of LSH reference point candidates

15:08 < rcurtin> maybe, do you think it makes sense to use a hash table-like structure?

15:09 < mentekid> do you think it is the dataset size or the selectivity of LSH? Is there a log of how many indices are returned each time?

15:09 < rcurtin> I think maybe there is a bit simpler solution, but maybe a std::map<size_t, size_t> where the index is the reference point index, and the value held is 1 if the point should be searched

15:09 < rcurtin> hm, let's see if I have that information. I didn't tune the parameters at all

15:09 < rcurtin> yeah, here is the parameters that were selected:

15:10 < rcurtin> miniboone: hash width 24.0734, hash table size 99901x500, with an average of 1168 distinct indices per query

15:10 < rcurtin> pokerhand: hash width 3.32858, hash table size 99901x391, with an average of 9200 distinct indices per query

15:11 < rcurtin> phy: hash width 87.33, hash table size 97333x500, with an average of 1916 distinct indices per query

15:11 < rcurtin> and corel: hash width 0.718296, hash table size 50652x500, with an average of 3879 distinct indices per query

15:13 < mentekid> so it looks like number of indices doesn't affect

15:13 < mentekid> because miniboone and phy have almost identical indices and different best solutions

15:15 < mentekid> regarding map, I tried using std::set as a way of avoiding unique(), but it ended up being worse (for corel). The pointer-chasing in the RB tree ended up wasting more time than it saved

15:15 < mentekid> i think map is implemented the same way but I'm not sure

15:15 < rcurtin> yeah, I'm running a quick simulation now on the miniboone dataset, and it is looking like there is not any speedup here

15:16 < rcurtin> yeah, 63.7s for std::map<> vs. 32.9s

15:16 < rcurtin> you're right that std::set is the better choice, but I doubt I would see much difference

15:17 zoq_ has joined #mlpack

15:17 < rcurtin> actually, let me try unordered_set, since ordering is not important...

15:17 < mentekid> ah I didn't know this was also implemented

15:18 < mentekid> I was thinking about trying something like a Bloom filter, of course then we have a second-order loss of recall since BFs are approximate

15:19 < mentekid> but it has better time and space behavior than set

15:21 zoq has quit [Ping timeout: 260 seconds]

15:22 Nilabhra has quit [Ping timeout: 260 seconds]

15:22 zoq_ is now known as zoq

15:22 < rcurtin> yeah, unordered_set only improves to 59.8s for the miniboone dataset

15:25 < mentekid> do you think it's the size of the reference set that affects performance?

15:25 Nilabhra has joined #mlpack

15:27 < rcurtin> I think the size of the reference set is definitely related, but I guess the exact number is that we'll be calling arma::unique() with a number of elements equal to, I think, the number of elements in each hash table we are searching

15:27 < rcurtin> but I think the behavior of unique() may be dependent on the data itself, so that might be harder to say. I'm not sure what implementation unique() is using

15:30 < mentekid> I think it uses quicksort because of the unguarded_pivot() function that I saw being called a lot when I profiled my code

15:30 < mentekid> and when I printed the input to unique() and plotted it, it looked like it was partially sorted

15:31 < rcurtin> yeah, that sounds about right

15:31 < mentekid> I mean it did "peaks" instead of being random

15:32 < mentekid> but I'm not sure how much these peaks harm performance - if they are selected as pivots then they might

15:34 tham has quit [Ping timeout: 250 seconds]

15:36 < rcurtin> I guess, the issue comes down to that there is a tradeoff:

15:37 < rcurtin> for large reference sets, the arma::find approach is slower because it takes more memory and the cost of scanning all the memory is high

15:37 < rcurtin> but for smaller reference sets, the find approach is faster than unique, since unique involves some amount of sorting

15:41 < rcurtin> do you think maybe a reasonable approach might be to figure out where this cutoff is, and then select automatically?

15:42 < mentekid> that could work, yes. But phy is larger than miniboone (150k vs 130k) and the unique() code performs better at miniboone and worse at phy. So maybe dataset size is an indicator but not the only one

15:42 < rcurtin> I think that maybe the cutoff can be expressed as some constraint on maxNumPoints / referenceSet.n_cols, but I haven't played around with it enough

15:43 < mentekid> But i think yeah, if we do a cutoff, some function of reference size is the best option

15:44 < rcurtin> I don't think it needs to be perfect at all; I'm not too concerned with differences of runtime that are like 1-5%

15:44 < rcurtin> but more like 25-50%, like I saw in the runs I did, those are more major and we should definitely be concerned that we pick the best option there

15:45 < mentekid> true, the point is the bigger differences, for example pokerhand and corel

15:45 < rcurtin> okay; so here's an interesting observation I've just run:

15:45 < rcurtin> on the phy dataset, maxNumPoints (the number of reference points that we'll be calling unique() with) is usually about 10-15k (i.e. 10% of the dataset size)

15:46 < rcurtin> but with miniboone, it's almost always about 1000 (more like 1% of the dataset size)

15:47 < rcurtin> that's only one data point, but it's maybe a direction towards a reasonable solution

15:48 < mentekid> so it might be more directly related to unique() input size and not reference size at all

15:48 < rcurtin> yeah, the unique() runtime is definitely directly related to maxNumPoints

15:48 < rcurtin> but the find() runtime will be related to the reference set size

15:49 < mentekid> that makes sense yeah

15:49 < rcurtin> so I think, maybe there will be some tradeoff value, where, e.g., if maxNumPoints > 0.05 * referenceSet.n_cols, then the find() approach is better

15:49 < rcurtin> there are going to be other weird effects too, like if the reference set gets really big, then there isn't enough RAM and it gets way slower

15:49 < rcurtin> but... we should try and keep it simple I think :)

15:49 < rcurtin> this is a problem with heuristics... there are so many factors involved, it often gets really messy

15:50 < mentekid> Sticking with something simple, I could do something like finding maxNumPoints (which is a small loop and shouldn't take much time), and then have some heuristic like the one you said

15:50 < mentekid> so if we're below that, go with the old code, otherwise go with the new code

15:51 < rcurtin> yeah; I'm playing with simulations now to see if this is reasonable intuition

15:51 < rcurtin> maxNumPoints is controlled by the number of hash tables, so, if I increase the number of hash tables, then the find() strategy should become faster as compared to unique()

15:52 < rcurtin> for miniboone, this is what I'm finding; with 30 tables, find() is about 50% slower... with 45, it's 22% slower... and with 60, it's 7.5% slower

15:53 < rcurtin> er, I got the table numbers wrong, that was for 30, 35, and 45 tables, not 30, 45, and 60

15:54 < mentekid> wait so find() becomes faster when we have larger number of tables? or does unique() become slower?

15:55 < rcurtin> it's difficult to say, because I am not timing find() individually (I guess I could)

15:55 < mentekid> no I mean the code with find

15:56 < rcurtin> ah, yeah, it is

15:56 < rcurtin> here are exact numbers:

15:56 < rcurtin> 30 tables: find() 43.155s, unique() 32.944s

15:56 < rcurtin> 35 tables: find() 54.911s, unique() 44.594s

15:57 < rcurtin> 45 tables: find() 72.077s, unique() 67.385s

15:57 < rcurtin> 60 tables: find() 116.796s, unique() 124.301s

15:57 < rcurtin> sorry that that was unclear, hopefully this is better

15:57 < mentekid> ah ok, both increase but unique() increases faster. I though you meant increasing tables made the find() code faster, that would be weird

15:59 < rcurtin> yeah, sadly it does not work that way :(

15:59 < mentekid> so yeah I think this all boils down to maxNumPoints and how many numbers unique() has to sift through. Increasing the tables will increase maxNumPoints and make unique slower than the current code

15:59 < rcurtin> I think, but am not sure, that the performance of unique() is not going to vary that much based on the dataset, so I think that maybe we can find a single reasonable crossover point

16:00 < rcurtin> I didn't have code in place to check what the average value of maxNumPoints was though

16:00 < rcurtin> and actually, and interesting thing about this strategy, is that maxNumPoints can be different for each query point, so the strategy of selecting either find() or unique() is likely to be faster than hard-coding either

16:00 < rcurtin> since the strategy used can be different for every query point

16:02 < mentekid> true, that's some useful added flexibility

16:02 < mentekid> I'll code it up and see how it goes

16:04 < mentekid> I will probably find some time in the coming days, not sure if I can do it today (I have family visiting since it's Easter break here)

16:05 Nilabhra has quit [Ping timeout: 240 seconds]

16:05 < rcurtin> sure, no hurry :)

16:05 < rcurtin> I am going to go grab some lunch now, I'll be back in a little while

16:07 < mentekid> cool talk to you later

17:33 Nilabhra has joined #mlpack

20:06 travis-ci has joined #mlpack

20:06 < travis-ci> mlpack/mlpack#800 (master - 4c8a8d1 : Ryan Curtin): The build passed.

20:06 < travis-ci> Change view : https://github.com/mlpack/mlpack/compare/7264d37128a9...4c8a8d1ccbc3

20:06 < travis-ci> Build details : https://travis-ci.org/mlpack/mlpack/builds/127316346

20:06 travis-ci has left #mlpack []

20:29 gtank has quit [Ping timeout: 268 seconds]

20:34 gtank has joined #mlpack

20:37 Nilabhra has quit [Remote host closed the connection]

21:23 Nilabhra has joined #mlpack

22:02 Nilabhra has quit [Remote host closed the connection]