verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
Nilabhra has joined #mlpack
govg has joined #mlpack
Mathnerd314 has quit [Ping timeout: 268 seconds]
Mathnerd314 has joined #mlpack
Mathnerd314 has quit [Ping timeout: 268 seconds]
govg has quit [Quit: leaving]
govg has joined #mlpack
nilabhra_ has joined #mlpack
nilabhra_ has quit [Remote host closed the connection]
Nilabhra has quit [Ping timeout: 276 seconds]
tham has joined #mlpack
govg has quit [Ping timeout: 260 seconds]
govg has joined #mlpack
govg has quit [Quit: leaving]
Nilabhra has joined #mlpack
Mathnerd314 has joined #mlpack
< rcurtin> mentekid: I did some benchmarking of the LSH improvements in #623; here's what I found on a handful of datasets:
< rcurtin> (with default parameters)
< rcurtin> miniboone dataset (50x130064) -- master 43.155s, improved 32.944s
< rcurtin> pokerhand dataset (10x1025010) -- master 1684.4s, improved 1182.6s
< rcurtin> phy dataset (78x150000) -- master 81.3s, imrpoved 104.9s
< rcurtin> corel dataset (32x37749) -- master 9.9s, improved 15.3s
< rcurtin> my best thought at the moment is, arma::unique() can be slower than the previous array approach for smaller datasets, but as the dataset gets larger unique() is a faster approach
< rcurtin> and regardless of dataset size I'm pretty certain the unique() approach will use less RAM
< rcurtin> I'm trying to think if maybe there is another solution than arma::unique()... the goal is to collect all the unique indices of LSH reference point candidates
< rcurtin> maybe, do you think it makes sense to use a hash table-like structure?
< mentekid> do you think it is the dataset size or the selectivity of LSH? Is there a log of how many indices are returned each time?
< rcurtin> I think maybe there is a bit simpler solution, but maybe a std::map<size_t, size_t> where the index is the reference point index, and the value held is 1 if the point should be searched
< rcurtin> hm, let's see if I have that information. I didn't tune the parameters at all
< rcurtin> yeah, here is the parameters that were selected:
< rcurtin> miniboone: hash width 24.0734, hash table size 99901x500, with an average of 1168 distinct indices per query
< rcurtin> pokerhand: hash width 3.32858, hash table size 99901x391, with an average of 9200 distinct indices per query
< rcurtin> phy: hash width 87.33, hash table size 97333x500, with an average of 1916 distinct indices per query
< rcurtin> and corel: hash width 0.718296, hash table size 50652x500, with an average of 3879 distinct indices per query
< mentekid> so it looks like number of indices doesn't affect
< mentekid> because miniboone and phy have almost identical indices and different best solutions
< mentekid> regarding map, I tried using std::set as a way of avoiding unique(), but it ended up being worse (for corel). The pointer-chasing in the RB tree ended up wasting more time than it saved
< mentekid> i think map is implemented the same way but I'm not sure
< rcurtin> yeah, I'm running a quick simulation now on the miniboone dataset, and it is looking like there is not any speedup here
< rcurtin> yeah, 63.7s for std::map<> vs. 32.9s
< rcurtin> you're right that std::set is the better choice, but I doubt I would see much difference
zoq_ has joined #mlpack
< rcurtin> actually, let me try unordered_set, since ordering is not important...
< mentekid> ah I didn't know this was also implemented
< mentekid> I was thinking about trying something like a Bloom filter, of course then we have a second-order loss of recall since BFs are approximate
< mentekid> but it has better time and space behavior than set
zoq has quit [Ping timeout: 260 seconds]
Nilabhra has quit [Ping timeout: 260 seconds]
zoq_ is now known as zoq
< rcurtin> yeah, unordered_set only improves to 59.8s for the miniboone dataset
< mentekid> do you think it's the size of the reference set that affects performance?
Nilabhra has joined #mlpack
< rcurtin> I think the size of the reference set is definitely related, but I guess the exact number is that we'll be calling arma::unique() with a number of elements equal to, I think, the number of elements in each hash table we are searching
< rcurtin> but I think the behavior of unique() may be dependent on the data itself, so that might be harder to say. I'm not sure what implementation unique() is using
< mentekid> I think it uses quicksort because of the unguarded_pivot() function that I saw being called a lot when I profiled my code
< mentekid> and when I printed the input to unique() and plotted it, it looked like it was partially sorted
< rcurtin> yeah, that sounds about right
< mentekid> I mean it did "peaks" instead of being random
< mentekid> but I'm not sure how much these peaks harm performance - if they are selected as pivots then they might
tham has quit [Ping timeout: 250 seconds]
< rcurtin> I guess, the issue comes down to that there is a tradeoff:
< rcurtin> for large reference sets, the arma::find approach is slower because it takes more memory and the cost of scanning all the memory is high
< rcurtin> but for smaller reference sets, the find approach is faster than unique, since unique involves some amount of sorting
< rcurtin> do you think maybe a reasonable approach might be to figure out where this cutoff is, and then select automatically?
< mentekid> that could work, yes. But phy is larger than miniboone (150k vs 130k) and the unique() code performs better at miniboone and worse at phy. So maybe dataset size is an indicator but not the only one
< rcurtin> I think that maybe the cutoff can be expressed as some constraint on maxNumPoints / referenceSet.n_cols, but I haven't played around with it enough
< mentekid> But i think yeah, if we do a cutoff, some function of reference size is the best option
< rcurtin> I don't think it needs to be perfect at all; I'm not too concerned with differences of runtime that are like 1-5%
< rcurtin> but more like 25-50%, like I saw in the runs I did, those are more major and we should definitely be concerned that we pick the best option there
< mentekid> true, the point is the bigger differences, for example pokerhand and corel
< rcurtin> okay; so here's an interesting observation I've just run:
< rcurtin> on the phy dataset, maxNumPoints (the number of reference points that we'll be calling unique() with) is usually about 10-15k (i.e. 10% of the dataset size)
< rcurtin> but with miniboone, it's almost always about 1000 (more like 1% of the dataset size)
< rcurtin> that's only one data point, but it's maybe a direction towards a reasonable solution
< mentekid> so it might be more directly related to unique() input size and not reference size at all
< rcurtin> yeah, the unique() runtime is definitely directly related to maxNumPoints
< rcurtin> but the find() runtime will be related to the reference set size
< mentekid> that makes sense yeah
< rcurtin> so I think, maybe there will be some tradeoff value, where, e.g., if maxNumPoints > 0.05 * referenceSet.n_cols, then the find() approach is better
< rcurtin> there are going to be other weird effects too, like if the reference set gets really big, then there isn't enough RAM and it gets way slower
< rcurtin> but... we should try and keep it simple I think :)
< rcurtin> this is a problem with heuristics... there are so many factors involved, it often gets really messy
< mentekid> Sticking with something simple, I could do something like finding maxNumPoints (which is a small loop and shouldn't take much time), and then have some heuristic like the one you said
< mentekid> so if we're below that, go with the old code, otherwise go with the new code
< rcurtin> yeah; I'm playing with simulations now to see if this is reasonable intuition
< rcurtin> maxNumPoints is controlled by the number of hash tables, so, if I increase the number of hash tables, then the find() strategy should become faster as compared to unique()
< rcurtin> for miniboone, this is what I'm finding; with 30 tables, find() is about 50% slower... with 45, it's 22% slower... and with 60, it's 7.5% slower
< rcurtin> er, I got the table numbers wrong, that was for 30, 35, and 45 tables, not 30, 45, and 60
< mentekid> wait so find() becomes faster when we have larger number of tables? or does unique() become slower?
< rcurtin> it's difficult to say, because I am not timing find() individually (I guess I could)
< mentekid> no I mean the code with find
< rcurtin> ah, yeah, it is
< rcurtin> here are exact numbers:
< rcurtin> 30 tables: find() 43.155s, unique() 32.944s
< rcurtin> 35 tables: find() 54.911s, unique() 44.594s
< rcurtin> 45 tables: find() 72.077s, unique() 67.385s
< rcurtin> 60 tables: find() 116.796s, unique() 124.301s
< rcurtin> sorry that that was unclear, hopefully this is better
< mentekid> ah ok, both increase but unique() increases faster. I though you meant increasing tables made the find() code faster, that would be weird
< rcurtin> yeah, sadly it does not work that way :(
< mentekid> so yeah I think this all boils down to maxNumPoints and how many numbers unique() has to sift through. Increasing the tables will increase maxNumPoints and make unique slower than the current code
< rcurtin> I think, but am not sure, that the performance of unique() is not going to vary that much based on the dataset, so I think that maybe we can find a single reasonable crossover point
< rcurtin> I didn't have code in place to check what the average value of maxNumPoints was though
< rcurtin> and actually, and interesting thing about this strategy, is that maxNumPoints can be different for each query point, so the strategy of selecting either find() or unique() is likely to be faster than hard-coding either
< rcurtin> since the strategy used can be different for every query point
< mentekid> true, that's some useful added flexibility
< mentekid> I'll code it up and see how it goes
< mentekid> I will probably find some time in the coming days, not sure if I can do it today (I have family visiting since it's Easter break here)
Nilabhra has quit [Ping timeout: 240 seconds]
< rcurtin> sure, no hurry :)
< rcurtin> I am going to go grab some lunch now, I'll be back in a little while
< mentekid> cool talk to you later
Nilabhra has joined #mlpack
travis-ci has joined #mlpack
< travis-ci> mlpack/mlpack#800 (master - 4c8a8d1 : Ryan Curtin): The build passed.
travis-ci has left #mlpack []
gtank has quit [Ping timeout: 268 seconds]
gtank has joined #mlpack
Nilabhra has quit [Remote host closed the connection]
Nilabhra has joined #mlpack
Nilabhra has quit [Remote host closed the connection]