verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
curiousguy13 has quit [Ping timeout: 264 seconds]
trapz has quit [Quit: trapz]
anntzer has quit [Ping timeout: 264 seconds]
curiousguy13 has joined #mlpack
trapz has joined #mlpack
trapz has quit [Quit: trapz]
curiousguy13 has quit [Ping timeout: 265 seconds]
govg has quit [Quit: leaving]
kshitijk has joined #mlpack
kshitijk has quit [Remote host closed the connection]
trapz has joined #mlpack
govg has joined #mlpack
syrius has left #mlpack []
lezorich has joined #mlpack
edlinde has joined #mlpack
< edlinde> I need a Sparse LSH implementation in C++
< edlinde> am thinking of modifying the LSH code that is already available
< edlinde> would any of you guys be interested in helping out?
< edlinde> I don’t know about the current LSH implementation and how well it performs.. does anyone know?
< naywhayare> edlinde: I'm happy to help out
< naywhayare> I don't actually know how the current LSH implementation performs, though; I haven't personally benchmarked it
< edlinde> thing is I haven’t yet looked deeper into the LSH code and don’t know how it works
< edlinde> I was treating it as a black box really… well another LSH implementation in Python that supported sparse matrices
< edlinde> turned out that it was returning way too many objects and setting “k” had no bearing on the performance
< naywhayare> this is what the python implementation was doing, or what mlpack was doing?
< edlinde> the python one
< edlinde> I cannot use MLpack because the data is very high-dimensional .. thousands and in some datasets about 1.3 million
< naywhayare> ah, okay, good, so no serious bugs to solve for mlpack's implementation then :)
< edlinde> I am not finding any good LSH implementation that works with sparse data :(
< naywhayare> so I think the right approach with LSH will be this:
< naywhayare> we have an LSHSearch class (in src/mlpack/methods/lsh/)
< edlinde> I am tossing up between : http://lshkit.sourceforge.net/ and this MLpack implementation to be honest
< edlinde> naywhayare: ok listening..
< naywhayare> we can add a template parameter to this ("MatType"), and then rewrite the internals of the class to use MatType instead of arma::mat
< naywhayare> then, we can specify MatType = arma::sp_mat, and you can use Armadillo sparse matrices with your dataset
< naywhayare> let me dig through the code a little deeper to see if there will be any issues with that
< edlinde> just not that good with C++ templates.. its been a while :)
< edlinde> naywhayare: is there some sort of a user-guide or tutorial that explains the code better?
< edlinde> I want to understand the code a bit more in detail.. because I need to make some more fundamental changes too
< edlinde> like having the option to attach “extra data” with each data point in the LSH
< edlinde> would like to attach the class label of each point
< naywhayare> okay, looking through the code, I think that idea will work fine
< naywhayare> the LSH code itself is fairly well commented
< edlinde> have you got a github link or something I can look at?
< naywhayare> for basic mlpack design guidelines, which is maybe what you're looking for, here is an old document from 2010: http://www.igglybob.com/mlpack_future.pdf
< edlinde> and the paper on which its based?
< naywhayare> yeah, sure, let me get both those links for you
< edlinde> cheers
< naywhayare> https://github.com/mlpack/mlpack/tree/master/src/mlpack/methods/lsh contains the lsh code... lsh_search.hpp defines the class and lsh_search_impl.hpp contains the implementation
< edlinde> ok
< naywhayare> as for attaching extra data like labels, I think maybe the best way to do this is to hold the labels in a separate vector
< edlinde> so I am off for a week or so for Easter, after which I got to get cracking on this
< edlinde> yeah I guess that would work too
< naywhayare> do the LSH search to get the indices of the approximate nearest neighbors of each query point
< edlinde> just wondering how we go about collaborating on this code?
< naywhayare> and then use those indices to get the labels of the nearest neighbors
< naywhayare> I'm happy to help through IRC or email (ryan@ratml.org), and we can use github's pull request interface too
< naywhayare> or, whatever works for you, I'm happy to help out
< edlinde> I don’t know if the code is already doing this.. but a major thing I need is that for small k .. I would like it to shorten the computation time by skipping having to compute all neighbors, then sorting them all by distance and returning me the top K
< naywhayare> yeah, that's how LSH works... you hash the query point and search the nearby buckets for the top k nearest neighbors
< naywhayare> so you compute the distance between the query point and very few other points
< edlinde> hmm what I meant was that the runtime when I set k=2, should be much better than say k=100
< edlinde> I know that the other LSH implementation out there by Piotr et al. does some sort of an initial pass to gather some statistics about the data distribution
< edlinde> and then accordingly comes up with the length of the bit representations
< naywhayare> hm, small k should take much less time than large k... let me run a quick simulation and check
< edlinde> great thanks!
< naywhayare> okay, I used the winequality dataset (about 6k points)
< naywhayare> for k = 2, it takes 1.02 seconds
< naywhayare> k = 10, 1.38s
< naywhayare> k = 100, 1.65s
< naywhayare> k = 1000, 4.62s
< edlinde> hmm
< edlinde> 6k points with what dimensionality?
< edlinde> seems a bit slow to be honest.. I mean 1 sec on just 6k points
< naywhayare> 11
< naywhayare> I didn't tune any of the parameters
< edlinde> i see
< naywhayare> keep in mind that LSH chooses its bucket size and number of hash tables in order to guarantee a success probability
< edlinde> I am guessing these params are to be tuned manually yes?
< edlinde> does this code chose those automatically? or you mean there is a default number of tables it will use?
< naywhayare> you can use the defaults and it will do okay, but realistically you should find your desired success probability and then choose your parameters according to that
< naywhayare> it uses a default number
< edlinde> yeah its a lot of fiddling around :)
< naywhayare> it might be possible to get it to autoconfigure, but that's not currently what the code does
< naywhayare> the paper may have more details about how to choose the bucket size and number of tables in a smart way, but I can't give too much insight about that
< edlinde> maybe we could write something that does a few runs with different parameters
< edlinde> yeah I know how LSH works..its just that I haven’t implemented it myself so don’t know all the implementation related details
< edlinde> I guess you could help with that part more once we have something
< naywhayare> sure, I can help with that
< edlinde> also, I am thinking distance computations between sparse vectors of very high dimensionality will become an issue
< edlinde> might have to think of a neat hack for that too.. to speed those up too
< naywhayare> you can do it faster than O(d), and it should not be hard to get the code to do that automatically based on what MatType is
< naywhayare> basically you iterate over nonzero values in each vector
< naywhayare> I think the idea in my head there is worst-case O(nnz_1 + nnz_2) where nnz_1 is the number of nonzeros in the first vector and nnz_2 is the number of nonzeros in the second
< edlinde> yes
< edlinde> just wondering how that will work when there is a ton of data with very high dimensionality .. like the news20 dataset
< edlinde> or rcv1
< naywhayare> if those datasets are reasonably sparse, then iterating over nonzero elements is the best you can do
< naywhayare> if the datasets are fully dense, there's not much else you can do other than approximately projecting it into lower dimensions (but then you are approximating)
< naywhayare> one of the thoughts I've had for a long time is using partial distance calculations
< edlinde> yeah this won’t be for dense sets
< naywhayare> often when you calculate a distance, what you're really trying to find out is if the distance is greater than some value v
< naywhayare> so you can terminate early if the partial distance in some dimensions exceeds v
< naywhayare> that might be advantageous in high dimensions
< edlinde> I was thinking along the lines of using compressed bit-vectors along with the sparse vectors.. so very quick bit operations should be quite fast
< naywhayare> can you describe more of what you mean?
< edlinde> don’t know how much speedup that would give though compared to a linear scan of nnzs
< edlinde> you know about plwah?
< naywhayare> I don't, but I am reading about it now :)
< edlinde> sure np
< edlinde> so with these bitmaps I can do a quick AND or OR or XOR operation to know what values in the sparse vector have also got corresponding values in the other sparse vector.. to compute the distance.. while all rest can be left alone
< edlinde> anyway its just a thought.. maybe the linear scan is still good enough.. we can start with something simple and then do some profiling :)
< edlinde> I just need the sparseLSH to scale really well on some very large datasets too
< naywhayare> hm, but if you calculate the values in the sparse vector which have corresponding values in the other vector, this is still not everything you need for the distance
< naywhayare> because you still need to consider those nonzero values that don't have corresponding values in the other vector
< naywhayare> you can only skip when both are zero
< naywhayare> anyway, yeah, I agree -- I think templatizing MatType is a good first step and should get you most of the way to your goal
< edlinde> lets see :)
< edlinde> it sounds like you are pretty well-versed with the code .. is templatizing hard for you to do?
< edlinde> I mean would it take you long?
< naywhayare> I could probably do it in 3 or 4 hours, but I'm pretty busy with other stuff now, unfortunately
< edlinde> hmm
< edlinde> I will go through the code when I get back from the Easter break
< naywhayare> unfortunately my list of things to do for mlpack in my free time is already very long :(
< edlinde> I actually also have to write a reader for libSVM formatted files and see how to put that into a Arma sparse mat
< naywhayare> when you get back, I am happy to help out :)
< edlinde> sure I understand
< edlinde> :)
< edlinde> our lives suck.. I get it :)
< edlinde> haha
< naywhayare> :)
< edlinde> maybe we should think about approximate distances
< edlinde> and not kill ourselves over exact distances between sparse vectors
< edlinde> might give us some real good speedup
< naywhayare> if you're going to do that, I think the better way to do it is to do random projections into log(d) dimensions and then you get guarantees on how approximate the distance is
< edlinde> will see if there is some work already in this direction
< naywhayare> but I could be wrong
< naywhayare> there may be some good approximate approaches in the direction of what you suggested; let me know what you find
< edlinde> yeah but when do you want to do the RP?
< edlinde> when looking at each sparse vector one at a time?
< edlinde> or do a RP right at the start when we load data into the Arma sparse matrix?
< naywhayare> no, before you start the LSH process at all
< naywhayare> yeah, that's when I would do it... project from sparse d dimensions to dense log(d) dimensions
< naywhayare> use the Johnson-Lindenstrauss lemma to guarantee approximation bounds
< edlinde> yep
< edlinde> makes sense
< edlinde> there is also a fast JL transform available
< naywhayare> yeah, I saw that... that may be a good approach to implement in mlpack as a standalone preprocessing step for any dataset and any algorithm
< edlinde> but then if we project and it gets dense.. we don’t really need a sparse LSH then right?
< edlinde> we can still work with the dense LSH implementation after the JL transform
< naywhayare> yeah, that is true, but I think we get less rigorous approximation bounds :)
< naywhayare> it all depends on how important the bounds are to you :)
< edlinde> maybe we can try both approaches and see
< naywhayare> it would be interesting to compare them
< edlinde> yeh
< naywhayare> in terms of both runtime and approximate result quality
< edlinde> the accuracy , runtime tradeoff might be interesting
< edlinde> :)
< edlinde> cool.. so we seem to have some sort of a rough idea of how we might do this then
< edlinde> I must admit my C++ is a bit rusty.. so you might have to help cleanup the code a little
< edlinde> is that ok with you?
< naywhayare> yeah, of course
< edlinde> cool
< edlinde> thanks then
< naywhayare> sure, no problem :)
< edlinde> ok ttyl then
< edlinde> thanks for the chat
lezorich has quit [Ping timeout: 246 seconds]
< naywhayare> sounds good -- talk to you later
lezorich has joined #mlpack
lezorich has quit [Ping timeout: 246 seconds]
lezorich has joined #mlpack
< edlinde> naywhayare: if you are still around.. have you considered a parallel version?
< edlinde> maybe the code will have to be written in a thread-safe manner then
< edlinde> just comparing it to —> http://lshkit.sourceforge.net/
< naywhayare> edlinde: I've considered it, and I think OpenMP or Cilk is the easiest way to go for now, but I haven't implemented it
< naywhayare> like I said, too many ideas, too little time :)
< edlinde> yeah OpenMP might do the trick
< edlinde> thing is I got to do many queries with just a single query point at a time
< stephentu> hey one interesting research evaluation question i've been wonderin gis
< stephentu> if you have a very high dimensional sparse dataset
< stephentu> is it better to do a random projection and work with a low dimension, but dense represetnation
< stephentu> or to use an algorithm which takes advantage of sparsity
< stephentu> of course this is vague right now but
< stephentu> the answer is probably it depends on the dataset and problem
< edlinde> yeah we were just asking the same question too
< edlinde> just wondering if there is a nicer way to somehow stop computing all the neighbors and stop the computation when we have “k” NNs
< edlinde> maybe keep some sort of a distance upper bound
< edlinde> I think most implementations are getting all neighbors, sorting them by distance and then reporting k-NNs
< naywhayare> edlinde: I'm not sure about that... the whole idea behind trees and LSH is to return the k nearest neighbors without getting the distance to every possible neighbor
< edlinde> but what happens in LSH if you have way too many collisions ?
< naywhayare> LSH does calculate the distance between a query point and every point in the bin it hashes to (I think), but this is still far fewer computations than between the query point and all points
< naywhayare> I think, at least in our implementation, that you set the desired bucket size accordingly
< edlinde> yeah it does do it between the points in the buckets and not the entire dataset
< edlinde> maybe its just trying to find the appropriate bucket size
< naywhayare> the implementation that we have now, unless I'm mistaken, uses a heuristic based on the input options to determine the right hash width to get the right number of points in each bucket... I think
< naywhayare> I'd need to refer to the paper to be sure
trapz has quit [Quit: trapz]
edlinde has quit [Quit: edlinde]
< zoq> naywhayare: I'm not sure if this is a problem at all, but maybe someone accidentally commits the modified backtrace.hpp file because he uses 'git add .' or something like that. We could add it to the .gitignore file. In this case we can't easily commit changes... not sure if we need to modify the backtrace.hpp file so perhaps this is a good solution?
lezorich has quit [Ping timeout: 272 seconds]
< naywhayare> zoq: I was thinking about this problem too... I played with git update-index --assume-unchanged, but that's only local
< naywhayare> I don't think that it will need to be modified often, so I think .gitignore can work
< naywhayare> we can either do that, or we can remove backtrace.hpp from git entirely and make a comment in log.cpp before the include that mentions that backtrace.hpp is generated by CMake at configuration time and just includes <execinfo.h> and defines Backtrace_FOUND if it's found
< naywhayare> either's fine with me... what do you think?
< naywhayare> (or maybe some other solution entirely?)