verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
govg has quit [Ping timeout: 244 seconds]
govg has joined #mlpack
wiking has quit [Quit: leaving]
wiking has joined #mlpack
wiking has quit [Quit: leaving]
wiking has joined #mlpack
wiking has quit [Quit: ZNC 1.6.3 - http://znc.in]
wiking has joined #mlpack
govg has quit [Ping timeout: 252 seconds]
govg has joined #mlpack
mentekid has joined #mlpack
nilay has quit [Ping timeout: 250 seconds]
< mentekid> rcurtin: Design question - I've implemented the LSHModel::Train() function up to the point where we train the predictor for k and N
< mentekid> next step is fitting gamma distributions to the parameters we've estimated
< mentekid> The thing is, Train() is dataset-dependent, whereas fitting the gammas isn't, strictly speaking: the user could potentially use a dataset to train the model and then repeatedly call Predict() with different parameters for k and N
< mentekid> so, if we put fitting the Gammas in Train(), train will have to be called every time a new k or N is specified
marcosirc has joined #mlpack
< mentekid> on the other hand, we could put fitting the Gammas in Predict(), and they will be fit only for the specific k and N... The disadvantage in this is we have to re-fit them every time we call Predict()
< mentekid> but I think that's not so much of a problem. The work that Train() does now is already quite heavy and we should avoid re-doing it if at all possible.
< mentekid> What do you think?
< mentekid> From what I remember from the minka paper, and what I saw from my test runs of the GammaDistribution code, it's quite fast... It converges after 3-4 iterations usually
< rcurtin> right, so you are saying that if we put all of the fitting into Predict(),
< rcurtin> er... I mean, if we put all of the fitting into Train(),
< rcurtin> then a user could call Predict() with some other dataset and get invalid results
< mentekid> well, that too
< mentekid> but I thought more like
< mentekid> call Train() with a dataset and k = 30
< mentekid> if we do all the fitting in Train, then if you decide to go with k = 32, you have to re-do everything
< mentekid> on the other hand, you could use Train() to fit the parameters
< mentekid> and have Predict not even accept a dataset, only a dataset size and a k
< mentekid> So Predict will predict the statistics from the regressor we trained for this datasetSize and k and then run binary search
< mentekid> I meant then fit the gammas and after that run binary search
< rcurtin> hm. I am not sure I fully grasp the issue
< rcurtin> I thought that once you had fit the gamma distributions, you could predict recall for any k and N (given that the dataset did not change)
< mentekid> well, the gamma distribution for Xk is fit using the arithmetic and geometric mean of kNN distances... And these are obtained from a regressor for a given k and N
< mentekid> so if we fit it inside Train(), we'll have to repeat the whole process of computing distances etc, which is slow.
< rcurtin> I thought if you fit it all inside Train(), then you would only need to calculate the distances and fit once
< rcurtin> instead of doing that every time in Predict()
< mentekid> Train() calculates the distances and fits the regressor. Predict() runs binary search.
< mentekid> so Train() is dataset-specific, it only needs to run whenever the dataset changes, not N or k
< rcurtin> okay, I see
< rcurtin> I agree with that design
< mentekid> but, if gammas are also fit inside Train(), it stops being dataset-specific and depends on N and k
< rcurtin> ah I spoke too soon :)
< mentekid> so yeah I think best choice is to go with fitting Gammas inside Predict(), keeping Train() dataset-specific
< rcurtin> hm. so is it correct that if I want to run LSH with some large value of k, then I must calculate E_k and G_k for every k up to that large value?
< mentekid> hm...
< mentekid> I would imagine no
< mentekid> but you might get better results if you do
< rcurtin> ah but I thought the recall depended on f_k() for all values of k between 0 and k
< rcurtin> I know I am abusing notation there, sorry
< mentekid> yeah, it does depend on f_k()
< rcurtin> ah, I have to turn my phone off for a while... I'll be back in 15/20
< rcurtin> maybe a few minutes more depending
< rcurtin> sorry about that
< mentekid> it's ok, talk to you later :)
< rcurtin> ok, I'm back now... sorry about that interruption
< rcurtin> it went a little longer than I thought because I fell asleep :)
< mentekid> hahaha it's ok
< mentekid> So I was confused about what you said about f_k()
< mentekid> How I have it in mind is, I'll fit k+1 distributions, k for 1:k and one for the simple pairwise distances
< rcurtin> right, I agree with that
< mentekid> and then in the recall integral, I'll call f_k() with a different k each time
< rcurtin> but then if you did the fitting in Train(), then you would need to accept a k value at that time
< rcurtin> and if you tried to run Predict() with a larger k, it couldn't work, because you would not have fit the distributions
< rcurtin> but, if you tried to run Predict() with a smaller k, it would be okay
< rcurtin> is that correct?
< mentekid> that's true, I thought about accepting a k in both Train() and Predict(). in Train() the k will have the notion of "this is the largest k I might be interested in"
< mentekid> and in Predict() it will have the meaning of "to find k nearest neighbors, what parameters do I need?"
< mentekid> so maybe in Train() I should name the parameter something like maxKValue
< rcurtin> that's exactly what I was thinking
< rcurtin> if you pass k > maxKValue in Predict(), then you can call Train() again and call Log::Warn with a mesage like "larger k requested; retraining required" or something like that
< rcurtin> message*
< mentekid> Right, so we enforce it :)
< mentekid> is there a way when I call other mlpack objects to force them to be quiet? Like calling kNN from LSHModel and telling it not to write to Log::Info?
< mentekid> it's more of an aesthetics question, nothing practical
< rcurtin> hm, I guess you could do "Log::Info.ignoreInput = false" and then be sure to restore it to its original value afterwards
< mentekid> you mean = true?
< rcurtin> oh... yeah. true :)
govg has quit [Ping timeout: 250 seconds]
mentekid has quit [Ping timeout: 264 seconds]
govg has joined #mlpack
mentekid has joined #mlpack
nilay has joined #mlpack
mentekid has quit [Ping timeout: 250 seconds]
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Client Quit]
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Client Quit]
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Client Quit]
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Client Quit]
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Client Quit]
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Client Quit]
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Client Quit]
govg has quit [Ping timeout: 258 seconds]
govg has joined #mlpack
nilay has quit [Ping timeout: 250 seconds]