verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
< mentekid>
rcurtin: Design question - I've implemented the LSHModel::Train() function up to the point where we train the predictor for k and N
< mentekid>
next step is fitting gamma distributions to the parameters we've estimated
< mentekid>
The thing is, Train() is dataset-dependent, whereas fitting the gammas isn't, strictly speaking: the user could potentially use a dataset to train the model and then repeatedly call Predict() with different parameters for k and N
< mentekid>
so, if we put fitting the Gammas in Train(), train will have to be called every time a new k or N is specified
marcosirc has joined #mlpack
< mentekid>
on the other hand, we could put fitting the Gammas in Predict(), and they will be fit only for the specific k and N... The disadvantage in this is we have to re-fit them every time we call Predict()
< mentekid>
but I think that's not so much of a problem. The work that Train() does now is already quite heavy and we should avoid re-doing it if at all possible.
< mentekid>
What do you think?
< mentekid>
From what I remember from the minka paper, and what I saw from my test runs of the GammaDistribution code, it's quite fast... It converges after 3-4 iterations usually
< rcurtin>
right, so you are saying that if we put all of the fitting into Predict(),
< rcurtin>
er... I mean, if we put all of the fitting into Train(),
< rcurtin>
then a user could call Predict() with some other dataset and get invalid results
< mentekid>
well, that too
< mentekid>
but I thought more like
< mentekid>
call Train() with a dataset and k = 30
< mentekid>
if we do all the fitting in Train, then if you decide to go with k = 32, you have to re-do everything
< mentekid>
on the other hand, you could use Train() to fit the parameters
< mentekid>
and have Predict not even accept a dataset, only a dataset size and a k
< mentekid>
So Predict will predict the statistics from the regressor we trained for this datasetSize and k and then run binary search
< mentekid>
I meant then fit the gammas and after that run binary search
< rcurtin>
hm. I am not sure I fully grasp the issue
< rcurtin>
I thought that once you had fit the gamma distributions, you could predict recall for any k and N (given that the dataset did not change)
< mentekid>
well, the gamma distribution for Xk is fit using the arithmetic and geometric mean of kNN distances... And these are obtained from a regressor for a given k and N
< mentekid>
so if we fit it inside Train(), we'll have to repeat the whole process of computing distances etc, which is slow.
< rcurtin>
I thought if you fit it all inside Train(), then you would only need to calculate the distances and fit once
< rcurtin>
instead of doing that every time in Predict()
< mentekid>
Train() calculates the distances and fits the regressor. Predict() runs binary search.
< mentekid>
so Train() is dataset-specific, it only needs to run whenever the dataset changes, not N or k
< rcurtin>
okay, I see
< rcurtin>
I agree with that design
< mentekid>
but, if gammas are also fit inside Train(), it stops being dataset-specific and depends on N and k
< rcurtin>
ah I spoke too soon :)
< mentekid>
so yeah I think best choice is to go with fitting Gammas inside Predict(), keeping Train() dataset-specific
< rcurtin>
hm. so is it correct that if I want to run LSH with some large value of k, then I must calculate E_k and G_k for every k up to that large value?
< mentekid>
hm...
< mentekid>
I would imagine no
< mentekid>
but you might get better results if you do
< rcurtin>
ah but I thought the recall depended on f_k() for all values of k between 0 and k
< rcurtin>
I know I am abusing notation there, sorry
< mentekid>
yeah, it does depend on f_k()
< rcurtin>
ah, I have to turn my phone off for a while... I'll be back in 15/20
< rcurtin>
maybe a few minutes more depending
< rcurtin>
sorry about that
< mentekid>
it's ok, talk to you later :)
< rcurtin>
ok, I'm back now... sorry about that interruption
< rcurtin>
it went a little longer than I thought because I fell asleep :)
< mentekid>
hahaha it's ok
< mentekid>
So I was confused about what you said about f_k()
< mentekid>
How I have it in mind is, I'll fit k+1 distributions, k for 1:k and one for the simple pairwise distances
< rcurtin>
right, I agree with that
< mentekid>
and then in the recall integral, I'll call f_k() with a different k each time
< rcurtin>
but then if you did the fitting in Train(), then you would need to accept a k value at that time
< rcurtin>
and if you tried to run Predict() with a larger k, it couldn't work, because you would not have fit the distributions
< rcurtin>
but, if you tried to run Predict() with a smaller k, it would be okay
< rcurtin>
is that correct?
< mentekid>
that's true, I thought about accepting a k in both Train() and Predict(). in Train() the k will have the notion of "this is the largest k I might be interested in"
< mentekid>
and in Predict() it will have the meaning of "to find k nearest neighbors, what parameters do I need?"
< mentekid>
so maybe in Train() I should name the parameter something like maxKValue
< rcurtin>
that's exactly what I was thinking
< rcurtin>
if you pass k > maxKValue in Predict(), then you can call Train() again and call Log::Warn with a mesage like "larger k requested; retraining required" or something like that
< rcurtin>
message*
< mentekid>
Right, so we enforce it :)
< mentekid>
is there a way when I call other mlpack objects to force them to be quiet? Like calling kNN from LSHModel and telling it not to write to Log::Info?
< mentekid>
it's more of an aesthetics question, nothing practical
< rcurtin>
hm, I guess you could do "Log::Info.ignoreInput = false" and then be sure to restore it to its original value afterwards