#mlpack on 2016-08-05 — irc logs at libera.irclog.whitequark.org

2015-01-15 23:05 verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/

00:02 govg has quit [Ping timeout: 244 seconds]

00:03 govg has joined #mlpack

01:38 wiking has quit [Quit: leaving]

02:08 wiking has joined #mlpack

02:12 wiking has quit [Quit: leaving]

02:19 wiking has joined #mlpack

02:23 wiking has quit [Quit: ZNC 1.6.3 - http://znc.in]

02:23 wiking has joined #mlpack

02:42 govg has quit [Ping timeout: 252 seconds]

03:30 govg has joined #mlpack

08:22 mentekid has joined #mlpack

10:46 nilay has quit [Ping timeout: 250 seconds]

12:04 < mentekid> rcurtin: Design question - I've implemented the LSHModel::Train() function up to the point where we train the predictor for k and N

12:05 < mentekid> next step is fitting gamma distributions to the parameters we've estimated

12:07 < mentekid> The thing is, Train() is dataset-dependent, whereas fitting the gammas isn't, strictly speaking: the user could potentially use a dataset to train the model and then repeatedly call Predict() with different parameters for k and N

12:08 < mentekid> so, if we put fitting the Gammas in Train(), train will have to be called every time a new k or N is specified

12:08 marcosirc has joined #mlpack

12:09 < mentekid> on the other hand, we could put fitting the Gammas in Predict(), and they will be fit only for the specific k and N... The disadvantage in this is we have to re-fit them every time we call Predict()

12:09 < mentekid> but I think that's not so much of a problem. The work that Train() does now is already quite heavy and we should avoid re-doing it if at all possible.

12:09 < mentekid> What do you think?

12:10 < mentekid> From what I remember from the minka paper, and what I saw from my test runs of the GammaDistribution code, it's quite fast... It converges after 3-4 iterations usually

12:31 < rcurtin> right, so you are saying that if we put all of the fitting into Predict(),

12:31 < rcurtin> er... I mean, if we put all of the fitting into Train(),

12:32 < rcurtin> then a user could call Predict() with some other dataset and get invalid results

12:32 < mentekid> well, that too

12:32 < mentekid> but I thought more like

12:32 < mentekid> call Train() with a dataset and k = 30

12:33 < mentekid> if we do all the fitting in Train, then if you decide to go with k = 32, you have to re-do everything

12:33 < mentekid> on the other hand, you could use Train() to fit the parameters

12:34 < mentekid> and have Predict not even accept a dataset, only a dataset size and a k

12:35 < mentekid> So Predict will predict the statistics from the regressor we trained for this datasetSize and k and then run binary search

12:35 < mentekid> I meant then fit the gammas and after that run binary search

12:36 < rcurtin> hm. I am not sure I fully grasp the issue

12:37 < rcurtin> I thought that once you had fit the gamma distributions, you could predict recall for any k and N (given that the dataset did not change)

12:38 < mentekid> well, the gamma distribution for Xk is fit using the arithmetic and geometric mean of kNN distances... And these are obtained from a regressor for a given k and N

12:39 < mentekid> so if we fit it inside Train(), we'll have to repeat the whole process of computing distances etc, which is slow.

12:40 < rcurtin> I thought if you fit it all inside Train(), then you would only need to calculate the distances and fit once

12:40 < rcurtin> instead of doing that every time in Predict()

12:41 < mentekid> Train() calculates the distances and fits the regressor. Predict() runs binary search.

12:41 < mentekid> so Train() is dataset-specific, it only needs to run whenever the dataset changes, not N or k

12:42 < rcurtin> okay, I see

12:42 < rcurtin> I agree with that design

12:42 < mentekid> but, if gammas are also fit inside Train(), it stops being dataset-specific and depends on N and k

12:42 < rcurtin> ah I spoke too soon :)

12:43 < mentekid> so yeah I think best choice is to go with fitting Gammas inside Predict(), keeping Train() dataset-specific

12:44 < rcurtin> hm. so is it correct that if I want to run LSH with some large value of k, then I must calculate E_k and G_k for every k up to that large value?

12:44 < mentekid> hm...

12:45 < mentekid> I would imagine no

12:45 < mentekid> but you might get better results if you do

12:46 < rcurtin> ah but I thought the recall depended on f_k() for all values of k between 0 and k

12:46 < rcurtin> I know I am abusing notation there, sorry

12:47 < mentekid> yeah, it does depend on f_k()

12:48 < rcurtin> ah, I have to turn my phone off for a while... I'll be back in 15/20

12:48 < rcurtin> maybe a few minutes more depending

12:48 < rcurtin> sorry about that

12:49 < mentekid> it's ok, talk to you later :)

13:32 < rcurtin> ok, I'm back now... sorry about that interruption

13:32 < rcurtin> it went a little longer than I thought because I fell asleep :)

13:32 < mentekid> hahaha it's ok

13:35 < mentekid> So I was confused about what you said about f_k()

13:36 < mentekid> How I have it in mind is, I'll fit k+1 distributions, k for 1:k and one for the simple pairwise distances

13:36 < rcurtin> right, I agree with that

13:37 < mentekid> and then in the recall integral, I'll call f_k() with a different k each time

13:37 < rcurtin> but then if you did the fitting in Train(), then you would need to accept a k value at that time

13:37 < rcurtin> and if you tried to run Predict() with a larger k, it couldn't work, because you would not have fit the distributions

13:38 < rcurtin> but, if you tried to run Predict() with a smaller k, it would be okay

13:38 < rcurtin> is that correct?

13:38 < mentekid> that's true, I thought about accepting a k in both Train() and Predict(). in Train() the k will have the notion of "this is the largest k I might be interested in"

13:39 < mentekid> and in Predict() it will have the meaning of "to find k nearest neighbors, what parameters do I need?"

13:39 < mentekid> so maybe in Train() I should name the parameter something like maxKValue

13:40 < rcurtin> that's exactly what I was thinking

13:41 < rcurtin> if you pass k > maxKValue in Predict(), then you can call Train() again and call Log::Warn with a mesage like "larger k requested; retraining required" or something like that

13:41 < rcurtin> message*

13:44 < mentekid> Right, so we enforce it :)

13:57 < mentekid> is there a way when I call other mlpack objects to force them to be quiet? Like calling kNN from LSHModel and telling it not to write to Log::Info?

13:57 < mentekid> it's more of an aesthetics question, nothing practical

13:58 < rcurtin> hm, I guess you could do "Log::Info.ignoreInput = false" and then be sure to restore it to its original value afterwards

13:59 < mentekid> you mean = true?

13:59 < rcurtin> oh... yeah. true :)

15:47 govg has quit [Ping timeout: 250 seconds]

16:35 mentekid has quit [Ping timeout: 264 seconds]

16:44 govg has joined #mlpack

17:10 mentekid has joined #mlpack

17:25 nilay has joined #mlpack

18:13 mentekid has quit [Ping timeout: 250 seconds]

18:38 sumedhghaisas has joined #mlpack

18:38 sumedhghaisas has quit [Client Quit]

18:39 sumedhghaisas has joined #mlpack

18:39 sumedhghaisas has quit [Client Quit]

18:39 sumedhghaisas has joined #mlpack

18:39 sumedhghaisas has quit [Client Quit]

18:40 sumedhghaisas has joined #mlpack

18:40 sumedhghaisas has quit [Client Quit]

18:40 sumedhghaisas has joined #mlpack

18:40 sumedhghaisas has quit [Client Quit]

18:41 sumedhghaisas has joined #mlpack

18:41 sumedhghaisas has quit [Client Quit]

18:41 sumedhghaisas has joined #mlpack

18:42 sumedhghaisas has quit [Client Quit]

19:57 govg has quit [Ping timeout: 258 seconds]

19:59 govg has joined #mlpack

21:57 nilay has quit [Ping timeout: 250 seconds]