#mlpack on 2014-05-26 — irc logs at libera.irclog.whitequark.org

2014-05-21 16:24 naywhayare changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/

03:27 udit_s_ has joined #mlpack

05:34 < jenkins-mlpack> Project mlpack - nightly matrix build build #466: STILL UNSTABLE in 1 hr 33 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20nightly%20matrix%20build/466/

09:44 sumedhghaisas has joined #mlpack

10:28 sumedhghaisas has quit [Ping timeout: 245 seconds]

10:37 sumedhghaisas has joined #mlpack

11:42 sumedhghaisas has quit [Ping timeout: 258 seconds]

11:42 sumedhghaisas has joined #mlpack

11:57 sumedhghaisas has quit [Ping timeout: 264 seconds]

11:58 sumedhghaisas has joined #mlpack

12:24 sumedhghaisas has quit [Ping timeout: 245 seconds]

12:28 sumedhghaisas has joined #mlpack

13:25 oldbeardo has joined #mlpack

13:47 < naywhayare> oldbeardo: did you figure out what lowBound() was?

14:13 andrewmw94 has joined #mlpack

14:35 < oldbeardo> naywhayare: no, I didn't

14:39 < oldbeardo> naywhayare: I can see that it has something to do with confidence intervals, but not sure what exactly

14:41 < naywhayare> ok; I'm about to head to lab... when I get there I will read over it and get you an answer

14:41 < naywhayare> I'm sorry for the delay; I was out of town for the weekend

14:41 < oldbeardo> naywhayare: not a problem

15:00 sumedhghaisas has quit [Ping timeout: 252 seconds]

15:03 sumedhghaisas has joined #mlpack

15:08 < naywhayare> oldbeardo: in step 2 of the algorithm, a monte carlo estimate of the squared magnitude of each sampled row's projection onto V ( || S_i V ||^2_F )

15:08 < naywhayare> and we do this several times (s times, to be exact)

15:09 < naywhayare> the idea is that we want a bound on the error that is such that with probability at least 1 - \delta, || A - A V V^T ||_2^F \le sqErr

15:09 < naywhayare> so what's done in step 3 is the fitting of a gaussian; first \mu is calculated, then \sigma^2 is calculated

15:09 < naywhayare> all with standard formulas

15:09 sumedhghaisas has quit [Ping timeout: 252 seconds]

15:10 < oldbeardo> yeah, I understood all of that

15:10 < naywhayare> so this gives you a gaussian, and I think lowBound() will be the point in that Gaussian PDF curve with probability \delta

15:11 < naywhayare> sorry, the point on the CDF curve at \delta... so, the point in the Gaussian where the probability mass to the left is equal to \delta

15:11 < naywhayare> does that make sense?

15:12 < oldbeardo> yes, shouldn't that be delta / 2?

15:12 < oldbeardo> also, I don't know how to compute that

15:12 < naywhayare> I'm trying to find the right function to use to compute that... give me a few moments

15:12 < oldbeardo> and what is the role of s in that computation?

15:12 < naywhayare> I think it should be \delta and not \delta / 2 because I think we only want the left tail

15:14 < andrewmw94> excuse me for jumping in, but it sounds like the question is just whether the confidence interval is one-sided or two-sided

15:14 < naywhayare> yeah, that's basically it, but I believe it to be a one-sided interval

15:14 < naywhayare> if you're interested, the algorithm in question is Algorithm 3 of the paper "QUIC-SVD: Fast SVD Using Cosine Trees"

15:15 < oldbeardo> naywhayare: in the final lines of Theorem 1, they have given a reference

15:16 < naywhayare> ooh, thanks; I didn't see that. let me see if that clarifies anything

15:16 < oldbeardo> I looked that up, I could see formulae mentioning z_(alpha/2), hence I asked

15:18 < naywhayare> in this algorithm, we're trying to produce an upper bound on sqError with confidence 1 - \delta

15:19 < naywhayare> but an upper bound on sqError is a lower bound on magSqLB

15:20 < naywhayare> so all we need to say is, 'with probability 1 - \delta, magSqLB is greater than this value', which is a one-sided confidence interval

15:21 < oldbeardo> right, I get it

15:21 < naywhayare> now I just need to find the right formula to use... I don't think it's the standard gaussian phi() function because we have biased parameter estimates

15:21 < naywhayare> I once had this formula memorized for a test, but that was years ago and I've forgotten it since then...

15:28 < oldbeardo> naywhayare: so what do I do?

15:29 < andrewmw94> " now I just need to find the right formula to use... I don't think it's the standard gaussian phi() function because we have biased parameter estimates" Do you mean you should use a T test?

15:29 < naywhayare> no, we need to de-bias the gaussian parameter estimates then use the phi() functino

15:30 < naywhayare> I was trying to find the formula I was looking for to de-bias the variance estimate; it can be found here under the "Bias Correction" section: samlping bias

15:30 < naywhayare> oops, that was a bad paste

15:30 < naywhayare> http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation

15:31 < andrewmw94> ahh, it's the same but with n-1 in the denominator iirc

15:31 < naywhayare> so divide the sample variance by c_4(n)^2 and then you have an unbiased estimator of the variance

15:31 < naywhayare> I think boost::math has an implementation of the gamma function... let me find it

15:32 < naywhayare> http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/math_toolkit/sf_gamma/tgamma.html

15:34 < oldbeardo> what is c_4(n)^2?

15:34 < naywhayare> c_4(n) = \sqrt(2 / (n - 1)) * \Gamma(n / 2) / \Gamma((n - 1) / 2)

15:34 < naywhayare> where \Gamma() is implemented as tgamma() in the reference to the boost documentation I gave, and 'n' is the number of points sampled

15:35 < naywhayare> c_4(n) is defined on the wikipedia page I linked to under the 'bias correction' section

15:38 < naywhayare> once you have the unbiased parameters, you can then find the critical value for the given value of \delta, again using boost::math: http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/math_toolkit/stat_tut/weg/normal_example/normal_misc.html

15:38 < naywhayare> it looks like the quantile() function is the right one to use, once you have created a 'normal' object with the unbiased mean and standard deviation

15:39 < jenkins-mlpack> Starting build #1917 for job mlpack - svn checkin test (previous build: SUCCESS)

15:46 < oldbeardo> okay, thanks for that

15:48 < naywhayare> sorry, I kept saying 'phi() function' but actually the phi() function is the gaussian PDF

15:48 < naywhayare> and what I really meant was the inverse of the gaussian CDF

15:52 < oldbeardo> this is still a little confusing to me, why can't we use the stats() thing in armadillo

15:52 < naywhayare> oh, hm, I didn't know about that. what functionality does it have?

15:57 < oldbeardo> http://arma.sourceforge.net/docs.html#stats_fns

15:58 < oldbeardo> I meant only the 'mean' and 'stddev'

16:04 < naywhayare> yes, I'm sorry. that was what I was looking for to begin with, not the complex c_4(n) de-biasing steps

16:04 < naywhayare> I was thinking that was way more complicated than what I remember...

16:05 < naywhayare> once you obtain the mean and variance estimates with armadillo, though, you'll still need to use the boost functionality to get the inverse of the gaussian CDF

16:07 < oldbeardo> let's say I calculate the mean and std_dev using those functions, how does 's' become useful then?

16:07 < naywhayare> according to our derivation, 's' is only useful for de-biasing the estimators, and if you've already done that, then it's no longer useful

16:08 < oldbeardo> to be frank, I still don't understand what the lowBound() value represents?

16:08 < oldbeardo> *no question mark

16:09 < naywhayare> with probability 1 - \delta, the value of || S_i V ||_F^2 for any sampled row S_i is greater than the lowBound() value

16:09 < naywhayare> that is my understanding of it

16:12 < oldbeardo> okay, I understood that, wrong framing, I didn't understand how the lowBound() value is computed

16:12 < jenkins-mlpack> Project mlpack - svn checkin test build #1917: SUCCESS in 33 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/1917/

16:12 < jenkins-mlpack> * andrewmw94: more R tree stuff. Still no build

16:12 < jenkins-mlpack> * andrewmw94: more code for the RectangleTree. Still not built yet.

16:13 < naywhayare> it's a one-sided confidence interval

16:14 < naywhayare> basically, we have a gaussian PDF defined by the mean and variance that's calculated just before the call to lowBound()

16:15 < naywhayare> roughly, this gaussian represents the estimated distribution of the values || S_i V ||_2^F for arbitrary sampled rows S_i

16:16 < naywhayare> so we need the value of the gaussian PDF where the mass of the probability to the left is (\delta * 100) percent of the total probability mass

16:16 < naywhayare> this is done with the inverse gaussian CDF

16:19 < naywhayare> the Boost documentation I linked to should provide some information on how to calculate the inverse gaussian CDF value for a given \delta

16:19 < naywhayare> I think the quartile() function is the right one

16:21 sumedhghaisas has joined #mlpack

16:22 < oldbeardo> fine, I think I get it, so basically all the computation is really being done in the quantile() function

16:25 < naywhayare> yes, boost::math should do most of the hard work for that

16:26 sumedhghaisas has quit [Ping timeout: 245 seconds]

16:26 < oldbeardo> do I need to include anything separately for that or is it included in mlpack by default?

16:27 < naywhayare> you'll need to include the right boost::math header file, but CMake is already set up to do the linking (if necessary) against boost::math

16:45 sumedhghaisas has joined #mlpack

16:50 sumedhghaisas has quit [Ping timeout: 240 seconds]

17:08 sumedhghaisas has joined #mlpack

17:21 udit_s_ has quit [Quit: Ex-Chat]

17:28 oldbeardo has quit [Quit: Page closed]

18:01 sumedhghaisas has quit [Ping timeout: 252 seconds]

19:26 sumedhghaisas has joined #mlpack

19:29 < sumedhghaisas> naywhayare: did you had a chance to look at LMF code??

19:30 < naywhayare> I have been periodically looking through it today

19:30 < naywhayare> I was gone over the weekend; I thought I would have time to do it before I left, but I had to clean my house up before I left

19:30 < naywhayare> I'll have my review done tonight and I'll send you an email (it'll be too much information for IRC, probably)

19:30 < sumedhghaisas> yeah sure :)

19:31 < sumedhghaisas> not a problem...

20:27 < andrewmw94> naywhayare: is there an efficient way to add one column to an arma::mat ?

20:28 < naywhayare> andrewmw94: not really, but it's not entirely because of armadillo

20:28 < naywhayare> when you need larger memory, you have to reallocate. you can call 'realloc()' but it's not guaranteed that it will get you the same part of memory, and if it doesn't, it copies everything anyway

20:32 < naywhayare> so unfortunately for a point insertion, you often can't do better than the usual insert_rows() or insert_cols()

20:32 < andrewmw94> ahh. Hmm. So, how does the base case code currently work. Will it be possible to have a matrix that's only partially used. eg. the first 9 columns

20:32 < andrewmw94> or something like that

20:32 < naywhayare> right now BaseCase() takes the indices of the query point and the reference point

20:33 < naywhayare> but if each node is storing the points, then it's reasonable to refactor things so that BaseCase() actually takes the points themselves

20:33 < naywhayare> whether they're of type arma::vec, arma::sp_vec, arma::subview, etc.

20:34 < andrewmw94> well, refernce points should work too. I have to store the number of points in the current node. So I could allocate memory for say 20 points in each leaf node, and then only use the first n cols of the matrix

20:35 < naywhayare> yes, you could do that

20:35 < andrewmw94> alright, that's probably the fastest

20:37 < naywhayare> I think this could mean that the memory footprint of a tree could be larger than O(N)

20:37 < naywhayare> but I'm not sure exactly what that would be

20:37 < naywhayare> I don't think that's a big deal, though; if it's a sacrifice that has to be made for gains in runtime, it is probably worth it

20:38 < andrewmw94> it could be larger, but I think that's standard for R tree's. There's a parameter for the minimum gauranteed fill

20:38 < andrewmw94> of each node

20:53 sumedhghaisas has quit [Ping timeout: 265 seconds]

21:17 sumedhghaisas has joined #mlpack

21:21 andrewmw94 has left #mlpack []

21:26 < jenkins-mlpack> Starting build #1918 for job mlpack - svn checkin test (previous build: SUCCESS)

21:59 < jenkins-mlpack> Project mlpack - svn checkin test build #1918: SUCCESS in 33 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/1918/

21:59 < jenkins-mlpack> andrewmw94: more R tree stuff. Still no build

22:39 sumedhghaisas has quit [Ping timeout: 240 seconds]

22:49 sumedhghaisas has joined #mlpack

23:47 sumedhghaisas has quit [Ping timeout: 252 seconds]

23:47 sumedhghaisas has joined #mlpack