naywhayare changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
udit_s_ has joined #mlpack
< jenkins-mlpack> Project mlpack - nightly matrix build build #466: STILL UNSTABLE in 1 hr 33 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20nightly%20matrix%20build/466/
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Ping timeout: 245 seconds]
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Ping timeout: 258 seconds]
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Ping timeout: 264 seconds]
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Ping timeout: 245 seconds]
sumedhghaisas has joined #mlpack
oldbeardo has joined #mlpack
< naywhayare> oldbeardo: did you figure out what lowBound() was?
andrewmw94 has joined #mlpack
< oldbeardo> naywhayare: no, I didn't
< oldbeardo> naywhayare: I can see that it has something to do with confidence intervals, but not sure what exactly
< naywhayare> ok; I'm about to head to lab... when I get there I will read over it and get you an answer
< naywhayare> I'm sorry for the delay; I was out of town for the weekend
< oldbeardo> naywhayare: not a problem
sumedhghaisas has quit [Ping timeout: 252 seconds]
sumedhghaisas has joined #mlpack
< naywhayare> oldbeardo: in step 2 of the algorithm, a monte carlo estimate of the squared magnitude of each sampled row's projection onto V ( || S_i V ||^2_F )
< naywhayare> and we do this several times (s times, to be exact)
< naywhayare> the idea is that we want a bound on the error that is such that with probability at least 1 - \delta, || A - A V V^T ||_2^F \le sqErr
< naywhayare> so what's done in step 3 is the fitting of a gaussian; first \mu is calculated, then \sigma^2 is calculated
< naywhayare> all with standard formulas
sumedhghaisas has quit [Ping timeout: 252 seconds]
< oldbeardo> yeah, I understood all of that
< naywhayare> so this gives you a gaussian, and I think lowBound() will be the point in that Gaussian PDF curve with probability \delta
< naywhayare> sorry, the point on the CDF curve at \delta... so, the point in the Gaussian where the probability mass to the left is equal to \delta
< naywhayare> does that make sense?
< oldbeardo> yes, shouldn't that be delta / 2?
< oldbeardo> also, I don't know how to compute that
< naywhayare> I'm trying to find the right function to use to compute that... give me a few moments
< oldbeardo> and what is the role of s in that computation?
< naywhayare> I think it should be \delta and not \delta / 2 because I think we only want the left tail
< andrewmw94> excuse me for jumping in, but it sounds like the question is just whether the confidence interval is one-sided or two-sided
< naywhayare> yeah, that's basically it, but I believe it to be a one-sided interval
< naywhayare> if you're interested, the algorithm in question is Algorithm 3 of the paper "QUIC-SVD: Fast SVD Using Cosine Trees"
< oldbeardo> naywhayare: in the final lines of Theorem 1, they have given a reference
< naywhayare> ooh, thanks; I didn't see that. let me see if that clarifies anything
< oldbeardo> I looked that up, I could see formulae mentioning z_(alpha/2), hence I asked
< naywhayare> in this algorithm, we're trying to produce an upper bound on sqError with confidence 1 - \delta
< naywhayare> but an upper bound on sqError is a lower bound on magSqLB
< naywhayare> so all we need to say is, 'with probability 1 - \delta, magSqLB is greater than this value', which is a one-sided confidence interval
< oldbeardo> right, I get it
< naywhayare> now I just need to find the right formula to use... I don't think it's the standard gaussian phi() function because we have biased parameter estimates
< naywhayare> I once had this formula memorized for a test, but that was years ago and I've forgotten it since then...
< oldbeardo> naywhayare: so what do I do?
< andrewmw94> " now I just need to find the right formula to use... I don't think it's the standard gaussian phi() function because we have biased parameter estimates" Do you mean you should use a T test?
< naywhayare> no, we need to de-bias the gaussian parameter estimates then use the phi() functino
< naywhayare> I was trying to find the formula I was looking for to de-bias the variance estimate; it can be found here under the "Bias Correction" section: samlping bias
< naywhayare> oops, that was a bad paste
< andrewmw94> ahh, it's the same but with n-1 in the denominator iirc
< naywhayare> so divide the sample variance by c_4(n)^2 and then you have an unbiased estimator of the variance
< naywhayare> I think boost::math has an implementation of the gamma function... let me find it
< oldbeardo> what is c_4(n)^2?
< naywhayare> c_4(n) = \sqrt(2 / (n - 1)) * \Gamma(n / 2) / \Gamma((n - 1) / 2)
< naywhayare> where \Gamma() is implemented as tgamma() in the reference to the boost documentation I gave, and 'n' is the number of points sampled
< naywhayare> c_4(n) is defined on the wikipedia page I linked to under the 'bias correction' section
< naywhayare> once you have the unbiased parameters, you can then find the critical value for the given value of \delta, again using boost::math: http://www.boost.org/doc/libs/1_55_0/libs/math/doc/html/math_toolkit/stat_tut/weg/normal_example/normal_misc.html
< naywhayare> it looks like the quantile() function is the right one to use, once you have created a 'normal' object with the unbiased mean and standard deviation
< jenkins-mlpack> Starting build #1917 for job mlpack - svn checkin test (previous build: SUCCESS)
< oldbeardo> okay, thanks for that
< naywhayare> sorry, I kept saying 'phi() function' but actually the phi() function is the gaussian PDF
< naywhayare> and what I really meant was the inverse of the gaussian CDF
< oldbeardo> this is still a little confusing to me, why can't we use the stats() thing in armadillo
< naywhayare> oh, hm, I didn't know about that. what functionality does it have?
< oldbeardo> I meant only the 'mean' and 'stddev'
< naywhayare> yes, I'm sorry. that was what I was looking for to begin with, not the complex c_4(n) de-biasing steps
< naywhayare> I was thinking that was way more complicated than what I remember...
< naywhayare> once you obtain the mean and variance estimates with armadillo, though, you'll still need to use the boost functionality to get the inverse of the gaussian CDF
< oldbeardo> let's say I calculate the mean and std_dev using those functions, how does 's' become useful then?
< naywhayare> according to our derivation, 's' is only useful for de-biasing the estimators, and if you've already done that, then it's no longer useful
< oldbeardo> to be frank, I still don't understand what the lowBound() value represents?
< oldbeardo> *no question mark
< naywhayare> with probability 1 - \delta, the value of || S_i V ||_F^2 for any sampled row S_i is greater than the lowBound() value
< naywhayare> that is my understanding of it
< oldbeardo> okay, I understood that, wrong framing, I didn't understand how the lowBound() value is computed
< jenkins-mlpack> Project mlpack - svn checkin test build #1917: SUCCESS in 33 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/1917/
< jenkins-mlpack> * andrewmw94: more R tree stuff. Still no build
< jenkins-mlpack> * andrewmw94: more code for the RectangleTree. Still not built yet.
< naywhayare> it's a one-sided confidence interval
< naywhayare> basically, we have a gaussian PDF defined by the mean and variance that's calculated just before the call to lowBound()
< naywhayare> roughly, this gaussian represents the estimated distribution of the values || S_i V ||_2^F for arbitrary sampled rows S_i
< naywhayare> so we need the value of the gaussian PDF where the mass of the probability to the left is (\delta * 100) percent of the total probability mass
< naywhayare> this is done with the inverse gaussian CDF
< naywhayare> the Boost documentation I linked to should provide some information on how to calculate the inverse gaussian CDF value for a given \delta
< naywhayare> I think the quartile() function is the right one
sumedhghaisas has joined #mlpack
< oldbeardo> fine, I think I get it, so basically all the computation is really being done in the quantile() function
< naywhayare> yes, boost::math should do most of the hard work for that
sumedhghaisas has quit [Ping timeout: 245 seconds]
< oldbeardo> do I need to include anything separately for that or is it included in mlpack by default?
< naywhayare> you'll need to include the right boost::math header file, but CMake is already set up to do the linking (if necessary) against boost::math
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Ping timeout: 240 seconds]
sumedhghaisas has joined #mlpack
udit_s_ has quit [Quit: Ex-Chat]
oldbeardo has quit [Quit: Page closed]
sumedhghaisas has quit [Ping timeout: 252 seconds]
sumedhghaisas has joined #mlpack
< sumedhghaisas> naywhayare: did you had a chance to look at LMF code??
< naywhayare> I have been periodically looking through it today
< naywhayare> I was gone over the weekend; I thought I would have time to do it before I left, but I had to clean my house up before I left
< naywhayare> I'll have my review done tonight and I'll send you an email (it'll be too much information for IRC, probably)
< sumedhghaisas> yeah sure :)
< sumedhghaisas> not a problem...
< andrewmw94> naywhayare: is there an efficient way to add one column to an arma::mat ?
< naywhayare> andrewmw94: not really, but it's not entirely because of armadillo
< naywhayare> when you need larger memory, you have to reallocate. you can call 'realloc()' but it's not guaranteed that it will get you the same part of memory, and if it doesn't, it copies everything anyway
< naywhayare> so unfortunately for a point insertion, you often can't do better than the usual insert_rows() or insert_cols()
< andrewmw94> ahh. Hmm. So, how does the base case code currently work. Will it be possible to have a matrix that's only partially used. eg. the first 9 columns
< andrewmw94> or something like that
< naywhayare> right now BaseCase() takes the indices of the query point and the reference point
< naywhayare> but if each node is storing the points, then it's reasonable to refactor things so that BaseCase() actually takes the points themselves
< naywhayare> whether they're of type arma::vec, arma::sp_vec, arma::subview, etc.
< andrewmw94> well, refernce points should work too. I have to store the number of points in the current node. So I could allocate memory for say 20 points in each leaf node, and then only use the first n cols of the matrix
< naywhayare> yes, you could do that
< andrewmw94> alright, that's probably the fastest
< naywhayare> I think this could mean that the memory footprint of a tree could be larger than O(N)
< naywhayare> but I'm not sure exactly what that would be
< naywhayare> I don't think that's a big deal, though; if it's a sacrifice that has to be made for gains in runtime, it is probably worth it
< andrewmw94> it could be larger, but I think that's standard for R tree's. There's a parameter for the minimum gauranteed fill
< andrewmw94> of each node
sumedhghaisas has quit [Ping timeout: 265 seconds]
sumedhghaisas has joined #mlpack
andrewmw94 has left #mlpack []
< jenkins-mlpack> Starting build #1918 for job mlpack - svn checkin test (previous build: SUCCESS)
< jenkins-mlpack> Project mlpack - svn checkin test build #1918: SUCCESS in 33 min: http://big.cc.gt.atl.ga.us:8080/job/mlpack%20-%20svn%20checkin%20test/1918/
< jenkins-mlpack> andrewmw94: more R tree stuff. Still no build
sumedhghaisas has quit [Ping timeout: 240 seconds]
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Ping timeout: 252 seconds]
sumedhghaisas has joined #mlpack