#mlpack on 2015-02-04 — irc logs at libera.irclog.whitequark.org

2015-01-15 23:05 verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/

00:10 dhfromkorea has joined #mlpack

00:12 dhfromkorea has quit [Remote host closed the connection]

00:29 dhfromkorea has joined #mlpack

00:32 stephentu_ has quit [Ping timeout: 245 seconds]

00:33 < naywhayare> Squalluca: LARS stores the gram matrix internally (so, 189390x189390, which will be very large...)

00:34 < naywhayare> or, hang on... maybe 13233x13233 depending on if 13233 is the number of dimensions or the number of points

00:35 < Squalluca> 13233 is the number of points

00:35 < Squalluca> the bigger number are the dimensions

00:35 < naywhayare> hmm, I think it will be 13233x13233 then, so that should not be a problem

00:35 < Squalluca> so maybe i am passing the wrong matrix to LARS

00:36 < naywhayare> I think the bigger issue for RAM might be that LARS is storing the complete solution path (accessible via BetaPath()), which is std::vector<arma::vec> and each vector has length equal to the number of dimensions

00:37 < Squalluca> i understand, these values are all used in the computation or are them stored for other reason?

00:38 < Squalluca> i guess the are needed if you keep them

00:38 < naywhayare> I'm not the one who wrote LARS

00:38 < naywhayare> I'm taking a look at it now

00:38 < naywhayare> I don't *think* they're necessary, but let me glance at it some more...

00:38 < Squalluca> ok, thank you very much :D

00:41 < naywhayare> hm, okay, so what I think is that only the last two elements of betaPath are ever used

00:41 < naywhayare> in InterpolateBeta() (which is called once LARS converges), to calculate the final solution vector beta

00:42 < naywhayare> the code could probably be refactored to only hold the two most recent betas fairly easily (only lars.cpp uses betaPath)

00:42 < naywhayare> unfortunately I have a paper deadline on Friday so I can't look into it further, but my best guess is that that is where the huge memory usage is coming from

00:42 < naywhayare> once the paper deadline passes I'll have a little bit more time...

00:42 < Squalluca> ok, thank you, one last question, in the documentation http://www.mlpack.org/doxygen.php?doc=classmlpack_1_1regression_1_1LARS.html#ac98caa68c3b0dcc964f3b776f60eeef6

00:43 < Squalluca> the bool

00:43 < Squalluca> is called in 2 different

00:43 < Squalluca> ways

00:43 < Squalluca> transposeData

00:43 < Squalluca> and rowMajor

00:43 < naywhayare> oops, that's a documentation bug... let me fix that quickly

00:43 < naywhayare> the variable should be called transposeData

00:44 < Squalluca> so i should set it to false if my matrix is row-major?

00:44 < naywhayare> yes, but be aware that if you load data with mlpack::data::Load() and it is row-major on disk, the matrix will be transposed by default

00:45 < Squalluca> no i am getting data from an opencv matrix

00:45 < naywhayare> ah, okay

00:46 < naywhayare> fixed the documentation, thanks for pointing it out -- https://github.com/mlpack/mlpack/commit/b4c08074ca03feaf38511e62fd0e928330b99d93

00:46 < Squalluca> no problema and thanks, i'll look into it more, because i think the Gram matrix is 189390x189390, that is the dimension that crashes my program

00:47 < Squalluca> when it grows

00:48 < Squalluca> thank you again, you have been very helpful, cya.

00:49 < naywhayare> yeah, the gram matrix should be (number of points) x (number of points), unless I've got my logic backwards

00:49 < naywhayare> crap! I do have it backwards. so the gram matrix is 189390x189390

00:50 < naywhayare> there's not a very easy way to solve that problem unless you have 267GB of RAM or so :)

00:50 < naywhayare> and since my logic is backwards, each element in betaPath will be of length (number of points), so that's probably not the bulk of your memory usage

00:51 < Squalluca> mhhh

00:51 < Squalluca> : (

00:52 < Squalluca> iĺl try to figure something out

00:53 < naywhayare> sorry for the bad news...

00:53 < naywhayare> refactoring LARS to not require the whole Gram matrix in memory at once would be a very significant effort

00:53 < naywhayare> it might be better to think about doing some dimensionality reduction like PCA first, maybe?

00:54 < naywhayare> I don't know what the application is, so I don't know if that's a good or bad idea

00:55 < Squalluca> it is a work based on a paper called "blessing of dimensionality" it uses regression to regress to a low dimensional space from an really High one for face recognition

00:55 < Squalluca> it is intented to use high dimension

00:57 < Squalluca> in fact they use a coordinate descent method for regression, maybe that doesn't use the gram matrix?

00:58 < naywhayare> yeah, techniques that work in extremely high dimensions should definitely avoid calculating the explicit gram matrix

00:59 < naywhayare> is this the paper by D. Chen et al. at CVPR 2013? it looks interesting

01:00 < Squalluca> yes

01:01 stephentu_ has joined #mlpack

01:28 Squalluca has quit [Quit: Page closed]

01:31 dhfromkorea has quit [Remote host closed the connection]

01:36 stephentu_ has quit [Ping timeout: 252 seconds]

01:59 < stephentu> naywhayare: so i downloaded arma-4.0 from sourceforge

01:59 < stephentu> did make

01:59 < stephentu> and then i built mlpack via

02:00 < stephentu> cmake -DARMADILLO_LIBRARY=... -DARMADILLO_INCLUDE_DIR=...

02:00 < stephentu> oh i also modified the config.hpp to match my system install one as close as possible

02:00 < stephentu> and now i get

02:01 < stephentu> https://gist.github.com/stephentu/46275e68d224b58f3fcb

02:01 < stephentu> but its weird b/c my system config.hpp

02:01 < stephentu> uses this wrapper stuff

02:01 < stephentu> any hints

02:01 < stephentu> i'm on arch linux

02:01 < stephentu> slash if you have a better process

02:01 < stephentu> for building w/ older versons

02:01 < stephentu> *versions

02:01 < stephentu> i'm all ears

02:04 jbc__ has quit [Quit: jbc__]

02:18 jbc_ has joined #mlpack

02:32 dhfromkorea has joined #mlpack

02:37 dhfromkorea has quit [Ping timeout: 265 seconds]

02:48 curiousguy13 has quit [Read error: Connection timed out]

02:48 < naywhayare> stephentu: what did you set -DARMADILLO_LIBRARY to?

02:48 < naywhayare> it should be the path to libarmadillo.so, not the path to the directory containing it

02:49 < naywhayare> (that's my first guess)

02:53 < naywhayare> oh

02:53 < naywhayare> I bet I know what's gone wrong

02:53 < naywhayare> so, this is kind of an oddity and I'm not responsible for it. during the armadillo build, it uses CMake to generate a config.hpp, and places it, along with armadillo and armadillo_bits/, into "tmp/include/"

02:54 < naywhayare> so ARMADILLO_INCLUDE_DIR should be /path/to/armadillo/tmp/include/, if you only did 'make' and not 'make install'

02:54 < naywhayare> but the library is still in /path/to/armadillo/libarmadillo.so...

03:24 < stephentu> oooo

03:24 < stephentu> that might explain it!

03:26 < stephentu> ill try a make install

03:33 dhfromkorea has joined #mlpack

03:37 dhfromkorea has quit [Ping timeout: 265 seconds]

04:01 jbc_ has quit [Quit: jbc_]

04:48 kshitijk has joined #mlpack

05:12 vedhu63w has joined #mlpack

05:31 vedhu63w has quit [Remote host closed the connection]

05:32 dhfromkorea has joined #mlpack

05:37 dhfromkorea has quit [Ping timeout: 265 seconds]

06:21 udit_s has joined #mlpack

07:20 curiousguy13 has joined #mlpack

07:29 udit_s has quit [Ping timeout: 252 seconds]

07:59 kshitijk has quit [Ping timeout: 264 seconds]

08:00 govg has quit [Ping timeout: 256 seconds]

08:01 govg has joined #mlpack

08:25 curiousguy13 has quit [Ping timeout: 265 seconds]

08:31 udit_s has joined #mlpack

08:46 curiousguy13 has joined #mlpack

09:00 curiousguy13 has quit [Ping timeout: 265 seconds]

09:08 kshitijk has joined #mlpack

09:14 curiousguy13 has joined #mlpack

09:22 stephentu has quit [Quit: Lost terminal]

09:49 kshitijk has quit [Ping timeout: 264 seconds]

10:17 curiousguy13 has quit [Ping timeout: 256 seconds]

10:35 udit_s has quit [Remote host closed the connection]

12:47 kshitijk has joined #mlpack

13:44 kshitijk has quit [Ping timeout: 245 seconds]

14:15 kshitijk has joined #mlpack

14:18 curiousguy13 has joined #mlpack

14:48 jbc_ has joined #mlpack

17:53 stephentu has joined #mlpack

17:58 curiousguy13 has quit [Ping timeout: 252 seconds]

18:08 kshitijk has quit [Ping timeout: 245 seconds]

18:10 curiousguy13 has joined #mlpack

18:54 < stephentu> naywhayare: the problem was somebody put a non-symmetric cov matrix

18:54 < stephentu> for the gaussian tests

18:55 < stephentu> and then i think the ifdef thing you did

18:55 < stephentu> caused it to factorize it differently

18:55 < stephentu> good times

19:00 kshitijk has joined #mlpack

19:01 < naywhayare> "2 1.5; 1 4"

19:01 < naywhayare> what was I thinking? :(

19:02 < naywhayare> well, thanks for digging to the bottom of it :)

19:14 < stephentu> naywhayare: i'm surprised the cholesky call didnt fail

19:14 < stephentu> maybe it just looks at the lower triangle

19:15 < stephentu> or something

19:20 < naywhayare> yeah; from the LAPACK documentation:

19:21 < naywhayare> If UPLO = 'U', the leading N-by-N upper triangular part of A contains the upper triangular part of the matrix A, and the strictly lower triangular part of A is not referenced.

19:21 < naywhayare> (the opposite applies for UPLO = 'L')

19:22 < stephentu> should symmetric matrices be a separate type?

19:22 < stephentu> (this is a design question)

19:22 < stephentu> (not an actual suggestion)

19:22 < naywhayare> I wouldn't be opposed to the idea, especially because you can represent the matrix in far less space

19:23 < naywhayare> but I think in this case we're constrained by what LAPACK and BLAS want, which is an NxM block of memory, so being clever with the storage doesn't get you anything

19:23 < naywhayare> personally I would be more interested in a way to "mark" matrices as symmetric

19:24 < stephentu> actually i was athinking about that

19:24 < stephentu> like i'm wondering if there should be these bits

19:24 < stephentu> in every matrix

19:24 < stephentu> like IS_UPPER_TRIANGULAR

19:24 < stephentu> IS_SYMMETRIC

19:24 < stephentu> IS_DIAGONAL

19:24 < naywhayare> but your choices there in C++ seem to be (a) a runtime member... increases sizeof(mat), bad; (b) specify it as a template parameter, but now you have a thousand template parameters and the syntax starts to resemble Eigen which in my opinion is overcomplex

19:25 < stephentu> hey eigen only has 4

19:25 < stephentu> template parameters

19:25 < naywhayare> (c) use some kind of expression to mark it... "this_matrix_is_diagonal(matrix)"

19:25 < stephentu> or maybe 5

19:25 < stephentu> i cant remember

19:25 < stephentu> :)

19:25 < naywhayare> :)

19:25 < stephentu> i actually really like eigen

19:25 < naywhayare> either way I think Eigen is overcomplex and the learning curve is pretty steep, which is why I originally chose Armadillo

19:25 < stephentu> i'm starting a new project and i hate to say that i used eigen

19:26 < naywhayare> I mean, if it does the job, it does the job

19:26 < naywhayare> I imagine Eigen gives you more flexibility, but at the cost of the (reasonably) nice syntax that Armadillo has

19:26 < naywhayare> I'm not particularly experienced with Eigen... I perused the docs enough to make a design decision against it years ago

19:31 < stephentu> "Implementing an algorithm on top of Eigen feels like just copying pseudocode."

19:31 < stephentu> haha

19:31 < stephentu> hows the ICML going

19:32 < naywhayare> I have an algorithm for k-means. it's reasonably fast, but it only really shines with large k and large datasets

19:32 < naywhayare> I've got 52.5 hours to get simulations run on datasets that are "large enough"

19:32 < naywhayare> so... we'll see...

19:34 < stephentu> good luck

19:44 < naywhayare> thanks... unfortunately, most of what I have to do is waiting

20:39 kshitijk has quit [Ping timeout: 240 seconds]

20:53 < stephentu> prove some theorems while waiting?

20:54 < stephentu> the way i see it

20:54 < stephentu> your algorithm coudl either

20:54 < stephentu> a) work in practice

20:54 < stephentu> or b) have theoretical guarantees

21:06 curiousguy13_ has joined #mlpack

21:08 curiousguy13 has quit [Read error: Connection timed out]

21:17 < naywhayare> stephentu: it is possible to have both :)

21:35 curiousguy13__ has joined #mlpack

21:37 curiousguy13_ has quit [Read error: Connection timed out]

21:39 stephentu_ has joined #mlpack

21:43 < stephentu_> naywhayare: thats like a phd :)

21:57 curiousguy13__ has quit [Read error: Connection timed out]

21:58 curiousguy13__ has joined #mlpack

22:22 curiousguy13_ has joined #mlpack

22:25 curiousguy13__ has quit [Read error: Connection timed out]

22:33 stephentu_ has quit [Read error: Connection reset by peer]

22:34 curiousguy13__ has joined #mlpack

22:37 stephentu_ has joined #mlpack

22:37 curiousguy13_ has quit [Ping timeout: 256 seconds]

23:02 curiousguy13__ has quit [Read error: Connection timed out]

23:03 curiousguy13 has joined #mlpack

23:05 stephentu_ has quit [Ping timeout: 245 seconds]

23:12 stephentu_ has joined #mlpack

23:21 curiousguy13_ has joined #mlpack

23:23 curiousguy13 has quit [Read error: Connection timed out]

23:43 curiousguy13_ has quit [Read error: Connection timed out]

23:44 curiousguy13_ has joined #mlpack