#mlpack on 2017-02-08 — irc logs at libera.irclog.whitequark.org

2015-01-15 23:05 verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/

00:21 palash123 has joined #mlpack

00:24 palash has quit [Ping timeout: 256 seconds]

00:25 palash has joined #mlpack

00:28 palash123 has quit [Ping timeout: 245 seconds]

01:32 palash123 has joined #mlpack

01:35 palash has quit [Ping timeout: 255 seconds]

02:59 palash has joined #mlpack

03:02 palash123 has quit [Ping timeout: 255 seconds]

03:08 palash123 has joined #mlpack

03:11 palash has quit [Ping timeout: 245 seconds]

03:16 govg has quit [Ping timeout: 245 seconds]

03:22 palash has joined #mlpack

03:25 palash123 has quit [Ping timeout: 240 seconds]

03:59 dinesh_ has quit [Ping timeout: 255 seconds]

04:40 govg has joined #mlpack

04:43 dineshraj01 has joined #mlpack

05:04 palash123 has joined #mlpack

05:07 palash has quit [Ping timeout: 240 seconds]

05:18 palash has joined #mlpack

05:21 palash123 has quit [Ping timeout: 245 seconds]

05:28 palash123 has joined #mlpack

05:31 palash has quit [Ping timeout: 245 seconds]

05:31 dineshraj01 has quit [Read error: Connection reset by peer]

05:31 dineshraj01 has joined #mlpack

05:52 palash has joined #mlpack

05:55 palash123 has quit [Ping timeout: 245 seconds]

06:02 palash123 has joined #mlpack

06:06 palash has quit [Ping timeout: 276 seconds]

06:40 dineshraj01 has quit [Ping timeout: 248 seconds]

07:37 vivekp has quit [Ping timeout: 258 seconds]

07:46 vivekp has joined #mlpack

07:58 dhawalht has joined #mlpack

08:17 < dhawalht> Hey, I want to contribute to your open-source project. I am a GSoC-2017 aspirant

08:17 < dhawalht> reply me to dhawalharkawat14@gmail.com

08:17 dhawalht has quit [Quit: Page closed]

08:28 vivekp has quit [Ping timeout: 240 seconds]

08:29 mikeling has joined #mlpack

08:30 vivekp has joined #mlpack

09:11 palash has joined #mlpack

09:15 palash123 has quit [Ping timeout: 276 seconds]

09:41 rcurtin_ has joined #mlpack

10:14 cult- has left #mlpack []

10:14 palash123 has joined #mlpack

10:14 palash123 has left #mlpack []

10:15 palash123 has joined #mlpack

10:17 palash has quit [Ping timeout: 252 seconds]

10:43 palash has joined #mlpack

10:46 palash123 has quit [Ping timeout: 252 seconds]

11:01 palash123 has joined #mlpack

11:03 palash has quit [Ping timeout: 255 seconds]

11:24 daasankur has joined #mlpack

12:37 govg has quit [Ping timeout: 258 seconds]

13:01 govg has joined #mlpack

14:47 < layback> i would like to talk to someone with knowledge about the collaborative filtering parts of mlpack. i am benchmarking it against other software with the movielens data set. when random splitting .8/.2 training/testing. common benchmarks show ~.9, but for mlpack I get ~3.5 for NMF and ~.05 with RegSVD. Not sure how they can differ that much, and why the resulting error is so different from others. my guess is

14:47 < layback> that it has something to do with the kNN evaluation protocol (or whatever you may call it). is there something that i may be grossly overlooking? (ps. i have found an old Agrawal document showing results ~.9 with RegSVD, albeit with an older cli interface)

15:08 < rcurtin_> layback: hi there; the error measure you are using is RMSE I guess?

15:08 < layback> rcurtin_: yes!

15:13 < rcurtin> let me find the movielens dataset, hang on

15:14 < layback> its the 1m one i've used!

15:15 < rcurtin> ok, I am only going to use the 100k one so it doesn't take so long to run simulations

15:15 < rcurtin> but we should check that the format is right

15:15 < rcurtin> the input CSV should be three columns: user id, movie id, rating

15:18 < layback> yes!

15:18 < rcurtin> ok, great

15:18 < rcurtin> next question: what are you using to calculate the RMSE, and what are you setting the rank of the decomposition to?

15:20 < layback> so, i've tried ranks from [5, 100] and get range from [3.8, 2.9] for default NMF, and for RegSVD i've tried rank 20 (that is the same as the agrawal thing i mentioned used).

15:20 < layback> but say ~3.5 for NMF with rank 20.

15:20 < layback> sorry this got messy but, for RegSVd -R 20 i get about ~.02

15:21 < rcurtin> ok, and you're using the -T option to calculate RMSE?

15:22 < layback> im running: `mlpack_cf -v -t training.csv -T testing.csv -R 20 [-a RegSVD]`

15:22 < rcurtin> ok, great, let me try that and see what I get

15:23 < layback> i do split my datasets in an external script i have, but i assure you it is just ranomly split .8/.2.

15:26 < rcurtin> ok, I see the same results for default NMF... I am trying playing with the --min_residue parameter, which will control how exact the decomposition is

15:27 < rcurtin> it makes it take longer to converge, so it may be a little bit until I get results...

15:29 < layback> yes! i feel like i've tried mixing every possible parameter, hehe. I very much appreciate the help!

15:30 < rcurtin> $ bin/mlpack_cf -t ~/datasets/ml-latest-small/ratings-train.csv -T ~/datasets/ml-latest-small/ratings-test.csv -a RegSVD -v --rank 20 | grep RMSE

15:30 < rcurtin> [INFO ] RMSE is 0.964829.

15:30 < rcurtin> so, that's with the 100k dataset

15:30 < rcurtin> let me download the 1M dataset and see if I see the same behavior

15:30 < rcurtin> if you are seeing an RMSE of 0.05, that is very strange

15:31 < rcurtin> I played with --min_residue for NMF, I am not getting RMSE much better that 3.0 or so

15:32 < layback> ok, I'll try also with the 100k dataset.

15:42 < rcurtin> $ bin/mlpack_cf -t ~/datasets/ml-1m/ratings-train.csv -T ~/datasets/ml-1m/ratings-test.csv -a RegSVD -v --rank 20 --max_iterations 30 | grep RMSE

15:42 < rcurtin> [INFO ] RMSE is 0.876338.

15:42 < rcurtin> it seems like the regularized SVD needs far fewer iterations than NMF to converge to a good solution

15:42 < rcurtin> 10 and 20 iterations gave good results too

15:43 < rcurtin> and that ran in about ~10 seconds

15:44 < layback> that blows my mind, ok, great so i'm doing something wrong it seems. hehe.

15:44 < layback> how did you split the 1m data set?

15:45 < rcurtin> hehe, I did it hackily...

15:45 < layback> just so I can quickly validate.

15:45 < rcurtin> after I removed the fourth column from ratings.dat (the timestamp) and converted it to csv,

15:45 < rcurtin> cat ratings.csv | sort -R > ratings-random.csv

15:45 < rcurtin> head -800000 ratings-random.csv > ratings-train.csv

15:45 < rcurtin> tail -200204 ratings-random.csv > ratings-test.csv

15:45 < rcurtin> (seems like the 1M dataset actually had 1M + 204 ratings)

15:46 < rcurtin> if I was smart, I would have just used the mlpack_preprocess_split program, but I decided to do it by hand for some reason :)

15:47 < rcurtin> I can't seem to get RMSE < 2.6 with the NMF decomposition on the 1M dataset

15:53 < rcurtin> I'm playing with SVDIncompleteIncremental and SVDCompleteIncremental decompositions, it seems like the results are trash... I think maybe they are being run with not-great parameters for the optimizers and can't be tuned through the command-line program

15:54 < rcurtin> I think I'll update the GSoC description for the CF project to possibly include working with those algorithms to set reasonable defaults for the optimizers

15:59 < layback> hmm, ok, so I can confirm ~.88 RMSE on the 1-m as per your instructions. so there seems to be something wrong with how i prepare my data, I GUESS. ye, the thing that tripped me really was that all of them was showing such different results from each other.

16:06 < rcurtin> yeah, ideally, the defaults should be configured in such a way that each algorithm type converges to something similar

16:06 < rcurtin> but for now I guess RegSVD is the right one to use

16:06 < rcurtin> glad I could help sort it out!

16:18 < layback> i think the problem comes from me trying to also figure out to get a good one-class rating i.e. "like" or "dont know", and i might have passed that to my movielens data as well. if you have any tips for that, please pass them along. otherwise i'll continue tinkering.

16:19 < layback> thanks a lot for the help! very helpful.

16:20 < rcurtin> what do you mean by "one-class rating"?

16:21 < rcurtin> not sure I follow completely

16:21 < rcurtin> also, sure, glad to help, that's why I'm here :)

16:26 travis-ci has joined #mlpack

16:26 < travis-ci> mlpack/mlpack#1777 (master - ec3f224 : Ryan Curtin): The build was broken.

16:26 < travis-ci> Change view : https://github.com/mlpack/mlpack/compare/49b661077ee4...ec3f224894e8

16:26 < travis-ci> Build details : https://travis-ci.org/mlpack/mlpack/builds/199652353

16:26 travis-ci has left #mlpack []

16:26 < layback> instead of having a rating [1, 5] for each rating, the actual data i want to work with only has ratings for "likes", so I really dont have an actual rating, basically only 5 star ratings and nothing else. so either i know if users like something or else i don't know anything.

16:27 < layback> so a row in my dataset is just an indication of a "like" between an item and a user.

16:33 < rcurtin> ah, ok, I guess that is a slightly different problem

16:34 < rcurtin> I think maybe any value for "like" will work (like 1 should be fine) but I think you will have to calculate RMSE differently

16:57 mikeling has quit [Quit: Connection closed for inactivity]

17:43 palash has joined #mlpack

17:45 palashahuja has joined #mlpack

17:45 palash123 has quit [Ping timeout: 255 seconds]

17:49 palash has quit [Ping timeout: 258 seconds]

18:03 palash has joined #mlpack

18:03 palash has quit [Remote host closed the connection]

18:05 palashahuja has quit [Ping timeout: 245 seconds]

18:35 gtank has quit [Remote host closed the connection]

20:04 gtank has joined #mlpack

20:11 daasankur has quit [Ping timeout: 260 seconds]

21:57 palashahuja has joined #mlpack

22:09 < layback> well, i guess i can calculate the RMSE to get some indication about the model, but the actual evaluation should probably be of some kind of precision.

22:10 < layback> since movielens dataset is used in every written text ever, my thinking is that it would be helpful to have it as a benchmark in some kind of way.