verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
palash123 has joined #mlpack
palash has quit [Ping timeout: 256 seconds]
palash has joined #mlpack
palash123 has quit [Ping timeout: 245 seconds]
palash123 has joined #mlpack
palash has quit [Ping timeout: 255 seconds]
palash has joined #mlpack
palash123 has quit [Ping timeout: 255 seconds]
palash123 has joined #mlpack
palash has quit [Ping timeout: 245 seconds]
govg has quit [Ping timeout: 245 seconds]
palash has joined #mlpack
palash123 has quit [Ping timeout: 240 seconds]
dinesh_ has quit [Ping timeout: 255 seconds]
govg has joined #mlpack
dineshraj01 has joined #mlpack
palash123 has joined #mlpack
palash has quit [Ping timeout: 240 seconds]
palash has joined #mlpack
palash123 has quit [Ping timeout: 245 seconds]
palash123 has joined #mlpack
palash has quit [Ping timeout: 245 seconds]
dineshraj01 has quit [Read error: Connection reset by peer]
dineshraj01 has joined #mlpack
palash has joined #mlpack
palash123 has quit [Ping timeout: 245 seconds]
palash123 has joined #mlpack
palash has quit [Ping timeout: 276 seconds]
dineshraj01 has quit [Ping timeout: 248 seconds]
vivekp has quit [Ping timeout: 258 seconds]
vivekp has joined #mlpack
dhawalht has joined #mlpack
< dhawalht> Hey, I want to contribute to your open-source project. I am a GSoC-2017 aspirant
< dhawalht> reply me to dhawalharkawat14@gmail.com
dhawalht has quit [Quit: Page closed]
vivekp has quit [Ping timeout: 240 seconds]
mikeling has joined #mlpack
vivekp has joined #mlpack
palash has joined #mlpack
palash123 has quit [Ping timeout: 276 seconds]
rcurtin_ has joined #mlpack
cult- has left #mlpack []
palash123 has joined #mlpack
palash123 has left #mlpack []
palash123 has joined #mlpack
palash has quit [Ping timeout: 252 seconds]
palash has joined #mlpack
palash123 has quit [Ping timeout: 252 seconds]
palash123 has joined #mlpack
palash has quit [Ping timeout: 255 seconds]
daasankur has joined #mlpack
govg has quit [Ping timeout: 258 seconds]
govg has joined #mlpack
< layback> i would like to talk to someone with knowledge about the collaborative filtering parts of mlpack. i am benchmarking it against other software with the movielens data set. when random splitting .8/.2 training/testing. common benchmarks show ~.9, but for mlpack I get ~3.5 for NMF and ~.05 with RegSVD. Not sure how they can differ that much, and why the resulting error is so different from others. my guess is
< layback> that it has something to do with the kNN evaluation protocol (or whatever you may call it). is there something that i may be grossly overlooking? (ps. i have found an old Agrawal document showing results ~.9 with RegSVD, albeit with an older cli interface)
< rcurtin_> layback: hi there; the error measure you are using is RMSE I guess?
< layback> rcurtin_: yes!
< rcurtin> let me find the movielens dataset, hang on
< layback> its the 1m one i've used!
< rcurtin> ok, I am only going to use the 100k one so it doesn't take so long to run simulations
< rcurtin> but we should check that the format is right
< rcurtin> the input CSV should be three columns: user id, movie id, rating
< layback> yes!
< rcurtin> ok, great
< rcurtin> next question: what are you using to calculate the RMSE, and what are you setting the rank of the decomposition to?
< layback> so, i've tried ranks from [5, 100] and get range from [3.8, 2.9] for default NMF, and for RegSVD i've tried rank 20 (that is the same as the agrawal thing i mentioned used).
< layback> but say ~3.5 for NMF with rank 20.
< layback> sorry this got messy but, for RegSVd -R 20 i get about ~.02
< rcurtin> ok, and you're using the -T option to calculate RMSE?
< layback> im running: `mlpack_cf -v -t training.csv -T testing.csv -R 20 [-a RegSVD]`
< rcurtin> ok, great, let me try that and see what I get
< layback> i do split my datasets in an external script i have, but i assure you it is just ranomly split .8/.2.
< rcurtin> ok, I see the same results for default NMF... I am trying playing with the --min_residue parameter, which will control how exact the decomposition is
< rcurtin> it makes it take longer to converge, so it may be a little bit until I get results...
< layback> yes! i feel like i've tried mixing every possible parameter, hehe. I very much appreciate the help!
< rcurtin> $ bin/mlpack_cf -t ~/datasets/ml-latest-small/ratings-train.csv -T ~/datasets/ml-latest-small/ratings-test.csv -a RegSVD -v --rank 20 | grep RMSE
< rcurtin> [INFO ] RMSE is 0.964829.
< rcurtin> so, that's with the 100k dataset
< rcurtin> let me download the 1M dataset and see if I see the same behavior
< rcurtin> if you are seeing an RMSE of 0.05, that is very strange
< rcurtin> I played with --min_residue for NMF, I am not getting RMSE much better that 3.0 or so
< layback> ok, I'll try also with the 100k dataset.
< rcurtin> $ bin/mlpack_cf -t ~/datasets/ml-1m/ratings-train.csv -T ~/datasets/ml-1m/ratings-test.csv -a RegSVD -v --rank 20 --max_iterations 30 | grep RMSE
< rcurtin> [INFO ] RMSE is 0.876338.
< rcurtin> it seems like the regularized SVD needs far fewer iterations than NMF to converge to a good solution
< rcurtin> 10 and 20 iterations gave good results too
< rcurtin> and that ran in about ~10 seconds
< layback> that blows my mind, ok, great so i'm doing something wrong it seems. hehe.
< layback> how did you split the 1m data set?
< rcurtin> hehe, I did it hackily...
< layback> just so I can quickly validate.
< rcurtin> after I removed the fourth column from ratings.dat (the timestamp) and converted it to csv,
< rcurtin> cat ratings.csv | sort -R > ratings-random.csv
< rcurtin> head -800000 ratings-random.csv > ratings-train.csv
< rcurtin> tail -200204 ratings-random.csv > ratings-test.csv
< rcurtin> (seems like the 1M dataset actually had 1M + 204 ratings)
< rcurtin> if I was smart, I would have just used the mlpack_preprocess_split program, but I decided to do it by hand for some reason :)
< rcurtin> I can't seem to get RMSE < 2.6 with the NMF decomposition on the 1M dataset
< rcurtin> I'm playing with SVDIncompleteIncremental and SVDCompleteIncremental decompositions, it seems like the results are trash... I think maybe they are being run with not-great parameters for the optimizers and can't be tuned through the command-line program
< rcurtin> I think I'll update the GSoC description for the CF project to possibly include working with those algorithms to set reasonable defaults for the optimizers
< layback> hmm, ok, so I can confirm ~.88 RMSE on the 1-m as per your instructions. so there seems to be something wrong with how i prepare my data, I GUESS. ye, the thing that tripped me really was that all of them was showing such different results from each other.
< rcurtin> yeah, ideally, the defaults should be configured in such a way that each algorithm type converges to something similar
< rcurtin> but for now I guess RegSVD is the right one to use
< rcurtin> glad I could help sort it out!
< layback> i think the problem comes from me trying to also figure out to get a good one-class rating i.e. "like" or "dont know", and i might have passed that to my movielens data as well. if you have any tips for that, please pass them along. otherwise i'll continue tinkering.
< layback> thanks a lot for the help! very helpful.
< rcurtin> what do you mean by "one-class rating"?
< rcurtin> not sure I follow completely
< rcurtin> also, sure, glad to help, that's why I'm here :)
travis-ci has joined #mlpack
< travis-ci> mlpack/mlpack#1777 (master - ec3f224 : Ryan Curtin): The build was broken.
travis-ci has left #mlpack []
< layback> instead of having a rating [1, 5] for each rating, the actual data i want to work with only has ratings for "likes", so I really dont have an actual rating, basically only 5 star ratings and nothing else. so either i know if users like something or else i don't know anything.
< layback> so a row in my dataset is just an indication of a "like" between an item and a user.
< rcurtin> ah, ok, I guess that is a slightly different problem
< rcurtin> I think maybe any value for "like" will work (like 1 should be fine) but I think you will have to calculate RMSE differently
mikeling has quit [Quit: Connection closed for inactivity]
palash has joined #mlpack
palashahuja has joined #mlpack
palash123 has quit [Ping timeout: 255 seconds]
palash has quit [Ping timeout: 258 seconds]
palash has joined #mlpack
palash has quit [Remote host closed the connection]
palashahuja has quit [Ping timeout: 245 seconds]
gtank has quit [Remote host closed the connection]
gtank has joined #mlpack
daasankur has quit [Ping timeout: 260 seconds]
palashahuja has joined #mlpack
< layback> well, i guess i can calculate the RMSE to get some indication about the model, but the actual evaluation should probably be of some kind of precision.
< layback> since movielens dataset is used in every written text ever, my thinking is that it would be helpful to have it as a benchmark in some kind of way.