#mlpack on 2016-05-19 — irc logs at libera.irclog.whitequark.org

2015-01-15 23:05 verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/

04:52 tsathoggua has joined #mlpack

04:52 tsathoggua has quit [Client Quit]

05:45 mentekid has joined #mlpack

06:20 mentekid_ has quit [Remote host closed the connection]

06:21 mentekid_ has joined #mlpack

06:29 mentekid__ has joined #mlpack

06:29 mentekid has quit [Read error: Connection reset by peer]

06:39 mentekid has joined #mlpack

06:39 mentekid__ has quit [Read error: Connection reset by peer]

06:49 mentekid__ has joined #mlpack

06:49 mentekid has quit [Read error: Connection reset by peer]

06:49 mentekid__ has quit [Client Quit]

06:50 < mentekid_> rcurtin: here are the new results - http://pastebin.com/XQmgPgGv with both profiling and debugging turned off. These results look a lot more like yours

06:50 < mentekid_> rcurtin: though I noticed that we didn't test all the same datasets, I swapped your phy for pokerhand. Anyway these results are quite more consistent, as you said unique is most of the time the fastest and find has a different behavior from unique most of the time

06:51 < mentekid_> I don't exclude the possibility of me doing something stupid yesterday. Like not saving the file and simply re-compiling unique (or find) twice and reporting the results as if they were different

06:53 < mentekid_> but it could also be the profiling I guess

07:08 Mathnerd314 has quit [Ping timeout: 246 seconds]

07:41 mentekid_ has quit [Ping timeout: 244 seconds]

07:44 nilay has joined #mlpack

08:31 mentekid has joined #mlpack

08:36 mentekid has quit [Ping timeout: 250 seconds]

09:12 mentekid has joined #mlpack

12:42 < rcurtin> mentekid: now there is not a single case where unique does not outperform find... hmm...

12:44 < rcurtin> looking through my results too, there is only one case where find is faster, and that's with OpenBLAS on the corel dataset that I have (which is weird and small and not the same one you are using)

12:46 < rcurtin> I have to step out for a little bit... but I think I want to run a few more OpenBLAS simulations

12:47 < rcurtin> maybe a few more non-OpenBLAS simulations too, on other datasets

12:47 < rcurtin> it seems like maybe we can conclude that if not using OpenBLAS, unique is always the better choice

12:48 < mentekid> so no need for cutoff either

12:48 < mentekid> simply go with unique?

12:50 < rcurtin> yeah, I would say "go with unique", but I want to do a little more testing with OpenBLAS first

12:50 < rcurtin> I'll be back in a little bit

12:50 < rcurtin> sorry that something that was supposed to be easy got so complex :(

12:52 < mentekid> nah it's an interesting problem (made more complex by my inability to compile correctly :P)

12:53 < rcurtin> do you think you were compiling with debugging symbols before?

12:53 < rcurtin> er, with profiling symbols?

12:53 < mentekid> profiling symbols yeah

12:53 < mentekid> I'm not sure about debugging, I think they were off

12:55 < mentekid> I also had the idea of performing the candidate retrieval during training. So for each 2nd level hash bucket, we could run retrieveIndicesFromTable and store the results in memory. That way queries would be O(1) if we use a hash table with key the level 2 Code and value the precomputed Candidate Set

12:55 < mentekid> I'm not sure what the impact would be on memory and how longer it would take for Train() to run though

12:56 < mentekid> but I think that's supposed to be the goal of LSH in the end, very fast queries after slow preprocessing

13:30 < rcurtin> so the idea is, instead of having secondHashTable store the indices of points, it would store the actual points themselves?

13:32 < mentekid> I am not sure how secondHashTable is implemented, but I think right now it works as secondHashTable[point] = code

13:33 < mentekid> so we check if secondHashTable[point] == secondHashTable[query] and if so add point to the candidates

13:34 < mentekid> couldn't we instead store secondHashTable[code] = list_of_points

13:34 < mentekid> so then queries would be faster?

13:34 < mentekid> again I haven't looked into it very carefully, I'm not sure what I'm saying makes sense

13:35 < rcurtin> yeah, I am not sure if I completely understand. but if you try it and it speeds things up I am all for it :)

13:35 < mentekid> I'll try and do it before Sunday then, it will be good warm-up

13:36 < rcurtin> sure, don't feel obligated

13:36 < rcurtin> I think it'll be a lot easier for me to understand if I see the code :)

13:36 < mentekid> to be honest my schedule is tight because I need to deliver a first draft of my thesis until next Friday... But I'm almost done with everything but the text which is in bad shape :P

13:36 < mentekid> But I think I'll find time on Saturday

13:49 < rcurtin> okay, sounds good... hopefully the thesis text gets better :)

14:05 govg has joined #mlpack

14:08 Mathnerd314 has joined #mlpack

15:00 nilay has quit [Ping timeout: 250 seconds]

15:30 marcosirc has joined #mlpack

15:34 sumedhghaisas has joined #mlpack

15:36 sumedhghaisas has quit [Client Quit]

15:36 sumedhghaisas has joined #mlpack

15:37 sumedhghaisas has quit [Client Quit]

15:44 sumedhghaisas has joined #mlpack

15:46 < marcosirc> Hi Sumedh.

15:51 sumedhghaisas has quit [Read error: Connection reset by peer]

15:53 sumedhghaisas has joined #mlpack

15:53 < sumedhghaisas> Hi @marcosirc

15:54 < sumedhghaisas> there is some problem in my connection... its getting disconnected frequently...

15:59 < marcosirc> Ok. No problem.

16:07 < marcosirc> Is it ok for you to chat now, or you would prefer another moment?

16:21 < sumedhghaisas> @marcosirc is 17:00 GMT okay for you??

16:21 < sumedhghaisas> I will be done with my dinner till then...

16:22 tsathoggua has joined #mlpack

16:23 < marcosirc> Yes. Ok.

16:23 < marcosirc> See you later.

16:24 mentekid has quit [Ping timeout: 276 seconds]

16:25 tsathoggua has quit [Client Quit]

16:55 < sumedhghaisas> @marcosirc Hey Marcos...

16:59 nilay has joined #mlpack

17:06 < marcosirc> Hi

17:06 < marcosirc> Here I am

17:07 < marcosirc> Sorry, I was having lunch.

17:12 < sumedhghaisas> Sure no problem...

17:12 < marcosirc> Thanks.

17:13 < marcosirc> If you agree, acording to the timeline included in the proposal, next week, I could start modifying the existing dual tree algorithm to approximate nearest neighbor search.

17:13 < marcosirc> I should modify the prune rule to include an epsilon, as mentioned by Ryan at the end of the paper: "Faster dual-tree traversal for nearest neighbor search". Something like: prune if d_min(N_q, N_r) > ( 1 / (1 + epsilon ) ) * B(N_q).

17:14 < marcosirc> Would you think this would be ok?

17:14 < marcosirc> Once that this is working properly, I could work on the command line tool.

17:14 < sumedhghaisas> yes I talked with him on this...

17:15 < sumedhghaisas> The current implementation will support this epsilon addition ...

17:16 < marcosirc> Yes.

17:16 < sumedhghaisas> but I have read that papaper long ago... let me take a look at it again today...

17:16 < marcosirc> Ok.

17:17 < sumedhghaisas> I am going over it now.. just a quick glance...

17:20 < sumedhghaisas> Okay so as I understand it...

17:20 < sumedhghaisas> the current framework is

17:20 < sumedhghaisas> the minimum distance between nodes is < bound for nodes then prune...

17:21 < sumedhghaisas> we want to add an epsilon parameter here to accommodate approximation...

17:21 < sumedhghaisas> is that right?

17:21 < marcosirc> Just the opposite, if the min distance is greater than the bound, then prune..

17:21 < sumedhghaisas> ohh sorry my bad...

17:22 < marcosirc> Yes. We want to add an epsilon to make it approximate.

17:22 < sumedhghaisas> < it is...

17:23 < sumedhghaisas> hmm...

17:23 < sumedhghaisas> if min distance > bound - epsilon then prune...

17:24 < sumedhghaisas> something like that...

17:24 < marcosirc> Yes.

17:24 < marcosirc> We make the bound smaller.

17:25 < marcosirc> We can prune more because we are looking for an approximate solution.

17:25 < sumedhghaisas> So your equation was (1 / (1 + epsilon) ) * bound...

17:26 < sumedhghaisas> yes thats right... thats the general idea I got from the paper you have added in your proposal...

17:26 < marcosirc> Yes, ok.

17:26 < sumedhghaisas> is there any special reason for ( 1 / (1 + epsilon) ) ?

17:28 < marcosirc> Is is mentioned in the paper: "Faster dual tree traversal for nearest neighbor search", below the "table 3".

17:28 < sumedhghaisas> ohh... let me take a quick look...

17:31 < sumedhghaisas> I am not sure how exactly it will differ from bound - epsilon... both will achieve same result I think...

17:32 < marcosirc> the difference is that epsilon is a relative coefficient here.

17:32 < marcosirc> It is not a fixed error. It is a relative error.

17:33 < sumedhghaisas> yes you are right...

17:34 < sumedhghaisas> In that scenario I think (1 / (1 + epsilon)) is a better choice...

17:35 < marcosirc> We can express the same as: if (1+e) * dmin(N_q,N_r) > B , prune, because we are sure we can get a neighbor at least as good as B.

17:35 < marcosirc> Ok.

17:35 < sumedhghaisas> yes... reminds me of the definition of approximation algorithms...

17:37 < rcurtin> I think relative error is better to implement than absolute error -- I think more people will want to use relative error

17:38 < marcosirc> Ok. Maybe we can add an option later, to decide if absolute or relative error is considered.

17:40 < sumedhghaisas> yeah... even if we go by the definition of approximation algorithms... its always a relative error... epsilon being the approximation parameter...

17:41 < sumedhghaisas> I think just relative is fine... I am thinking of any advantage of absolute over relative...

17:42 < sumedhghaisas> but I think an absolute error can be translated in relative error and vise versa...

17:42 < sumedhghaisas> so either is fine...

17:43 < sumedhghaisas> mention of it in the documentation will suffice...

17:45 < marcosirc> Ok. The advantage of absolute error is that you can be sure the real error won't be grater than that. With relative error it depends on the real distance of the nearest neighbor. If it is greater, the real error is greater.

17:46 < marcosirc> But, I think if we take an approach, we won't have problems to change it if we decide to do that in the future.

17:47 < sumedhghaisas> yes true... but when I say a 1/2 approximation algorithm its always the relative error...

17:48 < sumedhghaisas> thats why I thought that people might get confused in using absolute error in approximation algorithms...

17:48 < sumedhghaisas> but you are right...

17:48 < sumedhghaisas> changing it later won;t cause much of a trouble...

17:49 < marcosirc> Yes! Ok. So we agree on this. I will make it clear in the documentation.

17:50 < sumedhghaisas> Sure...

17:50 < sumedhghaisas> If we decide later we will change it...

17:52 < rcurtin> yeah, that sounds good to me too. I'm not sure what the right design is to support both relative and absolute approximation, but I don't think it will be too hard to do

17:55 < sumedhghaisas> I agree... we can go ahead with relative for now... maybe a working example will let us do some tests to figure this problem out...

17:55 < sumedhghaisas> We can check how absolute error fares in situations...

17:56 < marcosirc> Yes, sounds great.

17:57 < marcosirc> Another think I would like to talk is about github workflow.

17:58 < marcosirc> I will start working on my forked mlpack repo. What would you prefer, many pull requests during the summer, or a unique pull request at the end of the gsoc?

17:58 < sumedhghaisas> @rcurtin what happend to the naywhayare nickname? :P

17:58 < sumedhghaisas> hmm... yes... thats a good point...

17:59 < sumedhghaisas> different pull requests would let us merge the part of the code...

17:59 < sumedhghaisas> I mean even though you will basically work on the same thing...

18:00 < sumedhghaisas> I would like to follow the feature branch style... I think thats more elegant.. what you think??

18:04 < marcosirc> Yes. Ok, for me. Many features branches in my own forked repo, and many pull requests during the summer.

18:05 mentekid has joined #mlpack

18:05 < sumedhghaisas> that will also give me time to go through the code while you work on the next feature...

18:05 < rcurtin> sumedhghaisas: I realized that people had no idea who I was, so I decided to sync github username with irc nick :)

18:06 < rcurtin> marcosirc: I agree, many PRs is better, it allows us to get your code into a release quicker

18:06 < marcosirc> Ok. Yes, I agree!

18:07 < sumedhghaisas> @rcurtin haha... I always wanted to ask.. pardon my ignorance... what does it mean??? naywhayare??

18:09 < sumedhghaisas> @marcosirc we can decide as the time passes when to create a new branch so there is no confusion

18:10 < sumedhghaisas> that way I can pull the specific feature from your repo if I want ...

18:11 < marcosirc> Ok.

18:14 < rcurtin> sumedhghaisas: it was some thing I came up with when I was a kid

18:14 < rcurtin> I was telling a friend of mine that my name was "nayr" (ryan backwards)

18:14 < rcurtin> and he said he knew someone named naywhayare

18:15 < rcurtin> I have no idea if he actually did, but I adopted it as an IM handle :)

18:18 mentekid has quit [Ping timeout: 260 seconds]

18:18 govg has quit [Ping timeout: 260 seconds]

18:33 < sumedhghaisas> @rcurtin ohh my god...

18:34 < sumedhghaisas> you are kidding me right??

18:36 < nilay> zoq: do you also host mlpack irc server?

18:39 < nilay> sorry better question would be, does one channel also connected by more than one servers.. or it does not depend on channel...

18:42 < zoq> nilay: yeah, the irc network is a undirected graph, so my entry point could be different from yours

18:48 < nilay> zoq: so did you also see others getting split?

18:48 < zoq> nilay: sometimes, yes

18:57 nilay has quit [Ping timeout: 250 seconds]

18:58 nilay has joined #mlpack

19:02 < zoq> nilay: I've sent you the data by mail.

19:04 < nilay> zoq: ok

19:16 marcosirc has quit [Quit: WeeChat 1.4]

19:23 nilay has quit [Ping timeout: 250 seconds]

19:27 < rcurtin> sumedhghaisas: nope, no joke

19:28 < rcurtin> I think I was probably 8 or 9 years old? I can't remember

19:32 nilay has joined #mlpack

19:33 sumedhghaisas has quit [Ping timeout: 252 seconds]