verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
tsathoggua has joined #mlpack
tsathoggua has quit [Client Quit]
< keonkim> how is nan treated in C++? I was thinking of treating missing variables as nan. (so that user can decide what to replace it) Could that present any possible problem?
tham has joined #mlpack
< tham> Last night I study the codes of structured random forest and have some questions about it, I post it at http://pastebin.com/6spZW4Fm
< tham> zoq nilay : Do armadillo or mlpack provide distance_transform algorithm? I can implement it no such algorithm
< rcurtin> keonkim: I thought about this idea for a while, but problems are that NaN will only be available for double and float, and if you use any algorithm with a data matrix that has a NaN in it, the results will probably all be useless
< rcurtin> but you are right, a missing value is indeed, not a number :)
< rcurtin> I guess, depending on what you are using it for, missing values as NaNs might be reasonable
< keonkim> I originally thought the current mapping in string to value (which is using size_t) could be replaced with signed int to represent missing or wrong values with -1. But then I realized the original data can contain any negative values.
< keonkim> how about we use unordered_map(size_t, pari<bimap<string, size_t*> instead so that we could replace size_t* with Null when a value is missing?
nilay has joined #mlpack
< nilay> tham: i implemented the distance transform algorithm
< nilay> tham: we add positive and negative location and store it in the list. then compute reg_ftr and ss_ftr at both the edges(pos_loc) and non edges (neg_loc)
< rcurtin> keonkim: not sure, that seems like it would be a lot of overhead. I need to go to sleep, I will get back to you tomorrow
< rcurtin> I think that one issue here is that it is possible to have a dataset where the word "null" does not indicate a null value but is instead a valid entry, so we need to be careful with "auto-detecting" missing values
< nilay> tham: i don't know the answer to the third question. "#store the unique, invertable index, why? "
< keonkim> rcurtin: thanks good night!
< keonkim> and yes, I was thinking we could let user to define what missing value is, by getting mapped string, with values. using UnmapValue that returns a value given string (instead of UnmapString, which returns size_t).
Mathnerd314 has quit [Ping timeout: 240 seconds]
mentekid has joined #mlpack
< tham> nilay : Thanks, now I know why it add the pos_loc and neg_loc, it is append the list but not add the value, ah, stupid question
< tham> I will spend more times to study the codes on this weekend
< nilay> tham: yes :)
wasiq is now known as intense
intense has quit [Ping timeout: 258 seconds]
< tham> keonkim rcurtin : maybe we could define a template parameter to let the user define what kind of policy they want to deal with the missing value?
< tham> or set an std::function for them to specify the policy
< tham> I think std::unique_ptr<size_t> would be better than raw pointer
< tham> or use boost::optional<size_t> to represent null value
< tham> Not sure using unordered_map at here is a good idea
< tham> it need measure, std map out perform it when data is small
< tham> either using size_t*, std::unique<size_t*> or boost::optional
< tham> the api need to change--do we have a better solution for missing value?Without changing the api
< keonkim> I will research what tham listed above.
< keonkim> adding to what I've said earlier, this is the basic implementation of what I am trying to do.
wasiq has joined #mlpack
< tham> sorry, it should be std::unique_ptr<size_t>, typo
mentekid has quit [Ping timeout: 272 seconds]
mentekid has joined #mlpack
mentekid has quit [Client Quit]
mentekid has joined #mlpack
mentekid has quit [Client Quit]
mentekid-mobile has joined #mlpack
mentekid-mobile has quit [Quit: Bye]
mentekid-mobile has joined #mlpack
mentekid-mobile has quit [Client Quit]
< tham> keonkim rcurtin : Maybe we can provide new api to return the size_t wrap by boost::optional?
mentekid has joined #mlpack
< tham> provide a template as policy to allow the users setup their mapping strategy
< tham> if the strategy allow boost::optional, change the type of size_t to boost::optional<size_t>
< tham> if not, keep it as size_t
< tham> keonkim : I saw you imputer concept
< tham> I do not have a clear vision how to design it yet, I will reference another library in a few days, maybe we could discuss more about that later on
< keonkim> tham: ok
< tham> keonkim : I am just trying to give some pointers about the design, do not need to wait for my resposnse, you can design the function/class as you like
< tham> maybe your design would much better than existed library
tham has quit [Quit: Page closed]
< keonkim> I will try haha
nilay has quit [Quit: Page closed]
wasiq is now known as intense
< rcurtin> keonkim: I like your basic implementation, it seems like a good way to do it to me
< rcurtin> one thing to consider is that if a user gives a big CSV full of doubles but one of the features has a NULL or two in it, if DatasetInfo maps the data, the map will end up being huge...
< rcurtin> so I dunno, maybe DatasetInfo as it is might not be the best for imputation
< rcurtin> still I think the API you are providing is good, and I think maybe you can use a template parameter to the Imputer function to allow the user to specify their own imputation strategy
< rcurtin> tham: when you say, let them define a template parameter for the policy to deal with the missing value, do you mean a template parameter for DatasetInfo to be used during load and mapping?
< rcurtin> maybe that would be the best way to go... I am not sure, these are difficult problems to solve well :)
nilay has joined #mlpack
marcosirc has joined #mlpack
< keonkim> boost::optional seems like a good choice, but I got an another idea.
< keonkim> How about we set maps to: unordered_map<size_t, pair<bimap<string, double>,size_t>> maps;
< keonkim> the difference with the original one is that size_t is changed to double in bimap<string, double>
< keonkim> this is a bit unintuitive, but I think in this way we can assign nan and increase overhead to the least.
< keonkim> (i am referring to dataset_info.hpp#L104)
< keonkim> again, but this could increase the overhead of the whole dataset just because of a few missing values.
< rcurtin> keonkim: I agree, that is best for the situation where you are trying to find NaNs, but it breaks for categorical data
< rcurtin> since there might be more than one value that needs to be mapped
< rcurtin> maybe with some template trickery we can adapt the DatasetInfo class to use either doubles or size_ts for mapping, depending on the use case
< rcurtin> like DatasetInfo<double> will map anything unrecognized in a dimension to NaN and leave everything else alone, and DatasetInfo<size_t> will map any categorical features to size_t's (like it is now)
< rcurtin> but maybe at that point it is better to use a different class, because the functionality is too different... I am not sure
Mathnerd314 has joined #mlpack
< mentekid> rcurtin: I can't compile the tests for LSH Table Access, because compilation of the serialization test fails
< nilay> i am implementing a random_permutation function using c++11 <random> library, my doubt is this function gives the same output everytime you run because seed is set. Is this ok, or should i somehow remove the seed?
< mentekid> I'm not sure I understand how the version == 0 piece of code in your comment works, where would the version variable come from?
< mentekid> ah i just noticed the commented out version variable sorry
< nilay> or maybe i should initialize the generator and distribution outside the function...
< nilay> srand(time(0)) doesn't do anything, seed is set internally.
< rcurtin> mentekid: yeah, in most serialization methods it is commented out because there has only ever been one version
< zoq> nilay: I'm not sure, but it looks like arma::shuffle would probably be the better solution.
< mentekid> so to get the version for if (version == 0) do I just need to uncomment it or run the expansion of BOOST_CLASS_VERSION() ?
< rcurtin> you just need to uncomment that parameter
< rcurtin> you also need the expansion of BOOST_CLASS_VERSION(), which I just posted in the ticket, but that expansion is what will set the version to 1 at compile time
< mentekid> Sorry it's still not clear to me - do I put that code in lsh_search_imp or some other file? I have never dealt with serialization before
< nilay> zoq: ah, i couldn't find such a thing at first. thanks.
< rcurtin> mentekid: ah, sorry, I should have clarified. yeah, add the BOOST_CLASS_VERSION() expansion to lsh_search.hpp
< rcurtin> I'd put it there instead of lsh_search_impl.hpp, so people can see what the version is
< zoq> nilay: yeah, the function name is probably not the best.
< nilay> zoq: yeah :)
< mentekid> rcurtin: I get "error: 'version' is not a class template
< marcosirc> Hi zoq: rcurtin: I have added the number of "BaseCases" as a metric for allknn and allkfn, in the benchmarking system, to do some tests with different mlpack versions. Would you like this to be included in the main repo? I mean, do you think this would be useful? Do you use the benchmarking system to compare different versions of mlpack? If you want I can make a pull request...
< mentekid> I've put the snippet in lsh_search.hpp inside the neighbor and mlpack namespaces
< mentekid> I think it would be faster if you edited my pull request and then I pull it back, since I have no idea what I'm doing here :P
< rcurtin> marcosirc: I think it would be useful, but it might be difficult to extract that information from other libraries
< rcurtin> mentekid: okay, that's no problem, I'll take care of the serialization bit
< mentekid> thanks and sorry :/
< rcurtin> no, it's no issue, don't worry about it :)
< rcurtin> boost::serialization is... well, I don't want to call it a nightmare, because there are good things about it (like easy serialization), but it can be hard to work wtih
< rcurtin> kinda like CMake
intense has quit [Quit: Leaving]
< marcosirc> Yes, I agree. Ok, I make the pull request, then in the future you decide if this is useful.
< mentekid> unrelated question regarding installation and use (I'm using mlpack code for my thesis). I have cloned mlpack from github and compiled it, so it has generated the proper directories like include/ and lib/
< mentekid> I want to compile a program that uses mlpack/core.hpp and mlpack/methods/lsh/lsh_search.hpp
< mentekid> so I compile with g++ mylshv1.cpp -I include/ -L lib/libmlpack -lmlpack -larmadillo -lboost_program_options --std=c++11
< mentekid> mylsh.cpp is in the mlpack/build directory
< mentekid> but g++ says /usr/bin/ld: cannot find -lmlpack collect2: error: ld returned 1 exit status
< mentekid> what am I doing wrong? Since I've pointed to lib/libmlpack, should g++ find the library?
< mentekid> ah
< mentekid> disregard all that
< mentekid> sorry for spamming I was just doing something stupid
< rcurtin> marcosirc: zoq: if we only have one library calculating a particular metric, is that a problem?
< rcurtin> (for the benchmark system that is)
< rcurtin> mentekid: yeah, -Llib/ should fix that
< marcosirc> No, no problem. :)
< rcurtin> I guess the entries in the sqlite database for other libraries that don't calculate BaseCases are simply not there and will not be displayed by the frontend
< marcosirc> Yes.
< rcurtin> yeah, in that case, I think that would definitely be useful to add to the benchmarking system
< rcurtin> I am still thinking about why there was no difference for B_aux with cover trees
< marcosirc> rcurtin: yeah, I it is intriguing...
< marcosirc> Now, I am focusing on modifying the prune rule to do approximate search...
< marcosirc> If you prefer I can investigate about cover trees, I really enjoy learning about this, but I am trying to be productive...
< rcurtin> cover trees take forever to learn about, I can think about that one, don't worry about it :)
< rcurtin> to be perfectly honest I don't think that cover trees are all that interesting despite the attention they receive in the community:
< rcurtin> a) the worst-case single-tree O(log N) search time proof is wrong
< rcurtin> b) construction time is huge for large datasets
< rcurtin> c) search time is often not faster than kd-trees
< rcurtin> mentekid: I have something working but the test segfaults, when I figure it out I will send you the patches :)
< rcurtin> so I guess, I don't have something working then, I just have something compiling :)
< zoq> marcosirc: feel free to open a pull request
< marcosirc> rcurtin: Ok, thanks for that information! Yes, I have been looking for further documentation. I didn't find too much, appart from the main paper, and a video of John Langford's talk.
< marcosirc> zoq: ok, just in a minute. Thanks.
< rcurtin> marcosirc: some of the papers I have written may contain better descriptions of the cover tree, like "Plug-and-play runtime analysis for dual-tree algorithms", but other than that and the implementation in mlpack, documentation is scarce
< rcurtin> I think the original paper is hard to read because they have this "implicit representation" and "explicit representation"
< rcurtin> they say the implicit representation is easier for proofs, but it really is not that much harder to just use the explicit representation for proofs, and then the paper is way easier to understand because you don't have to juggle two representations in your mind
< marcosirc> rcurtin: yeah, I agree in that mixing both representations is confusing.
< marcosirc> I will take a look at your paper in my free time. Thanks!
< rcurtin> marcosirc: it is not guaranteed to be interesting :)
< marcosirc> haha
mentekid has quit [Ping timeout: 244 seconds]
sumedhghaisas has joined #mlpack
mentekid has joined #mlpack
mentekid has quit [Ping timeout: 260 seconds]
nilay has quit [Ping timeout: 250 seconds]
< keonkim> I messed up on using double. I tried using double but having nan all over the place gets dirty really quick. I will try to find another way tomorrow.
sumedhghaisas has quit [Ping timeout: 260 seconds]
nilay has joined #mlpack
< nilay> what can i do to do element wise comparison on 2 matrices in armadillo?
< rcurtin> nilay: if I have two matrices, A and B, if I do "arma::umat C = (A == B)"
< rcurtin> then C will have a zero where the elements are not equal and a 1 where elements are equal
< rcurtin> you could do other comparisons too, like "(A < B)" and whatever else
< rcurtin> you could count the number of elements that are equal... arma::accu(A == B)
< rcurtin> hopefully this is helpful :)
< nilay> rcurtin: and to perform element wise comparison of a matrix and a scalar? i know i can use find, but that does not return matrix.
< nilay> rcurtin: i tried using transform() , but i don't know how to pass a variable into the lambda expression
< rcurtin> couldn't you do (A == 5.0)?
< nilay> can i do that
< nilay> ok i can.
< nilay> i was searching so much for this simple stupid thing. :/
< rcurtin> it's okay, sometimes that is the way it goes :)
< rcurtin> it can take a long time to get fully familiar with armadillo and its features
< rcurtin> sometimes there are things you can do that are not documented well
< nilay> yeah i guess.
tsathoggua has joined #mlpack
tsathoggua has quit [Client Quit]
mentekid has joined #mlpack
nilay has quit [Ping timeout: 250 seconds]
nilay has joined #mlpack
< mentekid> rcurtin: tests pass, everything seems to work fine :)
< mentekid> I just pushed to my PR branch
< rcurtin> awesome, great
< rcurtin> is it ready to merge or were there any other comments that needed to be handled?
< mentekid> I haven't implemented the merge of BuildHash() and Train()
< mentekid> If you want I can do that tomorrow and then we can merge, or we can merge and then I can do that later
< mentekid> both are fine by me :)
< rcurtin> I'll go ahead and wait until you make those changes then
< rcurtin> also don't forget to return a const arma;:cube& not a const arma::cube from Projections() :)
< mentekid> I thought I changed that
< mentekid> I'll do it together with the BuildHash merge
< rcurtin> yeah, looks like it is still arma::cube, but even if you didn't I'd just do it after the merge, it is not so hard to add a single character :)
< lozhnikov> rcurtin: Hi, I issued a PR.
< rcurtin> lozhnikov: I saw it, I was hoping to take a look later today or tomorrow
< rcurtin> unless you need me to do something with it now, just let me know
< rcurtin> I glanced at it, it looks pretty comprehensive, but I agree that an armadillo object would probably be better than std::list
< lozhnikov> No, it is not urgent. I can think out some new tests or recall R+ trees.
< rcurtin> what do you think of the idea of refactoring the RectangleTree to use the localDataset in each node instead of just referencing the bigger dataset, in order to allow InsertPoint() and DeletePoint() to actually be useful?
< rcurtin> that isn't in the scope of your proposed project, so we can avoid doing that if you prefer, but I wanted to toss it out there as an idea
< lozhnikov> Hm.. I tried to do it with DiscreteHilbertValue, but it is done partially.
< lozhnikov> I have some ideas. I should think about it.
< rcurtin> the Rules classes that are used by the dual-tree algorithms would possibly need to be updated, but I can do that
< lozhnikov> For example, we can add something like arma::Col<ElemType> *insertedPoint. When a point is being inserted this is not null. The HilbertValue class can contain the same variable and there is no need to calculate the Hilbert value several times.
< lozhnikov> Ok, I'll try to do the refactoring for the RectangleTree
< rcurtin> yeah, if you think that is reasonable and will not take too much time, I think that would be a nice change
< rcurtin> we can also remove some of Andrew's incorrect documentation about whether or not localDataset is used... there are some misleading comments in rectangle_tree.hpp, I think
< lozhnikov> I think it shouldn't take a lot of time. I didn't read all comments:). I'll take a look.
travis-ci has joined #mlpack
< travis-ci> mlpack/mlpack#870 (master - dca52fd : Ryan Curtin): The build passed.
travis-ci has left #mlpack []
travis-ci has joined #mlpack
< travis-ci> mlpack/mlpack#871 (master - eba4f99 : Ryan Curtin): The build passed.
travis-ci has left #mlpack []
nilay has quit [Ping timeout: 250 seconds]
mentekid has quit [Ping timeout: 276 seconds]
tsathoggua has joined #mlpack
tsathoggua has quit [Client Quit]