verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
travis-ci has joined #mlpack
< travis-ci>
mlpack/mlpack#896 (master - 06cae13 : Ryan Curtin): The build was broken.
< keonkim>
i tried with array of char to test. but not with armadillo.
< rcurtin>
let the user specify... they may have datasets with some numeric features and some categorical features they want to one-hot encode
< rcurtin>
and since Armadillo does not allow different data types inside of a matrix, we would want to just use 'double' there
< rcurtin>
but if the function for one hot encoding is templated and accepts ElemType as a parameter, then the user can use uint8_t if they want
< keonkim>
hm, I don't think I understood correctly.
< keonkim>
how should I fit [[1, 0, 0, 0] [0, 1, 0, 0] [0, 0, 1, 0] [0, 0, 0, 1]] inside Mat<double> ?
nilay has quit [Ping timeout: 250 seconds]
< rcurtin>
do you mean, how should you represent the numbers? you can use insert_cols() to add the columns in the right place
< rcurtin>
and for the values you can just cast 0 and 1 to their double representations
< keonkim>
oh I get what you mean
< keonkim>
I was wondering how to use that with the original matrix.
nilay has joined #mlpack
< nilay>
rcurtin: sorry about the force push, i just came to know about it now.
< rcurtin>
it's okay, fortunately this time it was easy to fix
< rcurtin>
zoq and I looked into the permissions that github allows, and it turns out you can disable force pushes for repositories, so we went ahead and did that
< rcurtin>
since ideally nobody should be force pushing anyway, so this should help prevent accidents :)
< keonkim>
Hmm.. the original matrix will still hold the incrementing integers... I will think more about it.
< nilay>
i was trying to undo my push, and well didn't what manifestations the commands i used could have
< nilay>
didn't know*
< rcurtin>
it's okay, git is a complex tool and takes a long time to learn fully :)
< rcurtin>
keonkim: I'm still not sure what you mean... if you want to operate in place, you can add the new features with insert_rows() (sorry, I misspoke earlier, it should not be insert_cols()), and then you can remove the original categorical feature with remove_rows()
< keonkim>
rcurtin: ok
< rcurtin>
maybe I have overlooked something, I hope what I wrote is helpful, but it's possible I am not understanding the actual problem
< keonkim>
I think I was overthinking :p its clear to me now.
< rcurtin>
ok, glad I could help :)
< keonkim>
rcurtin: I have another bigger problem. Currently, when there is a missing variable inside what is supposed to be a number feature, the missing variable is converted to 0 and DatasetInfo changes it to categorical feature.
< keonkim>
so it becomes impossible to track missing value after mapping.
< keonkim>
we talked about how to redesign it, but I cannot come up with a new strategy :(
< rcurtin>
the issue is, what do we take to represent a missing variable?
< rcurtin>
sometimes this can be the string "NULL", sometimes this can just be a lack of anything (like "5, , 7" in a CSV)
< rcurtin>
so I think the user needs to be able to specify what they consider to be a missing value
< rcurtin>
I suppose we could modify the behavior of DatasetInfo to map certain strings not to categorical features but instead to a specifically chosen value to represent missing values (like NaN for doubles)
< rcurtin>
or actually I guess that modification would be for data::Load()
< keonkim>
maybe specifying it while loading can make it work
< keonkim>
yup and I was thinking while loading the data::Load can pass the specified missing variable to MapString to make it NaN.
< rcurtin>
let's see what tham thinks, I wonder if DatasetInfo is the right thing to use for the imputer
< rcurtin>
I think if we just modified data::Load() to have two more parameters, like a string (or set of strings) that should map to a certain value, and then that value (used to represent missing values)
< rcurtin>
then you could just load a matrix that did no mapping except, e.g., "" to NaN
< rcurtin>
then your imputer functions are easy since they just need to look for NaN
< rcurtin>
I dunno, do you think that would work? the issue with DatasetInfo is that it is made for encoding categorical features, but if you try to encode NaNs in a feature that's mostly doubles, then it will end up mapping all of the doubles
< rcurtin>
and that could be a huge number of values to map, so it would be very slow
tsathoggua has joined #mlpack
tsathoggua has quit [Client Quit]
< keonkim>
hmm that way data::Load should always take double type matrix. or provide different strategy for integer matrix right?
< rcurtin>
yeah, if the user can specify the value that a missing value should be mapped to, it is no problem
< rcurtin>
I think, unfortunately, that our loading needs are becoming too complex to keep using Armadillo's load functionality
< rcurtin>
and instead maybe we will have to switch to using boost::spirit or something like that, to handle situations like this
< rcurtin>
I don't like maintaining our own loading code, but maybe there is no alternative here
< rcurtin>
keonkim: your tutorial for VS2015 is really, really nice! do you mind if I link to it in the mlpack docs and wiki?
< keonkim>
rcurtin: thank you :) and sure I don't mind
travis-ci has joined #mlpack
< travis-ci>
mlpack/mlpack#907 (master - da7a2c0 : Ryan Curtin): The build was fixed.
< rcurtin>
zoq: great, thanks. ck.mitted a fix with those new feature requirementas
< rcurtin>
*committed, maybe I am not the best at phone typing :)
marcosirc has joined #mlpack
benchmark has joined #mlpack
benchmark has quit [Client Quit]
< rcurtin>
nice, pretty significant speedups for LSH
< lozhnikov>
rcurtin: I am testing RectangleTree without the dataset variable. It seems I have to do the refactoring for all tree traversals and all pruning rules and base cases. Tell me if I am mistaken.
< mentekid>
rcurtin: nice, is this from the second hash table?
< rcurtin>
unfortunately I think you are right, but fortunately I think the refactoring is straoghtforward
< rcurtin>
mentekid: yeah, I think so
< mentekid>
cool :D
< rcurtin>
er wait no it is from the hybrid search
< rcurtin>
lozhnikov: basically we will need to make the Rules classes stop holding references to the dataset and instead always use node->Dataset()
< rcurtin>
but ti should be as simple as that and it should be easy to test
< mentekid>
ah. Still cool :)
< rcurtin>
if you want I can do that refactoring but it may be a few days
< lozhnikov>
I'll do the refactoring since the R tree doesn't work without that.
< rcurtin>
okay, thanks. like I said earlier, the changes you have made are great, the RectangleTree code is in *much* better shape now
< rcurtin>
:)
< lozhnikov>
thanks:)
travis-ci has joined #mlpack
< travis-ci>
mlpack/mlpack#911 (master - c0d0563 : Ryan Curtin): The build passed.
< rcurtin>
we could just #define deprecated arma_deprecated and use 'deprecated'
< rcurtin>
maybe there is a better word, but I think 'arma_deprecated' would be weird to use directly since that's from a dependency and not mlpack itself
< zoq>
yes, right
< rcurtin>
do you want to apply those changes? I have a lot of other things to fix this week :)
< zoq>
I guess I could do it, or I could also open a new issue. Not sure right now.