naywhayare changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
naywhayare has joined #mlpack
udit_s_ has joined #mlpack
< naywhayare> udit_s_: when did you want to go over your code?
udit_s_ has quit [Quit: Ex-Chat]
oldbeardo has joined #mlpack
< oldbeardo> naywhayare: just sent you a mail
< naywhayare> oldbeardo: ok, thanks
< oldbeardo> naywhayare: if you are looking at it right now, tell me if there are any changes I should make
< naywhayare> oldbeardo: not at the moment -- I have some things I have to get done by later this afternoon so I am focusing on those now. but I will look through it tonight
< oldbeardo> okay, good thing tomorrow is Friday
< oldbeardo> naywhayare: by the way, how's your publication coming about?
Anand has joined #mlpack
< naywhayare> oldbeardo: little by little... lots of writing to do though
< oldbeardo> naywhayare: heh, best of luck
< Anand> Hi Marcus!
< Anand> I have added the unit tests
< marcus_zoq> Anand: Hello
< Anand> Please have a look and let me know if everything looks good
< Anand> I am planning the first merge this Sunday
< marcus_zoq> Anand: Okay great, I'll take a look in a few minutes.
< marcus_zoq> Anand: Okay the tests look good, but can you apply the style guidelines (convert indentation to spaces), this would be great.
< marcus_zoq> Anand: And we should set the path for the 'true_labels.csv' and 'probabilities.csv' file to 'tests/true_labels.csv' and 'test/probabilities.csv' so that we can run 'make checks' in the repository root.
< marcus_zoq> Anand: Another thing you don't need to add the 'util' folder to the pythonpath and I think we can delete the 'self.verbose' parameter?
Anand has quit [Ping timeout: 240 seconds]
< marcus_zoq> naywhayare: Is the cooling failure fixed?
< naywhayare> marcus_zoq: sort of... here's what I heard: "The holland main breaker apparently rusted out, they are both trying to repair it and wire in generators which were rolled up."
< naywhayare> so they have it back up now, and there's cooling, but it's possible they'll bring it down again over the weekend
< naywhayare> I need to find a better way to backup mlpack.org for periods of long downtime. a backup MX works great for mail, but http is a bit harder
< naywhayare> I think the best I can do is mirror the site on one of the build nodes, and then switch DNS if the main server goes down, and keep the DNS TTL relatively low to minimize site inaccessibility
< marcus_zoq> Yeah, that's probably the only way. Runs the IRC script on a another server?
< naywhayare> maybe, I'd need redundant loggers for that; I think just logging into IRC from irssi on some other system while the main server is down is sufficient for that
< naywhayare> then the logs can get rebuilt when the main server comes back up and I resync them
< marcus_zoq> I thought the script builds the irc log html pages immediately if a user makes a get request for the current irc logs? But there are no logs?
oldbeardo has quit [Ping timeout: 240 seconds]
< naywhayare> marcus_zoq: yeah, that's true. I guess I'd have to do a bit of thinking for how to back that system up correctly...
< marcus_zoq> naywhayare: If you need someone how is 'always' online to create a backup log .... but I think you have a lot of other servers that are able to do that :)
< naywhayare> actually, if you have logs from the downtime, that'd be great, and I'll work them into my existing logs for completeness
udit_s has joined #mlpack
< marcus_zoq> naywhayare: I've send you an email.
< naywhayare> excellent, thanks
< udit_s> hey ryan and marcus, you guys free now ?
< udit_s> to discuss the code ?
< naywhayare> udit_s: sure
< udit_s> marcus ?
< udit_s> naywhayare: I'll mail him the details after this...
< udit_s> have you gone through the code ?
< naywhayare> not the most recent version; let me do that now
< naywhayare> I looked at the old version in a good amount of detail, though, so it should only take me a few seconds to identify the changes you've made
< udit_s> okay
< udit_s> so basically -
< udit_s> after the train and labels are loaded,
< udit_s> i check for the default case of distinct classes i.e. there should be at least 2 classes - otherwise any test set can be given the class as the default class.
< naywhayare> this is in decision_stump_main.cpp, right?
< udit_s> the flow in this trivial case is :
< udit_s> yeah,
< udit_s> oneClass = 1 if !isDistinct (labels)
< udit_s> we check the value of oneClass in classify, and then accordingly set the predictedLabels vector as the default class if so.
< udit_s> the first unit test in decision_stump_test.cpp checks this: its by the name of OneClass
< naywhayare> (sorry for the slow responses, I am looking through the code as you say this to understand better :))
< udit_s> it's okay, really.
< udit_s> just interrupt me wherever you feel you should
< udit_s> now, if there >1 class,
< udit_s> then the else condition in constructor executes.
< udit_s> it sets the variable bestAtt to a default val of -1. this tracks the attribute to split upon
< naywhayare> so, to me it seems like this tree can only handle data points it's already been trained on, unless I've understood this wrong
< udit_s> now, for each row in the training set - chck if it is distinct
< udit_s> yeah... I was going to talk about that in the end.
< naywhayare> so suppose I train on { (0, 0), (1, 0), (2, 1), (3, 1) } where (i, j) --> i is the value, j is the class
< naywhayare> oh, ok. I'll wait until then, then :)
< udit_s> no go on - I wanted to asl about how to deal with it.
< udit_s> *ask
< udit_s> so you were saying, ryan ...
< naywhayare> sorry, I had to grab some water
< naywhayare> anyway, suppose you have that training dataset I wrote above
< naywhayare> now, I give a test point (-1)
< naywhayare> it seems to me like the code simply won't return a label for that point, because -1 won't exist in uniqueAtt
< udit_s> this code won't be able to label it
< udit_s> one way to do this would be also look for a default class
< udit_s> while training
< udit_s> and then assign...
< naywhayare> so if you do that, and we say the default label is class 0, then I'll get the label 0 for the point (-1)
< naywhayare> which makes sense
< udit_s> yeah..
< naywhayare> but if I pass in the point (4), it will give the label 0... when really the label should be 1
< udit_s> which raises the question: should this be incremental ?
< naywhayare> yes, I definitely think it should be, because mlpack can't really handle categorical data
< naywhayare> it can handle ordered categorical data as size_t or some other integer class, but not things like "blue", "red", "green", etc. where ordering doesn't make sense
< naywhayare> there are definitely decision tree algorithms that _can_ handle categorical data, but I think because mlpack has no support for categorical data anywhere else, we don't need to worry about un-orderable categorical data here
< naywhayare> do you think that's reasonable?
< udit_s> this kind of answers a question of mine: we're expecting the user to have mapped their dataset on their own, right ? I mean, mapping categorical, unorder-able data like "blue","red" to 0,1 ...
< udit_s> otherwise, that will itself be a task on it's own.
< naywhayare> yeah, the user should be passing in ordered data
< udit_s> I do think that's reasonable. :)
< udit_s> okay, looking at you, it seems you've gotten the gist of the code... I'll talk about the tricky, messy and brute force things I wanted to about... is that okay with you ?
< naywhayare> yeah -- I think that'll make the class much simpler. now all you need to do is hold the splitting dimension and some kind of range for each class, then say basically if (test(splitCol, i) is in range[j]) { labels[i] = j; }, or something
< naywhayare> yeah, go ahead
< udit_s> ^ about that - I could keep updating the counts of the classes, and each time there a value which the set has not been trained upon, you could just take out the max of those counts - that takes care of the dynamic part
< naywhayare> that's true, but I think by ignoring non-ordered categorical data, we can make a smaller and faster class
< naywhayare> also, just to clarify, are more than two labels allowed?
< udit_s> but should this just be done for missing values ?
< udit_s> yes, > 2 labels are allowed
< udit_s> what do you mean by a smaller and faster class ?
< naywhayare> smaller memory footprint and faster execution time
< udit_s> but if the user is already mapping the values and giving the test data to us, we treat all columns as numerical values, don't we ?
< naywhayare> yes; when I said "ignore non-ordered categorical data", I mean that we should ignore the possibility of the user passing non-ordered categorical data
< naywhayare> and write a class that is specifically for numerical values
< udit_s> exactly what I've done. Or assumed when I wrote the class.
< udit_s> okay, so another question I had was -
< naywhayare> are you sure? consider again the sample dataset I wrote above; any test point less than 1 should be classified as class 0, and any test point greater than 2 should be classified as class 1
< naywhayare> a decision tree on numerical data should see that the split value is 1.5 (or really any value greater than 1 and less than 2), and use that to label the test data
sumedhghaisas has joined #mlpack
< udit_s> wait, I haven't taken a binary split. I've done it like so: http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html
< naywhayare> ok. even for a larger number of splits, each split should correspond to a range
< naywhayare> like in that first example on the page you linked to, the first split corresponds to (-infinity, 30000); the second to [30000, 70000]; the third to [70000, infinity)
< naywhayare> take a look at the paper "Very simple classification rules perform well on most commonly used datasets."
< naywhayare> Appendix A contains psuedocode for the OneR splitting criteria, that will take continuous variables and split them into ranges
< udit_s> but if you say we ignore the categorical data, then how do you group ? I mean isn't a one level decision stump more like a one level rule learner - a rule mining which stops after the first level ? here you are labeling each unique instance per column of training data by majority voting, after finding the column to best split on.
< naywhayare> you group by splitting the numerical data into ranges, like in the first example on the id3.html page you linked to
sumedhghaisas has quit [Ping timeout: 240 seconds]
< naywhayare> I thought you were implementing OneR, which is why I linked to the paper I did; if you're actually implementing another algorithm, that's okay, but we should make sure it handles numerical data just fine
< naywhayare> step 3 of the psuedocode in appendix A is the bit that splits a numerical feature into sets of intervals
< naywhayare> it's written somewhat confusingly though... :(
< udit_s> right now, I had not taken care of handling continuous data - i was just assuming - say, given the playing data - outlook: {sunny,overcast,rainy}, temperature: {high,low}, humidity: {normal, high}, windy: false,true - class: play: yes or no.
< naywhayare> yeah -- those are all un-orderable categorical data, which we don't have a reasonable way to handle in mlpack
< naywhayare> or, I guess that's not quite true
< naywhayare> you could order each of those features reasonably
< naywhayare> even so, we should definitely focus on the case where the data is continuous too; continuous-valued data is likely to be the majority of data given to this class
< udit_s> say the user maps to numerical attributes and gives us the data - yes : 1 no: 0 and we make the decision stump from there. That is what train3.csv does
< udit_s> yes, I realise that.
sumedhghaisas has joined #mlpack
< naywhayare> ok. I took a look through the tests, too; for the type of data you're considering, they look just fine
< udit_s> if we do train on numerical data, how do you recognise continuous attributes from numerically orderable categorical attributes ?
< udit_s> I think I need to go take care of continuous attributes now..
< naywhayare> I think that it's reasonable to just assume that all data is continuous
< naywhayare> and if the data is actually categorical, when the stump is training on data, it should manage to create intervals that correspond to the integer-valued categorical attributes
< naywhayare> (although I guess the categorical attributes don't have to be integer-valued, they just have to be ordered)
< naywhayare> does that make sense? or maybe my ideas aren't good? :)
< naywhayare> I suppose it's also possible for a user to pass in a flag specifying what type of data they have
< naywhayare> but I think that a decent interval creation procedure should give the same results
< udit_s> hmm... I'll have to think over this.
< udit_s> I feel like this is a conversation we should have before we begin something next time...
< udit_s> keyword being before :)
< naywhayare> yeah, I agree
< naywhayare> I'm sorry about the confusion
< udit_s> I'm sorry too - I should have thought about this..
< naywhayare> either way, I think much of the code you have written is reusable, so I don't think this is a huge problem
< naywhayare> you may want to refer to the Weka implementation of OneR to see how they are choosing intervals
< naywhayare> although, the pseudocode given in Appendix A isn't very clear on how the intervals should be chosen
< naywhayare> I'm sure we can figure something out, though. I'll dig a little more into papers and documentation and code and see what I can find
< udit_s> okay, let me think about this, I'll get back to you tomorrow, or over the weekend with a mail. Could you also tell me whether we could do a better implementation of what data is stored - like storing the splitting attribute label matrix
< naywhayare> you mean the variable DecisionStump::spL ?
< udit_s> I'm just storing the class labels and the split col and splitlabel matrix as class variables..
< udit_s> yeah. ( I need to come up with better variable names)
< naywhayare> it's okay :)
< naywhayare> I think that there's no way around storing a matrix like spL
< naywhayare> because you need the class probabilities for each interval (or category)
< naywhayare> which implies that you either need the counts for each interval/category and each class, or the probabilities for each interval/category and each class (which are basically the same thing)
< naywhayare> for smaller numbers of categories this shouldn't be a problem
< udit_s> yeah, exactly. I was initially storing the entire training data, to be reused again in classify, which is what I removed today.
< naywhayare> I would think the case where the number of splits in the decision stump is extremely high is rare
< naywhayare> yeah, storing a precomputed matrix like spL is, I think, the best thing you can do
< naywhayare> it may be better to normalize that matrix by the number of points in each class, so it's a probability instead of a count, but that depends on whether you are using the probability or the count later in the class
< udit_s> okay - also, I was just thinking before this chat - the default/starting value of bestEntropy in the constructor should be 0 instead of 1.
< udit_s> that's the worst case.
< naywhayare> are you sure the worst case isn't -infinity (or -DBL_MAX)?
< naywhayare> the entropy is something like \sum_i p_i \log p_i
< naywhayare> so if p_i is very small then \log p_i is very negative
< naywhayare> or hang on, maybe I have this backwards
< udit_s> maybe ?
< naywhayare> let me check
< naywhayare> ah, yeah, it's -\sum_i p_i \log p_i
< naywhayare> so I did have it backwards
< udit_s> when all the bins are every unique value in itself : kind of like splitting on EVERY value in a continuous attribute - you would get 1*log2(1)
< udit_s> summing over all of them, you would sum to 0.
< udit_s> am I right ?
< naywhayare> I think so, but I think the formula is for only one bin
< naywhayare> so the formula for all bins is just the sum of the formula for only one bin
< udit_s> yeah, but the weighted sum.
< naywhayare> if each bin had one point that was correctly classified, then yeah, it would be 0
< naywhayare> oh, is it a weighted sum? ok
< udit_s> yeah.
< naywhayare> either way, if I had one bin that was 99.99999% classified accurately and 0.00001% classified wrong, then I'd have -(.999999 log_2 .999999 + .000001 log_2 .000001)
< naywhayare> that comes to 2.13e-5
< udit_s> so the worst case would be a 0. Keep in mind this is only the value of the entropy of the split not the information gain which is entropy of parent (minus) sum of entropy of children.
< udit_s> oh. yeah.
< naywhayare> hang on... I think I am wrong about the DBL_MAX
< naywhayare> yeah, I'm wrong; the worst case is an equal split of 0.5: -2(0.5 log2 0.5)
< naywhayare> and that's equal to 1
< naywhayare> so it seems to me like the worst case is 1 and the best case is 0
< naywhayare> so are you sure the default value of bestEntropy should be 0, then?
< naywhayare> or maybe I have screwed up my math
< udit_s> let me check this.
< naywhayare> ok, it turns out my math was wrong, sort of
< naywhayare> the worst case for entropy is when N classes each have equal probability
< naywhayare> which gives an entropy of
< naywhayare> -N * ((1 / N) * log2(1 / N))
< naywhayare> because (1/N) is the probability of each class, when they have equal probability
< udit_s> yeah, - worst case is equal probability of all classes.
< naywhayare> -N * ((1 / N) * log2(1 / N)) = -log2(1 / N) = -log2(N^-1) = log2(N)
< naywhayare> so in the worst case the probability is log2(N), not 1
< naywhayare> and N can become very large, so as N goes to infinity, then I guess the worst case entropy goes to infinity too
< udit_s> whereas the best case would be all outcomes are in one class.
< udit_s> so DBL_MAX ?
< naywhayare> yeah, I think that's what you should use
< udit_s> okay,
< naywhayare> to produce an entropy of DBL_MAX, a user would need to have exp(10^308) classes... that's a lot of classes :)
< udit_s> yeah. :D
< naywhayare> I'm not sure there are even that many particles in the universe
< udit_s> so...anything else you thought about when you looked at the code ? any specific test I should be writing ?
< udit_s> till now, I've taken note of: 1) handle continuous attr 2)worst case entropy 3) missing values - majority classes.
< naywhayare> I had some ideas for how to make the code more efficient, but it's usually not worth applying those ideas until the code is ready to commit
< naywhayare> I wouldn't worry about missing values. it's something that's never been thought of for mlpack methods, although it does happen often in real life
< udit_s> okay. I'll try and include it, if time permits.
< naywhayare> the only two other things I was thinking about is to actually test the entropy calculation inside of the EntropyCalculation test, and to implement the bit in decision_stump_main.cpp that says /*Code goes here*/ :)
< naywhayare> or, I guess in that section (lines 51 -- 65) you are testing the dimensionality of the labels and training data... so... nevermind :)
< sumedhghaisas> naywhayare: Premature optimization is the root of all evil -- DonaldKnuth :) sorry for that... something I remembered reading naywhayare's sentence
< udit_s> yeah :D I totally forgot anout that.
< naywhayare> sumedhghaisas: definitely :)
< sumedhghaisas> naywhayare: Needed little help.. you free right now??
< naywhayare> sort of... someone is coming to meet with me, so I'll have to step out
< naywhayare> but I'm probably here for about 5 minutes more
< naywhayare> sudden unexpected meeting :(
< sumedhghaisas> ohh... yeah no problem...
< sumedhghaisas> I was just reading Regularized ALS paper I mentioned in my application...
< naywhayare> by Cichocki and Zdunek?
< sumedhghaisas> In it... There in one paragraph they have mentioned that...
< sumedhghaisas> yup...
< sumedhghaisas> hierarchical multi layer system for optimizing matrix factorization...
< naywhayare> gotta go... back later, sorry
< sumedhghaisas> okay okay...
< udit_s> okay.
< sumedhghaisas> udit_s: Hello udit... Hows it going?? I heard from naywhayare that you are using Variadic Templates...
< udit_s> hi sumedh,
< udit_s> yeah I will be, later, when I implement adaboost
< udit_s> are you also working wiht them ?
< sumedhghaisas> I read about them but I was really confused about there application though... As in, in what situation they will be needed...
< sumedhghaisas> *their
< sumedhghaisas> no... I am not quite familiar with C++11 yet...
< udit_s> me too,
< udit_s> the way I understood them was you can basically have varying number of template parameters
< udit_s> which in the case of how i will implement them helps pretty deftly.
< sumedhghaisas> yes thats what my understanding is right now...
< udit_s> I still need to implement them in some practical program... that should surely help.
< sumedhghaisas> Its a very cool thing if you think of it... But still haven't figured out a major situation in which it is very helpful...
< udit_s> It is, for me. It basically allows me to allow a variable number of models of weak learners to be implemented in boosted methods... depending upon the parameters,
< sumedhghaisas> but do they necessarily have to template parameters?? cannot be function parameters??
< udit_s> no, the class is an adaboost class - with models to learn from/through as variadic parameters.
< sumedhghaisas> Okay... I am not familiar with adaboost... But yeah now atleast I know what to look at...
< sumedhghaisas> thanks...
< udit_s> you're welcome.
< udit_s> anyways, ryan, I catch you later. let me look at the things we discussed.
udit_s has quit [Quit: Ex-Chat]
< naywhayare> sumedhghaisas: back; what was the paragraph you were having trouble with?
< sumedhghaisas> naywhayare: in section 3...
< sumedhghaisas> second paragraph
< naywhayare> ok, I see
< naywhayare> what's the question?
< sumedhghaisas> I didn't understand the part where they have said that hierarchical multi-layer system is better...
< naywhayare> I'm not sure that it's known _why_ the hierarchical multi-layer system is better, but it seems to be better empirically
< sumedhghaisas> And exactly how this system can be used??
< naywhayare> it looks like you just run RALS over and over again on the X_i matrix
< sumedhghaisas> But is the overhead of this system worth the improvement in the results??
< naywhayare> that's a very subjective question that depends on the application's needs
< sumedhghaisas> If it is then I can try to incorporate it while coding....
< naywhayare> if you're planning on implementing that support, it should definitely that the user specifies
< naywhayare> your choice
< naywhayare> either way is fine
< sumedhghaisas> Yes... I was planning like having another argument which specifies the number of such iterations...
< sumedhghaisas> default 1...
< sumedhghaisas> is there any other criterion which can be considered??
< sumedhghaisas> for stopping this process...
< sumedhghaisas> Also the paper mentions Moore-Penrose pseudo inverse.. I know matlab's pinv function computes this inverse ... the pseudo inverse provided by armadillo... is it the same??
< naywhayare> sorry, I stepped out again
< naywhayare> armadillo's pinv() is fine
< naywhayare> couldn't you use the residual of the decomposition for another stopping criterion?
< sumedhghaisas> yeah... so can also provide minResidual for hierarchical multi-layer system...
< sumedhghaisas> *user