< oldbeardo>
should I follow the same abstractions in my implementation?
< naywhayare>
I'm hoping to be up and at a computer by 11:30 UTC every morning on weekdays
< naywhayare>
let me look, but probably not; that was very old code
< naywhayare>
there is some of a cosine tree implementation in src/mlpack/core/tree/cosine_tree/ if you haven't seen it already
< naywhayare>
from Mudit last year
< oldbeardo>
wow, I din't know that, I wasted the whole day trying to write one :(
< oldbeardo>
*didn't
< naywhayare>
that implementation needs a lot of work though
< naywhayare>
so I'm not certain of how useful it is
< naywhayare>
hang on... I am actually in transit right now
< naywhayare>
back in a moment or two
< oldbeardo>
sure
andrewmw94 has joined #mlpack
oldbeardo_ has joined #mlpack
< oldbeardo_>
naywhayare: sorry about that, network issue
< oldbeardo_>
and for some reason my earlier session has not expired
oldbeardo has quit [Ping timeout: 240 seconds]
oldbeardo_ has quit [Quit: Page closed]
oldbeardo has joined #mlpack
< naywhayare>
ok, back
< oldbeardo>
yeah, so what should I work on for the time being?
< naywhayare>
I figured you would be implementing quic-svd
< naywhayare>
I took a look at the API from mlpack 0.4 for the QuicSVD class
< naywhayare>
it's pretty simple... a constructor and a ComputeSVD() function
< oldbeardo>
yes, that' what I thought
< oldbeardo>
*that's
< naywhayare>
you can use that for your QuicSVD API but it'll need to be changed a little because we don't use the weird internal GenMatrix class anymore
< naywhayare>
from what I remember, Mudit's cosine tree expects row major data
< naywhayare>
but realistically it should be expecting column major data
< oldbeardo>
right, so should I write a new module or improve the one written by Mudit?
< naywhayare>
I don't think it's tested either so youay have an easier time starting from scratch
< naywhayare>
that's your call -- I'm fine with either as long as we have tests that show it works :-)
< oldbeardo>
okay, in that case I will write a new one, based on the old code
< naywhayare>
ok, sounds good
< naywhayare>
feel free to check unfinished code into trunk/, but if it doesn't compile, be sure that CMake won't try to compile it, so that the build doesn't become broken
< oldbeardo>
so we remove the one written by Mudit?
< naywhayare>
if you're going to rewrite it, then yes, we should remove his version at some point
< oldbeardo>
okay, will start working today
< naywhayare>
if you need any help, that is what I'm here for :)
koderok has joined #mlpack
< oldbeardo>
sure, one more thing, should I write the CosineTree module keeping in mind QUIC-SVD, or should I make it generic?
< naywhayare>
if you can make the API at least resemble BinarySpaceTree or CoverTree, that would be great
< naywhayare>
I'd like to try and use the cosine tree for some dual-tree algorithms
< naywhayare>
but don't worry about implementing things like MinDistance or MaxDistance that aren't necessary for QUIC-SVD
< naywhayare>
if I end up using it for a dual-tree algorithm I'll implement those myself, unless for some reason you just want to implement them
< oldbeardo>
okay, I will avoid those for the time being, will do it if I have some time remaining
< naywhayare>
sounds good
< oldbeardo>
thanks for the clarifications, I'll see you later
< udit_s>
naywhayare, I see you're up. I had a few questions
< udit_s>
so while declaring say an armadillo col, where does "size_t" come in ?
< udit_s>
in arma::Col<size_t>& assignments - how does one use size_t ?
< naywhayare>
udit_s: size_t is the type used in mlpack to represent an index of some sort
< udit_s>
also, for the decision stump, I believe I am working with discrete values...
< naywhayare>
I'm not totally sure I understand your question though
< naywhayare>
the object arma::Col<size_t>& assignments will be a vector holding size_t's
< naywhayare>
but I don't think that's the question you were asking
< udit_s>
so looking up in the armadillo api, it says Col<type> is the format, so I'm wondering where size_t is coming from... I'm pretty sure I had a handle on this and then through the day, this got muddled
< udit_s>
so say colvec := Col<double>
< udit_s>
so how does size_t specify the type ?
< udit_s>
say, double ?
< naywhayare>
oh, I think I understand
< naywhayare>
size_t _is_ the type being held, in this case
oldbeardo has quit [Quit: Page closed]
< naywhayare>
you can think of size_t as an unsigned integer value of some sort
< naywhayare>
(the number of bits that size_t is is determined by the compiler and architecture, but for a couple of reasons 'size_t' is a better choice than 'unsigned int')
< naywhayare>
does that help? or have I misunderstood again? :)
< udit_s>
it helps; so get this: const arma::rowvec& feature_row
< udit_s>
for one dimensions ( in a decision stump) specifies an armadillo matric of one row, of type double
< naywhayare>
right
< udit_s>
okay, that helps.
< udit_s>
also, I'm assuming discrete values while splitting
< naywhayare>
why? I thought a decision stump basically chose a (somewhat arbitrary) hyperplane to split the points on
< naywhayare>
but I don't see why that should be a discrete-valued hyperplane and not a floating-point-valued hyperplane
< naywhayare>
or is there some reason I've overlooked?
< udit_s>
No, I think I've overlooked something - help me here - I am splitting based on entropy for the decision stump. - choosing the best attribute to split upon by entropy.
< naywhayare>
ok; and you are only splitting on one attribute, right?
< udit_s>
yeah.
< naywhayare>
ok
< naywhayare>
I thought what you meant is that you would split on one attribute, but you would only split on a discrete value (1, 2, 3, ...) instead of a floating point value
< udit_s>
but multiple splits ?
< naywhayare>
well, it's a decision stump, so shouldn't it only have one split?
< naywhayare>
I suppose it could have multiple splits in that one attribute
< udit_s>
no, one level - splitting based on the attribute.
< naywhayare>
ok, that sounds reasonable to me
< naywhayare>
I think I misunderstood what you meant originally
< naywhayare>
I would imagine that all the DecisionStump class needs to hold is the index of the attribute/dimension it is splitting on, and the value(s) that define the split in that attribute/dimension
< udit_s>
yeah. Also, I think you didn't get the original question - should I consider that each attribute can be continuous for e.g. age, salary or only dscrete values like Iris color=blue,red,green?
< naywhayare>
I would assume continuous attributes
< naywhayare>
most mlpack code assumes continuous... I think there are very few places where discrete-valued observations are supported
< udit_s>
That means I would need to take care of discretiztion - or is there an implementaion for mlpack like binning ?
< udit_s>
*by mlpack.
< naywhayare>
there is currently no implementation for that. if you have ordered discrete values, it's easy: just cast the discrete value to continuous and it should work fine (mostly)
< naywhayare>
but for unordered discrete values it's a little harder, and I don't think any mlpack code has really considered that case in the past
< naywhayare>
for problems like k-nearest-neighbor and range search it doesn't make too much sense to use unordered discrete values, so that's part of why there's no code to deal with that case
< andrewmw94>
kind of a silly question, but would you prefer me to name things like: RTree.hpp or rectangle_tree.hpp or r_tree.hpp? I assume it's one of the latter two, but I don't think it's common to call them rectangle-trees rather than R-trees?
< naywhayare>
andrewmw94: I'd go with r_tree.hpp
< udit_s>
so basically, if I have something like: 24.7, 0 | 25.8,0 | 24.6, 1 | 25.1,1 | - (first att, class) then I have four splits one on each of the first att , right ?
< naywhayare>
for whatever reason, the file naming convention seems to be converting the class name to lowercase then inserting underscores
< naywhayare>
probably this was originally because it makes it easier to type and tab-complete them in a terminal
< naywhayare>
udit_s: I'm not sure what you mean. "24.7, 0" seems like two attributes to me
< naywhayare>
oh, sorry, the second thing is the class label
< udit_s>
yeah :)
< naywhayare>
yeah, that looks like four splits to me, but you can't split on the label because you generally don't know that when you're not training the model
< naywhayare>
unless you mean that the label as given in your list is the label you will assign to the point when it falls into that bin
< andrewmw94>
naywhayare: yeah, but should it be the same for the .hpp files that will be shared. For examlpe, the traverser in the R tree and the X tree will probably be the same. Should I just name that r_tree_traverser.hpp or should it get something that sounds more general
< naywhayare>
hmm, that's not a situation we've come across before. do whatever you think is best :)
< udit_s>
ok. here the first att is something like score, and the class is a binary assignment.
< udit_s>
and generally, splitting based on a continuous attribute would generate poor entropy; so a decision stump would usually not split on a continuous attribute
< naywhayare>
I don't think splitting on a continuous attribute will necessarily imply generating poor entropy
< naywhayare>
you could have poor entropy cases with discrete variables too; it all depends on the distribution
< naywhayare>
if class 0 is a gaussian with variance 1 and mean -5, and class 1 is a gaussian with variance 1 and mean +5, you will get a very good split if you split at 0
< udit_s>
hmm. so how should I handle this ?
< udit_s>
also, this might seem like a silly question, but in: const MatType& data, const arma::Col<size_t>& labels, const size_t classes, labels are the column labels and the last *row* of data is the class it is assigned to.
< naywhayare>
ah, no, labels should be the class that each point is assigned to
< naywhayare>
and the data matrix should only have the attributes of the data, and not the label
< naywhayare>
I think that it should be just fine to split on a continuous attribute, and I think that's what your implementation should do
< naywhayare>
and if you do that, then if a user passes in an arma::Mat<int> (or other discrete type), the decision stump should split on a discrete value
koderok has quit [Ping timeout: 240 seconds]
< udit_s>
okay, thanks.
< udit_s>
hold on. I just realised data is stored as col-major but we are storing the labels as a single col ? It doesn't really matter when the indexes are used, but is this interpretation right ?
Anand has joined #mlpack
< naywhayare>
udit_s: that's correct. I suppose we could have used arma::Row<> but as you pointed out, it doesn't make a very big difference
< naywhayare>
I think it would probably be a good idea to open a ticket to transition uses of arma::Col<size_t> for holding labels to arma::Row<size_t>
< naywhayare>
I'll go ahead and do that...
< udit_s>
hmm...I'm going through each of the attributes in "data", so I'll be doing that in the arma::Row<size_t> format.
< naywhayare>
ok, yeah; use arma::Row<size_t> for labels
< naywhayare>
someday maybe I will get around to resolving it :)
< andrewmw94>
I assume that I should try to correct comments in other people's code as I go through it. EG, the binary_space_tree.hpp code says it takes one parameter and then lists the three parameters to give it. Or should I try to keep the commit's smaller?
< naywhayare>
andrewmw94: yes, please correct any comments or anything you find wrong
< naywhayare>
keeping documentation correct and up to date is actually quite tedious...
< Anand>
Hi Marcus!
< Anand>
As you might have already noticed, I am working on a different branch
< Anand>
Will merge it later
< Anand>
Btw, what significant information does reports/benchmark.db provide after running the benchmarks once?
< Anand>
I couldn't figure out much from it (opened it in vim)
< marcus_zoq>
Anand: Hello! The benchmark.db is a sqlite database. You can open the file with sqlite3 benchmark.db.
< naywhayare>
Anand: benchmark.db is sqlite3, I think, so you'll need to open it with sqlite3
< naywhayare>
hah... just a little too slow :)
< Anand>
Right. The README.md mentions that this stores the output from the benchmarks
< Anand>
what exactly does it store?
< marcus_zoq>
All results from the timing benchmark.
< marcus_zoq>
We use the database to generate the html page.
< naywhayare>
.schema is a nice sqlite3 command that lets you see the format of the tables in the database
< Anand>
Ok
< Anand>
Thanks!
< marcus_zoq>
I think you have to load the database within sqlite: attach "benchmark.db" as db1;
< Anand>
sqlite3 database.db loads the required database
< Anand>
It can then be explored through standard select queries
< Anand>
and .schema gives the table structures
< Anand>
Btw, the link is insightful! :)
< marcus_zoq>
Anand: Btw, jenkins keeps track of the master branch, so we will see if new code breaks anything.
< Anand>
What does that mean? Will I have to work on the master?
< marcus_zoq>
No, but if you merge your code we see if everything works like expected.
< Anand>
Essentially, I will keep merging at certain points of time. you can check things then
< marcus_zoq>
Sounds good, heading home, back in a few minutes ...
< Anand>
If some issue arises, let me know. Some dependency issues might arise
< Anand>
We will fix them, anyways!
Anand has quit [Ping timeout: 240 seconds]
udit_s has quit [Quit: Ex-Chat]
andrewmw94 has quit [Ping timeout: 258 seconds]
andrewmw94 has joined #mlpack
< naywhayare>
andrewmw94: I saw the comment in binary_space_tree.hpp you were talking about
< andrewmw94>
was I wrong?
< naywhayare>
it seems like it was referring to the constructor parameters, not the template parameters
< naywhayare>
so I'm committing it back in but with more clear wording to note that it's a constructor parameter
< andrewmw94>
ahh
< naywhayare>
still, thank you for pointing it out -- consistent documentation is very important
< andrewmw94>
most of the stuff I wrote is probably not how it should be
< naywhayare>
it's only day one :)
< andrewmw94>
I'm trynig to get the things that should be similar between the BSP tree and the R tree
< naywhayare>
we'll get it in better shape as time goes one. don't worry if it isn't done yet, but please don't commit changes that break the build (or jenkins will send angry emails)
< andrewmw94>
and write them down, but I probably missed a lot and have a lot that shouldn't actually be there
< naywhayare>
*time goes on
< andrewmw94>
yeah, it's not in the CMake lists
< andrewmw94>
otherwise it probably would
< andrewmw94>
break the build I mean
< naywhayare>
yeah; it's fine if you commit code that doesn't compile, as long as CMake isn't compiling it by default and you intend to fix it at some point
< andrewmw94>
yeah, I thought I remebered you saying that, and I constantly worry that my computer will crash again
< andrewmw94>
so commit early and often
< naywhayare>
yeah; git is a better VCS tool for that philosophy because it makes branching and merging so easy
< naywhayare>
but svn branching is... sometimes quite painful...
< andrewmw94>
yeah, I had "fun" with that least year with my research project
< andrewmw94>
and that was just writing the paper, not code
< naywhayare>
when we did the original overhaul of mlpack code, we tried to do it in branches
< naywhayare>
and I worked really hard to preserve the revision history when I put them all back together
< naywhayare>
I had to give up. simply couldn't make it work without either making svn hang or crash or have a nightmare of manually merging thousands of lines of code
< naywhayare>
svn's better than CVS, but maybe not by much...
< andrewmw94>
what should be my policy on optimizations?
< andrewmw94>
for example, there's a swapping of two values in the mean_split_impl.hpp file that uses a temporary
< andrewmw94>
when using the xor method is probably faster
< andrewmw94>
however, I suspect that that is so common that compilers do it automatically
< naywhayare>
andrewmw94: sorry, I stepped out; let me take a look
< naywhayare>
I'd agree that the xor method is more likely to be faster, but I'd suspect that gcc does it automatically
< andrewmw94>
yeah, it seems like they must
< naywhayare>
you could write a quick test that swaps something a million times using xor and a million times using temporaries and see if the runtime is any different
< naywhayare>
you can do rough timing with either 'time' or using a builtin timer function or even mlpack::Timer
< sumedhghaisas>
naywhayare: yeah for starting... will it be helpful to conduct this discussion when both me and siddharth are present??
< naywhayare>
sumedhghaisas: nah, he doesn't need to be here; I can discuss it with him later if necessary
< naywhayare>
I am sure we can find a way to modify the CF abstractions so both your changes and his can fit just fine :)
< naywhayare>
andrewmw94: I just realized I didn't actually answer your question. I'd prefer the easier-to-read swap with a temporary, but if you check and find that XOR is significantly faster then maybe we should use that
< sumedhghaisas>
okay okay... First week I will be implementing LMF module ... the abstraction will be same as NMF...
< andrewmw94>
they're basically the same
< andrewmw94>
also, I'm not sure if the size_t fits in a register
< andrewmw94>
which would change the effectiveness of the xor method
< naywhayare>
size_t is architecture dependent and should be the same size as ptrdiff_t; so on a 32-bit system size_t should be 32 bits, on a 64-bit system size_t should be 64 bits
< naywhayare>
and on the s390 architecture for a fascinating legacy reason size_t is 31 bits
< andrewmw94>
interesting
< naywhayare>
but nobody really uses s390 anymore...
< naywhayare>
basically, they were using the 32nd bit to indicate whether or not virtual memory was being used... something like that
< andrewmw94>
hmm
< naywhayare>
anyway, a sane compiler should define size_t in such a way that it fits in a register
< sumedhghaisas>
the template parameters will be initialization rule and update rule...
< naywhayare>
sumedhghaisas: yeah; so, that seems reasonable to me. if they have defaults like NMF, then I should be able to write 'CF<LMF<> >' and that should work
< sumedhghaisas>
yes... this module will accept current initialization rules...
< sumedhghaisas>
so for testing I need to have a update rule...
< sumedhghaisas>
hummm... let me see if some current update rules of NMF can be used or not...
sumedh_ has joined #mlpack
sumedhghaisas has quit [Ping timeout: 240 seconds]
sumedh_ has quit [Client Quit]
sumedhghaisas has joined #mlpack
sumedhghaisas has quit [Ping timeout: 240 seconds]
sumedhghaisas has joined #mlpack
< sumedhghaisas>
naywhayare: please can you resend any msges you send...
sumedh_ has joined #mlpack
< naywhayare>
sumedh_: sumedhghaisas: I haven't sent any messages
< sumedh_>
okay sorry... So umm... what if in NMF local minima is obtained before reaching minResidue...
< sumedh_>
in this case if MaxIterations are reached the module will return the residue of last iteration...
< sumedh_>
but we may have a solution will lesser residue...
< sumedh_>
The workround could be to store the parameters corresponding to least residue in each iteration and return those parameters when maxIterations are reached...
< naywhayare>
sumedh_: I think that the NMF optimizer is set up in such a way that it will not take steps that increase the residue
< sumedh_>
okay... let me check...
< sumedh_>
naywhayare: The residue is updated without any checks... and I don't update rules will consider residue as it updates with derivatives..
< naywhayare>
sumedh_: I'm not convinced that this is a serious issue. if you can show me instances where the returned residue is significantly worse than the best residue during optimization, then we can investigate it further
< naywhayare>
but otherwise, I'd prefer to avoid the extra code and memory overhead of holding onto the best iterate
< sumedh_>
umm.. I dont have any instance of this sort right now... It is just a possibility that this could happen as there can be a local maxima above minResidue...
< sumedh_>
okay right now I will just go ahead without this change...
< naywhayare>
ok; I don't think it's something that will happen very often for most datasets, and I think that if it does happen the difference in residue will be minimal enough that it doesn't matter
< sumedh_>
yeah I agree to that... But the workaround is not that costly either... there will be one extra if case and 2 variable to hold best value...
< naywhayare>
I disagree; the W and H matrices could be large (though not as large as the data matrix), and every time the optimizer gets a better solution, those two matrices will need to be copied
< sumedh_>
yeah right... that copy will be costly...
< sumedh_>
can we add one more condition to the main loop... if residue of last iteration > residue of current ... then break...
< sumedh_>
but I dont think this will work... cause the optimizer may find a lower value of residue later...
< naywhayare>
that condition would terminate on the second iteration...
< sumedh_>
naywhayare: okay I have prepared the LMF module ... basically I only needed to change NMF to LMF ... But now we need a naming convention for update rules as there will be both NMF and SVD update rules...
< sumedh_>
currently there are 2 update rules als_update_rules.hpp and mult_dist_update_rules.hpp of NMF...
< sumedh_>
would it be better to create a folder of update rules... and then inside we can store update rules like NMF_ALS.hpp and NMF_multi_dist.hpp
< sumedh_>
svd update rules will just pe prefixed with SVD_
< naywhayare>
yes, a folder of update rules is probably best
< sumedh_>
okay I will make suitable changes...
< sumedh_>
and rather than creating separate class for WUpdate and HUpdate we can create 2 separate functions inside the same class...
andrewmw94 has quit [Ping timeout: 258 seconds]
< naywhayare>
I would rather create two separate classes to keep the API as simple as possible
andrewmw94 has joined #mlpack
< sumedh_>
There will too many classes when we dont need them... WUpdate and HUpdate are the only functions which are needed ... they can be in same class... they both are static...
< sumedh_>
Both WUpdate rules and HUpdate are associated with the same update rule... so it makes sense hierarchically...
< naywhayare>
oh
< naywhayare>
sorry
< naywhayare>
I misunderstoof
< naywhayare>
*misunderstood
< naywhayare>
combinijg HUpdate and WUpdate is an idea I agree with
< naywhayare>
*combining
< sumedh_>
okay okay...
andrewmw94 has quit [Ping timeout: 240 seconds]
< sumedh_>
naywhayare: okay I have made all changes... combined WUpdate and HUpdate ... created LMF abstraction and removed NMF... created folders for update rules and init rules...
< sumedh_>
also modified cf to work with LMF abstraction,...
< sumedh_>
okay... how to use svn to commit these changes??