verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
Mathnerd314 has quit [Quit: No Ping reply in 180 seconds.]
Mathnerd314 has joined #mlpack
tham has joined #mlpack
< tham>
nilay zoq : I saw your discussion on view_as_windows
< tham>
There are something quite confuse
< tham>
for me
< tham>
I type following codes
< tham>
A = np.arange(3*4*4).reshape(3,2,8)
< tham>
The results looks like 3 channels, 2 rows, 8 columns
< tham>
It do not looks like the codes try to extract 16*16 pixels patch from the input images(I think it is 32*32 from the paper)
< tham>
Am I interpret the codes with the wrong way?
< tham>
keonkim : Hi, do you still have problem about issue #658?
< tham>
I misunderstand the codes; python codes extract features at different locations(but still work); this is the intention of paper?
< rcurtin>
tham: I think maybe Keon is right, it does not make sense to me why he is able to produce a dimension with only one mapping when the dimension takes more than one value
< rcurtin>
so maybe there is a bug in what I wrote... not sure, need to look more
< rcurtin>
I thought I wrote a lot of test cases for DatasetInfo but maybe not enough :)
< rcurtin>
but it is late and I need to sleep, so I will look more tomorrow and comment on the issue (unless you and keon solve the issue before I get to it :))
< tham>
rcurtin : i think it works if you do not tranpose the matrix
< tham>
i ma not sure about the behavior after traspose
< tham>
i can look into the codes and contribute the later on
< tham>
what would you expect the behaviour after transpose?
< rcurtin>
hm good point this may have to do with the transposition being incorrectly handled...
< rcurtin>
let me look into it tomorrow morning, I need to go to bed for now
< tham>
ok, I do not how to fixed it before I confirm the correct behavior after transpose
nilay has joined #mlpack
< nilay>
tham: the codes calculate features at smp_loc locations for p_size * p_size patch.
< tham>
nilay : I think you are correct
< tham>
I check the doc of opencv, it is the right way to access the pixels
< tham>
thanks
Mathnerd314 has quit [Ping timeout: 240 seconds]
< keonkim>
tham: thanks for the response! I will further investigate :)
< keonkim>
sorry for the delayed response by the way :/
< tham>
keonkim : don't mind, I am not always online :)
< tham>
debugging
< tham>
load_impl.hpp
< tham>
rcurtin keonkim : I think I fixed the problem, but only for non transpose matrix
< tham>
Deal with transpose matrix is more difficult because it is unnatural to read the data by column
mentekid has joined #mlpack
< tham>
I think the easiest solution to deal with tranpose file are
< tham>
1 : read the whole file into a 2d matrix
< tham>
2 : write the whole file into another temporary file with transpose format
< tham>
3 : Remove transpose option, provide a file transpose api for the users, this way they could transpose the file, save it, and load the data
< tham>
This would not work if you read the matrix with transpose option as one
< tham>
if you turn on transpose option
< tham>
This function is quite long, I suggest we could put some implementation details in the details namespace
< tham>
I have not open the pull request yet
< tham>
because I am not sure how do you want to deal with transpose option?
< tham>
keonkim : you can copy and paste the codes to implement your algo first if you like
< keonkim>
tham: thanks!
< tham>
keonkim : you are welcome, please tell me if there are bugs
tham has quit [Quit: Page closed]
nilay has quit [Quit: Page closed]
boby has joined #mlpack
boby has quit [Quit: Page closed]
< mentekid>
rcurtin: I am not convinced that my code is actually correct... I just noticed a weird behavior.
< mentekid>
If you print the resulting additionalProbingBins just before exiting GetAdditionalProbingBins, you get the same bin over and over again
< mentekid>
the first 4 or 5 are different but when I request 10 I get 4-5 different ones followed by the same over and over again
< mentekid>
I am probably creating the perturbation vectors wrong
< rcurtin>
mentekid: yeah, I did not really look into the correctness of the code at all yet since there are no tests
< rcurtin>
I think that is your goal for this week if I remember right
< mentekid>
I have implemented a test but apparently it isn't sufficient
< mentekid>
I though I had pushed it
< mentekid>
anyway yeah I had hoped the code would be correct so I could start thinking about optimizations and parllelization, but it's debugging time
< rcurtin>
I thought that the test was for the recall calculation, not the multiprobe
< rcurtin>
maybe I misread the test
< mentekid>
no I ran multiprobe with increasing numProbes (for the same LSH object, so the tables didn't change)
< mentekid>
and expected the recall to improve or stay the same
< mentekid>
but the test passed without the code being correct because that's not a serious requirement... So it didn't really catch the bug I just discovered
< rcurtin>
oh, okay
< rcurtin>
I read it on a phone in the back of a car so it is not surprising that I misunderstood :)
< mentekid>
I think before going onto parallelization I should implement get/set functions for the projection tables, that way we can set our own and make better tests
< mentekid>
I just corrected the bug I mentioned (at least, I think I did) and there were so many (little) mistakes I'm surprised it even ran at all
< rcurtin>
yeah, functions to access the projection table are fine... I thought you had already done that earlier?
< mentekid>
yeah I did that but in a pretty hacky way just for the needs of my thesis
< mentekid>
I should do it properly, so the table sizes etc are checked
< mentekid>
and the functions documented
< mentekid>
I think if I try to parallelize stuff without better tests I will break the universe :P
< rcurtin>
yeah, always better to do tests first :)
< rcurtin>
although testing a parallelized version is not too hard, you can often uncover a lot of bugs by just running with different numbers of threads and checking the output to ensure it is the same
< rcurtin>
but actually I guess in this case since LSH is randomized you would need to set the same seed... but I am not sure of the behavior of the C++ RNGs when using multiple threads
< rcurtin>
so maybe that idea will not work
< mentekid>
actually I think you are right
tham has joined #mlpack
< mentekid>
if I only parallelize the search part
< mentekid>
then I don't need to re-train, so randomness doesn't affect me
< mentekid>
but that would require having an argument in Search() that defines number of threads (instead of automatically defining them)
< mentekid>
so I could run Search twice, once with numThreads=1 and once with NumThreads = 4 (for example) and check that results are the same
< mentekid>
that should work
< rcurtin>
I'd use openmp and then your user can just set the number of threads with OMP_NUM_THREADS as an environment variable
< mentekid>
oh yeah that's true
< mentekid>
I forgot about that
< mentekid>
but can you also set that from the boost testcases?
< rcurtin>
omp_set_num_threads(int);
< rcurtin>
but, openmp is an optional dependency for mlpack, so any test case should probably be surrounded by #ifdef _OPENMP or something like this
nihajsk has joined #mlpack
< mentekid>
Cool! I have completely forgotten openMP... I should refresh before starting the implementation
< mentekid>
yeah that makes sense
< mentekid>
did you see my comment about the low recall thing?
< mentekid>
have you also noticed something or is it my code? Though I did test the master branch as well and I think I got similar results
< mentekid>
actually no, wait... That's even weirder... Running with -K 1 returns much lower recall than -K 10
< mentekid>
-K is the number of hash functions per table so larger should mean fewer neighbors found... and L should increase it
Mathnerd314 has joined #mlpack
< rcurtin>
I am not too surprised about the low recall, I have found in my simulations with LSH that you have to tune it with very large bins to get good neighbors
< rcurtin>
but unfortunately I have to go for a little while so I can't help dig in right now... I will be back in some hours
< rcurtin>
I'll try and check in on my phone while I am out
< mentekid>
I'll try and figure it out too, we can talk later
< mentekid>
thanks for the help :)
< rcurtin>
sure, sorry I am flaky today :(
< tham>
rcurtin : what is your recommendation for issue #658?
< tham>
The matrix after transpose is uneasy to deal with
< rcurtin>
tham: we have to support transposing, because users generally have their data in row-major form but we need it in column major form
travis-ci has joined #mlpack
< travis-ci>
mlpack/mlpack#857 (master - e36eec5 : Ryan Curtin): The build passed.
< rcurtin>
I was writing some tests based on Keon's example matrices, but I have to step out, can't commit them yet
< rcurtin>
I saw your commit, I think isdigit() is not applicable in this case because "1.000e+04" is a valid number that can be casted, but will fail the isdigit() test
< rcurtin>
I think the best way to test is to try to extract it into an eT, like "val << token" in the original code
< rcurtin>
when I get back I will finish those test cases
< tham>
rcurtin : you are right, did not notice that
wasiq has joined #mlpack
< tham>
rcurtin : I can't find a memory efficient way to parse the matrix after transpose, it has to record the data to find out which "row" is numerical vs category
< tham>
I will give the dumbness solution a try first--parse all of the lines and transpose them
< zoq>
nilay: The mat file contains the segmentations and boundaries for one particular image, not the image itself.
< tham>
sorry, misunderstand
< nilay>
ok, but it is very confusing, for one image of size (x,y,z) the boundary must be of size (x,y).
< zoq>
nilay: that's right, size size of the boundary and segmentation is (x,y)
< zoq>
nilay: *the size
< nilay>
but here, when he writes in prepare_data function, this loop: for j, (img, bnds, segs) in enumerate(input_data): bnds is a list of 6 or 7 matrices each of size (x,y) (which is the size of image, without the channels).
< zoq>
nilay: yeah, one mat file could contain more than one segmentation and boundary
< zoq>
nilay: But I think the python code just uses the first segmentation and boundary
< zoq>
nilay: as you can see the dataset contains 6 segmentation matrices.
< nilay>
zoq but in that loop img is one image only, of size (x,y,z)
< nilay>
i tried printing img.shape
< nilay>
zoq: so in the dataset you generated, you have also taken first boundary?
< nilay>
or how can we know
< zoq>
nilay: right, so you have your input image of size (x,y,z), and the mat file contains all segmentations and boundary for that particular image with size of (x, y) but the mat file often contains more than one boundary and segmentation for one particular image, because the authors who generated the dataset tested different parameters e.g. to get more fine-grained segmentations.
< zoq>
nilay: yes, right the dataset I generated only contains the first segmentation nd boundary
< mentekid>
rcurtin: I fixed the bugs, tested everything by hand/eye. It should be working properly now
< mentekid>
I also tested some other datasets, miniboone and phy no weird behavior regarding numProj. I'll test more to see if CorelColorHistogram is alone in this
sumedhghaisas has quit [Ping timeout: 260 seconds]
< tham>
rcurtin : I think I can write the tests if you don't mind, based on keonkim examples
< tham>
it should be done within a few hours
< rcurtin>
tham: sure, go ahead, I am almost done with lunch and will check in my code theb
< rcurtin>
we can combine our tests
< tham>
rcurtin : ok, I am studying the codes of structure random forest, will finish the test later on
< rcurtin>
sure, sounds good
< rcurtin>
are you considering an RF implementation for mlpack?
< rcurtin>
I have implemented a random forest built on hoeffding trees but I have not finished testing yet
< nilay>
rcurtin: yes we need to implement RF for edge boxes algorithm
< zoq>
rcurtin: The structure random forest, is one part of the edge boxes method nilay implements to get some ROI's. If you already have an random forest in place, based on the hoeffding trees that's even better, that would come in handy.
< zoq>
rcurtin: The plan was to use the hoeffding tree anyway.
< zoq>
rcurtin: Or the modified decision stump by Cloud, I think, either one of this two should work.
nihajsk has quit [Ping timeout: 244 seconds]
tsathoggua has joined #mlpack
tsathoggua has quit [Client Quit]
nihajsk has joined #mlpack
< rcurtin>
zoq: maybe the decision stump is better, the Hoeffding random forest is not working anywhere near as well as Breiman's typical random forest (which is what scikit implements)
nihajsk has quit [Ping timeout: 260 seconds]
< rcurtin>
my observations so far are that the Hoeffding random forest sometimes doesn't even outperform a single Hoeffding tree
< zoq>
rcurtin: hm, okay, either way, it would be great to see the code, maybe there is an easy way, to make it work with the decision stump?
< rcurtin>
I need more time to look into this, but right now I don't have any time for it, so I don't know when I'll be able to look into that further
< rcurtin>
you could definitely use the same ideas I've used there, but you'd need to refactor the decision stump
< rcurtin>
the most important change I had to make was to add a template parameter "SplitSelectionStrategyType", which basically encodes the splitting dimensions the tree is allowed to consider
< rcurtin>
so for Breiman's random forest with single randomly chosen dimension, you use SingleRandomDimensionSplit, which only gives the tree one dimension to split on
< rcurtin>
but if you are using a default Hoeffding tree, AllDimensionSplit is used, which lets the tree split on any dimension
< zoq>
rcurtin: okay great, I'll take a look at the code in the next days, and probably come back with some questions.
< rcurtin>
yeah, it is not production-quality, so maybe there will be many questions to answer :)
< rcurtin>
I still have not fully documented it, because my goal was to get simulations working quickly to see if the idea worked, and at the last point I had time to work on it, it did not work very well
< rcurtin>
I didn't know Cloud wrote a modified decision stump, is his code available anywhere?
< rcurtin>
I thought his ideas were great and would be good changes, I just didn't know if he actually implemented those changes
< rcurtin>
mentekid: it's possible the weird behavior of CorelColorHistogram has to do with some odd properties of the dataset
< rcurtin>
that dataset I think is somewhat high-dimensional but has clusters in it. that doesn't explain why more hash functions would result in better recall, though, unless the hash width is also getting larger when the number of hash functions increases
< zoq>
rcurtin: The last time I check there was some code: https://mailman.cc.gatech.edu/pipermail/ mlpack/2016-April/000991.html ... looks like he delete his fork
< rcurtin>
yeah, I remember seeing that code, I thought it was just a stub outline of the changes he wanted to make, with no actual implementation
< rcurtin>
I guess we could email him and ask if he wrote anything, but I don't see anything else on his github page
< zoq>
rcurtin: I think it worked as a proof of concept.
< rcurtin>
ok, I did not remember that bit... maybe he still has it on his desktop or something... hopefully...
< zoq>
rcurtin: let's see , I'll go and write him an email
nihajsk has joined #mlpack
nilay has quit [Ping timeout: 250 seconds]
nihajsk has quit [Ping timeout: 260 seconds]
nihajsk has joined #mlpack
< rcurtin>
tham: you beat me to it, sorry I took so long... I was still writing some other test cases but I think you have already written them :)
< mentekid>
rcurtin: Should I make the changes regarding access to LSH hash tables in the same branch as Multiprobe?
< rcurtin>
mentekid: your call, I can easily merge those immediately
< mentekid>
Will there be no conflict if I have 2 different versions one with multiprobe and one with accessors?
< mentekid>
I am confused about how git handles that
< mentekid>
Or I guess if it's different parts of the file it's no problem
< tham>
rcurtin : build fail, need to find the reasons
< mentekid>
since I'm not changing the same function in two different ways or anything
< rcurtin>
mentekid: git might be smart enough to merge it well, but if not, I can figure it out, I am an expert at git merges :)
< mentekid>
cool then, I'll start a branch so we can keep each one focused :)
< rcurtin>
or I guess, do you mean, that you would have the same commit in two different branches?
< rcurtin>
I'd need to merge that by hand, but that's easy to do, I just merge all the commits except the duplicate one
< rcurtin>
github's interface won't do that automatically I don't think, but that's no problem
< mentekid>
I was thinking about starting from master again
< mentekid>
which one is simpler for you?
< rcurtin>
tham: mark TransposeTokens() inline :)
< rcurtin>
mentekid: all the changes in one branch are easier, but I mean, for me they are both pretty simple, so it's no issue to start a new branch from master---go ahead and do that
< mentekid>
ok cool!
< mentekid>
rcurtin: Also, I found the reason behind the weird numProj behavior
< mentekid>
the second hash size is to blame... the default value is 500, so (2nd level) buckets with more points are just filled to capacity and ignore new ones
< tham>
rcurtin : thanks, I forgot the declaration and definition issue
< rcurtin>
tham: no problem. I am adapting one of the tests I wrote to make a more difficult one that you can add to the list of tests
< rcurtin>
I'll open a PR for it when it's done
< mentekid>
so setting it to 1 creates one huge bucket with (theoretically) 68000 points and then only keeps the first 500 of these
< rcurtin>
ahh, okay, this makes sense
< rcurtin>
but I guess, shouldn't there be more than one bucket if we set numProj = 1?
< rcurtin>
or I guess, is it the case that there are many buckets, but each end up with more than 500 points regardless?
< mentekid>
I'm not sure how many buckets are created to be honest, I think that's random
< mentekid>
but if it's less than N/500 then you can expect some of them to be overfilled
< mentekid>
and the spillover points are ignored
< mentekid>
at least, I think that's what is happening...
< rcurtin>
do you think it's worth the time to check?
< rcurtin>
if it takes a while, maybe it is not worth it for just this dataset
< mentekid>
I would expect it to behave similarly for all big datasets
< mentekid>
I didn't run such extreme values for phy so maybe that's why I didn't see it
< rcurtin>
yeah, I agree with your explanation
< mentekid>
I'll run a few big ones at night so we can have a better idea. And I'll look at the code to see where that is happening
< rcurtin>
it might be worth thinking about what better defaults for secondHashSize are, but like I said maybe not worth investigating... we should limit the number of rabbit holes we crawl down :)
< mentekid>
well I enjoy this so much that I'll probably end up exploring it at some point
< mentekid>
I don't know why I'm so infatuated with this algorithm :P
< mentekid>
I think maybe replacing a "magic" 500 with some heuristic relative to reference size and numProj could work
tham has quit [Quit: Page closed]
< mentekid>
but maybe we should leave that for after the tuning algorithm is done - I'm not sure but they could have a model for the second hash size as well
< mentekid>
quick question - assert doesn't fail on non-debug builds does it?
< mentekid>
so if a user gives me wrong table sizes and I assert they are correct it won't stop them?
< rcurtin>
right
< rcurtin>
when compiling without debug symbols, assert (or Log::Assert) is not called
< mentekid>
ok then I guess it should throw an exception instead
< rcurtin>
Log::Assert is nice because it gives a backtrace
< rcurtin>
yeah, I'd go with a std::invalid_argument
travis-ci has joined #mlpack
< travis-ci>
mlpack/mlpack#861 (master - 02e31b3 : Ryan Curtin): The build passed.
< mentekid>
does running make mlpack_lsh recompile the lib/ stuff as well? or do they remain the same?
< rcurtin>
depends on whether or not any .cpp file changed
< rcurtin>
if no .cpp file changed, then libmlpack.so is up to date and there is nothing to update there
< mentekid>
what if I changed .hpp files
< rcurtin>
(this is with the exception of, say, lsh_search_main.cpp, since that is not compiled into libmlpack.so)
< rcurtin>
if anything that includes those .hpp files is a .cpp file that is compiled into libmlpack.so, then that will need to be recompiled
< rcurtin>
so like if you change prereqs.hpp, I think everything will be recompiled
< rcurtin>
but if you change something in lsh_search.hpp I don't think anything will need to be recompiled in libmlpack.so, just in mlpack_test and the mlpack_lsh program
< mentekid>
I've written a custom cpp file that uses lsh_search.hpp. I compile it with g++ accesstables.cpp -L lib/libmlpack.so (and flags)
< mentekid>
accesstables.cpp is my file
< mentekid>
now I want to change something in lsh_search.hpp and re-compile libmlpack.so
< mentekid>
so that accesstables "sees" the update
< rcurtin>
so there is one other detail there that is important... whenever you make anything, all the header files in src/ need to be copied to <build-directory>/include/
< rcurtin>
you can do that by 'make mlpack_headers'
< rcurtin>
but also 'make mlpack' will do that, because the mlpack target depends on mlpack_headers
< mentekid>
aha
< rcurtin>
if you modify lsh_search.hpp and then type 'make mlpack', then it should show that it is copying all the headers files with the mlpack_headers target, but then it will not actually compile anything for the mlpack target
< mentekid>
so in order to include my updated version I need to run make mlpack_headers
< rcurtin>
yes, that should do it
< mentekid>
I didn't know that, thanks :)
< rcurtin>
no problem, glad I could help
< rcurtin>
maybe it is worth collecting little things like this and writing them up somewhere
< rcurtin>
there are lots of little undocumented build system tricks like this in mlpack
< rcurtin>
but I dunno how to make all of that useful to someone new to the project... if you present them with a big list of "little tricks" they might not remember any of them because they had not encountered any of the issues where those tricks are relevant yet
< mentekid>
Could some "sample cases" tutorial or wiki help?
< mentekid>
like "Recompiling the library after changing updating header files"
< mentekid>
etc
< rcurtin>
yeah, maybe like some developer FAQ or something like that
< rcurtin>
either as a wiki page or in doc/
< mentekid>
by the way I still can't get it to work... I tried both make mlpack and make mlpack_headers
< mentekid>
it uses the old version of the .hpp file which doesn't throw exceptions
< rcurtin>
can you give your full g++ invocation?
< rcurtin>
do you have -Iinclude/ ?
< rcurtin>
if not it is probably searching in /usr/include/ or /usr/local/include/
< mentekid>
no I compile with g++ accesstables.cpp -L lib/libmlpack.so -lmlpack -larmadillo --std=c++11
< rcurtin>
okay, try adding -I include/, I think that may fix your issue
< mentekid>
ah yes
< mentekid>
perfect
< mentekid>
I thought it was the libmlpack.so file that was kept old
< rcurtin>
nope; actually, the way you were doing it could cause some really big weirdness to happen
< rcurtin>
if it had compiled
< rcurtin>
you would have ended up with a program that was built using headers in /usr/include/ which were maybe old, but implementations in lib/libmlpack.so which might work differently
< rcurtin>
like for instance if, e.g., ARMA_64BIT_WORD was for some reason set in the /usr/include/ version of mlpack but not in the one you built
< rcurtin>
then when you ran it, all calls to anything in libmlpack.so would be with the wrong size arma::uword, and then you would have a difficult-to-debug disaster :)
< mentekid>
ahhh so the whole problem was that I have an installed version of libmlpack
< rcurtin>
yeah, that's not necessarily a problem, but it is something to be aware of :)
< mentekid>
I'm not used to build systems so it only makes sense after you tell me
< rcurtin>
yeah, CMake can be quite complex
< mentekid>
Ok I finished the code, I'll start a PR so you can see what I've done. It's only a few lines of code really
< rcurtin>
sounds good
marcosirc has quit [Quit: WeeChat 1.4]
nilay has joined #mlpack
< rcurtin>
mentekid: added comments, did not expect to add so many. I think many of the comments have to do with Pari's original design, not your changes... let me know what you think
nilay has quit [Ping timeout: 250 seconds]
travis-ci has joined #mlpack
< travis-ci>
mlpack/mlpack#866 (master - e6d2ca7 : Ryan Curtin): The build passed.