verne.freenode.net changed the topic of #mlpack to: http://www.mlpack.org/ -- We don't respond instantly... but we will respond. Give it a few minutes. Or hours. -- Channel logs: http://www.mlpack.org/irc/
witness_ has quit [Quit: Connection closed for inactivity]
witness_ has joined #mlpack
lozhnikov has quit [Ping timeout: 268 seconds]
lozhnikov has joined #mlpack
ImQ009 has joined #mlpack
witness_ has quit [Quit: Connection closed for inactivity]
< Atharva> sumedhghaisas: I made a VAE model and tried to train it on MNIST. I had some doubts, when will you be free?
< ShikharJ> zoq: I have updated the PR again, rcurtin, I would appreciate if you could please review it as well, as we are refactoring the design of the GAN module.
< jenkins-mlpack> Project docker mlpack nightly build build #367: SUCCESS in 3 hr 5 min: http://masterblaster.mlpack.org/job/docker%20mlpack%20nightly%20build/367/
< sumedhghaisas> Atharva: Hi Atharva
< sumedhghaisas> The last meeting I have today will end at 17:00 BST.
< sumedhghaisas> Will it be possibly after that?
< Atharva> sumedhghaisas: Yeah sure Sumedh.
< sumedhghaisas> also is the NormalDistribution error solved>?
< Atharva> Yes :), I pushed the latest code.
< sumedhghaisas> Atharva: Great. Just for curiosity. What was the problem with SoftPlus
< Atharva> t was due to the fact that the approximate jacobian was calculated w.r.t. the standard deviation and logProbBackward was w.r.t. pre standard deviation. I tried perturbing pre standard deviation and the test passed.
< sumedhghaisas> ahh.... Yes.
< Atharva> Although, I had to add some functions.
< sumedhghaisas> Good catch :)
< Atharva> You telling me to try it without softplus worked :)
< rcurtin> ShikharJ: I'll see if I can make time to take a look
< ShikharJ> rcurtin: Sure, no hurry :)
< Atharva> zoq: Are you there?
< Atharva> zoq: Sorry to bother, I got that figured out. :)
vivekp has joined #mlpack
vivekp has quit [Ping timeout: 276 seconds]
vivekp has joined #mlpack
wenhao has joined #mlpack
wenhao has quit [Ping timeout: 260 seconds]
< sumedhghaisas> Atharva: Hey Atharva
< sumedhghaisas> you there?
< Atharva> Yes
< Atharva> Hi Sumedh
< sumedhghaisas> great! :)
< sumedhghaisas> so... what up?
< sumedhghaisas> hows it going?
< Atharva> I put everything together in a local branch and started buiding a model.
< Atharva> For some reason, the reconstruction loss isn't decreasing at all
< Atharva> It just keeps fluctuating around some value, which is negative.
< Atharva> So that leads to another doubt
< Atharva> We have used log probability as the error, should we be using negative log probability?
< sumedhghaisas> hmmm... okay for that I have to take a look at ReconstructionLoss :)
< sumedhghaisas> give me a minute
< sumedhghaisas> aha
< sumedhghaisas> Atharva: return dist->LogProbability(std::move(target));
< sumedhghaisas> yes... here its should be negative
< sumedhghaisas> loss = Negative Log Likelihood
< sumedhghaisas> Also is there a PR where I can see the code you are using to train?
< Atharva> sumedhghaisas: Okaie, and also the corresponding changes in `Backward()` function
< sumedhghaisas> yup
< Atharva> No, haven't pushed that yet
< sumedhghaisas> no worries
< sumedhghaisas> Also could you rebase the ReconstructionLoss PR on the NormalDistribution one?
< sumedhghaisas> maybe I have already mentioned this... if I have ignore this one :)
< Atharva> I already did :)
< sumedhghaisas> ahh... great
< Atharva> Another thing, first I wasn't using `Sequential<>` layers for the encoder and decoder. Now I am but am facing some issues.
< rcurtin> ShikharJ: I had a question for you, I've been seeing comments that now we are able to train a GAN in ~7 hours or so, and the impression that I get is that this is pretty fast
< rcurtin> on the whole MNIST dataset that is
< rcurtin> do you happen to know how long the same kind of training would take with other toolkits like TensorFlow or CNTK or whatever else?
< sumedhghaisas> Atharva: ahh sorry, I wanted you to rebase NormalDistribution on latest master and Reconstruction loss on latest NormalDistribution rather than merging master in reconstruction loss
< rcurtin> if you don't it's ok, of course, but I would love to be able to tell people "mlpack is really fast even when only on the CPU, you can train GANs very quickly", but I want to make sure I am not saying something wrong :)
< Atharva> Oh sorry, but I did rebase the Reconstuction loss branch on the NormalDistribution PR branch. I haven't merged master in reconstruction loss
< sumedhghaisas> rcurtin, ShikharJ: 7 hours? huh? what kind of GAN is this?
< Atharva> Maybe I am doing something wrong with git here.
< ShikharJ> rcurtin: The O'Reilly blog mentions about the network taking about a full day on a desktop cpu using Tensorflow. We can still expect about 1.5x to 2.0x speed if we take into consideration a desktop environment.
< rcurtin> sumedhghaisas: I have no idea, I don't know the details, I've only seen some posts and comments on IRC
< ShikharJ> sumedhghaisas: Just the basic GAN.
< ShikharJ> rcurtin: I should mention, that with the Stochastic method, this still takes about 3 days for the model to converge.
< sumedhghaisas> ShikharJ: How many layers in generative and discriminative modules? Maybe we can estimate how much time tensorflow takes
< sumedhghaisas> Atharva: yes... git still amazes me sometimes :)
< Atharva> rcurtin: What addtional tags do I need to compile a file with to see debugging symbols. I built mlpack with DEBUG and ARMA_EXTRA_DEBUG on.
< ShikharJ> rcurtin: The basic implementation and the number of layers is more or less the same as in the O'Reilly implementation.
< rcurtin> ShikharJ: ok, thanks for the clarification. I am not a GAN expert but that gives me enough to work with to understand the comparison
< Atharva> sumedhghaisas: I didn't merge master into ReconstructionLoss PR, can you tell me what led you to think the same so that I can correct my mistake. What I did was I rebased the Recon loss PR branch on the Normal Dist PR branch.
< ShikharJ> rcurtin: A certain reason behind the speedup may also be because we're choosing big hyperparameters for the vanilla GAN implementation, so our network learns faster.
< rcurtin> Atharva: that should be everything you need
< sumedhghaisas> Atharva: I saw the commit history here
< Atharva> rcurtin: So when compiling a sample file, say vae.cpp, I don't need to use any extra tags other than lmlpack and larmadillo?
< rcurtin> ShikharJ: right, that makes sense. it could be interesting to do an exact comparison at some point, but I don't think it is very high priority
< rcurtin> TensorFlow uses Eigen internally, whereas Armadillo uses whatever BLAS replacement is available, so on the CPU, I would not be surprised if OpenBLAS can outperform Eigen, and this would be a big part of the speedup
< ShikharJ> rcurtin: Also, we have little to no copying in our routines, which utilize basic matrices. Tensorflow, as I remember only takes 4-d tensors which might be slow, specially if copying is there.
< rcurtin> plus maybe our implementation has less overhead because it is less complex, but I don't know how much overhead will factor into it
< rcurtin> right, the copying can be super painful if that's happening
< rcurtin> Atharva: you'll need to make sure that you compiled mlpack itself with -DDEBUG=ON and -DARMA_EXTRA_DEBUG=ON
< rcurtin> and then when you run, e.g., g++ -o vae vae.cpp ...
< rcurtin> you'll want to do it as
< sumedhghaisas> rcurtin: Also, tensorflow has less lazy evaluation than Armadillo
< Atharva> sumedhghaisas: Yes, the first 7 commits on that page are from the Normal Dist PR which appear here because I rebased on it. Only the last two commits are from Recon Loss PR
< rcurtin> g++ -g -DDEBUG -DARMA_EXTRA_DEBUG -o vae vae.cpp -lmlpack -larmadillo
< Atharva> rcurtin: Thanks!
< sumedhghaisas> Atharva: Yes, but the last commit is the Merge from master into ReconstructionLoss
< rcurtin> sure, hope it helps, let me know if there are any issues :)
< Atharva> sumedhghaisas: Yes, it was because I had to resolve some merge conflicts.
< Atharva> sumedhghaisas: Sorry for the silly confusion.
< sumedhghaisas> rcurtin: A quick question only if you have time, Do you think if we use Armadillo internal lazy classes in our forward passes, we would get a speedup? Cause we actually do not need to evaluate the expression until the last layer
< ShikharJ> sumedhghaisas: How can we estimate the time to be taken by Tensorflow from the layers?
< rcurtin> sumedhghaisas: I guess it could be possible, it would depend a lot on the network. if the network was just a bunch of chained linear layers I don't think we'd get any speedup
< rcurtin> since Armadillo would have to do all those multiplications sequentially anyway
< sumedhghaisas> Atharva: no worries, for linear history prefer rebase where-ever possible.
< sumedhghaisas> Yeah, especially non-linearities will give huge speedup
< ShikharJ> rcurtin: I'm practically free from my planned goals, so if you want me to benchmark it, I can try.
< rcurtin> hmm, are you sure it would? if I write, e.g., arma::log(A * B), there's not much speedup there that Armadillo could give us
< rcurtin> I guess, internally Armadillo could avoid a temporary matrix C = A * B for that
< rcurtin> but I'm not sure if it does; I think Armadillo does a good job of avoiding a lot of temporaries, but there are a lot of optimizations that could be done
< sumedhghaisas> ShikharJ: I train lot of VAEs for prototyping, and that too on MNIST. So maybe I can estimate if you can give me the number of layers, and what kind of layers. :)
< rcurtin> ShikharJ: sure, up to you. if we can come up with some post showing that mlpack is faster than other toolkits on the CPU, this is an exciting result (although, admittedly, most people want to use the GPU these days)
< sumedhghaisas> rcurtin: arma::log(A * B) evaluated lazily would save a copy, for 1000 iterations, 1000 copies
< ShikharJ> rcurtin: Then there's people like me who are students and only have a decent GPU, who can't afford AWS or GCP :P
< rcurtin> right, definitely, but if the cost of the copy is O(N^2) and the multiplication is O(N^3), that limits the speedup we could have, so it may be more minor than we might hope
< Atharva> rcurtin: If we use bandicoot for Armadillo, how fast does it get? Does the speed increase proportionally to the speed increase of say tensorflow from cpu to gpu?
< ShikharJ> sumedhghaisas: The blog was written in mid 2017, so I don't think a lot of difference in computing power has occurred in that time.
< rcurtin> Atharva: I would hope. there is still a lot of work to be done on bandicoot, and right now it wraps clblas, not nvidia's cublas
< rcurtin> so I think that it will not be as fast as cublas, but I am hoping to work with Conrad in the upcoming months on finishing the library and providing support for cublas
< rcurtin> that said, I am not sure of the speed difference between cublas and clblas
< sumedhghaisas> ShikharJ: So 4 fayer feedforward network? how much time they say tensorflow takes? I think it should take less than 7 hours...
< ShikharJ> rcurtin: Is there a benchmark utility in mlpack that I can use? I sure have read somewhere of a benchmark architecture in mlpack.
< ShikharJ> sumedhghaisas: "If you want to run this code yourself, prepare to wait: it takes about 3 hours on a fast GPU, but could take ten times that long on a desktop CPU."
< sumedhghaisas> ShikharJ: ahh wait... I use CPU optimized tensorflow
< ShikharJ> Directly from the blog
< rcurtin> ShikharJ: https://github.com/mlpack/benchmarks is the project
< rcurtin> basically you write a Python script to run whatever you're planning to benchmarks and extract certain metrics like runtime, accuracy, etc. from it
< sumedhghaisas> ShikharJ: no way... For some code, especially feed forward layers I have noticed GPU takes longer than CPU
< rcurtin> it would take a little while to write a full script for the GANs, so talk with Marcus and decide if it's something you want to do, otherwise I think it would be interesting just to compare with a simple run or two
< sumedhghaisas> ShikharJ: Thats due to GPU positioning overhead
< sumedhghaisas> Atharva: Sorry you were asking me some doubt about SequentialLayer?
< ShikharJ> sumedhghaisas: Hmm, interesting.
< sumedhghaisas> ShikharJ: My estimate, assuming you build tensorflow on your machine with optimizations, which is fair since MLPack is also built, it shouldn't take way more than 3 hours
< sumedhghaisas> rcurtin: Surely the speedup from CPU to GPU is much more than lazy evaluation
< ShikharJ> sumedhghaisas: I think the best way to check would be to run it in the same environment as we do our GAN implementation.
< ShikharJ> rcurtin: Could we install tensorflow on savannah? Maybe I can tmux a build and compare the runtimes?
< rcurtin> sure, would you like me to do it with pip?
< Atharva> sumedhghaias: It wasn't about exactly about `Sequential` layer, the code is just failing with some matrix size miss matches. I will see where that is coming from and let you know.
< sumedhghaisas> rcurtin: but the lazy speedup will be there even if we move to GPU
< sumedhghaisas> Atharva: Sure
< rcurtin> ShikharJ: actually in that case, probably better to install with pip install --local
< rcurtin> sumedhghaisas: right, agreed. I'd be interested in seeing how much speedup we could get first, since rearchitecting the whole system could be really difficult
< ShikharJ> rcurtin: I don't think I have the necessary privileges to install any packages, so please go ahead.
< Atharva> rcurtin: I will try to use bandicoot even if it's experimental now and see how fast it can get.
< rcurtin> ShikharJ: I think you could use pip3 though
< rcurtin> Atharva: sure, give it a shot, but I don't know if it has enough functionality implemented to be a full substitute yet
< rcurtin> so compilation may fail
< Atharva> Oh, I will let you know what happens.
< sumedhghaisas> rcurtin: Agreed. If I use 'auto' that should prefer the internal Armadillo class rather than arma::mat right?
< rcurtin> sumedhghaisas: usually, but auto can cause very weird things to happen so don't be surprised if there are problems :(
< sumedhghaisas> rcurtin: yeah :( I just don't wanted a way to infer the classes rather than opening the GLUE architecture of Armadillo again
< sumedhghaisas> *remove don't :)
< rcurtin> right, agreed, I see what you mean. well, see if 'auto' works... :)
< rcurtin> if we can get the correct Armadillo internal type back and manage to pass that through the different layers, maybe heavy modification of the abstractions is not necessary
< rcurtin> which would be really nice :)
< sumedhghaisas> rcurtin: Yes thats precisely what I was thinking. We anyway templatize input and output
< rcurtin> right, I think it could work
< sumedhghaisas> I always think the same about template substitution until g++ proves me how difficult life could be
< rcurtin> heh, same...
< ShikharJ> sumedhghaisas: What did you say the expected time was? 3 hours?
< sumedhghaisas> ShikharJ: Little more, but will be within 1.5 times
< sumedhghaisas> ShikharJ: Maybe will need to adjust the batch size accordingly
< ShikharJ> sumedhghaisas: Let's see, I'll tmux a build to check.
< sumedhghaisas> although, the best way to estimate would be from single update, if we know how many updates it usually takes to converge
< ShikharJ> sumedhghaisas: By default, both the implementations take single step updates.
< sumedhghaisas> try single batch update and measure time
< sumedhghaisas> yes... but how many iterations of the single step update?
< ShikharJ> sumedhghaisas: We had a batch size of 50 for 100,000 iterations in the tensorflow implementation. So that's almost 71 full passes, but I guess t converges much earlier.
< ShikharJ> sumedhghaisas: We also pre-train the discriminator for 300 iterations on 50 sized batches.
< ShikharJ> *it
< sumedhghaisas> 300 iterations pretraining is nothing compared to 100,000 iterations, so we need to know how much time each iteration takes in tensorflow and MLPACK
< sumedhghaisas> convergence is also a property of optimizer, so lets take that out of the equation, lets only compare single step update
< sumedhghaisas> would be much faster as well
< ShikharJ> sumedhghaisas: Hmm, probably I can do that on my system, I'll share what I find later today!
< sumedhghaisas> ShikharJ: Sure thing!
< ShikharJ> sumedhghaisas: Is tensorflow multi-threaded?
< ShikharJ> sumedhghaisas: I am seeing almost all the cores occupied :/
< sumedhghaisas> ShikharJ: hmm... now that I have not sure about.
< sumedhghaisas> But I think yes,
< sumedhghaisas> tensorflow prefers parallel execution where-ever the graph lets it
< sumedhghaisas> I have heard this a lot of times...
< ShikharJ> sumedhghaisas: If it is then there's no point in the conparison, because mlpack isn't :( We'll have to run tensorflow on a single core to make the comparison fair.
< rcurtin> ShikharJ: or use OpenBLAS, which will do parallel matrix multiplication when possible
< ShikharJ> rcurtin: How can I do that? What would I have to change?
< rcurtin> a way to check (maybe not the best way) is just to do 'ldd libmlpack.so'
< rcurtin> and that will be linked to some BLAS library... libblas.so, libopenblas.so, something like this
< rcurtin> which one it links to tells you which one it's using
< rcurtin> if OpenBLAS is installed on the system, then Armadillo should be automatically using it anyway
< rcurtin> but I always use ldd to check... seeing exactly what it's linked against removes most doubts about what it could be
< ShikharJ> rcurtin: Where would I find libmlpack.so file?
< rcurtin> in your build directory under lib/
< rcurtin> or if you are building a standalone program, you can also just use ldd on that
< rcurtin> looks like on savannah, openblas is not installed. I should install it on all five benchmarking systems, but I won't interrupt if you are doing anything with them now
< ShikharJ> rcurtin: Hmm, only libblas is installed as far as I can see.
< rcurtin> right, so if you like I can just go ahead and install now, then you can rebuild and relink
< rcurtin> (or rather just rebuild, that will do the linking too of course)
< ShikharJ> rcurtin: Sure, let's even the odds a bit :P
< sumedhghaisas> rcurtin, ShikharJ: Although I am not sure they are equivalent, OpenBLAS does the same operation on multiple GPU right? Tensorflow I think finds whole operations that can be parallelized
< rcurtin> ShikharJ: ok, installed on all 5 benchmarking systems
< rcurtin> sumedhghaisas: I agree, they are not exactly equivalent, but in both cases I'd say each should be making pretty full use of all available CPU cores
< rcurtin> (I think you meant CPU not GPU there)
< rcurtin> I would say it is as fair a comparison as we can get for TensorFlow and mlpack in most "real life" use cases... people won't generally want to restrict their usage to one core
< ShikharJ> rcurtin: Agreed. Currently Tensorflow is taking up all the cores, so let me see how long that takes, and probably, then I can spawn an mlpack build.
< rcurtin> right
< ShikharJ> rcurtin: Meanwhile, I'll try and get some statistics on the single iteration timings :)
vivekp has quit [Ping timeout: 260 seconds]
witness_ has joined #mlpack
< zoq> ShikharJ: I guess, if we like to test out some ideas it might be usefull to write a simple script for the benchmark system. That would allow us to easily track what approach worked and which one didn't. Let me know what you think.
sumedhghaisas has quit [Ping timeout: 260 seconds]
< ShikharJ> zoq: Seems like an interesting idea, but first I'd like to spend the time on our RBM implementation. If this takes too long, I don't wish to keep those goals on hold. Is it fine by you?
< zoq> ShikharJ: absolutely
< ShikharJ> rcurtin: Hmm, interesting, for tensorflow it seems that the single iteration is a lot faster, but still, the network takes much longer to reach convergence.
< ShikharJ> rcurtin: The one beautiful thing about htop is that it aggregates the cpu and time elapsed by a single process, over all its threads. Tensorflow has already crossed the 7 hour mark (as it would on a single core system I believe).
< ShikharJ> rcurtin: It's been three hours into training.
< zoq> ShikharJ: Might be interesting to test TensorFlow Lite + TensorFlow Mobile as well, probably not at this point but we should keep that in mind.
< ShikharJ> zoq: Good point
< rcurtin> ShikharJ: yeah, that sounds reasonable so far. we should double-check for any copies that are happening during the mlpack training
< rcurtin> that can be a big source of slowdowns
< ShikharJ> rcurtin: Agreed. I'll take a thorough look tomorrow and update any changes in the WGAN PR itself.
< rcurtin> sometimes ARMA_EXTRA_DEBUG can be helpful here
witness_ has quit [Quit: Connection closed for inactivity]
ImQ009 has quit [Quit: Leaving]
< ShikharJ> rcurtin: It took about 11 hours (single core aggregate), and about 4.5 hours (real time multi-threaded aggregate) to train the complete model, which corresponds to a 240 - 250 % CPU usage (which is along the lines of what I saw on htop). So the baseline improvement of 1.57x on single core is there.
< ShikharJ> rcurtin: I couldn't keep my eyes off the htop screen, I was watching it that anxiously :P
< ShikharJ> rcurtin: I'll tmux a build for openBLAS based mlpack build in the morning. Going to get some sleep now :)
< ShikharJ> rcurtin: Though, could you please tell me if I will just have to do a make clean (or would I have to clear CMakeCache.txt as well)?
< zoq> ShikharJ: make clean followed by cmake should work just fine, but you can also remove the build folder to be sure.
< ShikharJ> zoq: Thanks, let's see of we can atleast match Tensorflow's time with openBLAS for multi-threads!
< ShikharJ> *if
< zoq> yeah fingers crossed :)