ChanServ changed the topic of #mlpack to: Due to ongoing spam on freenode, we've muted unregistered users. See http://www.mlpack.org/ircspam.txt for more information, or also you could join #mlpack-temp and chat there.
< ShikharJ>
rcurtin: Are you there?
rajat_ has joined #mlpack
< rcurtin>
ShikharJ: yeah, I'm here
< rcurtin>
a little later than usual... I am in California this week so three hours behind in timezones :)
vivekp has quit [Read error: Connection reset by peer]
< ShikharJ>
rcurtin: No worries, just wanted to talk. Do you have any experience in Distributed ML software? If I'm not mistaken, is it your current area of work?
< rcurtin>
it's not what I'm doing currently, but I have used various distributed programming toolkits like MPI and have some knowledge of distributed ML libraries :)
< rcurtin>
so I might not be *that* helpful but I can try...
< ShikharJ>
rcurtin: I actually had some theoretical doubts. Would a distributed ML system running on multiple nodes converge slower (if we're talking about let's say classification accuracy), rather than one running on a single node?
< rcurtin>
ah, ok, I have a little more knowledge here then
< rcurtin>
it *could*, but I would expect in practice that the difference in convergence would be marginal
< rcurtin>
and actually if the distributed system runs faster, it may take more distributed iterations but that still may take less time than a single-node system
< rcurtin>
I guess you could say, the single-node system would use less power overall
< rcurtin>
but often, the way these systems might work would be a little like Hogwild!:
< ShikharJ>
rcurtin: What could be the prominent reasons behind slower convergence? I assume synchronization time to be one reason, extra I/O to be another, but what more reasons can you think of?
< rcurtin>
each node will compute a gradient on a subset of points, and then send the gradient update back to a master server
< rcurtin>
but since it's all in parallel, the gradient updates from each node may not be with respect to the most up-to-date parameters
< rcurtin>
so, for instance, consider that at some time t_0, we have parameters p_0
< rcurtin>
and we have nodes 0 and 1 that compute f'_0(p_0) and f'_1(p_0) simultaneously
< rcurtin>
where f'_0(p_0) is the gradient of the parameters p_0 with respect to the points held by node 0
< rcurtin>
(and f'_1(p_0) is the same for the points held by node 1)
< rcurtin>
when these quantities are computed, they will be sent back to some central server (at least for some types of distributed optimizers) for an update
< ShikharJ>
rcurtin: Thanks I get the idea, basically following from the DistBelief paper?
< rcurtin>
so if f'_0(p_0) is received first, we'll set p_1 = p_0 + \alpha * f'_0(p_0)
< rcurtin>
ah, ok, so yeah you got the rest :)
< rcurtin>
I am not sure if it's the same as the DistBelief paper, I'm actually thinking more about Hogwild here
< rcurtin>
but I think the systems are pretty similar
< rcurtin>
the key is that when each node computes the gradient, it may be computing the gradient with an out-of-date set of parameters
< rcurtin>
which can add some "noise" to the gradient and slow down convergence somewhat
< ShikharJ>
rcurtin: Yeah, based on what I read, DistBelief fixes a number of local gradient updates (default is 1, but can be increased) before they pull in the set of most updated parameters for each node from the parameter server.
< ShikharJ>
rcurtin: Funny thing is, there is no official Google open-source implementation of DistBelief. Wikipedia states that it was later developed into Tensorflow, though I'm not sure of Distributed Tensorflow would be the same as Distbelief :/
< ShikharJ>
rcurtin: So all in all, it could take more number of iterations, but eventually a distributed ML system should, for a certain amount of difference in accuracy, should take lesser time than a single node? Would that be safe to say?
vivekp has joined #mlpack
< rcurtin>
ShikharJ: yeah, I think that would be reasonable to say (sorry for the slow response... meetings as usual)
< rcurtin>
probably, in the end, it could take more iterations, but both single-node and distributed optimizers should converge to solutions of basically equivalent quality (or accuracy)
< ShikharJ>
rcurtin: No worries, thanks for help! Just one final query, as an alternative for DistBelief, which framework would you suggest me to run the above experiment in? I need to show the results, in addition to the inference from the results.
< rcurtin>
hmmm, the only other one I know would be Hogwild!, and I guess there is also Parameter Server (by Alex Smola and friends?)
< rcurtin>
but I am not too familiar with all that's available in that sense
< rcurtin>
is this for work you're doing for the GAN MLsys paper? how is that going?
< rcurtin>
I'm underwater with an MLsys paper along with Marcus and some others on the optimization framework :(
< rcurtin>
it's turning out to be far more time-consuming than I thought it would be...
< rcurtin>
almost done though
< ShikharJ>
rcurtin: Nah this is something that I was discussing with a faculty at my college, I was curious so asked you.
< rcurtin>
ah, ok :)
< rcurtin>
the field of distributed machine learning is huge right now, there is a massive amount of work going on
< ShikharJ>
rcurtin: The MLSYS is underway, but far from completion, not sure if I'll be able to make it by the deadline this year. I need to make a number of changes to the framework, and the epoch timings need to be increased as well (didn't get to time them, but by the looks of it there is some potential work there). I kind of shifted to CycleGAN and its tests.
< rcurtin>
sounds good---sometimes it is better to wait until everything is more ready than to submit too early :)
< ShikharJ>
Plus my exams ate up most of my time. I should be getting back to contributing a bit more in the coming weeks.
< rcurtin>
ah, well hopefully the exams went well :)
< rcurtin>
I realized recently, it's actually been a long time since I've taken any type of exam
< rcurtin>
kind of a weird feeling given that the first 26 years of my life were basically entirely devoted to school, tests, exams, etc.
< rcurtin>
wait, I mean 28. I can't remember. a long time I guess...
< ShikharJ>
Though I must mention, MLPACK is kind of well known in academic circles, I was in talk with a professor in Germany, and he mentioned that he had heard of the library and Prof. Alex Gray and your name.
< rcurtin>
oh, really?? that's cool :)
< rcurtin>
I have noticed over the years that the profile of mlpack has raised, and when I talk to people at conferences they are more likely to be familiar with it
< rcurtin>
did that professor happen to be at a university in Aachen? :)
< ShikharJ>
rcurtin: Though he ended up thinking that we're a commercial thing, and asked me where the funds for it come from.
< rcurtin>
hehe, I guess the funds come from Google for GSoC
< ShikharJ>
Who supports us and stuff, yeah I mentioned GSoC.
< rcurtin>
Alex Gray did start a company called Skytree in ~2011 based on a very old version of the mlpack codebase
< ShikharJ>
No, this was a professor from MPI-SWS.
< rcurtin>
ah, ok
< rcurtin>
I traveled to visit a professor in Aachen last year, and it was a cool experience; I'd never been to Germany
< rcurtin>
my five years of studying German weren't useful, it turned out :(
< rcurtin>
(I guess I forgot most of the language and my accent was very bad, and everyone spoke English anyway)
< ShikharJ>
rcurtin: Yeah, I mentioned that Alex Gray started a company that got acquired by Infosys, though I didn't knew that he based that company on mlpack.
< rcurtin>
Skytree ended up being acquired by Infosys a couple years ago, and now Alex has left there and is (I think?) a VP of research at IBM
< rcurtin>
yeah, a very long time ago, like 2008, mlpack used to be called "fastlib/mlpack" (it was actually two libraries)
< ShikharJ>
Woah, quite the trajectory!
< rcurtin>
and just before I joined his lab, he formed the company and took the state of the code and the company built on that
< rcurtin>
yeah, he is a big shot now I guess :)
< rcurtin>
at some point I saw the codebase of the company, around 2014 I think, and it had evolved in an entirely different way than mlpack did
< ShikharJ>
Interesting, wonder how much of their version of the library is still based on dual trees and stuff that Dr. Gray researched in.
< rcurtin>
so it was kinda like how crocodiles and cats have a common ancient ancestor somewhere, but they are totally different :)
< rcurtin>
hmmm, I'm not sure actually. I think they were starting to focus more on distributed machine learning, but doing dual-tree algorithms in that context actually turns out to be pretty hard
< rcurtin>
in the end, I think (but I am not sure) that most of their work was basic machine learning consulting for different companies, and in these cases it's not actually that important to have crazy dual-tree algorithms because mostly people just need a simple logistic regression model or similar
< rcurtin>
I could be totally wrong when I say that---I wasn't a part of the company, so I am kind of just guessing based on the things I've heard :)
< ShikharJ>
rcurtin: Is Distributed ML still an active area of research? I have heard a lot on other areas of ML systems but not this specifically.
< rcurtin>
definitely, very much so
< rcurtin>
I think actually that a lot of work at MLsys will be about distributed machine learning systems
< rcurtin>
the industry is pushing heavily for it---Google is making TPUs to put in big server farms, so that you can run giant models across tons of TPUs
< rcurtin>
but to do that, you need good algorithms to learn in the distributed setting
< rcurtin>
if you're looking for things to read, I know that Alex Smola is (or has recently been?) active in this area, or I guess you could just look at papers that cite DistBelief
< rcurtin>
there should be tons of them :)
< ShikharJ>
But what exactly are people trying to improve upon these days? I'm guessing a large part of systems related work would be done already, algorithms probably would be an area in this regard.
< ShikharJ>
Oh yeah, I see you mentioned that
< rcurtin>
communications problems are a big issue; reducing that overhead is a big help for scaling
< rcurtin>
you can also learn by having a "central" parameter server, or by splitting the parameters over many systems
< rcurtin>
if you split the parameters themselves over many systems, there are all kinds of new problems that need to be solved, like synchronization, locality of gradient computations, data movement...
< rcurtin>
I'm not sure exactly where the state of the art is, but I do know for sure it's a very active field
< ShikharJ>
Downpour, Asynchronous SGD and L-BFGS are the only ones that come to mind on the algorithms side. Oh I see, thanks for the insight. I can see why the effort is being put here in this field.
< rcurtin>
yeah, there is a lot of money in it I think :) If Google has the best implementations of the algorithms, they can sell more time on TPUs than Amazon can sell on GPUs or whatever other competitors there are for cloud-based machine learning
< rcurtin>
many companies want to build their own hardware or infrastructure too; some months ago I was interviewing with Tesla and even they are building their own hardware
< rcurtin>
(whether or not they will be successful... we'll see. in the end I think it was good that one didn't work out... I'd probably be working 80+ hour weeks every week)
< ShikharJ>
rcurtin: Interesting talk, thanks, I have to go back and finish some assignments, I'll be back later!
< rcurtin>
sounds good. good luck with the assignments and talk to you later! :)