10:04 UTC

< September 2018 > Su Mo Tu We Th Fr Sa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

- Console
- #amaranth-lang
- #armbian
- #armbian-allwinner
- #armbian-amlogic
- #armbian-broadcom
- #armbian-rockchip
- #armlinux
- #beagle
- #buildbot
- #commonlisp
- #crux
- #crux-arm
- #crux-devel
- #crux-social
- #crystal-lang
- #discferret
- #evennia
- #fedilinks
- #fedora-coreos
- #fedora-riscv
- #ffmpeg
- #ffmpeg-devel
- #foot
- #glasgow
- #hpy
- #hts
- #jruby
- #kisslinux
- #libreelec
- #linux-amlogic
- #linux-exynos
- #linux-mediatek
- #linux-rockchip
- #linux-ti
- #litex
- #maemo-leste
- #mailx
- #mew
- #mlpack
- #moin
- #nmigen
- #numbat
- #ocaml
- ##openfpga
- #openFPGALoader
- #openocd
- #openscad
- #openvswitch
- #osdev
- #picolisp
- #prjbureau
- #pypy
- #racket
- #radxa
- ##raspberrypi-internals
- #riscv
- #river
- #ruby
- #rust-embedded
- #sandstorm
- #scopehal
- #solvespace
- #Speedsaver
- ##stm32-rs
- #tegra
- #titandev
- #u-boot
- ##yamahasynths
- #yocto
- #yosys
- #zeppe-lin

ChanServ changed the topic of #mlpack to: "Due to ongoing spam on freenode, we've muted unregistered users. See http://www.mlpack.org/ircspam.txt for more information, or also you could join #mlpack-temp and chat there."

< rcurtin>
also, I believe (but am not 100% sure) that I will at least be in California during the mentor summit

< ShikharJ>
rcurtin: As far as my understanding goes regarding this, a neuron is nothing but a simple activation (computed by matrix operations) in this context. Also, if you read the abstract, it specifically mentions that the gain and bias parameters specifically follow from BatchNorm technique. And in equation 4 it specifically mentions that these vectors have the same dimensions as the mean. So we need to have just a single vector

< rcurtin>
ShikharJ: right, so to me this would imply that g and b should be vectors of length input.n_rows, and that we should be using each_col() when g and b are multiplied and added instead of each_row()

< zoq>
rcurtin: That sounds correct to me; will incoperate that into the open PR; Shikhar what do you think?

< rcurtin>
definitely. I will probably order the plane tickets today or tomorrow; just need to double-check

< ShikharJ>
rcurtin: No, quite the opposite. In LayerNorm, we have the mean vector as 1 x n_cols, so the g and b vectors should also be of the same shape. If you look carefully at the equation 4, you'll see that we're doing an element wise multiplication with g and (x - mu), which wouldn't be valid if you put the shape of the g vector to be n_rows instead of n_cols.

< rcurtin>
ShikharJ: it seems to me like equation 4 in the paper is assuming that the input a^t is one single point with dimension (n_rows x 1)---so, that is, the paper assumes a batch size of 1

< rcurtin>
in this case, for elementwise multiplication to make sense, then g would have to have the shape (n_rows x 1) also

< rcurtin>
if we generalize to larger batch sizes... then A^t (let's call it capitalized since it's a matrix now not a vector) has size (n_rows x n_cols) where n_cols is the batch size

< rcurtin>
and in this case I agree, the mean vector has size (1 x n_cols), and the operation (A^t - \mu^t) would actually be implemented as A^t.each_row() -= \mu^t

< rcurtin>
but it only makes sense to learn a bias and gain (b and g) for all points, instead of one for each point

< rcurtin>
I hope this makes sense, I am not sure if I wrote it well. But to me it made sense when I considered that the equations are written for a batch size of 1, then I manually generalized them from there

< rcurtin>
I think Marcus said he will already handle it, but I am not sure if he's already done it yet :)

< ShikharJ>
rcurtin: This is a nice catch, I had only referred to the Tensorflow documentation, and maybe I misunderstood (https://www.tensorflow.org/api_docs/python/tf/contrib/layers/layer_norm).