Boda Blog

Deep Learning Papers Summarization

Decoupled Neural Interfaces using Synthetic Gradients

  • In NN, the training process, has 3 bottle-necks
    • forward lock: you need to calculate teh output of the previous layer before you can can go into next layer in forward pass
    • backward pass: the same, but for backward propagation
    • weights lock: you can’t update weights unless you do for weights in next layer
  • the paper trying to unlock these bootle-necks by decoupling each layer, to be sufficient alone
  • it does that by introducing, a Synthetic Gradient Model, that can predict the gradient for the current layer, without waiting for the gradient of the next layer
  • this was we can calculate gradient and update weights as soon as we calculate the activation of the current layer

Synthetic Gradient Model

  • can be just a simple NN that is trained to output the gradient of the layer

GIT

Beatiful commands

git log --oneline --decorate --all --graph

git merge --abort ==> abort merge, and get back like it never happened

git reset --hard ==> is your way to lose all uncommited work in your working directory

  • git fast forward is basically that git moves the commit pointer upward to the new posotion, without creating a merge commit or anything
  • you can merge with --no-ff flag, to disable the fast forward merge and force git to create the merge commit

Git Bisect

  • used when something broke, and you know what did broke, but you can’t figure out when did it broke
  • you just give it a testing criteria to test the commit history against

Methodology

  • everything inside git is an object
  • all your local branches are located in .git/refs/heads
  • a branch is basically a file that appoints to a commit. a branch is bisacally a pointer to specific commit
  • every commit has a parent, so to assemble branches we follow and compute their parents

Commits

  • keep added changes in commits related to the same topic
  • add informative commit message
  • you can add parts of changes in a single file using -p flag in git add -p filename0

Branching

Long-running branches

  • Main branch
  • Dev branch

Short-lived branches

  • features branches
  • bug fixes branches

Merging

  • When the one of the two branches has the head is the same as the common ancesstor of the two branches, then we can do a fast-forward merge by putting the commits of the another branch on top the common ancesstor commit

Rebase

  • rebase puts the commits of the second brach on top of the common ancesstor commit then rebase the commits of the first branch on top of the last commit of the first branch, then it changes the history of commits

Only use rebase to clean local commit history, don’t use rebase on commits that is pushed to online

Applied Deep Learning

Reference

Deep Learning overview

  • we can look at deep learning as an algorithm that writes algorithms, like a compiler
  • in this case the source code would be the data: (examples/experiences)
  • excutable code would be the deployable model
  • Deep: Functions compositions $ fl f{l-1} …. f_1$

  • Learning: Loss, Back-propagation, and Gradient Descent

  • $ L(\theta) \approx J(\theta)$ –> noisy estimate of the objective function due to mini-batching. That’s why we call it stochastic Gradient Descent

NLP Specialization

Course 1: Classification and vector Spaces

Weak 4

Hashing

  • We can use hashing to search for the K-nearest vectors, to heavily reduce the searching space

Locality senstive hashing

  • the idea is to put items that are close in the vector space, in the same hashing buckets

  • we can create a set of planes and calculate the relative position of points compated to this plane and then we can calculate the hash value for this point accordingly

CS480/680 Intro to Machine Learning

Lecture 12

Gausain process

  • infinite dimentional gaussian distribution

Lecture 16

Convolution NN

  • a rule of thumb: to have many layers with smaller filters is better than having one big filter, as going deep captures better features and also uses fewer parameters

Residual Networks

  • even after using Relu, NN can still suffer from gradient vanishing
  • the idea in to add skip connections so that we can create shorter paths

Lecture 18

LSTM vs GRU vs Attention

  • LSTM: 3 gates, one for the cell state, one for the input, one for the output
  • GRU: only two states, one for output, and one for taking weighted probablitiy for the contribution of the input and the hidden state
    • takes less parameters
  • Attention: at every step of producing the output, create a new context vector that gives more attention to the importat input tokens for this output token

Lecture 20

Autoencoder

takes different input and generates the same output