Brief outline below (more of personal guide actually): Read from link.
- Convolution Operation Description
- Cross Correlation
- Why Convolution?
- Sparse Interaction
- Parameter Sharing
- Equivariant Representation
- Conv Nets Operation
- Detector (Nonlinear Function)
- Pooling – adding strong prior that the function the layer learns must be invariant to small translations.
- Convolution may imply an infinitely strong prior that weights is shared among neighbors and that far edges have 0 weights. This prior makes sense if the feature is equivariant to translation.
- Variants of Convolution
- 1 kernel = 1 kind of feature. Usually use many kinds of kernel.
- downsampling (stride)
- border – zero padding
- valid convolution
- same convolution
- full convolution
- locally connected layers / unshared convolution
- tiled convolution
- Structured Output
- Data Types – can process inputs of varying spatial extents (contains varying amount of observation of the same kind of thing, not optionally contain varying amounts of observation)
- Efficient convolution algorithms – If the kernel is “separable”, a much more efficient approach can be used.
- We can use the following to train our convolutional network
- Greedy layer wise pre-training
- Unsupervised learning
- Neuroscience basic for conv nets
- Gabor Functions
- History – In a way, conv nets paved the way to the general acceptance of neural networks.
Brief outline below. Read from link.
- Cost Function
- Maximum Log Likelihood (cross entropy)
- Minimum Square Error
- Minimum Absolute Error
- Output Units
- Sigmoid + Maximum Log Likelihood
- Softmax + Maximum Log Likelihood (multivariate)
- Gaussian Mixture
- Hidden Units
- Rectified Linear Unit
- Absolute Value Rectification
- leaky ReLU
- parametric ReLU (PReLU)
- Maxout units
- Sigmoid Units
- Logistic Sigmoid
- Hard tanh
- Architecture Design
- Depth vs Width (exponential)
- Connection between layers
- Back Propagation
- Might need to implement one myself to truly understand this
After reading and digesting Chapter 4 (link), I aggregated the following questions to test my comprehension. I’ll post the answer to the questions when I review them.
- Define the underflow and overflow problem.
- For example, how can you modify softmax to evade the underflow and overflow problem?
- Define condition number.
- Define poor conditioning.
- Define the function you are trying to optimize in a gradient based optimization.
- Define the following:
- critical points
- stationary points
- local maximum
- local minimum
- saddle points
- Define partial derivations and gradients.
- Define directiona derivatives.
- Define the Jacobian matrix.
- Define the Hessian matrix.
- Define issues with Hessian matrix with poor conditioning.
- Define first order optimization algorithms, second order optimization algorithms.
- Define Lipschitz constant and its significance.
- Define Convex optimization algorithms
- Define constrained optimization and 3 approaches you can solve it.
- Define Karush Kuhn Tucker (KKT).
After reading and digesting Chapter 3 (link), I aggregated the following questions to test my comprehension. I’ll post the answer to the questions when I review them.
- What is the purpose of probability theory?
- What are its two uses in Deep Learning?
- Why probability in ML?
- What are the three possible sources of uncertainly?
- Is it always better to use “complex and certain rules” than “simple and uncertain rules”?
- What is Frequentist probability?
- What is Bayesian probability?
- What is a random variable?
- A random variable can be __ and __ ?
- What is a probability distribution?
- What is a probability mass function?
- What is a joint probability distribution?
- What are the 3 properties that a probability mass function must satisfy?
- What is a probability density function?
- What are the 3 properties that a probability density function must satisfy?
- Define marginal probability and its key equation (also known as the sum rule).
- Define conditional probability and its key equation.
- Define intervention query and causal modeling.
- Define the chain rule of conditional probabilities.
- Define independence and conditional independence.
- Define the formula for expectation (for both discrete and continuous).
- Define variance and standard deviation.
- Define covariance and correlation.
- How is independence and covariance related?
- Define the covariance matrix?
- Define a Bernoulli Distribution.
- Define a Multinoulli Distribution.
- Define a Gaussian distribution.
- Define a Normal distribution.
- What is precision in the Gaussian distribution?
- In absence of prior knowledge, why is normal distribution a good default choice (2 reasons)?
- Define a multivariate normal distribution.
- Define an Exponential distribution.
- Define a Laplace distribution.
- Define a Dirac distribution.
- Define an Empirical distribution.
- Is dirac delta function a generalized function?
- Is Dirac delta distribution necessary to define empirical distribution over discrete variables?
- Define a Mixture distribution.
- Define a Latent variable.
- Define a Gaussian Mixture Model and explain why is called a universal approximator.
- Explain what are prior and posterior probabilities.
- Define Bayes rule
- Define briefly measure theory, measure zero, and almost everywhere.
- When handling two continuous random variables that are related by a deterministic function, what should be careful about (specifically, how does it affect the domain space of the two continuous random variables)?
- What equation relates the two variables? What is the equation in higher dimensions?
B. Common Functions
- Define a logistic sigmoid (including where does it saturate).
- Define a softplus function (including its range).
- Define a logit in statistics.
- Note about the math properties of these common functions (see the book).
C. Information Theory
- Define Information Theory. What is the basic intuition behind it?
- Define self-information. Explain the unit nat, bit, and shannon.
- What is Shannon entropy?
- What is Differential entropy?
- Define the Kullback-Leibler (KL) divergence.
- Is KL divergence symmetric? is it non negative?
- Define cross entropy.
- How is cross entropy similar to KL divergence?
- What is “0 log 0”?
- Define a structured probabilistic model.
- Define a graphical model.
- What is the main equation for a Directed model?
- What is the main equation for a Undirected model? What is a clique?
- Can a probability distribution be classified to Directed and Undirected models?
Note to self: after reading the math taught in this chapter, I realized that many of the things I did not understand before suddenly started to make sense. I know I still need to study a lot of stuff, but this just got me really excited after seeing how math enables and serves as a language and framework of machine learning.
Research is only useful when it is shared, teaching provides the best opportunity to share your research with the next generation of scientists.
- Be strict about preparation times for teaching material. Set a time allocation, (e.g. 3 hrs), and stick to it!
- Keep the learning objectives in mind, when writing lecture material. It will help you focus and cover the essentials.
- Remember you don’t have to talk for the full time.
Disclaimer: This is taken directly from a Mendeley article (link) about balancing research and teaching. I felt that reflecting upon its key points is essential to building in me the right character and principle in my path to becoming a researcher.
Deep learning has been a very hot topic lately. As part of my OMSCS Big Data for Healthcare class and PhD preparation, it seems like I also need to learn about “Deep Learning”. I did watch the Deep Learning videos from Udacity and I honestly believe that those videos are more than enough to give one an overview of Deep Learning. But to do more meaningful work in this area, I need a deeper understanding of “Deep Learning”. Hence, I started reading the popular “deep learning book“. Below is my internalization of chapter 1. Note that aside from my opinions, most of the contents below is just me retelling the contents of the book.
Looking at the categories of bodies of knowledge, Deep Learning is basically under Representation Learning which is under Machine Learning which is under Artificial Intelligence. From the data, representation learning learns simple representations of the big problem and combining these representations to make more accurate predictions.
3 Notable Phases of Deep Learning History:
- Cybernetics (1940s-1960s)
- Linear models were created, along with the discovery of its limitations, such as being unable to solve XORs.
- Connectionism (1980s-1990s)
- One main idea of connectionism is that a large number of computational units can achieve intelligence behavior when networked together (as inspired by our brain and the network of neurons it contains)
- Distributed representation – the idea that each input of a system should be represented by many features, and each feature should be involved in the representation of many possible inputs (reading the example in the book will make things clearer )
- Deep learning (2006-present)
Two neural perspectives for deep learning
- the brain is a living example that intelligent behavior is possible, and a straightforward way to build intelligence is to reverse engineer the brain (which is easier said than done)
- assuming that machine learning models encapsulate a part of how our brain works, it becomes useful in shedding light to understanding the brain and the underlying principles of human intelligence.
In recent years, there are a lot of improvements in the field due to:
- Faster computers
- More data
- the models did not change much compared to the 1980s, what changed was the amount of data we used to train the models.
- Rough rule of thumb as of 2016:
- 5000 labeled examples per category = acceptable performance.
- 10 million labeled examples = match or exceed human performance.
- New techniques to enable deeper networks
- We have more computational resources to run much larger models today. Model size increases 2x every roughly 2.4 years.
- If we continue this track, we will probably reach the same number of neurons as humans by 2050s. Although a biological neuron may be more complicated than a computer neuron so doing an apple to apple comparison might be wrong.
As data grows and as the expertise of AI increases, I believe it is important to think about how will this affect certain areas of my life and how should I act now in preparation for the future.
This is my very first blog post.
I and my wife just moved to Sydney last year. We are still getting used to the lifestyle here, but by God’s grace, things have been doing well. Before coming here, we made the big decision of leaving our comfort zone to study abroad (and hopefully in the process be closer to doing what we believe is God’s plan for our lives). However, things did not go as planned so we got delayed for a few months, but now I think we are sort of back on track.
I am starting my PhD. in Australia at the later part of the year, and I have read from other posts (link) that starting a blog increases your chances of successfully finishing a PhD., which is why I am starting this blog. My friends know that I am not really keen with social media and writing. I personally think that I am bad at expressing myself, sharing my thoughts, and at telling stories. However, not because I am bad a something, it doesn’t mean that I should just cower in fear and stay being bad at it. As one of my favorite quotes says, “Courage is not the absence of fear, but doing the right thing in spite of one’s fear”. So here goes my first blog post!
Note to self: I hope I don’t become a “Mikka Bozu (三日坊主)”, which is a Japanese saying for those who always start something with intense passion, but loses interest quickly.