diff --git a/slides/NLP-linear.pdf b/slides/NLP-linear.pdf
index 166a63f..35516be 100644
Binary files a/slides/NLP-linear.pdf and b/slides/NLP-linear.pdf differ
diff --git a/slides/NLP-linear.tex b/slides/NLP-linear.tex
index 32a3c1b..471f25e 100644
--- a/slides/NLP-linear.tex
+++ b/slides/NLP-linear.tex
@@ -263,16 +263,28 @@
 \end{scriptsize}
 \end{frame}
 
+
 \begin{frame}{The Sigmoid function}
 \begin{figure}[htb]
 	\centering
-	 \includegraphics[scale=0.3]{pics/sigmoid.png}
+	 \includegraphics[scale=0.5]{pics/sigmoid.png}
 \end{figure}
+\end{frame}
+
+
+
+\begin{frame}{The Sigmoid function}
+
 \begin{scriptsize}
 \begin{itemize}
-\item The output $f(\vec{x})$ is in the range $[-\infty,\infty]$ , and we map it to one of two classes $\{-1,+1\}$ using the $sign$ function.
-\item This is a good fit if all we care about is the assigned class.
-\item We may be interested also in the confidence of the decision, or the probability that the classifier assigns to the class.
+\item The sigmoid function is is monotonically increasing, and maps values
+to the range $[0, 1]$, with $0$ being mapped to $\frac{1}{2}$.
+\item When used with a suitable loss function (discussed later) the binary predictions made through the log-linear model can be interpreted as class membership probability estimates:
+\begin{equation}
+ \sigma(f(\vec{x})) = P(\hat{y} = 1| \vec{x}) \quad \text{of $\vec{x}$ belonging to the positive class.}
+\end{equation}
+\item We also get $P(\hat{y} = 0| \vec{x}) = 1 - P(\hat{y} = 1| \vec{x}) = 1 -  \sigma(f(\vec{x}))$
+\item The closer the value is to $0$ or $1$ the more certain the model is in its class membership prediction, with the value of $0.5$ indicating model uncertainty.
 \end{itemize}
 \end{scriptsize}
 \end{frame}
@@ -280,6 +292,72 @@
 
 
 
+\begin{frame}{Multi-class Classification}
+
+\begin{scriptsize}
+\begin{itemize}
+\item Most classification problems are of a multi-class nature, in which we
+should assign an example to one of $k$ different classes.
+\item For example, we are given a document and asked to classify it into one of six possible languages: English, French, German, Italian, Spanish, Other.
+\item A possible solution is to consider six weight vectors $\vec{w}_{EN}$,$\vec{w}_{FR},\dots$ and biases, one for each
+language, and predict the language resulting in the highest score:
+\begin{equation}
+ \hat{y} = f(\vec{x}) = \operatorname{argmax}_{L \in \{ EN,FR,GR,IT,SP,O \}} \quad \vec{x}\cdot \vec{w}_L+ b_{L}
+\end{equation}
+\end{itemize}
+\end{scriptsize}
+\end{frame}
+
+
+
+\begin{frame}{Multi-class Classification}
+
+\begin{scriptsize}
+\begin{itemize}
+\item The six sets of parameters $\vec{w}_L \in  \mathcal{R}^{784}$ and $b_L$  can be arranged as a matrix $W \in \mathcal{R}^{784\times6}$ and vector $\vec{b} \in \mathcal{R}^6$ , and the equation re-written as:
+\begin{equation}
+ \begin{split}
+  \vec{\hat{y}} = f(\vec{x}) = \quad & \vec{x} \cdot W + \vec{b}\\
+   \text{prediction} = \hat{y} = \quad  & \operatorname{argmax}_i \vec{\hat{y}}_{[i]} 
+ \end{split}
+\end{equation}
+
+\item Here $\vec{\hat{y}} \in \mathcal{R}^6$ is a vector of the scores assigned by the model to each language, and we again determine the predicted language by taking the argmax over the entries of $\vec{\hat{y}}$.
+
+\end{itemize}
+\end{scriptsize}
+\end{frame}
+
+
+\begin{frame}{Representations}
+
+\begin{scriptsize}
+\begin{itemize}
+\item Consider the vector $\vec{\hat{y}}$  resulting from applying a trained model to a document.
+\item The vector can be considered as a representation of the document, capturing the properties of the document that are important to us, namely the scores of the different languages.
+\item The representation $\vec{\hat{y}}$  contains strictly more information than the prediction $\operatorname{argmax}_i \vec{\hat{y}}_{[i]} $.
+\item For example, $\vec{\hat{y}}$ can be used to distinguish documents in which the main language in German, but which also contain a sizeable amount of French words.
+\item By clustering documents based on their vector representations as assigned by the model, we could perhaps discover documents written in regional dialects, or by multilingual authors.
+
+
+\end{itemize}
+\end{scriptsize}
+\end{frame}
+
+
+\begin{frame}{Representations}
+
+\begin{scriptsize}
+\begin{itemize}
+\item The vectors $\vec{x}$ containing the normalized letter-bigram counts for the documents are also representations of the documents.
+\item Arguably containing a similar kind of information to the vectors  $\vec{\hat{y}}$. 
+\item However, the representations in  $\vec{\hat{y}}$ is more compact (6 entries instead of 784) and more specialized for the language prediction objective.
+\item Clustering by the vectors $\vec{x}$ would likely reveal document similarities that are not due to a particular mix of languages, but perhaps due to the document's topic or writing styles.
+\end{itemize}
+\end{scriptsize}
+\end{frame}
+
+
 \begin{frame}{Training}
 \begin{scriptsize}
 \begin{itemize}