Given a dataset \mathcal{D}=\{ (x_n, y_n) \}_{n=1}^N, where x_n \in \mathbb{R}^D and y_n \in \{0, 1\}, the logistic regression model is described as:
\begin{align*}
f_n &= w_0 + w_1 x_1 + w_2 x_2 + \dots + w_D x_D \\
y_n &= \mathrm{Bern} \left( \sigma( f_n ) \right) ,\quad \sigma(x) = 1 / (1 + e^{-x})
\end{align*}
where \mathrm{Bern}(p) is a Bernoulli distribution with a parameter p \ge 0.
The log likelihood of the model is
\begin{align*}
\mathcal{L}(w) &= \ln \prod_{n=1}^N p(y_n) \\
&= \ln \prod_{n=1}^N \mathrm{Bern}(\sigma(f_n)) \\
&= \ln \prod_{n=1}^N \mathrm{Bern}(\sigma(w^\top x_n)) \\
&= \ln \prod_{n=1}^N \sigma(w^\top x_n)^{y_n} (1 - \sigma(w^\top x_n))^{1 - y_n} \\
&= \sum_{n=1}^N \left\{ \ln \sigma(w^\top x_n)^{y_n} + \ln (1 - \sigma(w^\top x_n))^{1 - y_n} \right\} \\
&= \sum_{n=1}^N \left\{ y_n \ln \sigma(w^\top _nx) + (1 - y_n) \ln (1 - \sigma(w^\top x_n)) \right\} \\
\end{align*}
The gradient of \mathcal{L}(w) is
\begin{align*}
\frac{\partial \mathcal{L}}{\partial w}
&= \sum_{n=1}^N
y_n \frac{1}{\sigma(w^\top x_n)} \frac{\partial}{\partial w} \sigma(w^\top x_n)
+ (1 - y_n) \frac{1}{(1 - \sigma(w^\top x_n))} \frac{\partial}{\partial w} (1 - \sigma(w^\top x_n)) \\
&= \sum_{n=1}^N
y_n \frac{1}{\sigma(w^\top x_n)} \frac{\partial}{\partial w} \sigma(w^\top x_n)
- (1 - y_n) \frac{1}{(1 - \sigma(w^\top x_n))} \frac{\partial}{\partial w} \sigma(w^\top x_n) \\
&= \sum_{n=1}^N \left\{
y_n \frac{1}{\sigma(w^\top x_n)}
- (1 - y_n) \frac{1}{(1 - \sigma(w^\top x_n))}
\right\} \frac{\partial}{\partial w} \sigma(w^\top x_n)\\
&\overset{\flat}{=} \sum_{n=1}^N \left\{
y_n \frac{1}{\sigma(w^\top x_n)}
- (1 - y_n) \frac{1}{(1 - \sigma(w^\top x_n))}
\right\} (x_n \sigma(w^\top x_n) (1 - \sigma(w^\top x_n))) \\
&= \sum_{n=1}^N \left\{
y_n x_n (1 - \sigma(w^\top x_n))
- (1 - y_n) x_n \sigma(w^\top x_n)
\right\} \\
&= \sum_{n=1}^N \left\{
y_n x_n
- \cancel{y_n x_n \sigma(w^\top x_n)}
- x_n \sigma(w^\top x_n)
+ \cancel{y_n x_n \sigma(w^\top x_n)}
\right\} \\
&= \sum_{n=1}^N (y_n - \sigma(w^\top x_n)) x_n \\
\end{align*}
The derivation of \overset{\flat}{=} .
\begin{align*}
\frac{\partial}{\partial w} \sigma(w^\top x_n) &= \frac{\partial}{\partial w} \frac{1}{1 + e^{- w^\top x_n}} \\
&= \frac{1' (1 + e^{- w^\top x_n}) - 1 (1 + e^{- w^\top x_n})'}{(1 + e^{- w^\top x_n})^2} \\
&= \frac{ x_n e^{- w^\top x_n}}{(1 + e^{- w^\top x_n})^2} \\
&= x_n \frac{1}{(1 + e^{- w^\top x_n})} \frac{e^{- w^\top x_n}}{(1 + e^{- w^\top x_n})} \\
&= x_n \sigma(w^\top x_n) (1 - \sigma(w^\top x_n)) \\
\end{align*}
Like the case of linear regressions, you can also use a regularization constant \lambda > 0,
\begin{align*}
\mathcal{L}_{\rm{ridge}}(w) &= \mathcal{L}(w) - \lambda \|w\|^2 \\
\frac{\partial \mathcal{L}_{\rm{ridge}}}{\partial w} &= \sum_{n=1}^N (y_n - \sigma(w^\top x_n)) x_n - \lambda w \\
\end{align*}
It should be noted that \lambda |w|^2 should be subtracted this time because \mathcal{L} represents the likelihood to be maximized.
In order to optimize w, we can use a gradient ascent method like
\begin{align*}
w \leftarrow w + \eta \frac{\partial \mathcal{L}_{\rm ridge}(w)}{\partial w}
\end{align*}
where \eta > 0 is a learning rate.
A sample implementation and a notebook.
Discussion