Deriving cost function using MLE :Why use log function?

Lets try to derive why the logarithm comes in the cost function of logistic regression from first principles.

So we have a dataset X consisting of m datapoints and n features. And there is a class variable y a vector of length m which can have two values 1 or 0.

Now logistic regression says that the probability that class variable value $y_i =1$ , $i=1,2,...m$ can be modelled as follows

$$ P( y_i =1 | \mathbf{x}_i ; \theta) = h_{\theta}(\mathbf{x}_i) = \dfrac{1}{1+e^{(- \theta^T \mathbf{x}_i)}} $$

so $y_i = 1$ with probability $h_{\theta}(\mathbf{x}_i)$ and $y_i=0$ with probability $1-h_{\theta}(\mathbf{x}_i)$.

This can be combined into a single equation as follows, ( actually $y_i$ follows a Bernoulli distribution)

$$ P(y_i ) = h_{\theta}(\mathbf{x}_i)^{y_i} (1 - h_{\theta}(\mathbf{x}_i))^{1-y_i}$$

$P(y_i)$ is known as the likelihood of single data point $\mathbf{x}_i$, i.e. given the value of $y_i$ what is the probability of $\mathbf{x}_i$ occurring. it is the conditional probability $P(\mathbf{x}_i | y_i)$.

The likelihood of the entire dataset $\mathbf{X}$ is the product of the individual data point likelihoods. Thus

$$ P(\mathbf{X}|\mathbf{y}) = \prod_{i=1}^{m} P(\mathbf{x}_i | y_i) = \prod_{i=1}^{m} h_{\theta}(\mathbf{x}_i)^{y_i} (1 - h_{\theta}(\mathbf{x}_i))^{1-y_i}$$

Now the principle of maximum likelihood says that we find the parameters that maximise likelihood $P(\mathbf{X}|\mathbf{y})$.

As mentioned in the comment, logarithms are used because they convert products into sums and do not alter the maximization search, as they are monotone increasing functions. Here too we have a product form in the likelihood.So we take the natural logarithm as maximising the likelihood is same as maximising the log likelihood, so log likelihood $L(\theta)$ is now:

$$ L(\theta) = \log(P(\mathbf{X}|\mathbf{y}) = \sum_{i=1}^{m} y_i \log(h_{\theta}(\mathbf{x}_i)) + (1-y_i) \log(1 - h_{\theta}(\mathbf{x}_i)) $$.

Since in linear regression we found the $\theta$ that minimizes our cost function , here too for the sake of consistency, we would like to have a minimization problem. And we want the average cost over all the data points. Currently, we have a maximimzation of $L(\theta)$. Maximization of $L( \theta)$ is equivalent to minimization of $ -L(\theta)$. And using the average cost over all data points, our cost function for logistic regresion comes out to be,

$$ J(\theta) = - \dfrac{1}{m} L(\theta)$$

$$ = - \dfrac{1}{m} \left( \sum_{i=1}^{m} y_i \log (h_{\theta}(\mathbf{x}_i)) + (1-y_i) \log (1 - h_{\theta}(\mathbf{x}_i)) \right )$$

Now we can also understand why the cost for single data point comes as follows:

the cost for a single data point is $ = -\log( P(\mathbf{x}_i | y_i))$, which can be written as $ - \left ( y_i \log (h_{\theta}(\mathbf{x}_i)) + (1 - y_i) \log (1 - h_{\theta}(\mathbf{x}_i) \right )$.

We can now split the above into two depending upon the value of $y_i$. Thus we get

$J(h_{\theta}(\mathbf{x}_i), y_i) = - \log (h_{\theta}(\mathbf{x}_i)) , \text{ if } y_i=1$, and

$J(h_{\theta}(\mathbf{x}_i), y_i) = - \log (1 - (h_{\theta}(\mathbf{x}_i) ) , \text{ if } y_i=0 $.


For calculate the gradient is used the chain rule:

$$ \dfrac{dJ(\theta)}{d\theta} = \dfrac{dJ(\theta)}{dh_\theta (x^{(i)})}\dfrac{h_\theta (x^{(i)})}{dZ}\dfrac{dZ}{d\theta} $$ In your case, your are using the Cross-Entropy as Cost Function: $$ J(\theta) = - \frac{1}{m} \sum_{i=1}^m [y^{(i)}\ln (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))] $$

Below is the explanation that how to derivative the Cross-Entropy.

$$ \dfrac{dJ(\theta)}{dy} = \dfrac{1}{m}. \Big[\dfrac{y}{h_\theta (x^{(i)})}-\dfrac{(1-y)}{(1-h_\theta (x^{(i)}))}\Big] $$

$$ = \dfrac{1}{m}. \Big[\dfrac{y.(1-h_\theta (x^{(i)})) - (1-y).h_\theta (x^{(i)})}{h_\theta (x^{(i)}).(1-h_\theta (x^{(i)}))}\Big] $$

$$ = \dfrac{1}{m}. \Big[\dfrac{y-yh_\theta (x^{(i)}) - h_\theta (x^{(i)})+y h_\theta (x^{(i)})}{h_\theta (x^{(i)}).(1-h_\theta (x^{(i)}))}\Big] $$ $$ = \dfrac{1}{m}. \Big[\dfrac{y - h_\theta (x^{(i)})}{h_\theta (x^{(i)}).(1-h_\theta (x^{(i)}))} \Big] $$ Considering that the derivative of the Sigmoid Function is: $$ \dfrac{h_\theta (x^{(i)})}{dZ} = h_\theta (x^{(i)}).(1-h_\theta (x^{(i)})) $$ The result of the gradient is: $$ \dfrac{dJ(\theta)}{dh_\theta (x^{(i)})}\dfrac{h_\theta (x^{(i)})}{dZ}\dfrac{dZ}{d\theta} = \dfrac{1}{m}. \Big[\dfrac{y - h_\theta (x^{(i)})}{h_\theta (x^{(i)}).(1-h_\theta (x^{(i)}))} \Big] . h_\theta (x^{(i)}).(1-h_\theta (x^{(i)})) .x^{(i)} $$ $$ =(y - h_\theta (x^{(i)})). x^{(i)} $$


@user76170 should be correct. But I don't understand a thing and as I am new to here I am not allowed to comment. I hope it is all right to write here.

I think, what @user76170 means by :

$\ p(y_i)=h_\theta(x_i)^{y_i}*(1−h_\theta(x_i))^{1−y_i}$

is:$$ p(y_i|x_i;\theta)=h_\theta(x_i)^{y_i}*(1−h_\theta(x_i))^{1−y_i} $$ And I don't understand why he/she changes the above to conditonal probability $\ P(x_i|y_i)$. As I have read from here http://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/06/lecture-06.pdf

The expression of the log likelihood function of parameter value can be: $$ L(\theta) =log (\prod_{i=1}^{m}p(y_i|x_i;\theta))\\ L(\theta) =\sum_{i=1}^{m}log(p(y_i|x_i;\theta))\\ L(\theta) = \sum_{i=1}^{m}y_i\log(h_\theta(x_i))+(1-y_i)*log(1-h_\theta(x_i)) $$ After that, in my schoolwork, we will use first and second derivative to find the value of $\ \theta$ when the above log-likelihood reaches its maximum value and we call that the MLE method.

As here, we want to use the Cost Function method so we need a cost function which while minimising it, the log-likelihood will be maximised. Hence, we just need to add a minus to it as suggested by @user76170. For reason I don't know, the Cost Function given below and in the lecture is the average of cost over all data points. $$ J(\theta)=-\frac{1}{m}*L(\theta) $$ Why do we have have to divide it by m. Will that make the gradient descent method goes quicker? Probably, I should raise a question.