logistic regression

下载 0

RememberY

发布于

2013

人观看

#信息技术

Compared to LDA, logistic regression relies fewer on assumptions, so it seems to be more robust. In practice, logistic regression and LDA often give similar results.

展开查看详情

1 .Logistic Regression Logistic Regression Jia Li Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu http://www.stat.psu.edu/∼jiali Jia Li http://www.stat.psu.edu/∼jiali

2 .Logistic Regression Logistic Regression Preserve linear classification boundaries. By the Bayes rule: ˆ (x) = arg max Pr (G = k | X = x) . G k Decision boundary between class k and l is determined by the equation: Pr (G = k | X = x) = Pr (G = l | X = x) . Divide both sides by Pr (G = l | X = x) and take log. The above equation is equivalent to Pr (G = k | X = x) log =0. Pr (G = l | X = x) Jia Li http://www.stat.psu.edu/∼jiali

3 .Logistic Regression Since we enforce linear boundary, we can assume p Pr (G = k | X = x) (k,l) (k,l) log = a0 + aj xj . Pr (G = l | X = x) j=1 For logistic regression, there are restrictive relations between a(k,l) for different pairs of (k, l). Jia Li http://www.stat.psu.edu/∼jiali

5 .Logistic Regression For any pair (k, l): Pr (G = k | X = x) log = βk0 − βl0 + (βk − βl )T x . Pr (G = l | X = x) Number of parameters: (K − 1)(p + 1). Denote the entire parameter set by θ = {β10 , β1 , β20 , β2 , ..., β(K −1)0 , βK −1 } . The log ratio of posterior probabilities are called log-odds or logit transformations. Jia Li http://www.stat.psu.edu/∼jiali

6 .Logistic Regression Under the assumptions, the posterior probabilities are given by: exp(βk0 + βkT x) Pr (G = k | X = x) = K −1 1+ l=1 exp(βl0 + βlT x) for k = 1, ..., K − 1 1 Pr (G = K | X = x) = K −1 . 1+ l=1 exp(βl0 + βlT x) For Pr (G = k | X = x) given above, obviously K Sum up to 1: k=1 Pr (G = k | X = x) = 1. A simple calculation shows that the assumptions are satisfied. Jia Li http://www.stat.psu.edu/∼jiali

7 .Logistic Regression Comparison with LR on Indicators Similarities: Both attempt to estimate Pr (G = k | X = x). Both have linear classification boundaries. Difference: Linear regression on indicator matrix: approximate Pr (G = k | X = x) by a linear function of x. Pr (G = k | X = x) is not guaranteed to fall between 0 and 1 and to sum up to 1. Logistic regression: Pr (G = k | X = x) is a nonlinear function of x. It is guaranteed to range from 0 to 1 and to sum up to 1. Jia Li http://www.stat.psu.edu/∼jiali

8 .Logistic Regression Fitting Logistic Regression Models Criteria: find parameters that maximize the conditional likelihood of G given X using the training data. Denote pk (xi ; θ) = Pr (G = k | X = xi ; θ). Given the first input x1 , the posterior probability of its class being g1 is Pr (G = g1 | X = x1 ). Since samples in the training data set are independent, the posterior probability for the N samples each having class gi , i = 1, 2, ..., N, given their inputs x1 , x2 , ..., xN is: N Pr (G = gi | X = xi ) . i=1 Jia Li http://www.stat.psu.edu/∼jiali

9 .Logistic Regression The conditional log-likelihood of the class labels in the training data set is N L(θ) = log Pr (G = gi | X = xi ) i=1 N = log pgi (xi ; θ) . i=1 Jia Li http://www.stat.psu.edu/∼jiali

10 .Logistic Regression Binary Classification For binary classification, if gi = 1, denote yi = 1; if gi = 2, denote yi = 0. Let p1 (x; θ) = p(x; θ), then p2 (x; θ) = 1 − p1 (x; θ) = 1 − p(x; θ) . Since K = 2, the parameters θ = {β10 , β1 }. We denote β = (β10 , β1 )T . Jia Li http://www.stat.psu.edu/∼jiali

11 .Logistic Regression If yi = 1, i.e., gi = 1, log pgi (x; β) = log p1 (x; β) = 1 · log p(x; β) = yi log p(x; β) . If yi = 0, i.e., gi = 2, log pgi (x; β) = log p2 (x; β) = 1 · log(1 − p(x; β)) = (1 − yi ) log(1 − p(x; β)) . Since either yi = 0 or 1 − yi = 0, we have log pgi (x; β) = yi log p(x; β) + (1 − yi ) log(1 − p(x; β)) . Jia Li http://www.stat.psu.edu/∼jiali

12 .Logistic Regression The conditional likelihood N L(β) = log pgi (xi ; β) i=1 N = [yi log p(xi ; β) + (1 − yi ) log(1 − p(xi ; β))] i=1 There p + 1 parameters in β = (β10 , β1 )T . Assume a column vector form for β:   β10  β11    β =  β12  .    ..   .  β1,p Jia Li http://www.stat.psu.edu/∼jiali

13 .Logistic Regression Here we add the constant term 1 to x to accommodate the intercept.   1  x,1    x =  x,2  .    ..   .  x,p Jia Li http://www.stat.psu.edu/∼jiali

14 .Logistic Regression By the assumption of logistic regression model: exp(β T x) p(x; β) = Pr (G = 1 | X = x) = 1 + exp(β T x) 1 1 − p(x; β) = Pr (G = 2 | X = x) = 1 + exp(β T x) Substitute the above in L(β): N Tx L(β) = yi β T xi − log(1 + e β i ) . i=1 Jia Li http://www.stat.psu.edu/∼jiali

15 .Logistic Regression To maximize L(β), we set the first order partial derivatives of L(β) to zero. N N T ∂L(β) xij e β xi = yi xij − β1j i=1 i=1 1 + e β T xi N N = yi xij − p(x; β)xij i=1 i=1 N = xij (yi − p(xi ; β)) i=1 for all j = 0, 1, ..., p. Jia Li http://www.stat.psu.edu/∼jiali

16 .Logistic Regression In matrix form, we write N ∂L(β) = xi (yi − p(xi ; β)) . ∂β i=1 To solve the set of p + 1 nonlinear equations ∂L(β) ∂β1j = 0, j = 0, 1, ..., p, use the Newton-Raphson algorithm. The Newton-Raphson algorithm requires the second-derivatives or Hessian matrix: N ∂ 2 L(β) =− xi xiT p(xi ; β)(1 − p(xi ; β)) . ∂β∂β T i=1 Jia Li http://www.stat.psu.edu/∼jiali

17 .Logistic Regression The element on the jth row and nth column is (counting from 0): ∂L(β) ∂β1j ∂β1n N Tx T Tx (1 + e β i )e β xi xij xin − (e β i )2 xij xin = − i=1 (1 + e β T xi )2 N = − xij xin p(xi ; β) − xij xin p(xi ; β)2 i=1 N = − xij xin p(xi ; β)(1 − p(xi ; β)) . i=1 Jia Li http://www.stat.psu.edu/∼jiali

18 .Logistic Regression Starting with β old , a single Newton-Raphson update is −1 new old ∂ 2 L(β) ∂L(β) β =β − , ∂β∂β T ∂β where the derivatives are evaluated at β old . Jia Li http://www.stat.psu.edu/∼jiali

19 .Logistic Regression The iteration can be expressed compactly in matrix form. Let y be the column vector of yi . Let X be the N × (p + 1) input matrix. Let p be the N-vector of fitted probabilities with ith element p(xi ; β old ). Let W be an N × N diagonal matrix of weights with ith element p(xi ; β old )(1 − p(xi ; β old )). Then ∂L(β) = XT (y − p) ∂β 2 ∂ L(β) = −XT WX . ∂β∂β T Jia Li http://www.stat.psu.edu/∼jiali

20 .Logistic Regression The Newton-Raphson step is β new = β old + (XT WX)−1 XT (y − p) = (XT WX)−1 XT W(Xβ old + W−1 (y − p)) = (XT WX)−1 XT Wz , where z Xβ old + W−1 (y − p). If z is viewed as a response and X is the input matrix, β new is the solution to a weighted least square problem: β new ← arg min(z − Xβ)T W(z − Xβ) . β Recall that linear regression by least square is to solve arg min(z − Xβ)T (z − Xβ) . β z is referred to as the adjusted response. The algorithm is referred to as iteratively reweighted least squares or IRLS. Jia Li http://www.stat.psu.edu/∼jiali

21 .Logistic Regression Pseudo Code 1. 0 → β 2. Compute y by setting its elements to 1 if gi = 1 yi = , 0 if gi = 2 i = 1, 2, ..., N. 3. Compute p by setting its elements to T e β xi p(xi ; β) = i = 1, 2, ..., N. 1 + e β T xi 4. Compute the diagonal matrix W. The ith diagonal element is p(xi ; β)(1 − p(xi ; β)), i = 1, 2, ..., N. 5. z ← Xβ + W−1 (y − p). 6. β ← (XT WX)−1 XT Wz. 7. If the stopping criteria is met, stop; otherwise go back to step 3. Jia Li http://www.stat.psu.edu/∼jiali

22 .Logistic Regression Computational Efficiency Since W is an N × N diagonal matrix, direct matrix operations with it may be very inefficient. A modified pseudo code is provided next. Jia Li http://www.stat.psu.edu/∼jiali

23 .Logistic Regression 1. 0 → β 2. Compute y by setting its elements to 1 if gi = 1 yi = , i = 1, 2, ..., N . 0 if gi = 2 3. Compute p by setting its elements to T e β xi p(xi ; β) = i = 1, 2, ..., N. 1 + e β T xi 4. Compute the N × (p + 1) matrix X ˜ by multiplying the ith row of matrix X by p(xi ; β)(1 − p(xi ; β)), i = 1, 2, ..., N:  T  p(x1 ; β)(1 − p(x1 ; β))x1T   x1  xT   p(x2 ; β)(1 − p(x2 ; β))x T 2  ˜  X= 2  ···  X =  ···    xNT p(xN ; β)(1 − p(xN ; β))xNT 5. β ← β + (XT X) ˜ −1 XT (y − p). 6. If the stopping criteria is met, stop; otherwise go back to step 3. Jia Li http://www.stat.psu.edu/∼jiali

24 .Logistic Regression Example Diabetes data set Input X is two dimensional. X1 and X2 are the two principal components of the original 8 variables. Class 1: without diabetes; Class 2: with diabetes. Applying logistic regression, we obtain β = (0.7679, −0.6816, −0.3664)T . Jia Li http://www.stat.psu.edu/∼jiali

25 .Logistic Regression The posterior probabilities are: e 0.7679−0.6816X1 −0.3664X2 Pr (G = 1 | X = x) = 1 + e 0.7679−0.6816X1 −0.3664X2 1 Pr (G = 2 | X = x) = 1 + e 0.7679−0.6816X1 −0.3664X2 The classification rule is: ˆ (x) = 1 0.7679 − 0.6816X1 − 0.3664X2 ≥ 0 G 2 0.7679 − 0.6816X1 − 0.3664X2 < 0 Jia Li http://www.stat.psu.edu/∼jiali

26 .Logistic Regression Solid line: decision boundary obtained by logistic regression. Dash line: decision boundary obtained by LDA. Within training data set classification error rate: 28.12%. Sensitivity: 45.9%. Specificity: 85.8%. Jia Li http://www.stat.psu.edu/∼jiali

27 .Logistic Regression Multiclass Case (K ≥ 3) When K ≥ 3, β is a (K-1)(p+1)-vector:   β10  β11    .    ..   β10     β1   β1p       β20   β20       .  β2   .  β= = .   ..    β2p    .     .  β(K −1)0    ..   βK −1    β(K −1)0     ..   .  β(K −1)p Jia Li http://www.stat.psu.edu/∼jiali

28 .Logistic Regression βl0 Let β¯l = . βl The likelihood function becomes N L(β) = log pgi (xi ; β) i=1 N ¯T e βgi xi = log K −1 β¯lT xi i=1 1+ l=1 e N K −1 ¯T xi = β¯gTi xi − log 1 + e βl i=1 l=1 Jia Li http://www.stat.psu.edu/∼jiali

29 .Logistic Regression Note: the indicator function I (·) equals 1 when the argument is true and 0 otherwise. First order derivatives: N ¯T ∂L(β) e βk xi xij = I (gi = k)xij − K −1 β¯lT xi ∂βkj 1+ l=1 e i=1 N = xij (I (gi = k) − pk (xi ; β)) i=1 Jia Li http://www.stat.psu.edu/∼jiali

0点赞

0收藏

0下载