ivs.statistics.linearregression

1 # -*- coding: utf-8 -*- 2 """ 3 Easy linear fitting. 4 Author: Joris De Ridder 5 6 A crash course 7 ============== 8 The following example will illustrate some of the core functionality of the package. 9 We first make some "fake" observations I{obs} that were measured as a function of I{x}. 10 11 >>> from numpy import * 12 >>> noise = array([0.44, -0.48, 0.26, -2.00, -0.93, 2.21, -0.57, -2.04, -1.09, 1.53]) 13 >>> x = linspace(0, 5, 10) 14 >>> obs = 2.0 + 3.0 * exp(x) + noise # our "observations" 15 16 The model we want to fit is y = a_0 + a_1 * exp(x) = a_0 * 1 + a_1 * exp(x): 17 18 >>> myModel = LinearModel([ones(10), exp(x)], ["1", "exp(x)"]) 19 >>> print(myModel) 20 Model: y = a_0 + a_1 * exp(x) 21 Expected number of observations: 10 22 23 Now we fit the observations: 24 25 >>> myFit = myModel.fitData(obs) 26 >>> myFit.summary() 27 Model: y = a_0 + a_1 * exp(x) 28 Regression coefficients: 29 a_0 = 1.519644 +/- 0.574938 30 a_1 = 3.006151 +/- 0.010033 31 Sum of squared residuals: 16.755934 32 Estimated variance of the residuals: 2.094492 33 Coefficient of determination R^2: 0.999911 34 F-statistic: 89767.825042 35 AIC: 15.161673 36 BIC: 12.069429 37 38 39 The more advanced features 40 ========================== 41 42 A lot of interesting information can be extracted from the LinearModel and LinearFit classes. 43 A simple way to list the available methods on the prompt is: 44 45 >>> [method for method in dir(myModel) if not method.startswith("_")] 46 47 With the LinearModel class we can compute, for example: 48 49 >>> myModel.conditionNumber() 50 71.994675255533011 51 >>> myModel.singularValues() 52 array([ 181.21519026, 2.51706379]) 53 >>> myModel.degreesOfFreedom() 54 8 55 >>> myModel.traceHatMatrix() 56 1.9999999999999996 57 58 and more. Some examples with the LinearFit class: 59 60 >>> myFit.regressionCoefficients() 61 array([ 1.51964437, 3.00615141]) 62 >>> myFit.errorBars() 63 array([ 0.57493781, 0.01003345]) 64 >>> myFit.covarianceMatrix() # correlation matrix also available 65 array([[ 3.30553480e-01, -3.49164676e-03], 66 [ -3.49164676e-03, 1.00670216e-04]]) 67 >>> myFit.confidenceIntervals(0.05) 68 (array([ 0.19383541, 2.98301422]), array([ 2.84545333, 3.0292886 ])) 69 70 The 95% confidence interval of coefficient a_0 is thus [0.19, 2.85]. See the list 71 of methods for more functionality. Note that also weights can be taken into account. 72 73 To evaluate your linear model in a new set of covariates 'xnew', use the method evaluate(). 74 Be aware, however, that you need to give the regressors evaluated in the new covariates, 75 not only the covariates. For example: 76 77 >>> xnew = linspace(-5.0, +5.0, 20) 78 >>> y = myFit.evaluate([ones_like(xnew), exp(xnew)]) 79 >>> print(y) 80 [ 1.53989966 1.55393018 1.57767944 1.61787944 1.68592536 81 1.80110565 1.99606954 2.32608192 2.8846888 3.83023405 82 5.43074394 8.13990238 12.72565316 20.48788288 33.62688959 83 55.86708392 93.51271836 157.23490405 265.09646647 447.67207215] 84 85 86 Assessing the models 87 ==================== 88 89 Test if we can reject the null hypothesis that all coefficients are zero 90 with a 5% significance level: 91 92 >>> myFit.FstatisticTest(0.05) 93 True 94 95 The "True" result means that H0 can indeed be rejected. Now we test if we can 96 reject the null hypothesis that one particular coefficient is zero with a 5% 97 significance level 98 99 >>> myFit.regressorTtest(0.05) 100 array([ True, True], dtype=bool) 101 102 We obtained "True" for both coefficients, so both a_0 and a_1 are significant. 103 104 105 Some handy derived classes 106 ========================== 107 108 There is a shortcut to define polynomials: 109 110 >>> myPolynomial = PolynomialModel(x, "x", [3,1]) 111 >>> print myPolynomial 112 Model: y = a_0 * x^3 + a_1 * x^1 113 Expected number of observations: 10 114 115 and to define harmonic models: 116 117 >>> myHarmonicModel = HarmonicModel(x, "x", [2.13, 3.88], ["f1", "f2"], maxNharmonics=2) 118 >>> print myHarmonicModel 119 Model: y = a_0 * sin(2*pi*1*f1*x) + a_1 * cos(2*pi*1*f1*x) + a_2 * sin(2*pi*2*f1*x) + a_3 * cos(2*pi*2*f1*x) + a_4 * sin(2*pi*1*f2*x) + a_5 * cos(2*pi*1*f2*x) + a_6 * sin(2*pi*2*f2*x) + a_7 * cos(2*pi*2*f2*x) 120 Expected number of observations: 10 121 122 One can even add the two models: 123 124 >>> myCombinedModel = myPolynomial + myHarmonicModel 125 >>> print myCombinedModel 126 Model: y = a_0 * x^3 + a_1 * x^1 + a_2 * sin(2*pi*1*f1*x) + a_3 * cos(2*pi*1*f1*x) + a_4 * sin(2*pi*2*f1*x) + a_5 * cos(2*pi*2*f1*x) + a_6 * sin(2*pi*1*f2*x) + a_7 * cos(2*pi*1*f2*x) + a_8 * sin(2*pi*2*f2*x) + a_9 * cos(2*pi*2*f2*x) 127 Expected number of observations: 10 128 129 130 Notes 131 ===== 132 The package is aimed to be robust and efficient. 133 - B{Robust}: Singular value decomposition is used to do the fit, where the 134 singular values are treated to avoid numerical problems. This is 135 especially useful when you tend to overfit the data. 136 - B{Efficient}: asking twice for the same information will not lead to a 137 recomputation. Results are kept internally. The reason for 138 having a separate LinearModel and LinearFit class is that 139 the bulk of the fitting computations can actually be done 140 without the observations. Fitting the same model to different 141 sets of observations can therefore be done a lot faster this 142 way. 143 144 """ 145 146 147 import sys 148 import copy 149 from math import sqrt,log,pi 150 from itertools import combinations 151 import numpy as np 152 import scipy as sp 153 import scipy.stats as stats 154 155 156 157 158

159 -class LinearModel(object):

160 161 """ 162 A linear model class 163 164 The class implements a linear model of the form 165 166 C{y(x) = a_0 * f_0(x) + a_1 * f_1(x) + ... + a_{n-1} f_{n-1}(x)} 167 168 where 169 - y are the responses (observables) 170 - x are the covariates 171 - a_i are the regression coefficients (to be determined) 172 - f_i(x) are the regressors evaluated in the covariates. 173 174 Note: LinearModel instances can be added and printed. 175 """ 176 177

178 - def __init__(self, regressors, nameList, covMatrix = None, regressorsAreWeighted = False):

179 180 """ 181 Initialiser of the LinearModel class. 182 183 Example: 184 185 >>> x = linspace(0,10,100) 186 >>> lm = LinearModel([sin(x),x*x], ["sin(x)", "x^2"]) 187 188 @param regressors: either a list of equally-sized numpy arrays with the regressors 189 evaluated in the covariates: [f_0(x),f_1(x),f_2(x),...], 190 or an N x M design matrix (numpy array) where these regressor arrays 191 are column-stacked, with N the number of regressors, and M the number 192 of data points. 193 @type regressors: list or ndarray 194 @param nameList: list of strings of the regressors names. E.g. ["sin(x)", "cos(x)"] 195 @type nameList: list 196 @param covMatrix: square array containing the covariance matrix of the 197 observations. Weights, and correlations can hence be 198 taken into account. 199 @type covMatrix: ndarray 200 @param regressorsAreWeighted: False if regressors are not yet weighted, True otherwise 201 @type regressorsAreWeighted: boolean 202 @return: a LinearModel instance 203 @rtype: LinearModel 204 205 """ 206 207 # Sanity check of the 'regressors' 208 209 if isinstance(regressors, list): 210 self._designMatrix = np.column_stack(np.double(regressors)) 211 self._nParameters = len(regressors) 212 self._nObservations = len(regressors[0]) 213 elif isinstance(regressors, np.ndarray): 214 self._designMatrix = np.double(regressors) 215 self._nParameters = regressors.shape[1] 216 self._nObservations = regressors.shape[0] 217 else: 218 raise TypeError, "LinearModel only accepts a list of regressors, or a design matrix" 219 220 221 # Sanity check of the 'nameList' 222 223 if len(nameList) != self._nParameters: 224 raise ValueError, "Number of names not equal to number of regressors" 225 else: 226 self._regressorNames = copy.copy(nameList) 227 228 229 # Sanity check of the 'covMatrix' 230 231 if covMatrix == None: 232 self._covMatrixObserv = None 233 elif not isinstance(covMatrix, np.ndarray): 234 raise TypeError, "Covariance matrix of observations needs to be an ndarray object" 235 else: 236 if len(covMatrix.shape) != 2: 237 raise TypeError, "Covariance matrix not a 2-dimensional array" 238 elif covMatrix.shape[0] != self._nObservations: 239 raise TypeError, "Size of covariance matrix not compatible with number of observations" 240 elif covMatrix.shape[1] != covMatrix.shape[0]: 241 raise TypeError, "Covariance matrix is not a square matrix" 242 else: 243 self._covMatrixObserv = covMatrix 244 245 246 # If a covariance matrix is specified, and if not already done, compute the weighted 247 # design matrix. The weights are determined by the cholesky decomposition of the 248 # inverse of the covariance matrix. A simple way would be to first invert the covariance 249 # matrix, cholesky-decompose the result, and then do a matrix multiplication. A faster 250 # method is to cholesky-decompose the covariance matrix, and then determine the 251 # weighted design matrix by a standard solve. 252 # 'choleskyLower' is the lower triangular matrix of the cholesky decomposition. 253 254 if (self._covMatrixObserv != None) and (regressorsAreWeighted == False): 255 self._choleskyLower = np.linalg.cholesky(self._covMatrixObserv) 256 for n in range(self._designMatrix.shape[1]): 257 self._designMatrix[:,n] = np.linalg.solve(self._choleskyLower, self._designMatrix[:,n]) 258 else: 259 self._choleskyLower = None 260 261 262 263 # Initialisation of several internal variables 264 265 self._pseudoInverse = None 266 self._standardCovarianceMatrix = None 267 self._conditionNumber = None 268 self._hatMatrix = None 269 self._traceHatMatrix = None 270 self._degreesOfFreedom = None 271 self._withIntercept = None 272 self._U = None # svd: A = U * w * V^t 273 self._V = None 274 self._w = None 275 self._B = None 276 self._svdTOL = np.finfo(np.double).eps * self._nObservations # tolerance to ignore singular values

277 278 279 280 281

282 - def __add__(self, linearModel):

283 284 """ 285 Defines operator '+' of two LinearModel instances. 286 287 Example: 288 289 >>> x = np.linspace(0,10,100) 290 >>> lm1 = LinearModel([x], ["x"]) 291 >>> lm2 = LinearModel([x**2], ["x^2"]) 292 >>> lm3 = lm1 + lm2 # the same as LinearModel([x, x**2], ["x", "x^2"]) 293 294 295 @param linearModel: a LinearModel instance with the same number of observations as the 296 linear model in 'self' 297 @type linearModel: LinearModel class 298 @return: a newly created LinearModel object representing a linear model that is the sum 299 of 'self' and 'linearModel'. 'self' and 'linearModel' are unaltered. 300 @rtype: LinearModel 301 302 """ 303 304 if not isinstance(linearModel, LinearModel): 305 raise TypeError, "Only a LinearModel can be added to a LinearModel" 306 307 if linearModel.designMatrix().shape[0] != self._nObservations: 308 raise ValueError, "Linear model has incompatible design matrix" 309 310 if not np.alltrue(linearModel._covMatrixObserv == self._covMatrixObserv): 311 raise ValueError, "Linear model has a different covariance matrix" 312 313 designMatrix = np.hstack([self._designMatrix, linearModel.designMatrix()]) 314 regressorNames = self._regressorNames + linearModel.regressorNames() 315 316 if self._covMatrixObserv == None: 317 return LinearModel(designMatrix, regressorNames) 318 else: 319 return LinearModel(designMatrix, regressorNames, self._covMatrixObserv, regressorsAreWeighted = True)

320 321 322 323 324 325 326 327 328 329 330

331 - def designMatrix(self):

332 333 """ 334 Returns the design matrix of the linear model 335 336 Example: 337 338 >>> x = np.linspace(0,10,5) 339 >>> lm = LinearModel([x], ["x"]) 340 >>> lm.designMatrix() 341 array([[ 0. , 0. ], 342 [ 2.5 , 6.25], 343 [ 5. , 25. ], 344 [ 7.5 , 56.25], 345 [ 10. , 100. ]]) 346 347 @return: a N x M design matrix. If the model is written as 348 y(x) = a_0 * f_0(x) + a_1 * f_1(x) + ... + a_{n-1} f_{n-1}(x) 349 then the design matrix A is defined by A_{i,j} = f_i(x_j). N is 350 the number of regressors, M is the number of data points. 351 @rtype: ndarray 352 353 """ 354 355 return self._designMatrix

356 357 358 359 360 361 362 363 364 365 366 367

368 - def _singularValueDecompose(self):

369 370 """ 371 Performs and stores the singular value decomposition of the design matrix 372 373 Private class method, not to be used by the user. 374 375 The singular value decomposition of the design matrix is defined as 376 377 C{X = U * W * V^t } 378 379 where 380 - X: M x N design matrix. 381 - U: M x M unitary matrix 382 - W: M x N diagonal matrix 383 - V: N x N unitary matrix 384 with M the number of observations, and N the number of regressors. Note that for a 385 unitary matrix A: A^{-1} = A^t 386 387 @return: nothing 388 389 """ 390 391 # Compute the singular value decomposition of the design matrix. 392 # As the number of singular values is between 1 and min(M,N), only compute 393 # the first min(M,N) columns of U, as only these columns are ever used 394 # (i.e. multiplied with non-zero values). In the following decomposition: 395 # - w is a vector containing the non-zero diagonal elements of W 396 # - V^t is returned rather than V 397 398 399 self._U,self._w, Vt = np.linalg.svd(self._designMatrix, full_matrices=0) 400 self._V = Vt.T # svd() actually returns V^t rather than V 401 402 # Compute the inverse of the singular values. But set the inverse value 403 # to zero if the 404 405 wmax = self._w.max() 406 self._invSingularValues = np.zeros_like(self._w) 407 for n in range(len(self._w)): 408 if self._w[n] > self._svdTOL * wmax: 409 self._invSingularValues[n] = 1.0 / self._w[n]

410 411 412 413 414 415 416 417 418 419

420 - def pseudoInverse(self):

421 422 """ 423 Computes the pseudo-inverse in a robust way 424 425 Remarks: 426 - The "robustness" is obtained by using a singular value decomposition 427 of X, and treating the singular values in order to avoid numerical 428 problems. 429 430 Example: 431 432 >>> x = linspace(0,10,5) 433 >>> lm = LinearModel([x, x**2], ["x", "x^2"]) 434 >>> lm.pseudoInverse() 435 array([[ 8.95460520e-17, 1.63870968e-01, 1.98709677e-01, 436 1.04516129e-01, -1.18709677e-01], 437 [ -1.07455262e-17, -1.80645161e-02, -2.06451613e-02, 438 -7.74193548e-03, 2.06451613e-02]]) 439 440 @return: A numerical safe version of the N x M matrix 441 X^t X)^{-1} X^t where X is the M x N design matrix, 442 ^t the transpose, ^{-1} the inverse matrix. 443 The multiplications are matrix multiplications. 444 @rtype: ndarray 445 446 447 """ 448 449 if self._pseudoInverse is not None: 450 return self._pseudoInverse 451 else: 452 if self._U == None: self._singularValueDecompose() 453 M = self._U.shape[0] # number of observations 454 N = self._U.shape[1] # number of regressors 455 temp = np.zeros((N,M)) 456 for n in range(N): 457 temp[n] = self._invSingularValues[n] * self._U[:,n] 458 self._pseudoInverse = np.dot(self._V, temp) 459 460 return self._pseudoInverse

461 462 463 464 465 466 467 468 469 470 471 472 473

474 - def standardCovarianceMatrix(self):

475 476 """ 477 Computes the standard covariance matrix. 478 479 Remarks: 480 - The "safeness" is obtained by using a singular value 481 decomposition of X, and treating the singular values to avoid 482 numerical problems. 483 - The matrix C is the covariance matrix of the regression 484 coefficients if the signal noise would be i.i.d. with sigma = 1. 485 - In the case of sigma != 1, the actual covariance matrix 486 can be obtained by simply rescaling the standard covariance 487 matrix. See L{LinearFit.covarianceMatrix}. 488 489 Example: 490 491 >>> x = linspace(0,10,5) 492 >>> lm = LinearModel([x, x**2], ["x", "x^2"]) 493 >>> lm.standardCovarianceMatrix() 494 array([[ 0.09135484, -0.01032258], 495 [-0.01032258, 0.00123871]]) 496 497 @return: A numerical safe version of C = (X^t X)^{-1} where 498 X is the design matrix, ^t the transpose, ^{-1} the 499 inverse matrix. The multiplications are matrix 500 multiplications. 501 @rtype: ndarray 502 503 504 """ 505 506 if self._standardCovarianceMatrix != None: 507 return self._standardCovarianceMatrix 508 else: 509 if self._V == None: self._singularValueDecompose() 510 self._standardCovarianceMatrix = np.dot(self._V, np.dot(np.diag(self._invSingularValues**2), self._V.T)) 511 return self._standardCovarianceMatrix

512 513 514 515 516 517 518 519 520 521 522

523 - def conditionNumber(self):

524 525 """ 526 Computes the condition number. 527 528 Example: 529 530 >>> x = linspace(0,10,5) 531 >>> lm = LinearModel([x, x**2], ["x", "x^2"]) 532 >>> lm..conditionNumber() 533 35.996606504814814 534 535 @return: The ratio of the highest to the lowest singular value, 536 where the singular values are obtained by singular value 537 decomposition of the design matrix. A large condition number 538 implies an ill-conditioned design matrix. 539 @rtype: double 540 541 542 """ 543 544 if self._conditionNumber is not None: 545 return self._conditionNumber 546 else: 547 if self._w == None: self._singularValueDecompose() 548 self._conditionNumber = self._w.max() / self._w.min() 549 return self._conditionNumber

550 551 552 553 554 555 556 557 558 559 560

561 - def singularValues(self):

562 563 """ 564 Return the non-zero singular values 565 as obtained with the singular value decomposition 566 567 Remarks: 568 - the singular values can be used a diagnostic to see how 569 ill-conditioned the matrix inversion is. 570 571 Example: 572 573 >>> x = linspace(0,10,5) 574 >>> lm = LinearModel([x, x**2], ["x", "x^2"]) 575 >>> lm.singularValues() 576 array([ 118.34194851, 3.28758625]) 577 578 @return: The non-zero singular values. 579 @rtype: ndarray 580 581 """ 582 583 if self._w != None: 584 return self._w 585 else: 586 self._singularValueDecompose() 587 return self._w

588 589 590 591 592 593 594 595 596 597 598 599

600 - def hatMatrix(self):

601 602 """ 603 Computes the hat matrix. 604 605 The hat matrix is defined by H = X (X^t X)^{-1} X^t with 606 - X the (weighted) design matrix 607 - ^t the transpose, 608 - ^{-1} the inverse matrix 609 and where all multiplications are matrix multiplications. 610 It has the property that \hat{y} = H * y where 611 - \hat{y} are the (weighted) predictions 612 - y the (weighted) observations. 613 So, it "puts a hat" on the (weighted) observation vector. 614 615 Remarks: 616 - as the number of data points may be large, this matrix may 617 become so huge that it no longer fits into your computer's 618 memory 619 - for weighted (and/or correlated) observations, the resulting 620 weighted hat matrix does not put a hat on the unweighted 621 observations vector, only on the weighted one. 622 623 Example: 624 625 >>> x = linspace(0,10,5) 626 >>> lm = LinearModel([x, x**2], ["x", "x^2"]) 627 >>> lm.hatMatrix() 628 array([[ 9.32150093e-32, 1.56705591e-16, 1.79092104e-16, 629 6.71595390e-17, -1.79092104e-16], 630 [ 1.56705591e-16, 2.96774194e-01, 3.67741935e-01, 631 2.12903226e-01, -1.67741935e-01], 632 [ 1.79092104e-16, 3.67741935e-01, 4.77419355e-01, 633 3.29032258e-01, -7.74193548e-02], 634 [ 6.71595390e-17, 2.12903226e-01, 3.29032258e-01, 635 3.48387097e-01, 2.70967742e-01], 636 [ -1.79092104e-16, -1.67741935e-01, -7.74193548e-02, 637 2.70967742e-01, 8.77419355e-01]]) 638 639 @return: The hat matrix. 640 @rtype: ndarray 641 642 """ 643 644 if self._hatMatrix is not None: 645 return self._hatMatrix 646 else: 647 if self._U is None: self._singularValueDecompose() 648 self._hatMatrix = np.dot(self._U, self._U.T) 649 return self._hatMatrix

650 651 652 653 654 655 656 657 658 659 660 661

662 - def traceHatMatrix(self):

663 664 """ 665 Returns the trace of the (possibly weighted) hat matrix. 666 667 The computation of the entire MxM hat matrix (M the number of 668 observations) is avoided to calculate the trace. This is useful 669 when the number of observations is so high that the hat matrix 670 does no longer fit into the memory. 671 672 Remarks: 673 - If the hat matrix was computed anyway it will be used. 674 - The trace of the hat matrix that puts a hat on the original 675 observations equals the trace of the hat matrix that puts a 676 hat on the weighted observations. 677 678 Example: 679 680 >>> x = linspace(0,10,5) 681 >>> lm = LinearModel([x, x**2], ["x", "x^2"]) 682 >>> lm.hatMatrix() 683 1.9999999999999993 684 685 @return: The trace of the hat matrix. 686 @rtype: ndarray 687 688 """ 689 690 if self._traceHatMatrix is not None: 691 return self._traceHatMatrix 692 else: 693 if self._hatMatrix is not None: 694 self._traceHatMatrix = self._hatMatrix.trace() 695 else: 696 if self._U is None: self._singularValueDecompose() 697 self._traceHatMatrix = 0.0 698 for k in range(self._U.shape[1]): 699 self._traceHatMatrix += np.sum(np.square(self._U[:,k])) 700 return self._traceHatMatrix

701 702 703 704 705 706 707 708 709 710 711 712 713

714 - def degreesOfFreedom(self):

715 716 """ 717 Returns the effective number of degrees of freedom. 718 719 The d.o.f. is defined as the trace of the hat matrix. See also: 720 U{Wikipedia<http://en.wikipedia.org/wiki/Degrees_of_freedom_(statistics)>} 721 722 Example: 723 724 >>> x = linspace(0,10,5) 725 >>> lm = LinearModel([x, x**2], ["x", "x^2"]) 726 >>> lm.degreesOfFreedom() 727 3 728 729 @return: The effective number of degrees of freedom. 730 @rtype: integer 731 732 """ 733 734 if self._degreesOfFreedom is not None: 735 return self._degreesOfFreedom 736 else: 737 self._degreesOfFreedom = self._nObservations - int(round(self.traceHatMatrix())) 738 return self._degreesOfFreedom

739 740 741 742 743 744 745 746 747 748 749

750 - def regressorNames(self):

751 752 """ 753 Returns the regressor names. 754 755 Example: 756 757 >>> x = linspace(0,10,100) 758 >>> lm = LinearModel([sin(x),x*x], ["sin(x)", "x^2"]) 759 >>> lm.regressorNames() 760 ["sin(x)", "x^2"] 761 762 @return: list with strings containing the names of the regressors. 763 @rtype: list 764 765 """ 766 767 return self._regressorNames

768 769 770 771 772 773 774 775 776 777 778 779 780 781

782 - def containsRegressorName(self, regressorName):

783 784 """ 785 Checks if the linear model contains the given regressor. 786 787 Example: 788 789 >>> x = linspace(0,10,100) 790 >>> lm = LinearModel([sin(x), x*x], ["sin(x)", "x^2"]) 791 >>> lm.containsRegressorName("x^2") 792 True 793 >>> lm.containsRegressorName("cos(x)") 794 False 795 796 @param regressorName: name of a regressor 797 @type regressorName: string 798 @return: True if one of the regressor names is identical 799 to the input 'regressorName', else False. 800 @rtype: boolean 801 802 """ 803 804 return regressorName in self._regressorNames

805 806 807 808 809 810 811 812 813

814 - def someRegressorNameContains(self, someString):

815 816 """ 817 Checks if a regressor name contains a given substring. 818 819 Example: 820 821 >>> x = linspace(0,10,100) 822 >>> lm = LinearModel([sin(x), x*x], ["sin(x)", "x^2"]) 823 >>> lm.someRegressorNameContains("sin") 824 True 825 826 @param someString: substring to be checked 827 @type someString: string 828 829 @return: True if at least one of the regressor names contains 830 the substring someString, else return false. 831 @rtype: boolean 832 833 """ 834 835 for name in self._regressorNames: 836 if someString in name: return True 837 838 return False

839 840 841 842 843 844 845 846 847 848

849 - def withIntercept(self):

850 851 """ 852 Checks if there is a constant regressor 853 854 Example: 855 856 >>> x = linspace(0,10,100) 857 >>> lm = LinearModel([1, x*x], ["1", "x^2"]) 858 >>> lm.withIntercept() 859 True 860 861 @return: True if there is a constant regressor (the intercept). 862 False otherwise. 863 @rtype: boolean 864 865 """ 866 867 if self._withIntercept is not None: 868 return self._withIntercept 869 else: 870 self._withIntercept = False 871 for n in range(self._nParameters): 872 if len(np.unique(self._designMatrix[:,n])) == 1: 873 self._withIntercept = True 874 break 875 return self._withIntercept

876 877 878 879 880 881 882 883 884 885

886 - def nObservations(self):

887 888 """ 889 Returns the number of observations 890 891 Example: 892 893 >>> x = linspace(0,10,23) 894 >>> lm = LinearModel([x], ["x"]) 895 >>> lm.nObservations() 896 23 897 898 @return: the number of observations that the LinearModel expects 899 @rtype: integer 900 901 """ 902 903 return self._nObservations

904 905 906 907 908 909 910 911 912 913 914 915

916 - def nParameters(self):

917 918 """ 919 Returns the number of fit coefficients 920 921 Example: 922 923 >>> x = linspace(0,10,100) 924 >>> lm = LinearModel([sin(x),x*x], ["sin(x)", "x^2"]) 925 >>> lm.nParameters() 926 2 927 928 @return: the numbero fit coefficients of the linear model 929 @rtype: integer 930 931 """ 932 933 return self._nParameters

934 935 936 937 938 939 940 941 942 943 944 945

946 - def fitData(self, observations):

947 948 """ 949 Create and return a LinearFit object. 950 951 From this object the user can extract all kinds of quantities 952 related to the linear fit. See L{LinearFit}. 953 954 Example: 955 956 >>> x = linspace(0,10,100) 957 >>> lm = LinearModel([x], ["x"]) 958 >>> obs = x + normal(0.0, 1.0, 100) # some simulated observations 959 >>> linearfit = lm.fitData(obs) 960 961 @param observations: array with the observations y_i. The size 962 of the array should be compatible with the number 963 of expected observations in the LinearModel. 964 @type observations: ndarray 965 @return: a LinearFit instance, containing the fit of the observations 966 with the linear model. 967 @rtype: LinearFit 968 969 """ 970 971 if len(observations) != self._nObservations: 972 raise ValueError, "Number of observations should be %d != %d" % (self._nObservations, len(observations)) 973 974 return LinearFit(self, observations)

975 976 977 978 979 980 981 982 983 984 985 986 987

988 - def submodels(self, Nmin = 1, Nmax = None, nested = True, ranks = None, simpleFirst=True):

989 990 """ 991 Returns a generator capable of generating submodels 992 993 The generator sequentially returns the submodels which are instances 994 of LinearModel with only a subset of the regressors of the parent model. 995 The submodels with fewer regressors will always appear first in the list. 996 The order in which the regressors appear in the submodels will be from 997 low rank to high rank. 998 999 Examples: 1000 1001 If the regressors are [1, x, x**2] then 1002 - Nmin=1, nested=False, ranks=None gives the submodels 1003 [1], [x], [x**2], [1,x], [1,x**2], [x,x**2], [1,x,x**2] 1004 - Nmin=1, Nmax=2, nested=False, ranks=None gives the submodels 1005 [1], [x], [x**2], [1,x], [1,x**2], [x,x**2] 1006 - Nmin=2, nested=False, ranks=None gives the submodels 1007 [1,x], [1,x**2], [x,x**2], [1,x,x**2] 1008 - Nmin=3, nested=True, ranks=None gives the submodel 1009 [1,x,x**2] 1010 - Nmin=3, nested=True, ranks=[1,0,3] gives the submodel 1011 [x,1,x**2] 1012 - Nmin=2, nested=False, ranks=[1,0,3] gives the submodels 1013 [x,1], [x,x**2], [1,x**2], [x,1,x**2] 1014 - Nmin=2, nested=True, ranks=[1,0,3] gives the submodels 1015 [x,1], [x,1,x**2] 1016 If the regressors are [1,x,sin(2x),cos(2x)] then 1017 - Nmin=3, nested=True, ranks=[0,1,2,2] gives the submodel 1018 [1,x,sin(2x),cos(2x)] 1019 - Nmin=2, nested=True, ranks=[0,1,2,2] gives the submodels 1020 [1,x], [1,x,sin(2x),cos(2x)] 1021 - Nmin=2, nested=False, ranks=[0,1,2,2] gives the submodels 1022 [1,x], [1,sin(2x),cos(2x)], [x,sin(2x),cos(2x)], [1,x,sin(2x),cos(2x)] 1023 - Nmin=1, Nmax=2, nested=False, ranks=[0,1,2,2] gives the submodels 1024 [1], [x], [sin(2x),cos(2x)], [1,x], [1,sin(2x),cos(2x)], [x,sin(2x),cos(2x)] 1025 1026 @param Nmin: Minimum number of regressors (or groups of equally ranking 1027 regressors). Should be minimal 1. 1028 @type Nmin: integer 1029 @param Nmax: Maximum number of regressors (or groups of equally ranking 1030 regressors). If == None, it will take the maximum number 1031 possible. Note: if the ranks = [0,1,2,2] then there are 1032 maximum 3 "groups" of regressors, as the latter 2 regressors 1033 will always be taken together. 1034 @type Nmax: integer 1035 @param nested: if False: return all possible submodels, i.e. all possible 1036 combinations of regressors. If true: return only nested 1037 submodels with combinations of regressors (or groups of 1038 equally ranked regressors) so that a higher ranked regressors 1039 will never appear without the lower ranked regressors 1040 @type nested: boolean 1041 @param ranks: list with (integer) rankings of the regressors, only the 1042 relative ranking of the numbers will be used, so the exact 1043 numbers are not important. For example, [1,2,3] has the 1044 same result as [10,24,35]. Regressors having the same rank 1045 should always be included together in a submodel, regardless 1046 how the input parameter 'nested' has been set. In case of 1047 nested submodels, a regressor with a higher rank should 1048 never occur without the regressors with lower rank in the 1049 same submodel. 1050 @type ranks: list with integers 1051 @param simpleFirst: if True: generate the more simple submodels first. 1052 if False: generate the complicated submodels first. 1053 @type simpleFirst: boolean 1054 @return: a python generator implementing the yield method 1055 @rtype: generator 1056 1057 """ 1058 1059 # If no ranking of the regressors are given, consider the first regressor 1060 # having the most important, and going down, the last regressor the least 1061 # important 1062 1063 if ranks == None: 1064 ranks = np.arange(self._nParameters) 1065 1066 # Sort the (unique) ranks from low (most important) to high (least important) 1067 1068 uniqueRanks = np.unique(ranks) 1069 if Nmax == None: 1070 Nmax = len(uniqueRanks) 1071 1072 # nRegressor is a list to be looped over, containing for each submodel the number 1073 # of regressors. Be aware that a group of regressors with equal rank are considered to 1074 # be only 1 regressor in this accounting. 1075 1076 if simpleFirst: 1077 nRegressorRange = range(Nmin,Nmax+1) 1078 else: 1079 nRegressorRange = range(Nmax,Nmin-1,-1) 1080 1081 # Make a generator yield the submodels 1082 1083 if nested == True: 1084 for n in nRegressorRange: 1085 nameListSub = [] 1086 regressorListSub = [] 1087 indices = [k for m in uniqueRanks[:n] for k in np.where(ranks == m)[0]] 1088 nameListSub += [self._regressorNames[k] for k in indices] 1089 regressorListSub += [self._designMatrix[:,k] for k in indices] 1090 if self._covMatrixObserv != None: 1091 covMatrixObservSub = self._covMatrixObserv[indices,:][:,indices] 1092 yield LinearModel(regressorListSub, nameListSub, covMatrixObservSub, regressorsAreWeighted=True) 1093 else: 1094 yield LinearModel(regressorListSub, nameListSub) 1095 else: 1096 comboList = [combo for comboSize in nRegressorRange for combo in combinations(uniqueRanks, comboSize)] 1097 for combo in comboList: 1098 indices = [k for n in combo for k in np.where(ranks==n)[0]] 1099 nameListSub = [self._regressorNames[k] for k in indices] 1100 regressorListSub = [self._designMatrix[:,k] for k in indices] 1101 if self._covMatrixObserv != None: 1102 covMatrixObservSub = self._covMatrixObserv[indices,:][:,indices] 1103 yield LinearModel(regressorListSub, nameListSub, covMatrixObservSub, regressorsAreWeighted=True) 1104 else: 1105 yield LinearModel(regressorListSub, nameListSub)

1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116

1117 - def copy(self):

1118 1119 """ 1120 Returns a duplicate of the current LinearModel 1121 1122 Example: 1123 1124 >>> x = linspace(0,10,100) 1125 >>> lm = LinearModel([x], ["x"]) 1126 >>> lm2 = lm.copy() 1127 1128 @return: of duplicate of self. Not a reference, a real copy. 1129 @rtype: LinearModel 1130 1131 """ 1132 1133 if self._covMatrixObserv != None: 1134 return LinearModel(self._designMatrix.copy(), copy.copy(self._regressorNames), self._covMatrixObserv.copy()) 1135 else: 1136 return LinearModel(self._designMatrix.copy(), copy.copy(self._regressorNames))

1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148

1149 - def __str__(self):

1150 1151 """ 1152 Generates what is written to the console if the LinearModel is printed 1153 1154 @return: nothing 1155 1156 """ 1157 1158 if self._regressorNames[0] == "1": 1159 str = "Model: y = a_0" 1160 else: 1161 str = "Model: y = a_0 * %s" % self._regressorNames[0] 1162 1163 for n in range(1, self._nParameters): 1164 if self._regressorNames[n] == "1": 1165 str += " + a_%d" % n 1166 else: 1167 str += " + a_%d * %s" % (n, self._regressorNames[n]) 1168 1169 str += "\nExpected number of observations: %d" % self._nObservations 1170 1171 return str

1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185

1186 -class PolynomialModel(LinearModel):

1187 1188 """ 1189 Class to implement a polynomial model 1190 1191 The class implements a polynomial model function of the form 1192 1193 C{y(x) = a_0 * a_1 * x + a_2 * x^2 + ...} 1194 1195 where y are the responses (observables), x are the covariates, 1196 the a_i are the regression coefficients (to be determined). 1197 Such a class provides convenient way of creating a polynomial model. 1198 1199 Remarks: 1200 - PolynomialModel instances can be added to other PolynomialModel 1201 or LinearModel instances to give a new LinearModel instance. 1202 """ 1203 1204

1205 - def __init__(self, covariate, covariateName, orderList, covMatrix = None):

1206 1207 """ 1208 Initialiser of the PolynomialModel class. 1209 1210 Example: 1211 1212 >>> time = linspace(0,10,100) 1213 >>> pm = PolynomialModel(time, "t", [2,1]) 1214 1215 @param covariate: an array of size N with the covariate values, 1216 where N is the number of observations 1217 @type covariate: ndarray 1218 @param covariateName: the (preferably short) name of the covariate. 1219 E.g. "t", or "x". 1220 @type covariateName: string 1221 @param orderList: list of the orders that should be included in the 1222 model. E.g. [3,0,1] implies the model 1223 a_0 * x^3 + a_1 + a_2 * x 1224 @type orderList: list 1225 @param covMatrix: square array containing the covariance matrix of the 1226 observations. This way, weights and correlations can 1227 be taken into account. 1228 @type covMatrix: ndarray 1229 @return: an instance of PolynomialModel, subclass of LinearModel 1230 @rtype: PolynomialModel 1231 1232 """ 1233 1234 # Create and fill the design matrix 1235 1236 designMatrix = np.empty((len(covariate), len(orderList))) 1237 for n in range(len(orderList)): 1238 designMatrix[:,n] = covariate**orderList[n] 1239 1240 1241 # List the regressor names 1242 1243 regressorNames = [] 1244 for n in range(len(orderList)): 1245 if orderList[n] == 0: 1246 regressorNames += ["1"] 1247 else: 1248 regressorNames += ["%s^%d" % (covariateName, orderList[n])] 1249 1250 1251 # Let the base class do the initialising 1252 1253 super(PolynomialModel, self).__init__(designMatrix, regressorNames, covMatrix, regressorsAreWeighted=False) 1254 1255 1256 # Testing for an intercept can be done more efficiently for polynomials 1257 1258 if 0 in orderList: 1259 self._withIntercept = True 1260 else: 1261 self._withIntercept = False

1262 1263 1264 1265 1266

1267 -class HarmonicModel(LinearModel):

1268 1269 """ 1270 Class to implement a harmonic model function. 1271 1272 The class implements a harmonic model of the form 1273 1274 C{y(x) = a_0 * sin(2*pi*f1*t) + a_1 * cos(2*pi*f1*t) 1275 + a_2 * sin(2*pi*f2*t) + a_3 * cos(2*pi*f2*t) + ...} 1276 1277 where y are the responses (observables), x are the covariates, 1278 the a_i are the regression coefficients (to be determined). 1279 Such a class provides a more convenient short way of creating a 1280 harmonic model. 1281 1282 Remarks: 1283 - HarmonicModel instances can be added to other HarmonicModel 1284 or LinearModel instances to give a new LinearModel instance. 1285 """ 1286 1287

1288 - def __init__(self, covariate, covariateName, freqList, freqNameList, 1289 maxNharmonics=1, covMatrix = None):

1290 1291 """ 1292 Initialiser of HarmonicModel class, with *known* frequencies. 1293 1294 Example: 1295 1296 >>> time = linspace(0,10,100) 1297 >>> hm = HarmonicModel(time, "t", [2.34, 8.9], ["f_1", "f_2"], 1) 1298 1299 1300 @param covariate: an array of size N with the covariate values, 1301 where N is the number of observations. 1302 @type covariate: ndarray 1303 @param covariateName: a (preferably short) name of the covariate. 1304 E.g. "t", or "x". 1305 @type covariateName: string 1306 @param freqList: the frequencies (unit: inverse unit of covariate) 1307 @type freqList: list 1308 @param freqNameList: the (preferably short) names of the frequencies. 1309 E.g. "f_1" or "nu_1". 1310 @type freqNameList: list 1311 @param maxNharmonics: integer values > 1. The number of harmonics for 1312 each frequency. For example: 1313 1 : f_1, f_2, ... 1314 2 : f_1, 2*f_1, f_2, 2*f_2, ... 1315 @type maxNharmonics: list 1316 @param covMatrix: square array containing the covariance matrix of the 1317 observations. Weights, and correlations can hence be 1318 taken into account. 1319 @type covMatrix: ndarray 1320 @return: an instance of HarmonicModel, subclass of LinearModel 1321 @rtype: HarmonicModel 1322 1323 """ 1324 1325 # Create and fill the design matrix 1326 1327 nParam = 2*len(freqList)*maxNharmonics 1328 designMatrix = np.empty((len(covariate), nParam)) 1329 runningIndex = 0 1330 for n in range(len(freqList)): 1331 for k in range(1,maxNharmonics+1): 1332 designMatrix[:,runningIndex] = np.sin(2.*pi*k*freqList[n]*covariate) 1333 designMatrix[:,runningIndex+1] = np.cos(2.*pi*k*freqList[n]*covariate) 1334 runningIndex += 2 1335 1336 # List the regressor names 1337 1338 regressorNames = [] 1339 for n in range(len(freqList)): 1340 for k in range(1,maxNharmonics+1): 1341 regressorNames += ["sin(2*pi*%d*%s*%s)" % (k, freqNameList[n], covariateName), 1342 "cos(2*pi*%d*%s*%s)" % (k, freqNameList[n], covariateName)] 1343 1344 1345 # Let the base class do the initialising 1346 1347 super(HarmonicModel, self).__init__(designMatrix, regressorNames, covMatrix, regressorsAreWeighted=False) 1348 1349 1350 # There is no intercept for a harmonic model 1351 1352 self._withIntercept = False

1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368

1369 -class LinearFit(object):

1370 1371 """ 1372 A class that allows to easily extract the fit coefficients, 1373 covariance matrix, predictions, residuals, etc. 1374 1375 Remarks: LinearFit instances can be printed 1376 """ 1377 1378

1379 - def __init__(self, linearModel, observations):

1380 1381 """ 1382 Initialises the LinearFit instance 1383 1384 @param linearModel: a LinearModel instance 1385 @type linearModel: LinearModel 1386 @param observations: the observations. The size of the array must be 1387 compatible with the number of observations that 1388 the linearModel expects. 1389 @type observations: ndarray 1390 @return: a LinearFit instance 1391 @rtype: LinearFit 1392 1393 """ 1394 1395 # Store some ndarrays internally for later 1396 1397 self._linearModel = linearModel 1398 self._nObservations = len(observations) 1399 self._nParameters = linearModel.designMatrix().shape[1] 1400 1401 1402 # Init some quantities as None, signaling that they have not yet been computed 1403 1404 self._regressionCoefficients = None 1405 self._residuals = None 1406 self._weightedResiduals = None 1407 self._predictions = None 1408 self._weightedPredictions = None 1409 self._sumSqResiduals = None 1410 self._sumSqWeightedResiduals = None 1411 self._covarianceMatrix = None 1412 self._correlationMatrix = None 1413 self._errorBars = None 1414 self._t_values = None 1415 self._coefficientOfDetermination = None 1416 self._weightedCoeffOfDetermination = None 1417 self._residualVariance = None 1418 self._weightedResidualVariance = None 1419 self._BICvalue = None 1420 self._AICvalue = None 1421 self._Fstatistic = None 1422 self._weightedFstatistic = None 1423 1424 1425 # If a covariance of the observations was specified, weight the observations. 1426 1427 self._originalObservations = observations 1428 if linearModel._covMatrixObserv != None: 1429 self._weightedObservations = np.linalg.solve(linearModel._choleskyLower, observations) 1430 else: 1431 self._weightedObservations = observations

1432 1433 1434

1435 - def observations(self, weighted=False):

1436 1437 """ 1438 Returns the observations 1439 1440 Remarks: 1441 - if no covariance matrix of the observations was specified for 1442 the model, the "decorrelated" observations are identical to the 1443 original observations. 1444 1445 @param weighted: If false return the original observations, if 1446 True, return de decorrelated observations. 1447 @type weighted: boolean 1448 @return: the observations that were used to initialize the LinearFit instance 1449 @rtype: ndarray 1450 1451 """ 1452 1453 if weighted: 1454 return self._weightedObservations 1455 else: 1456 return self._originalObservations

1457 1458 1459 1460

1461 - def regressionCoefficients(self):

1462 1463 """ 1464 Returns the regression coefficients. 1465 1466 @return: the fit coefficients 1467 @rtype: ndarray 1468 1469 """ 1470 1471 if self._regressionCoefficients is not None: 1472 return self._regressionCoefficients 1473 else: 1474 self._regressionCoefficients = np.dot(self._linearModel.pseudoInverse(), 1475 self._weightedObservations) 1476 return self._regressionCoefficients

1477 1478 1479 1480

1481 - def sumSqResiduals(self, weighted=False):

1482 1483 """ 1484 Returns the sum of the squared residuals 1485 1486 @param weighted: If false the unweighted residuals are used, 1487 if True, the weighted ones are used. 1488 @type weighted: boolean 1489 @return: the sum of the squared residuals 1490 @rtype: double 1491 1492 """ 1493 1494 if weighted: 1495 if self._sumSqWeightedResiduals is not None: 1496 return self._sumSqWeightedResiduals 1497 else: 1498 self._sumSqWeightedResiduals = np.sum(np.square(self.residuals(weighted=True))) 1499 return self._sumSqWeightedResiduals 1500 else: 1501 if self._sumSqResiduals is not None: 1502 return self._sumSqResiduals 1503 else: 1504 self._sumSqResiduals = np.sum(np.square(self.residuals(weighted=False))) 1505 return self._sumSqResiduals

1506 1507 1508 1509 1510 1511

1512 - def residualVariance(self, weighted=False):

1513 1514 """ 1515 Estimates the variance of the residuals of the time series. 1516 1517 As normalization the degrees of freedom is used 1518 1519 @param weighted: If false the unweighted residuals are used, 1520 if True, the weighted ones are used. 1521 @type weighted: boolean 1522 @return: the variance of the residuals 1523 @rtype: double 1524 1525 """ 1526 1527 if weighted: 1528 if self._weightedResidualVariance is not None: 1529 return self._weightedResidualVariance 1530 else: 1531 self._weightedResidualVariance = self.sumSqResiduals(weighted=True) \ 1532 / self._linearModel.degreesOfFreedom() 1533 return self._weightedResidualVariance 1534 else: 1535 if self._residualVariance is not None: 1536 return self._residualVariance 1537 else: 1538 self._residualVariance = self.sumSqResiduals(weighted=False) \ 1539 / self._linearModel.degreesOfFreedom() 1540 return self._residualVariance

1541 1542 1543 1544

1545 - def covarianceMatrix(self):

1546 1547 """ 1548 Returns the covariance matrix of the fit coefficients 1549 1550 @return: the MxM covariance matrix of the fit coefficients with 1551 M the number of fit coefficients 1552 @rtype: ndarray 1553 1554 """ 1555 1556 if self._covarianceMatrix is not None: 1557 return self._covarianceMatrix 1558 else: 1559 self._covarianceMatrix = self._linearModel.standardCovarianceMatrix() \ 1560 * self.residualVariance(weighted=True) 1561 return self._covarianceMatrix

1562 1563 1564 1565

1566 - def correlationMatrix(self):

1567 1568 """ 1569 Returns the correlation matrix of the fit coefficients 1570 1571 @return: the MxM correlation matrix of the fit coefficients with 1572 M the number of fit coefficients 1573 @rtype: ndarray 1574 1575 """ 1576 1577 if self._correlationMatrix is not None: 1578 return self._correlationMatrix 1579 else: 1580 cov = self.covarianceMatrix() 1581 self._correlationMatrix = np.empty_like(cov) 1582 cor = self._correlationMatrix # shorter alias 1583 for n in range(cor.shape[0]): 1584 for m in range(n): 1585 cor[n,m] = cov[n,m] / sqrt(cov[n,n] * cov[m,m]) 1586 cor[m,n] = cor[n,m] 1587 cor[n,n] = 1.0 1588 return self._correlationMatrix

1589 1590 1591 1592

1593 - def errorBars(self):

1594 1595 """ 1596 Returns the formal error bars of the fit coefficients 1597 1598 @return: The square root of the diagonal of the covariance matrix 1599 @rtype: ndarray 1600 1601 """ 1602 1603 if self._errorBars is not None: 1604 return self._errorBars 1605 else: 1606 self._errorBars = np.sqrt(self.covarianceMatrix().diagonal()) 1607 return self._errorBars

1608 1609 1610

1611 - def confidenceIntervals(self, alpha):

1612 1613 """ 1614 Returns the symmetric (1-alpha) confidence interval around the fit coefficients 1615 E.g. if alpha = 0.05, the 95% symmetric confidence interval is returned. 1616 1617 Remarks: 1618 - The formula used assumes that the noise on the observations is 1619 independently, identically and gaussian distributed. 1620 1621 @param alpha: confidence level in ]0,1[ 1622 @type alpha: double 1623 @return: the arrays lowerBounds[0..K-1] and upperBounds[0..K-1] which contain 1624 respectively the lower and the upper bounds of the symmetric interval 1625 around the fit coefficients, where K is the number of fit coeffs. 1626 @rtype: tuple 1627 1628 """ 1629 1630 t = stats.t.ppf(1.0-alpha/2., self._linearModel.degreesOfFreedom()) 1631 1632 lowerBounds = self.regressionCoefficients() - t * self.errorBars() 1633 upperBounds = self.regressionCoefficients() + t * self.errorBars() 1634 1635 return lowerBounds, upperBounds

1636 1637 1638

1639 - def t_values(self):

1640 1641 """ 1642 Returns the formal t-values of the fit coefficients 1643 1644 @return: t-values = fit coefficients divided by their formal error bars 1645 @rtype: ndarray 1646 1647 """ 1648 1649 if self._t_values is not None: 1650 return self._t_values 1651 else: 1652 self._t_values = self.regressionCoefficients() / self.errorBars() 1653 return self._t_values

1654 1655 1656 1657

1658 - def regressorTtest(self, alpha):

1659 1660 """ 1661 Performs a hypothesis T-test on each of the regressors 1662 Null hypothesis: H0 : fit coefficient == 0 1663 Alternative hypothesis : H1 : fit coefficient != 0 1664 1665 Remarks: 1666 - This test assumes that the noise is independently, 1667 identically, and gaussian distributed. It is not robust 1668 against this assumption. 1669 1670 @param alpha: significance level of the hypothesis test. In ]0,1[. 1671 E.g.: alpha = 0.05 1672 @type alpha: double 1673 @return: a boolean array of length K with K the number of regressors. 1674 "True" if the null hypothesis was rejected for the regressor, 1675 "False" otherwise. 1676 @rtype: ndarray 1677 1678 """ 1679 1680 t = stats.t.ppf(1.0-alpha/2., self._linearModel.degreesOfFreedom()) 1681 1682 return np.fabs(self.t_values()) > t

1683 1684 1685

1686 - def predictions(self, weighted=False):

1687 1688 """ 1689 Returns the predicted (fitted) values 1690 1691 It concerns the predictions for the original observations, 1692 not the decorrelated ones. 1693 1694 Remarks: 1695 - If no covariance matrix of the observations was specified 1696 for the model, the weighted predictions are identical to 1697 the unweighted ones. 1698 1699 @param weighted: If True/False, the predictions for the 1700 decorrelated/original observations will be used. 1701 @type weighted: boolean 1702 1703 """ 1704 1705 if weighted: 1706 if self._weightedPredictions is not None: 1707 return self._weightedPredictions 1708 else: 1709 if self._linearModel._covMatrixObserv != None: 1710 self._weightedPredictions = np.dot(self._linearModel._choleskyLower, np.dot(self._linearModel.designMatrix(), self.regressionCoefficients())) 1711 else: 1712 self._weightedPredictions = np.dot(self._linearModel.designMatrix(), self.regressionCoefficients()) 1713 return self._weightedPredictions 1714 else: 1715 if self._predictions is not None: 1716 return self._predictions 1717 else: 1718 if self._linearModel._covMatrixObserv != None: 1719 self._predictions = np.dot(self._linearModel._choleskyLower, np.dot(self._linearModel.designMatrix(), self.regressionCoefficients())) 1720 else: 1721 self._predictions = np.dot(self._linearModel.designMatrix(), self.regressionCoefficients()) 1722 return self._predictions

1723 1724 1725 1726

1727 - def residuals(self, weighted=False):

1728 1729 """ 1730 Returns an array with the residuals. 1731 Residuals = observations minus the predictions 1732 1733 Remarks: 1734 - If no covariance matrix of the observations was specified for the 1735 model, the weighted residuals are identical to the unweighted ones. 1736 1737 @param weighted: If True/False, the residuals for the 1738 decorrelated/original observations will be used. 1739 @type weighted: boolean 1740 1741 """ 1742 1743 if weighted: 1744 if self._weightedResiduals is not None: 1745 return self._weightedResiduals 1746 else: 1747 self._weightedResiduals = self._weightedObservations - self.predictions(weighted=True) 1748 return self._weightedResiduals 1749 else: 1750 if self._residuals is not None: 1751 return self._residuals 1752 else: 1753 self._residuals = self._originalObservations - self.predictions(weighted=False) 1754 return self._residuals

1755 1756 1757 1758 1759

1760 - def coefficientOfDetermination(self, weighted=False):

1761 1762 """ 1763 Returns the coefficient of determination 1764 1765 The coeff of determination is defined by 1 - S1 / S2 with 1766 - S1 the sum of squared residuals: 1767 S1 = \frac{\sum_{i=1}^N (y_i - \hat{y}_i)^2} 1768 - S2 the sample variance w.r.t. the mean: 1769 S2 = \frac{\sum_{i=1}^N (y_i - \overline{y})^2} 1770 - If there is no intercept term in the model, then S2 is computed 1771 S2 = \frac{\sum_{i=1}^N (y_i)^2} 1772 1773 Remarks: 1774 - If no covariance matrix of the observations was specified 1775 for the model, the weighted coeff of determination is 1776 identical to the unweighted one. 1777 1778 @param weighted: If True/False, the residuals for the 1779 decorrelated/original observations will be used. 1780 @type weighted: boolean 1781 @return: coefficient of determination 1782 @rtype: double 1783 1784 """ 1785 1786 if weighted: 1787 if self._weightedCoeffOfDetermination is not None: 1788 return self._weightedCoeffOfDetermination 1789 else: 1790 if self._linearModel.withIntercept(): 1791 sampleVariance = np.sum(np.square(self._weightedObservations - np.mean(self._weightedObservations))) 1792 self._weightedCoeffOfDetermination = 1.0 - self.sumSqResiduals(weighted=True) / sampleVariance 1793 else: 1794 sampleVariance = np.sum(np.square(self._weightedObservations)) 1795 self._weightedCoeffOfDetermination = 1.0 - self.sumSqResiduals(weighted=True) / sampleVariance 1796 1797 return self._coefficientOfDetermination 1798 else: 1799 if self._coefficientOfDetermination is not None: 1800 return self._coefficientOfDetermination 1801 else: 1802 if self._linearModel.withIntercept(): 1803 sampleVariance = np.sum(np.square(self._originalObservations - np.mean(self._originalObservations))) 1804 self._coefficientOfDetermination = 1.0 - self.sumSqResiduals(weighted=False) / sampleVariance 1805 else: 1806 sampleVariance = np.sum(np.square(self._originalObservations)) 1807 self._coefficientOfDetermination = 1.0 - self.sumSqResiduals(weighted=False) / sampleVariance 1808 1809 return self._coefficientOfDetermination

1810 1811 1812 1813 1814

1815 - def BICvalue(self):

1816 1817 """ 1818 Returns the Bayesian Information Criterion value. 1819 1820 Remarks: 1821 - Also called the Schwartz Information Criterion (SIC) 1822 - Gaussian noise is assumed, with unknown variance sigma^2 1823 - Constant terms in the log-likelihood are omitted 1824 1825 TODO: . make a weighted version 1826 1827 @return: BIC value 1828 @rtype: double 1829 1830 """ 1831 1832 if self._BICvalue is not None: 1833 return self._BICvalue 1834 else: 1835 # Include the sigma of the noise in the number of unknown fit parameters 1836 nParam = self._nParameters + 1 1837 self._BICvalue = self._nObservations * log(self.sumSqResiduals(weighted=False)/self._nObservations) \ 1838 + nParam * log(self._nObservations) 1839 return self._BICvalue

1840 1841 1842 1843

1844 - def AICvalue(self):

1845 1846 """ 1847 Returns the 2nd order Akaike Information Criterion value 1848 1849 Remarks: 1850 - Gaussian noise is assumed, with unknown variance sigma^2 1851 - Constant terms in the log-likelihood were omitted 1852 - If the number of observations equals the number of parameters + 1 1853 then the 2nd order AIC is not defined. In this case the method 1854 gives a nan. 1855 1856 TODO: . make a weighted version 1857 1858 @return: AIC value 1859 @rtype: double 1860 1861 """ 1862 1863 if self._AICvalue is not None: 1864 return self._AICvalue 1865 else: 1866 nParam = self._nParameters + 1 1867 if (self._nObservations - nParam - 1) == 0: 1868 # In this case we would get a division by zero 1869 self._AICvalue = np.nan 1870 else: 1871 # Include the sigma of the noise in the number of unknown fit parameters 1872 self._AICvalue = self._nObservations * log(self.sumSqResiduals(weighted=False)/self._nObservations) \ 1873 + 2*nParam + 2*nParam*(nParam+1)/(self._nObservations - nParam - 1) 1874 return self._AICvalue

1875 1876 1877 1878

1879 - def Fstatistic(self, weighted=False):

1880 1881 """ 1882 Returns the F-statistic, commonly used to assess the fit. 1883 1884 Remarks: 1885 - if no covariance matrix of the observations was specified for the 1886 model, the weighted F-statistic is identical to the unweighted one 1887 1888 @param weighted: If True/False, the residuals for the 1889 decorrelated/original observations will be used. 1890 @type weighted: boolean 1891 @return: the F-statistic 1892 @rtype: double 1893 1894 """ 1895 1896 if weighted: 1897 if self._weightedFstatistic is not None: 1898 return self._weightedFstatistic 1899 else: 1900 Rsq = self.coefficientOfDetermination(weighted=True) 1901 df = self._linearModel.degreesOfFreedom() 1902 1903 if self._nParameters == 1: 1904 self._Fstatistic = Rsq / (1-Rsq) * df / self._nParameters 1905 else: 1906 self._Fstatistic = Rsq / (1-Rsq) * df / (self._nParameters-1) 1907 1908 return self._weightedFstatistic 1909 else: 1910 if self._Fstatistic is not None: 1911 return self._Fstatistic 1912 else: 1913 Rsq = self.coefficientOfDetermination(weighted=False) 1914 df = self._linearModel.degreesOfFreedom() 1915 1916 if self._nParameters == 1: 1917 self._Fstatistic = Rsq / (1-Rsq) * df / self._nParameters 1918 else: 1919 self._Fstatistic = Rsq / (1-Rsq) * df / (self._nParameters-1) 1920 1921 return self._Fstatistic

1922 1923 1924

1925 - def FstatisticTest(self, alpha, weighted=False):

1926 1927 """ 1928 Performs a hypothesis F-test on the fit 1929 1930 Null hypothesis: H0 : all fit coefficients == 0 1931 Alternative hypothesis : H1 : at least one fit coefficient != 0 1932 1933 Stated otherwise (with R^2 the coefficient of determination): 1934 Null hypothesis: H0 : R^2 == 0 1935 Alternative hypothesis: H1: R^2 != 0 1936 1937 Remarks: 1938 - if no covariance matrix of the observations was specified for the 1939 model, the weighted F-test is identical to the unweighted one 1940 1941 1942 @param alpha: significance level of the hypothesis test. In ]0,1[. 1943 @type alpha: double 1944 @param weighted: If True/False, the weighted/not-weighted 1945 F-statistic will be used. 1946 @type weighted: boolean 1947 @return: "True" if null hypothesis was rejected, "False" otherwise 1948 @rtype: boolean 1949 1950 """ 1951 1952 df = self._linearModel.degreesOfFreedom() 1953 1954 if self._nParameters == 1: 1955 Fcritical = stats.f.ppf(1.0-alpha, self._nParameters, df) 1956 else: 1957 Fcritical = stats.f.ppf(1.0-alpha, self._nParameters-1, df) 1958 1959 return (self.Fstatistic(weighted) > Fcritical)

1960 1961 1962

1963 - def summary(self, outputStream = sys.stdout):

1964 1965 """ 1966 Writes some basic results of fitting the model to the observations. 1967 1968 @param outputStream: defaulted to the standard output console, but can be 1969 replaced by another open stream, like a file stream. 1970 @type outputStream: stream class. The .write() method is used. 1971 @return: nothing 1972 1973 """ 1974 1975 regressorNames = self._linearModel.regressorNames() 1976 if regressorNames[0] == "1": 1977 outputStream.write("Model: y = a_0") 1978 else: 1979 outputStream.write("Model: y = a_0 * %s" % regressorNames[0]) 1980 for n in range(1, self._nParameters): 1981 if regressorNames[n] == "1": 1982 outputStream.write(" + a_%d" % n) 1983 else: 1984 outputStream.write(" + a_%d * %s" % (n, regressorNames[n])) 1985 outputStream.write("\n") 1986 1987 coeff = self.regressionCoefficients() 1988 errors = self.errorBars() 1989 outputStream.write("Regression coefficients:\n") 1990 for n in range(self._nParameters): 1991 outputStream.write(" a_%d = %f +/- %f\n" % (n, coeff[n], errors[n])) 1992 1993 outputStream.write("Sum of squared residuals: %f\n" % self.sumSqResiduals()) 1994 outputStream.write("Estimated variance of the residuals: %f\n" % self.residualVariance()) 1995 outputStream.write("Coefficient of determination R^2: %f\n" % self.coefficientOfDetermination()) 1996 outputStream.write("F-statistic: %f\n" % self.Fstatistic()) 1997 outputStream.write("AIC: %f\n" % self.AICvalue()) 1998 outputStream.write("BIC: %f\n" % self.BICvalue())

1999 2000 2001

2002 - def __str__(self):

2003 2004 """ 2005 Returns the string written by printing the LinearFit object 2006 2007 """ 2008 2009 return "Linear fit of a dataset of %d observations, " % self._nObservations \ 2010 + "using a linear model with %d regressors" % self._nParameters

2011 2012 2013 2014 2015 2016 2017

2018 - def evaluate(self, regressors):

2019 2020 """ 2021 Evaluates your current best fit in regressors evaluated in new covariates. 2022 2023 Remark: 2024 - The new regressor functions should be exactly the same ones as you used 2025 to define the linear model. They should only be evaluated in new covariates. 2026 This is not checked for! 2027 2028 Example: 2029 2030 >>> noise = array([0.44, -0.48, 0.26, -2.00, -0.93, 2.21, -0.57, -2.04, -1.09, 1.53]) 2031 >>> x = linspace(0, 5, 10) 2032 >>> obs = 2.0 + 3.0 * exp(x) + noise 2033 >>> myModel = LinearModel([ones(10), exp(x)], ["1", "exp(x)"]) 2034 >>> print(myModel) 2035 Model: y = a_0 + a_1 * exp(x) 2036 Expected number of observations: 10 2037 2038 >>> myFit = myModel.fitData(obs) 2039 >>> xnew = linspace(-5.0, +5.0, 20) 2040 >>> y = myFit.evaluate([ones_like(xnew), exp(xnew)]) 2041 >>> print(y) 2042 [ 1.53989966 1.55393018 1.57767944 1.61787944 1.68592536 2043 1.80110565 1.99606954 2.32608192 2.8846888 3.83023405 2044 5.43074394 8.13990238 12.72565316 20.48788288 33.62688959 2045 55.86708392 93.51271836 157.23490405 265.09646647 447.67207215] 2046 2047 @param regressors: either a list of equally-sized numpy arrays with the regressors 2048 evaluated in the new covariates: [f_0(xnew),f_1(xnew),f_2(xnew),...], 2049 or an N x M design matrix (numpy array) where these regressor arrays 2050 are column-stacked, with N the number of regressors, and M the number 2051 of data points. 2052 @type regressors: either a list or an ndarray 2053 @return: the linear model evaluated in the new regressors 2054 @rtype: ndarray 2055 """ 2056 2057 # Sanity check of the 'regressors' 2058 2059 if not isinstance(regressors, list) and not isinstance(regressors, np.ndarray): 2060 raise TypeError, "LinearModel only accepts a list of regressors, or a design matrix" 2061 2062 # Construct the design matrix, if required 2063 2064 if isinstance(regressors, list): 2065 designMatrix = np.column_stack(np.double(regressors)) 2066 else: 2067 designMatrix = regressors 2068 2069 # Check whether the design matrix has the proper shape 2070 2071 if designMatrix.ndim != 2: 2072 raise TypeError, "Design matrix is not 2-dimensional" 2073 2074 if designMatrix.shape[1] != self._nParameters: 2075 raise TypeError, "Number of regressors not equal to the one of the original model" 2076 2077 # Evaluate the model in the new regressors 2078 2079 return np.dot(designMatrix, self.regressionCoefficients())

2080 2081 2082 2083 2084 __all__ = [LinearModel, PolynomialModel, HarmonicModel, LinearFit] 2085

Source Code for Module ivs.statistics.linearregression