Linear trend estimation

For a broader coverage related to this topic, see Curve fitting.

Trend estimation is a statistical technique to aid interpretation of data. When a series of measurements of a process are treated as a time series, trend estimation can be used to make and justify statements about tendencies in the data, by relating the measurements to the times at which they occurred. This model can then be used to describe the behaviour of the observed data.

In particular, it may be useful to determine if measurements exhibit an increasing or decreasing trend which is statistically distinguished from random behaviour. Some examples are determining the trend of the daily average temperatures at a given location from winter to summer, and determining the trend in a global temperature series over the last 100 years. In the latter case, issues of homogeneity are important (for example, about whether the series is equally reliable throughout its length).

Fitting a trend: least-squares

Given a set of data and the desire to produce some kind of model of those data, there are a variety of functions that can be chosen for the fit. If there is no prior understanding of the data, then the simplest function to fit is a straight line with the data plotted vertically and values of time (t = 1, 2, 3, ...) plotted horizontally.

Once it has been decided to fit a straight line, there are various ways to do so, but the most usual choice is a least-squares fit. This method minimizes the sum of the squared errors in the data series, denoted the y variable.

Given a set of points in time $t$ , and data values $y_{t}$ observed for those points in time, values of $a$ and $b$ are chosen so that

\sum _{t}\left\{[(at+b)-y_{t}]^{2}\right\}

is minimized. Here at + b is the trend line, so the sum of squared deviations from the trend line is what is being minimized. This can always be done in closed form since this is a case of simple linear regression.

For the rest of this article, “trend” will mean the slope of the least squares line, since this is a common convention.

Trends in random data

Before considering trends in real data, it is useful to understand trends in random data.

Red shaded values are greater than 99% of the rest; blue, 95%; green, 90%. In this case, the V values discussed in the text for (one-sided) 95% confidence is seen to be 0.2.

If a series which is known to be random is analysed – fair dice falls, or computer-generated pseudo-random numbers – and a trend line is fitted through the data, the chances of an exactly zero estimated trend are negligible. But the trend would be expected to be small. If an individual series of observations is generated from simulations that employ a given variance of noise that equals the observed variance of our data series of interest, and a given length (say, 100 points), a large number of such simulated series (say, 100,000 series) can be generated. These 100,000 series can then be analysed individually to calculate estimated trends in each series, and these results establish a distribution of estimated trends that are to be expected from such random data – see diagram. Such a distribution will be normal according to the central limit theorem except in pathological cases. A level of statistical certainty, S, may now be selected – 95% confidence is typical; 99% would be stricter, 90% looser – and the following question can be asked: what is the borderline trend value V that would result in S% of trends being between −V and +V?

The above procedure can be replaced by a permutation test. For this, the set of 100,000 generated series would be replaced by 100,000 series constructed by randomly shuffling the observed data series; clearly such a constructed series would be trend-free, so as with the approach of using simulated data these series can be used to generate borderline trend values V and −V.

In the above discussion the distribution of trends was calculated by simulation, from a large number of trials. In simple cases (normally distributed random noise being a classic) the distribution of trends can be calculated exactly without simulation.

The range (−V, V) can be employed in deciding whether a trend estimated from the actual data is unlikely to have come from a data series that truly has a zero trend. If the estimated value of the regression parameter a lies outside this range, such a result could have occurred in the presence of a true zero trend only, for example, one time out of twenty if the confidence value S=95% was used; in this case, it can be said that, at degree of certainty S, we reject the null hypothesis that the true underlying trend is zero.

However, note that whatever value of S we choose, then a given fraction, 1 − S, of truly random series will be declared (falsely, by construction) to have a significant trend. Conversely, a certain fraction of series that in fact have a non-zero trend will not be declared to have a trend.

Data as trend plus noise

To analyse a (time) series of data, we assume that it may be represented as trend plus noise:

y_{t}=at+b+e_{t}\,

where $a$ and $b$ are unknown constants and the $e$ 's are randomly distributed errors. If one can reject the null hypothesis that the errors are non-stationary, then the non-stationary series {y_t } is called trend stationary. The least squares method assumes the errors to be independently distributed with a normal distribution. If this is not the case, hypothesis tests about the estimated values of a and b may be inaccurate. It is simplest if the $e$ 's all have the same distribution, but if not (if some have higher variance, meaning that those data points are effectively less certain) then this can be taken into account during the least squares fitting, by weighting each point by the inverse of the variance of that point.

In most cases, where only a single time series exists to be analysed, the variance of the $e$ 's is estimated by fitting a trend, thus allowing $at+b$ to be subtracted from the data $y_{t}$ (thus detrending the data) and leaving the residuals $e_t$ as the detrended data, and calculating the variance of the $e_t$ 's from the residuals — this is often the only way of estimating the variance of the $e_t$ 's.

One particular special case of great interest, the (global) temperature time series, is known not to be homogeneous in time: apart from anything else, the number of weather observations has (generally) increased with time, and thus the error associated with estimating the global temperature from a limited set of observations has decreased with time. Though many people do attempt to fit a "trend" to climate data the climate trend is clearly not a straight line and the idea of attributing a straight line is not mathematically correct because the assumptions of the method are not valid in this context.

Once we know the "noise" of the series, we can then assess the significance of the trend by making the null hypothesis that the trend, $a$ , is not significantly different from 0. From the above discussion of trends in random data with known variance, we know the distribution of trends to be expected from random (trendless) data. If the calculated trend, $a$ , is larger than the value, $V$ , then the trend is deemed significantly different from zero at significance level $S$ .

The use of a linear trend line has been the subject of criticism, leading to a search for alternative approaches to avoid its use in model estimation. One of the alternative approaches involves unit root tests and the cointegration technique in econometric studies.

The estimated coefficient associated with a linear time trend variable is interpreted as a measure of the impact of a number of unknown or known but unmeasurable factors on the dependent variable over one unit of time. Strictly speaking, that interpretation is applicable for the estimation time frame only. Outside that time frame, one does not know how those unmeasurable factors behave both qualitatively and quantitatively. Furthermore, the linearity of the time trend poses many questions:

(i) Why should it be linear?

(ii) If the trend is non-linear then under what conditions does its inclusion influence the magnitude as well as the statistical significance of the estimates of other parameters in the model?

(iii) The inclusion of a linear time trend in a model precludes by assumption the presence of fluctuations in the tendencies of the dependent variable over time; is this necessarily valid in a particular context?

(iv) And, does a spurious relationship exist in the model because an underlying causative variable is itself time-trending?

Research results of mathematicians, statisticians, econometricians, and economists have been published in response to those questions. For example, detailed notes on the meaning of linear time trends in regression model are given in Cameron (2005);^[1] Granger, Engle and many other econometricians have written on stationarity, unit root testing, co-integration and related issues (a summary of some of the works in this area can be found in an information paper^[2] by the Royal Swedish Academy of Sciences (2003); and Ho-Trieu & Tucker (1990) have written on logarithmic time trends^[3] with results indicating linear time trends are special cases of cycles^[3]

Noisy time series, and an example

It is harder to see a trend in a noisy time series. For example, if the true series is 0, 1, 2, 3 all plus some independent normally distributed "noise" e of standard deviation E, and we have a sample series of length 50, then if E = 0.1 the trend will be obvious; if E = 100 the trend will probably be visible; but if E = 10000 the trend will be buried in the noise.

If we consider a concrete example, the global surface temperature record of the past 140 years as presented by the IPCC:^[4] then the interannual variation is about 0.2 °C and the trend about 0.6 °C over 140 years, with 95% confidence limits of 0.2 °C (by coincidence, about the same value as the interannual variation). Hence the trend is statistically different from 0. However, as noted elsewhere this time series doesn't conform to the assumptions necessary for least squares to be valid.

Goodness of fit (R-squared) and trend

Illustration of the effect of filtering on r². Black=unfiltered data; red=data averaged every 10 points; blue=data averaged every 100 points. All have the same trend, but more filtering leads to higher r² of fitted trend line.

The least-squares fitting process produces a value – r-squared (r²) – which is the square of the residuals of the data after the fit. It says what fraction of the variance of the data is explained by the fitted trend line. It does not relate to the statistical significance of the trend line (see graph); statistical significance of the trend is determined by its t-statistic. Often, filtering a series increases r² while making little difference to the fitted trend.

Real data need more complicated models

Thus far the data have been assumed to consist of the trend plus noise, with the noise at each data point being independent and identically distributed random variables and to have a normal distribution. Real data (for example climate data) may not fulfill these criteria. This is important, as it makes an enormous difference to the ease with which the statistics can be analysed so as to extract maximum information from the data series. If there are other non-linear effects that have a correlation to the independent variable (such as cyclic influences), the use of least-squares estimation of the trend is not valid. Also where the variations are significantly larger than the resulting straight line trend, the choice of start and end points can significantly change the result. That is, the result is mathematically inconsistent. Statistical inferences (tests for the presence of trend, confidence intervals for the trend, etc.) are invalid unless departures from the standard assumptions are properly accounted for, for example as follows:

Dependence: autocorrelated time series might be modeled using autoregressive moving average models.
Non-constant variance: in the simplest cases weighted least squares might be used.
Non-normal distribution for errors: in the simplest cases a generalised linear model might be applicable.
Unit root: taking first (or occasionally second) differences of the data, with the level of differencing being identified through various unit root tests.^[5]

In R, the linear trend in data can be estimated by using the 'tslm' function.

Notes

↑ "Making Regression More Useful II: Dummies and Trends" (PDF). Retrieved June 17, 2012.
↑ "The Royal Swedish Academy of Sciences" (PDF). 8 October 2003. Retrieved June 17, 2012.
1 2 "Note on the use of a Logarithm Time Trend" (PDF). Retrieved June 17, 2012.
↑ "IPCC Third Assessment Report - Climate Change 2001 - Complete online versions". Retrieved June 17, 2012.
↑ Forecasting: principles and practice. 20 September 2014. Retrieved May 17, 2015.

References

Bianchi, M.; Boyle, M.; Hollingsworth, D. (1999). "A comparison of methods for trend estimation". Applied Economics Letters. 6 (2): 103–109. doi:10.1080/135048599353726.
Cameron, S. (2005). "Making Regression Analysis More Useful, II". Econometrics. Maidenhead: McGraw Hill Higher Education. pp. 171–198. ISBN 0077104285.
Chatfield, C. (1993). "Calculating Interval Forecasts". Journal of Business and Economic Statistics. 11 (2): 121–135. doi:10.1080/07350015.1993.10509938.
Ho-Trieu, N. L.; Tucker, J. (1990). "Another note on the use of a logarithmic time trend". Review of Marketing and Agricultural Economics. 58 (1): 89–90.
Kungl. Vetenskapsakademien (The Royal Swedish Academy of Sciences) (2003). "Time-series econometrics: Cointegration and autoregressive conditional heteroskedasticity". Advanced information on the Bank of Sweden Prize in Economic Sciences in Memory of Alfred Nobel.
Arianos, S.; Carbone, A.; Turk, C. (2011). "Self-similarity of high-order moving averages". Physical Review E. 84: 046113. doi:10.1103/physreve.84.046113.

Statistics

Descriptive statistics

Continuous data

Center	Mean arithmetic geometric harmonic Median Mode

Dispersion	Variance Standard deviation Coefficient of variation Percentile Range Interquartile range

Shape	Moments Skewness Kurtosis L-moments

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Population Statistic Effect size Statistical power Sample size determination Missing data

Survey methodology	Sampling Standard error stratified cluster Opinion poll Questionnaire

Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Interaction Factorial experiment

Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in

Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife

Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons

Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F

Goodness of fit	Chi-squared Kolmogorov–Smirnov Anderson–Darling Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC

Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra)

Bayesian inference

Correlation	Pearson product–moment Partial correlation Confounding variable Coefficient of determination

Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)

Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression

Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity

Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality

Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey

Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)

Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time

Hazard function	Nelson–Aalen estimator

Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics

Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification

Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population statistics Psychometrics

Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

This article is issued from Wikipedia - version of the 12/3/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.