Basic Concepts
A kernel is a probability density function (pdf) f(x) which is symmetric around the y axis, i.e. f(-x) = f(x).
A kernel density estimation (KDE) is a non-parametric method for estimating the pdf of a random variable based on a random sample using some kernel K and some smoothing parameter (aka bandwidth) h > 0.
Let {x1, x2, …, xn} be a random sample from some distribution whose pdf f(x) is not known. We estimate f(x) as follows:
The results are sensitive to the value chosen for h. Rules for choosing an optimum value for h are complex, but the following are some simple guidelines:
- You should use a larger bandwidth value when the sample size is small and the data are sparse. This results in a larger standard deviation; the estimate places more weight on the neighboring data values.
- You can use a smaller bandwidth value when the sample size is large and the data are densely packed. This results in a smaller standard deviation; the estimate places more weight on the specific data value and less on the neighboring data values.
Bandwidths that are too small result in a pdf that is too spiky, while bandwidths that are too large result in a pdf that is over-smoothed.
If f(x) follows a normal distribution then an optimal estimate for h is
where s is the standard deviation of the sample.
Silverman’s optimum estimate of h is
where s* = min(s, IQR/1.34) and IQR is the interquartile range of the sample data.
Commonly used kernels
Some commonly used kernels are listed in Figure 1. Note that seven of the kernels restrict the domain to values |u| ≤ 1 (and are zero outside this domain). The Epanechnikov kernel is the most efficient in some sense that we won’t go into here. The efficiency column in the figure displays the efficiency of each of the kernel choices as a percentage of the efficiency of the Epanechnikov kernel.
Kernel name | Kernel pdf | restriction | efficiency |
uniform | K(u) = 1/2 | |u| ≤ 1 | 92.9% |
triangular | K(u) = 1 – |u| | |u| ≤ 1 | 98.6% |
biweight | K(u) = 15(1–u2)2/16 | |u| ≤ 1 | 99.4% |
triweight | K(u) = 35(1–u2)3/32 | |u| ≤ 1 | 98.7% |
tricube | K(u) = 70(1–|u|3)3/81 | |u| ≤ 1 | 98.7% |
Epanechnikov | K(u) = 3(1–u2)/4 | |u| ≤ 1 | 100% |
cosine | K(u) = π·cos(1–π·u/2)/4 | |u| ≤ 1 | 99.9% |
Gaussian | K(u) = exp(-u2/2) /√2π | 95.1% | |
logistic | K(u) = 1/(eu + e-u + 2) | 88.7% | |
sigmoid | K(u) = 2/[π(eu + e-u)] | 84.3% | |
Silverman | K(u) = exp(-|u|/√2) · sin(|u|/√2 + π/4) | N/A |
Figure 1 – Kernels
There is also another version of the Epanechnikov kernel, namely
K(u) = .75(1–u2/5)/√5 for u ≤ √5
and K(u) = 0 if u > √5.
Other Topics
References
Silverman, B. W. (1986) Density estimation for statistics and data analysis. Monographs on Statistics and Applied Probability, London: Chapman and Hall
https://ned.ipac.caltech.edu/level5/March02/Silverman/paper.pdf
Zucchini, W. (2003) Applied smoothing techniques. Part 1: Kernel density estimation
http://staff.ustc.edu.cn/~zwp/teach/Math-Stat/kernel.pdf
Helwig, N. E. (2017) Density and distribution estimation
http://users.stat.umn.edu/~helwig/notes/den-Notes.pdf
Wikipedia (2019) Kernel (statistics)
https://en.wikipedia.org/wiki/Kernel_(statistics)
Thanks for your excellent webpage and the Excel codes.
I work with healthcare expenditure data which is positive, skewed and fat tailed. There are some methods of kernel density estimation based on both the Gamma and Generalised Gamma distribution. Would it be possible to implement them using your different tools?
Thank you in advance for your reply.
Alberto Holly
References:
Chen, S.X. (2000), ‘Probability Density Function Estimation Using Gamma Kernels’, Annals of the Institute of Statistical Mathematics, 52, 471–480
Hirukawa, M., & Sakudo, M. (2015). Family of the generalised gamma kernels: a generator of asymmetric kernels for nonnegative data. Journal of Nonparametric Statistics, 27(1), 41–63.
Hello Alberto,
The Real Statistics supports a symmetric kernel. It looks like you are saying that you are looking for a skewed kernel. I am not familiar with the two references that you cited, but probably the current implementation is not adequate.
The Real Statistics software supports Gamma and Generalised Gamma Distributions, including fitting data. These might be useful, but I can’t say for sure.
Charles
Many thanks. Most helpful!
Munir
Thanks for these excellent web pages. For the removal of ambiguity, I would be grateful for your help in clarifying a couple of point:
(1)
Re the web page http://www.real-statistics.com/distribution-fitting/kernel-density-estimation
Is the variable u (eg, in k (u)) a reference to the standardised z used in general statistics (ie, z = (x – mean)/standard deviation)?
(2)
Re the web page http://www.real-statistics.com/distribution-fitting/kernel-density-estimation/kde-example
Where exactly can one download the Excel example/sheet. It does not seem to be bundled with other downloadable material.
Once again, many thanks. I should like to refer postgraduate researchers I am in contact with to these pages.
(1) u can be any variable. In some cases, the only restriction is |u| <= 1. (2) It is on the Distribution worksheet (see Distribution Fitting). You can download it from https://www.real-statistics.com/free-download/real-statistics-examples-workbook/
Charles