Namespaces
Variants
Actions

Difference between revisions of "Kendall tau metric"

From Encyclopedia of Mathematics
Jump to: navigation, search
m (Automatically changed introduction)
(fix tex)
Line 2: Line 2:
 
the semi-automatic procedure described at https://encyclopediaofmath.org/wiki/User:Maximilian_Janisch/latexlist
 
the semi-automatic procedure described at https://encyclopediaofmath.org/wiki/User:Maximilian_Janisch/latexlist
 
was used.
 
was used.
If the TeX and formula formatting is correct and if all png images have been replaced by TeX code, please remove this message and the {{TEX|semi-auto}} category.
 
  
 
Out of 99 formulas, 97 were replaced by TEX code.-->
 
Out of 99 formulas, 97 were replaced by TEX code.-->
  
{{TEX|semi-auto}}{{TEX|part}}
+
{{TEX|done}}{{MSC|62H20}}
 +
 
 
''Kendall tau''
 
''Kendall tau''
  
 
The non-parametric [[Correlation coefficient|correlation coefficient]] (or measure of association) known as Kendall's tau was first discussed by G.T. Fechner and others about 1900, and was rediscovered (independently) by M.G. Kendall in 1938 [[#References|[a3]]], [[#References|[a4]]]. In modern use, the term  "correlation"  refers to a measure of a linear relationship between variates (such as the [[Pearson product-moment correlation coefficient|Pearson product-moment correlation coefficient]]), while  "measure of association"  refers to a measure of a monotone relationship between variates (such as Kendall's tau and the [[Spearman rho metric|Spearman rho metric]]). For a historical review of Kendall's tau and related coefficients, see [[#References|[a5]]].
 
The non-parametric [[Correlation coefficient|correlation coefficient]] (or measure of association) known as Kendall's tau was first discussed by G.T. Fechner and others about 1900, and was rediscovered (independently) by M.G. Kendall in 1938 [[#References|[a3]]], [[#References|[a4]]]. In modern use, the term  "correlation"  refers to a measure of a linear relationship between variates (such as the [[Pearson product-moment correlation coefficient|Pearson product-moment correlation coefficient]]), while  "measure of association"  refers to a measure of a monotone relationship between variates (such as Kendall's tau and the [[Spearman rho metric|Spearman rho metric]]). For a historical review of Kendall's tau and related coefficients, see [[#References|[a5]]].
  
Underlying the definition of Kendall's tau is the notion of concordance. If $( x_j, y_j )$ and $( x _ { k } , y _ { k } )$ are two elements of a [[Sample|sample]] $\{ ( x _ { i } , y _ { i } ) \} _ { i = 1 } ^ { n }$ from a bivariate population, one says that $( x_j, y_j )$ and $( x _ { k } , y _ { k } )$ are concordant if $x _ { j } < x _ { k }$ and $y _ { j } < y _ { k }$ or if $x _ { j } > x _ { k }$ and $y _ { j } > y _ { k }$ (i.e., if $( x _ { j } - x _ { k } ) ( y _ { j } - y _ { k } ) > 0$); and discordant if $x _ { j } < x _ { k }$ and $y _ { j } > y _ { k }$ or if $x _ { j } > x _ { k }$ and $y _ { j } < y _ { k }$ (i.e., if $( x _ { j } - x _ { k } ) ( y _ { j } - y _ { k } ) < 0$). There are $\left( \begin{array} { l } { n } \\ { 2 } \end{array} \right)$ distinct pairs of observations in the sample, and each pair (barring ties) is either concordant or discordant. Denoting by $S$ the number $c$ of concordant pairs minus the number $d$ of discordant pairs, Kendall's tau for the sample is defined as
+
Underlying the definition of Kendall's tau is the notion of concordance. If $( x_j, y_j )$ and $( x _ { k } , y _ { k } )$ are two elements of a [[Sample|sample]] $\{ ( x _ { i } , y _ { i } ) \} _ { i = 1 } ^ { n }$ from a bivariate population, one says that $( x_j, y_j )$ and $( x _ { k } , y _ { k } )$ are concordant if $x _ { j } < x _ { k }$ and $y _ { j } < y _ { k }$ or if $x _ { j } > x _ { k }$ and $y _ { j } > y _ { k }$ (i.e., if $( x _ { j } - x _ { k } ) ( y _ { j } - y _ { k } ) > 0$); and discordant if $x _ { j } < x _ { k }$ and $y _ { j } > y _ { k }$ or if $x _ { j } > x _ { k }$ and $y _ { j } < y _ { k }$ (i.e., if $( x _ { j } - x _ { k } ) ( y _ { j } - y _ { k } ) < 0$). There are $\left( \begin{array} { l } { n } \\ { 2 } \end{array} \right)$ distinct pairs of observations in the sample, and each pair (barring ties) is either concordant or discordant. Denoting by $S$ the number $c$ of concordant pairs minus the number $d$ of discordant pairs, Kendall's tau for the sample is defined as
  
 
\begin{equation*} \tau _ { n } = \frac { c - d } { c + d } = \frac { S } { \left( \begin{array} { l } { n } \\ { 2 } \end{array} \right) } = \frac { 2 S } { n ( n - 1 ) } \end{equation*}
 
\begin{equation*} \tau _ { n } = \frac { c - d } { c + d } = \frac { S } { \left( \begin{array} { l } { n } \\ { 2 } \end{array} \right) } = \frac { 2 S } { n ( n - 1 ) } \end{equation*}
Line 23: Line 23:
 
Note that $\tau _ { n }$ is equal to the probability of concordance minus the probability of discordance for a pair of observations $( x_j, y_j )$ and $( x _ { k } , y _ { k } )$ chosen randomly from the sample $\{ ( x _ { i } , y _ { i } ) \} _ { i = 1 } ^ { n }$. The population version $\tau$ of Kendall's tau is defined similarly for random variables $X$ and $Y$ (cf. also [[Random variable|Random variable]]). Let $( X _ { 1 } , Y _ { 1 } )$ and $( X _ { 2 } , Y _ { 2 } )$ be independent random vectors with the same distribution as $( X , Y )$. Then
 
Note that $\tau _ { n }$ is equal to the probability of concordance minus the probability of discordance for a pair of observations $( x_j, y_j )$ and $( x _ { k } , y _ { k } )$ chosen randomly from the sample $\{ ( x _ { i } , y _ { i } ) \} _ { i = 1 } ^ { n }$. The population version $\tau$ of Kendall's tau is defined similarly for random variables $X$ and $Y$ (cf. also [[Random variable|Random variable]]). Let $( X _ { 1 } , Y _ { 1 } )$ and $( X _ { 2 } , Y _ { 2 } )$ be independent random vectors with the same distribution as $( X , Y )$. Then
  
\begin{equation*} \tau = \mathsf{P} [ ( X _ { 1 } - X _ { 2 } ) ( Y _ { 1 } - Y _ { 2 } ) &gt; 0 ] + \end{equation*}
+
\begin{equation*} \tau = \mathsf{P} [ ( X _ { 1 } - X _ { 2 } ) ( Y _ { 1 } - Y _ { 2 } ) > 0 ] + \end{equation*}
  
\begin{equation*} - \mathsf{P} [ ( X _ { 1 } - X _ { 2 } ) ( Y _ { 1 } - Y _ { 2 } ) &lt; 0 ] = \end{equation*}
+
\begin{equation*} - \mathsf{P} [ ( X _ { 1 } - X _ { 2 } ) ( Y _ { 1 } - Y _ { 2 } ) < 0 ] = \end{equation*}
  
 
\begin{equation*} = \operatorname { corr } [ \operatorname { sign } ( X _ { 1 } - X _ { 2 } ) , \operatorname { sign } ( Y _ { 1 } - Y _ { 2 } ) ]. \end{equation*}
 
\begin{equation*} = \operatorname { corr } [ \operatorname { sign } ( X _ { 1 } - X _ { 2 } ) , \operatorname { sign } ( Y _ { 1 } - Y _ { 2 } ) ]. \end{equation*}
Line 45: Line 45:
 
The population parameter estimated by $q$, denoted by $\beta$, is defined analogously to Kendall's tau (cf. Kendall tau metric). Denoting by $\tilde{X}$ and $\tilde{Y}$ the population medians of $X$ and $Y$, then
 
The population parameter estimated by $q$, denoted by $\beta$, is defined analogously to Kendall's tau (cf. Kendall tau metric). Denoting by $\tilde{X}$ and $\tilde{Y}$ the population medians of $X$ and $Y$, then
  
\begin{equation*} \beta = \mathsf{P} [ ( X - \tilde { X } ) ( Y - \tilde { Y } ) &gt; 0 ] + \end{equation*}
+
\begin{equation*} \beta = \mathsf{P} [ ( X - \tilde { X } ) ( Y - \tilde { Y } ) > 0 ] + \end{equation*}
  
\begin{equation*} - \mathsf{P} [ ( X - \tilde { X } ) ( Y - \tilde { Y } ) &lt; 0 ] = \end{equation*}
+
\begin{equation*} - \mathsf{P} [ ( X - \tilde { X } ) ( Y - \tilde { Y } ) < 0 ] = \end{equation*}
  
<table class="eq" style="width:100%;"> <tr><td style="width:94%;text-align:center;" valign="top"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/k/k130/k130020/k13002082.png"/></td> </tr></table>
+
\begin{equation*} 4 F_{X,Y}(\tilde{X},\tilde{Y}) - 1  \end{equation*}
  
 
where $F_{ X , Y}$ denotes the joint distribution function of $X$ and $Y$. Since $\beta$ depends only on the value of $F_{ X , Y}$ at the point whose coordinates are the population medians of $X$ and $Y$, it is sometimes called the medial correlation coefficient. When $X$ and $Y$ are continuous,
 
where $F_{ X , Y}$ denotes the joint distribution function of $X$ and $Y$. Since $\beta$ depends only on the value of $F_{ X , Y}$ at the point whose coordinates are the population medians of $X$ and $Y$, it is sometimes called the medial correlation coefficient. When $X$ and $Y$ are continuous,

Revision as of 13:03, 9 February 2021

2020 Mathematics Subject Classification: Primary: 62H20 [MSN][ZBL]

Kendall tau

The non-parametric correlation coefficient (or measure of association) known as Kendall's tau was first discussed by G.T. Fechner and others about 1900, and was rediscovered (independently) by M.G. Kendall in 1938 [a3], [a4]. In modern use, the term "correlation" refers to a measure of a linear relationship between variates (such as the Pearson product-moment correlation coefficient), while "measure of association" refers to a measure of a monotone relationship between variates (such as Kendall's tau and the Spearman rho metric). For a historical review of Kendall's tau and related coefficients, see [a5].

Underlying the definition of Kendall's tau is the notion of concordance. If $( x_j, y_j )$ and $( x _ { k } , y _ { k } )$ are two elements of a sample $\{ ( x _ { i } , y _ { i } ) \} _ { i = 1 } ^ { n }$ from a bivariate population, one says that $( x_j, y_j )$ and $( x _ { k } , y _ { k } )$ are concordant if $x _ { j } < x _ { k }$ and $y _ { j } < y _ { k }$ or if $x _ { j } > x _ { k }$ and $y _ { j } > y _ { k }$ (i.e., if $( x _ { j } - x _ { k } ) ( y _ { j } - y _ { k } ) > 0$); and discordant if $x _ { j } < x _ { k }$ and $y _ { j } > y _ { k }$ or if $x _ { j } > x _ { k }$ and $y _ { j } < y _ { k }$ (i.e., if $( x _ { j } - x _ { k } ) ( y _ { j } - y _ { k } ) < 0$). There are $\left( \begin{array} { l } { n } \\ { 2 } \end{array} \right)$ distinct pairs of observations in the sample, and each pair (barring ties) is either concordant or discordant. Denoting by $S$ the number $c$ of concordant pairs minus the number $d$ of discordant pairs, Kendall's tau for the sample is defined as

\begin{equation*} \tau _ { n } = \frac { c - d } { c + d } = \frac { S } { \left( \begin{array} { l } { n } \\ { 2 } \end{array} \right) } = \frac { 2 S } { n ( n - 1 ) } \end{equation*}

When ties exist in the data, the following adjusted formula is used:

\begin{equation*} \tau _ { n } = \frac { S } { \sqrt { n ( n - 1 ) / 2 - T } \sqrt { n ( n - 1 ) / 2 - U } }, \end{equation*}

where $T = \sum _ { t } t ( t - 1 ) / 2$ for $t$ the number of $X$ observations that are tied at a given rank, and $U = \sum _ { u } u ( u - 1 ) / 2$ for $u$ the number of $Y$ observations that are tied at a given rank. For details on the use of $\tau _ { n }$ in hypotheses testing, and for large-sample theory, see [a2].

Note that $\tau _ { n }$ is equal to the probability of concordance minus the probability of discordance for a pair of observations $( x_j, y_j )$ and $( x _ { k } , y _ { k } )$ chosen randomly from the sample $\{ ( x _ { i } , y _ { i } ) \} _ { i = 1 } ^ { n }$. The population version $\tau$ of Kendall's tau is defined similarly for random variables $X$ and $Y$ (cf. also Random variable). Let $( X _ { 1 } , Y _ { 1 } )$ and $( X _ { 2 } , Y _ { 2 } )$ be independent random vectors with the same distribution as $( X , Y )$. Then

\begin{equation*} \tau = \mathsf{P} [ ( X _ { 1 } - X _ { 2 } ) ( Y _ { 1 } - Y _ { 2 } ) > 0 ] + \end{equation*}

\begin{equation*} - \mathsf{P} [ ( X _ { 1 } - X _ { 2 } ) ( Y _ { 1 } - Y _ { 2 } ) < 0 ] = \end{equation*}

\begin{equation*} = \operatorname { corr } [ \operatorname { sign } ( X _ { 1 } - X _ { 2 } ) , \operatorname { sign } ( Y _ { 1 } - Y _ { 2 } ) ]. \end{equation*}

Since $\tau$ is the Pearson product-moment correlation coefficient of the random variables $\operatorname { sign } ( X _ { 1 } - X _ { 2 } )$ and $\operatorname { sign } ( Y _ { 1 } - Y _ { 2 } )$, $\tau$ is sometimes called the difference sign correlation coefficient.

When $X$ and $Y$ are continuous,

\begin{equation*} \tau = 4 \int _ { 0 } ^ { 1 } \int _ { 0 } ^ { 1 } C _ { X , Y } ( u , v ) d C _ { X , Y } ( u , v ) - 1, \end{equation*}

where $C _ { X , Y }$ is the copula of $X$ and $Y$. Consequently, $\tau$ is invariant under strictly increasing transformations of $X$ and $Y$, a property $\tau$ shares with Spearman's rho, but not with the Pearson product-moment correlation coefficient. For a survey of copulas and their relationship with measures of association, see [a6].

Besides Kendall's tau, there are other measures of association based on the notion of concordance, one of which is Blomqvist's coefficient [a1]. Let $\{ ( x _ { i } , y _ { i } ) \} _ { i = 1 } ^ { n }$ denote a sample from a continuous bivariate population, and let $\tilde{x}$ and $\tilde{y}$ denote sample medians (cf. also Median (in statistics)). Divide the $( x , y )$-plane into four quadrants with the lines $x = \tilde { x }$ and $y = \tilde { y }$; and let $n_ 1$ be the number of sample points belonging to the first or third quadrants, and $n_{2}$ the number of points belonging to the second or fourth quadrants. If the sample size $n$ is even, the calculation of $n_ 1$ and $n_{2}$ is evident. If $n$ is odd, then one or two of the sample points fall on the lines $x = \tilde { x }$ and $y = \tilde { y }$. In the first case one ignores the point; in the second case one assigns one point to the quadrant touched by both points and ignores the other. Then Blomqvist's $q$ is defined as

\begin{equation*} q = \frac { n_1 - n_2 } { n_1 + n_2 }. \end{equation*}

For details on the use of $q$ in hypothesis testing, and for large-sample theory, see [a1].

The population parameter estimated by $q$, denoted by $\beta$, is defined analogously to Kendall's tau (cf. Kendall tau metric). Denoting by $\tilde{X}$ and $\tilde{Y}$ the population medians of $X$ and $Y$, then

\begin{equation*} \beta = \mathsf{P} [ ( X - \tilde { X } ) ( Y - \tilde { Y } ) > 0 ] + \end{equation*}

\begin{equation*} - \mathsf{P} [ ( X - \tilde { X } ) ( Y - \tilde { Y } ) < 0 ] = \end{equation*}

\begin{equation*} 4 F_{X,Y}(\tilde{X},\tilde{Y}) - 1 \end{equation*}

where $F_{ X , Y}$ denotes the joint distribution function of $X$ and $Y$. Since $\beta$ depends only on the value of $F_{ X , Y}$ at the point whose coordinates are the population medians of $X$ and $Y$, it is sometimes called the medial correlation coefficient. When $X$ and $Y$ are continuous,

\begin{equation*} \beta = 4 C _ { X , Y } \left( \frac { 1 } { 2 } , \frac { 1 } { 2 } \right) - 1, \end{equation*}

where $C _ { X , Y }$ again denotes the copula of $X$ and $Y$. Thus $\beta$, like $\tau$, is invariant under strictly increasing transformations of $X$ and $Y$.

References

[a1] N. Blomqvist, "On a measure of dependence between two random variables" Ann. Math. Stat. , 21 (1950) pp. 503–600
[a2] J.D. Gibbons, "Nonparametric methods for quantitative analysis" , Holt, Rinehart & Winston (1976)
[a3] M.G. Kendall, "A new measure of rank correlation" Biometrika , 30 (1938) pp. 81–93
[a4] M.G. Kendall, "Rank correlation methods" , Charles Griffin (1970) (Edition: Fourth)
[a5] W.H. Kruskal, "Ordinal measures of association" J. Amer. Statist. Assoc. , 53 (1958) pp. 814–861
[a6] R.B. Nelsen, "An introduction to copulas" , Springer (1999)
How to Cite This Entry:
Kendall tau metric. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Kendall_tau_metric&oldid=50721
This article was adapted from an original article by R.B. Nelsen (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article