Namespaces
Variants
Actions

Difference between revisions of "Adaptive sampling"

From Encyclopedia of Mathematics
Jump to: navigation, search
(Importing text file)
 
m (tex encoded by computer)
 
Line 1: Line 1:
 +
<!--
 +
a1103301.png
 +
$#A+1 = 39 n = 0
 +
$#C+1 = 39 : ~/encyclopedia/old_files/data/A110/A.1100330 Adaptive sampling
 +
Automatically converted into TeX, above some diagnostics.
 +
Please remove this comment and the {{TEX|auto}} line below,
 +
if TeX found to be correct.
 +
-->
 +
 +
{{TEX|auto}}
 +
{{TEX|done}}
 +
 
Adaptive sampling [[#References|[a1]]] is a probabilistic algorithm invented by M. Wegman (unpublished) around 1980. It provides an [[Unbiased estimator|unbiased estimator]] of the number of distinct elements (the  "cardinality" ) of a file (a sequence of data items) of potentially large size that contains unpredictable replications. The algorithm is useful in data-base query optimization and in information retrieval. By standard hashing techniques [[#References|[a3]]], [[#References|[a6]]] the problem reduces to the following.
 
Adaptive sampling [[#References|[a1]]] is a probabilistic algorithm invented by M. Wegman (unpublished) around 1980. It provides an [[Unbiased estimator|unbiased estimator]] of the number of distinct elements (the  "cardinality" ) of a file (a sequence of data items) of potentially large size that contains unpredictable replications. The algorithm is useful in data-base query optimization and in information retrieval. By standard hashing techniques [[#References|[a3]]], [[#References|[a6]]] the problem reduces to the following.
  
A sequence <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a1103301.png" /> of real numbers is given. The sequence has been formed by drawing independently and randomly an unknown number <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a1103302.png" /> of real numbers from <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a1103303.png" />, after which the elements are replicated and permuted in some unknown fashion. The problem is to estimate the cardinality <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a1103304.png" /> in a computationally efficient manner.
+
A sequence $  ( h _ {1} \dots h _ {n} ) $
 +
of real numbers is given. The sequence has been formed by drawing independently and randomly an unknown number $  N $
 +
of real numbers from $  [ 0,1 ] $,
 +
after which the elements are replicated and permuted in some unknown fashion. The problem is to estimate the cardinality $  N $
 +
in a computationally efficient manner.
  
 
Three algorithms can perform this task.
 
Three algorithms can perform this task.
  
1) Straight scan computes incrementally the sets <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a1103305.png" />, where replications are eliminated on the fly. (This can be achieved by keeping the successive <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a1103306.png" /> in sorted order.) The cardinality is then determined exactly by <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a1103307.png" /> but the auxiliary memory needed is <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a1103308.png" />, which may be as large as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a1103309.png" />, resulting in a complexity that is prohibitive in many applications.
+
1) Straight scan computes incrementally the sets $  U _ {j} = \{ h _ {1} \dots h _ {j} \} $,  
 +
where replications are eliminated on the fly. (This can be achieved by keeping the successive $  U _ {j} $
 +
in sorted order.) The cardinality is then determined exactly by $  N = { \mathop{\rm card} } ( U _ {n} ) $
 +
but the auxiliary memory needed is $  N $,  
 +
which may be as large as $  n $,  
 +
resulting in a complexity that is prohibitive in many applications.
  
2) Static sampling is based on a fixed sampling ratio <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033010.png" />, where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033011.png" /> (e.g., <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033012.png" />). One computes sequentially the samples <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033013.png" />. The cardinality estimate returned is <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033014.png" />. The estimator <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033015.png" /> is unbiased and the memory used is <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033016.png" /> on average.
+
2) Static sampling is based on a fixed sampling ratio $  p $,  
 +
where $  0 < p \leq  1 $(
 +
e.g., $  p = {1 / {100 } } $).  
 +
One computes sequentially the samples $  U _ {j}  ^ {*} = \{ h _ {1} \dots h _ {j} \} \cap [ 0,p ] $.  
 +
The cardinality estimate returned is $  N  ^ {*} = { {{ \mathop{\rm card} } ( U _ {n}  ^ {*} ) } / p } $.  
 +
The estimator $  N  ^ {*} $
 +
is unbiased and the memory used is $  Np $
 +
on average.
  
3) Adaptive sampling is based on a design parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033017.png" /> (e.g., <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033018.png" />) and it maintains a dynamically changing sampling rate <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033019.png" /> and a sequence of samples <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033020.png" />. Initially, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033021.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033022.png" />. The rule is like that of static sampling, but with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033023.png" /> divided by <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033024.png" /> each time the cardinality of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033025.png" /> would exceed <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033026.png" /> and with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033027.png" /> modified accordingly in order to contain only <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033028.png" />. The estimator <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033029.png" /> (where the final value of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033030.png" /> is used) is proved to be unbiased and the memory used is at most <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033031.png" />.
+
3) Adaptive sampling is based on a design parameter $  b \geq 2 $(
 +
e.g., $  b = 100 $)  
 +
and it maintains a dynamically changing sampling rate $  p $
 +
and a sequence of samples $  U _ {j} ^ {** } $.  
 +
Initially, $  p = 1 $
 +
and $  U _ {0} ^ {** } = \emptyset $.  
 +
The rule is like that of static sampling, but with $  p $
 +
divided by $  2 $
 +
each time the cardinality of $  U _ {j} ^ {** } $
 +
would exceed $  b $
 +
and with $  U _ {j} ^ {** } $
 +
modified accordingly in order to contain only $  U _ {j} \cap [ 0,p ] $.  
 +
The estimator $  N ^ {** } = { {{ \mathop{\rm card} } ( U _ {n} ^ {** } ) } / p } $(
 +
where the final value of $  p $
 +
is used) is proved to be unbiased and the memory used is at most $  b $.
  
The accuracy of any such unbiased estimator <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033032.png" /> of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033033.png" /> is measured by the standard deviation of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033034.png" /> divided by <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033035.png" />. For adaptive sampling, the accuracy is almost constant as a function of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033036.png" /> and asymptotically close to
+
The accuracy of any such unbiased estimator $  {\widetilde{N}  } $
 +
of $  N $
 +
is measured by the standard deviation of $  {\widetilde{N}  } $
 +
divided by $  N $.  
 +
For adaptive sampling, the accuracy is almost constant as a function of $  N $
 +
and asymptotically close to
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033037.png" /></td> </tr></table>
+
$$
 +
{
 +
\frac{1.20 }{\sqrt b }
 +
} ,
 +
$$
  
a result established in [[#References|[a1]]] by generating functions and Mellin transform techniques. An alternative algorithm, called probabilistic counting [[#References|[a2]]], provides an estimator <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033038.png" /> of cardinalities that is unbiased only asymptotically but has a better accuracy, of about <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a110/a110330/a11033039.png" />.
+
a result established in [[#References|[a1]]] by generating functions and Mellin transform techniques. An alternative algorithm, called probabilistic counting [[#References|[a2]]], provides an estimator $  N ^ {*** } $
 +
of cardinalities that is unbiased only asymptotically but has a better accuracy, of about $  { {0.78 } / {\sqrt b } } $.
  
 
Typically, the adaptive sampling algorithm can be applied to gather statistics on word usage in a large text. Well-designed hashing transformations are then known to fulfill practically the uniformity assumption [[#References|[a4]]]. A general perspective on probabilistic algorithms may be found in [[#References|[a5]]].
 
Typically, the adaptive sampling algorithm can be applied to gather statistics on word usage in a large text. Well-designed hashing transformations are then known to fulfill practically the uniformity assumption [[#References|[a4]]]. A general perspective on probabilistic algorithms may be found in [[#References|[a5]]].

Latest revision as of 16:09, 1 April 2020


Adaptive sampling [a1] is a probabilistic algorithm invented by M. Wegman (unpublished) around 1980. It provides an unbiased estimator of the number of distinct elements (the "cardinality" ) of a file (a sequence of data items) of potentially large size that contains unpredictable replications. The algorithm is useful in data-base query optimization and in information retrieval. By standard hashing techniques [a3], [a6] the problem reduces to the following.

A sequence $ ( h _ {1} \dots h _ {n} ) $ of real numbers is given. The sequence has been formed by drawing independently and randomly an unknown number $ N $ of real numbers from $ [ 0,1 ] $, after which the elements are replicated and permuted in some unknown fashion. The problem is to estimate the cardinality $ N $ in a computationally efficient manner.

Three algorithms can perform this task.

1) Straight scan computes incrementally the sets $ U _ {j} = \{ h _ {1} \dots h _ {j} \} $, where replications are eliminated on the fly. (This can be achieved by keeping the successive $ U _ {j} $ in sorted order.) The cardinality is then determined exactly by $ N = { \mathop{\rm card} } ( U _ {n} ) $ but the auxiliary memory needed is $ N $, which may be as large as $ n $, resulting in a complexity that is prohibitive in many applications.

2) Static sampling is based on a fixed sampling ratio $ p $, where $ 0 < p \leq 1 $( e.g., $ p = {1 / {100 } } $). One computes sequentially the samples $ U _ {j} ^ {*} = \{ h _ {1} \dots h _ {j} \} \cap [ 0,p ] $. The cardinality estimate returned is $ N ^ {*} = { {{ \mathop{\rm card} } ( U _ {n} ^ {*} ) } / p } $. The estimator $ N ^ {*} $ is unbiased and the memory used is $ Np $ on average.

3) Adaptive sampling is based on a design parameter $ b \geq 2 $( e.g., $ b = 100 $) and it maintains a dynamically changing sampling rate $ p $ and a sequence of samples $ U _ {j} ^ {** } $. Initially, $ p = 1 $ and $ U _ {0} ^ {** } = \emptyset $. The rule is like that of static sampling, but with $ p $ divided by $ 2 $ each time the cardinality of $ U _ {j} ^ {** } $ would exceed $ b $ and with $ U _ {j} ^ {** } $ modified accordingly in order to contain only $ U _ {j} \cap [ 0,p ] $. The estimator $ N ^ {** } = { {{ \mathop{\rm card} } ( U _ {n} ^ {** } ) } / p } $( where the final value of $ p $ is used) is proved to be unbiased and the memory used is at most $ b $.

The accuracy of any such unbiased estimator $ {\widetilde{N} } $ of $ N $ is measured by the standard deviation of $ {\widetilde{N} } $ divided by $ N $. For adaptive sampling, the accuracy is almost constant as a function of $ N $ and asymptotically close to

$$ { \frac{1.20 }{\sqrt b } } , $$

a result established in [a1] by generating functions and Mellin transform techniques. An alternative algorithm, called probabilistic counting [a2], provides an estimator $ N ^ {*** } $ of cardinalities that is unbiased only asymptotically but has a better accuracy, of about $ { {0.78 } / {\sqrt b } } $.

Typically, the adaptive sampling algorithm can be applied to gather statistics on word usage in a large text. Well-designed hashing transformations are then known to fulfill practically the uniformity assumption [a4]. A general perspective on probabilistic algorithms may be found in [a5].

References

[a1] P. Flajolet, "On adaptive sampling" Computing , 34 (1990) pp. 391–400
[a2] P. Flajolet, G.N. Martin, "Probabilistic counting algorithms for data base applications" J. Comp. System Sci. , 31 : 2 (1985) pp. 182–209
[a3] D.E. Knuth, "The art of computer programming" , 3. Sorting and Searching , Addison-Wesley (1973)
[a4] V.Y. Lum, P.S.T. Yuen, M. Dodd, "Key to address transformations: a fundamental study based on large existing format files" Commun. ACM , 14 (1971) pp. 228–239
[a5] R. Motwani, P. Raghavan, "Randomized algorithms" , Cambridge Univ. Press (1995)
[a6] R. Sedgewick, "Algorithms" , Addison-Wesley (1988) (Edition: Second)
How to Cite This Entry:
Adaptive sampling. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Adaptive_sampling&oldid=11856
This article was adapted from an original article by Ph. Flajolet (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article