1

The Fundamentals of Sampling

In statistics and finance, we often want to learn about a large group — for example, all stocks in an index, all bond yields in a market, or all investor returns over a decade. This large group is called the population.

Studying every single member of the population (a census) is usually impractical due to time, cost, or logistical constraints. Instead, we select a smaller, manageable subset called a sample. The process of selecting this subset is called sampling.

Key Definitions

  • Population: The complete set of items or individuals we're interested in.
  • Sample: A subset of the population that we actually observe and analyze.
  • Parameter: A fixed, unknown numerical measure that describes a characteristic of the population.
    • Example: The true mean return of all S&P 500 stocks.
  • Sample Statistic: A value calculated from the sample data, used to estimate the population parameter.
    • Example: The average return of 50 randomly selected S&P 500 stocks.

Example: Estimating Average Dividend Yield

Suppose you want to know the average dividend yield of all 2,000 stocks in the Russell 2000 index.

You select a sample of 100 stocks and find their average dividend yield is 2.3%.

Interpretation: You don't know the true population mean, but you use 2.3% as your best estimate of it.

Why Sampling Is Essential

Sampling is far more efficient than a full census. Benefits include:

  • Cost-effective — saves money and resources
  • Time-efficient — results are available faster
  • Reduces data collection errors — easier to manage quality control
  • Often just as accurate — if the sample is well-chosen
2

Sampling Methods: How to Choose a Sample

The quality of your statistical conclusions depends heavily on how you select your sample. There are two broad categories: probability sampling and non-probability sampling.

Probability vs. Non-Probability Sampling

  • Probability Sampling: Every member of the population has a known, non-zero chance of being selected. These methods allow for valid statistical inference because they minimize selection bias.
  • Non-Probability Sampling: Selection is based on convenience, judgment, or other non-random criteria. These methods are quick and easy but cannot support reliable inference about the population.

Types of Probability Sampling

Method Description Pros & Cons
Simple Random Sampling Every possible sample of size n has an equal chance of being selected. Can be done using random number generators. ✔️ Pro: Truly random and unbiased.
❌ Con: May, by chance, produce a non-representative sample (e.g., all tech stocks).
Stratified Random Sampling The population is divided into homogeneous subgroups (strata) based on a key variable (e.g., sector, risk level). A random sample is then drawn from each stratum, proportional to its size. ✔️ Pro: More precise and representative than simple random sampling.
❌ Con: Requires prior knowledge of population structure.
Cluster Sampling The population is divided into clusters (e.g., geographic regions). A random set of clusters is selected, and individuals are sampled from within them.
  • One-stage: All members of selected clusters are included.
  • Two-stage: A random subsample is drawn from each selected cluster.
✔️ Pro: Cost-effective for large, dispersed populations.
❌ Con: Less accurate than stratified sampling due to potential cluster heterogeneity.

Types of Non-Probability Sampling

  • Convenience Sampling: Individuals are selected based on ease of access (e.g., surveying people at a coffee shop).
    • ✔️ Pro: Fast and inexpensive.
    • ❌ Con: High risk of bias; not representative.
  • Judgmental (Purposive) Sampling: The researcher uses their expertise to select a sample they believe is representative.
    • ✔️ Pro: Can be useful in exploratory research.
    • ❌ Con: Subject to researcher bias; not random.
3

The Central Limit Theorem (CLT) and Inference

The Central Limit Theorem (CLT) is one of the most powerful and foundational concepts in statistics. It allows us to make reliable inferences about population parameters even when we don't know the shape of the population distribution.

What Is the Central Limit Theorem?

The CLT states that:

For any population with mean \(\mu\) and finite variance \(\sigma^2\),
the sampling distribution of the sample mean \(\bar{X}\)
will be approximately normally distributed as the sample size \(n\) increases,
regardless of the shape of the population distribution.

Rule of Thumb: A sample size of \(n \geq 30\) is generally sufficient for the approximation to be good.

Properties of the Sampling Distribution of the Sample Mean

Let's define the sampling distribution of the sample mean — that is, the distribution of all possible sample means of size \(n\).

  • Mean: The mean of the sample means equals the population mean.
    \( \mu_{\bar{X}} = \mu \)
  • Variance: The variance of the sample means is the population variance divided by the sample size.
    \( \sigma_{\bar{X}}^2 = \frac{\sigma^2}{n} \)
  • Standard Deviation (Standard Error): This is the most important practical measure.

Standard Error of the Sample Mean

The standard error (SE) is the standard deviation of the sampling distribution of the sample mean. It tells us how much the sample mean is likely to vary from the true population mean due to random sampling.

If population standard deviation \(\sigma\) is known: \( SE = \frac{\sigma}{\sqrt{n}} \)
If \(\sigma\) is unknown (most real-world cases): \( SE = \frac{s}{\sqrt{n}} \)

Where:

  • \(s\) = sample standard deviation
  • \(n\) = sample size

Example: Calculating Standard Error

You sample 49 stocks and find:

  • Sample mean return = 8%
  • Sample standard deviation = 14%

Since the population standard deviation is unknown, use the sample SD:

\( SE = \frac{14\%}{\sqrt{49}} = \frac{14}{7} = 2\% \)

Interpretation: On average, the sample mean will differ from the true population mean by about 2 percentage points.

Standard Deviation vs. Standard Error

Don't confuse these two:

  • Standard Deviation (SD): Measures how spread out the individual data points are in a single sample.
  • Standard Error (SE): Measures how spread out the sample means are across many samples — it's about the precision of the mean estimate.
4

Resampling Techniques

Resampling involves repeatedly drawing samples from the observed data to estimate the variability of a statistic (like the mean or standard error) without relying on strict assumptions about the population distribution.

Bootstrapping

Bootstrapping is a powerful resampling method that uses sampling with replacement from the original dataset.

  • You take many random samples (e.g., 10,000) of the same size as the original, drawing with replacement.
  • For each bootstrap sample, you calculate the statistic of interest (e.g., mean, median).
  • The distribution of these bootstrap statistics estimates the sampling distribution.

Uses: Estimating standard errors, confidence intervals, and bias — especially when parametric assumptions (like normality) are questionable.

Example: Bootstrapping the Mean

You have 30 daily returns. To estimate the 95% confidence interval for the mean:

  1. Draw 1,000 bootstrap samples (each with 30 returns, with replacement).
  2. Calculate the mean of each sample.
  3. Take the 2.5th and 97.5th percentiles of the 1,000 means.

This gives you a robust confidence interval without assuming normality.

Jackknife

The Jackknife is another resampling technique that works by leaving out one observation at a time.

  • For a sample of size \(n\), you create \(n\) subsamples, each missing one different observation.
  • You calculate the statistic (e.g., mean) for each subsample.
  • These values are used to estimate bias and standard error.

Bootstrapping vs. Jackknife: Key Difference

  • Jackknife: Always produces exactly \(n\) samples for a sample of size \(n\). Deterministic.
  • Bootstrapping: You choose how many samples to draw (e.g., 1,000 or 10,000). Stochastic (random).

Bootstrapping is more flexible and widely used; Jackknife is simpler but less powerful for complex statistics.

Progress:
Chapter 7 of 11