This project provides implementations of the plfit
and plpva
functions, which are essential tools for fitting power-law distributions to empirical data. Power-law distributions are significant in various fields, including physics, biology, economics, and social sciences, due to their ability to describe a wide range of natural and man-made phenomena. Understanding and accurately fitting these distributions allows researchers to better model complex systems and predict future events based on observed data.
The plfit
function is designed to fit a power-law distribution to a given data set using maximum likelihood estimation (MLE). This method is known for providing robust parameter estimates, particularly in the presence of large fluctuations in the tail of the distribution, which are characteristic of power-law behaviors. The function determines the scaling parameter
The plpva
function complements plfit
by performing a statistical test to assess the goodness-of-fit of the power-law model. Using a p-value derived from the Kolmogorov–Smirnov (KS) statistic and likelihood ratios, plpva
quantifies how well the power-law distribution matches the empirical data compared to synthetic datasets generated from the fitted model. This allows researchers to determine the plausibility of the power-law hypothesis and rule out alternative distributions.
This project builds upon the methodologies presented in the research paper Power-Law Distributions in Empirical Data by Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman (2009). The paper provides a principled statistical framework for detecting and quantifying power-law behavior in empirical data, combining MLE with goodness-of-fit tests. Our MATLAB implementation closely follows the algorithms and statistical techniques discussed in the paper, ensuring accurate and reliable power-law fits.
The MATLAB code for this project has been further developed based on the work presented in another notable research titled Molecular motors robustly drive active gels to a critically connected state. This research used the power-law fit MATLAB code to produce significant figures, demonstrating the practical application of these functions in cutting-edge scientific studies. This project aims to document and expand upon this code, providing detailed explanations and enhanced usability for researchers and analysts.
By documenting and explaining the power-law fitting process in detail, this project serves as a comprehensive resource for those looking to understand and apply power-law models to their data. Whether for academic research, data analysis, or practical applications, the tools provided here offer robust solutions for fitting and validating power-law distributions.
A power-law distribution is a type of probability distribution that has the form:
A power-law distribution can be characterized differently depending on whether the data is continuous or discrete.
In the continuous case, the variable
In the discrete case, the variable
Power-law distributions are crucial because they often describe the distribution of various types of data, such as:
- Earthquake magnitudes.
- Word frequencies in natural language.
- Wealth distribution.
- Sizes of cities and towns.
- Internet traffic patterns.
Understanding power-law distributions allows researchers and analysts to better comprehend the underlying mechanisms that generate such data and to make more accurate predictions and models.
The plfit
function fits a power-law distribution to empirical data using maximum likelihood estimation. It determines the scaling parameter
- Estimates the scaling parameter
$\alpha$ . - Identifies the lower bound
$x_{min}$ . - Provides diagnostics for assessing the fit quality.
In the continuous case, the power-law distribution is represented as
-
Normalization Constant: The normalization constant
$C$ ensures that the total probability integrates to 1 over the range$x \geq x_{min}$ :$C = (\alpha - 1)x_{min}^{(\alpha-1)}$ . -
Maximum Likelihood Estimation for
$\alpha$ : Given a set of observed values$x_{i}$ such that$x_i \geq x_{min}$ , the MLE for the scaling parameter$\alpha$ is$\hat{\alpha} = 1 + n \left[ \sum\limits_{i=1}^n \ln \left( \frac{x_i}{x_{\min}} \right) \right]^{-1}$ , where$n$ is the number of observations with$x \geq x_{min}$ . -
Estimating
$x_{min}$ : 1. To find the optimal$x_{min}$ , the function iteratively tests different values of$x_{min}$ and selects the one that minimizes the Kolmogorov-Smirnov (KS) statistic, which measures the distance between the empirical distribution function and the fitted power-law model.
In the discrete case, the power-law distribution is represented as
-
Hurwitz Zeta Function: The Hurwitz zeta function is defined as
$\zeta(\alpha, x_{\min}) = \sum\limits_{n=0}^{\infty} (n + x_{\min})^{-\alpha}$ . -
Maximum Likelihood Estimation for
$\alpha$ : For discrete data, the MLE for$\alpha$ is found by solving the following equation numerically$\frac{\zeta'(\hat{\alpha}, x_{\min})}{\zeta(\hat{\alpha}, x_{\min})} = - \frac{1}{n} \sum\limits_{i=1}^n \ln x_i$ , where$\zeta'(\hat{\alpha}, x_{\min})$ is the derivative of the Hurwitz zeta function with respect to$\alpha$ . -
Estimating
$x_{min}$ : Similar to the continuous case, the optimal$x_{min}$ is determined by iteratively testing different values and minimizing the KS statistic.
The plpva
function performs a statistical test to determine whether the power-law model is a good fit for the given data. It uses a p-value to quantify the goodness-of-fit by comparing the empirical data to synthetic datasets generated from the fitted power-law distribution.
- Computes the p-value for the power-law fit.
- Generates synthetic datasets for comparison.
- Assesses the statistical significance of the fit.
The GoF statistic is the maximum absolute difference between the empirical cumulative distribution function (CDF) and the fitted CDF. This is similar to the Kolmogorov-Smirnov (K-S) statistic. The K-S statistic measures the maximum absolute difference between the empirical CDF and the theoretical CDF. This helps assess how well the power-law model fits the data.
The CDF of a random variable
The CDF provides a complete description of the distribution of a random variable. It allows us to calculate the probability that the random variable falls within a certain range. It’s useful for comparing different distributions and for goodness-of-fit tests.
For integer-valued data,
Here,
-
Generate Synthetic Data: Generate synthetic datasets under the null hypothesis that the data follows a power-law distribution. For each bootstrap sample, generate synthetic data that follows the power-law distribution with parameters (
$\alpha$ ) and ($x_{min}$ ) estimated from the empirical data. -
Calculate GoF for Synthetic Data: Calculate the GoF statistic for each synthetic dataset in the same way as for the empirical data.
-
Compare GoF Statistics: Compare the GoF statistic of the empirical data to the distribution of GoF statistics from the synthetic datasets.
The p-value is the proportion of bootstrap GoF statistics that are greater than or equal to the empirical Goodness of Fit (GoF) statistic, such that
- Estimating the scaling parameter (
$α$ ) and the lower bound ($x_{min}$ ) using theplfit
function. - Calculating the GoF statistic for the empirical data.
- Generating multiple synthetic datasets (typically 1000) that follow the power-law distribution with the estimated parameters.
- Calculating the GoF statistic for each synthetic dataset.
- Determining the proportion of synthetic GoF statistics that are greater than or equal to the empirical GoF statistic, which gives the p-value.
The plfit
and plpva
functions implemented in this project utilize the methodologies described in the power-fit distribution original paper; Maximum Likelihood Estimation (MLE) and Goodness-of-Fit Tests.
- Maximum Likelihood Estimation (MLE): Used for fitting the power-law model to the data, providing robust parameter estimates.
- Goodness-of-Fit Tests: Based on the KS statistic, these tests help determine the plausibility of the power-law model.
The MATLAB code provided in this project closely follows the algorithms and statistical techniques discussed in the paper, ensuring accurate and reliable power-law fits.
- Identifying Patterns: By fitting power-law distributions, we can identify patterns in data that are not apparent with other distributions.
- Modeling Complex Systems: Power-law models are instrumental in understanding and modeling complex systems with scale-invariant properties.
- Predictive Analysis: Accurate power-law fits allow for better predictive analysis in fields such as finance, risk assessment, and network analysis.
- Informing Policy and Decision Making: Insights from power-law fits can inform policy decisions in economics, urban planning, and disaster management.
To use the plfit
and plpva
functions, clone this repository.
git clone https://github.com/omarmnfy/Power-Law-Fit-Distribution-MATLAB.git
cd Power-Law-Fit-Distribution-MATLAB