Independent Component Analysis (ICA)

Mode

Key idea

"Two microphones recorded two speakers. Which speaker said what?" If you observe linear mixtures of statistically-independent sources, ICA can recover the sources — without ever hearing them in isolation. PCA finds uncorrelated directions; ICA finds independent ones. Independence is a much stronger condition than uncorrelation, and it's what makes ICA work.

Cocktail party — two sources, two mixed observations, two ICA-recovered signals

Two ground-truth sources — a sine wave (speaker A) and a sawtooth (speaker B). Each microphone hears a different linear combination. PCA decorrelates them but doesn't separate them. ICA gets the original sources back up to sign and scale.

The model. x = As, where s are independent sources, A is an unknown mixing matrix, x is what you observe. ICA finds an unmixing matrix W such that Wx ≈ s (up to permutation, sign, scale).

Why it works. Linear combinations of independent non-Gaussian sources are "more Gaussian" than the sources themselves (Central Limit Theorem). ICA maximises the non-Gaussianity of the recovered components — kurtosis, negentropy, log-cosh. Demixing increases non-Gaussianity → reveals the original sources.

The Gaussian impossibility. If the sources are themselves Gaussian, ICA can't tell them apart — every rotation of a multivariate Gaussian is also Gaussian. Two requirements: at most one source is Gaussian, and the sources are independent.

Reach for it when

Audio source separation (cocktail party)
EEG / MEG artefact removal (eye blinks, muscle noise)
fMRI component analysis
Removing baseline drift from signals
You need independent features, not just decorrelated ones

Doesn't help when

Sources are themselves Gaussian (no information in higher moments)
Mixing is non-linear
You only need visualisation (PCA, t-SNE, UMAP fit better)
You need predictive features (use the original data + regularisation)

from sklearn.decomposition import FastICA, PCA
import numpy as np

# Toy cocktail party
t = np.linspace(0, 8, 2000)
s1 = np.sin(2 * t)                                       # speaker A
s2 = np.sign(np.sin(3 * t))                              # speaker B (square)
S  = np.column_stack([s1, s2])
A  = np.array([[1.0, 0.5], [0.5, 2.0]])                  # mixing matrix
X  = S @ A.T                                              # observed mixtures

# Recover with FastICA
ica = FastICA(n_components=2, random_state=0)
S_hat = ica.fit_transform(X)                              # recovered sources

# Compare with PCA — same dimensions, different result
S_pca = PCA(n_components=2).fit_transform(X)              # decorrelated but not independent

Want the FastICA algorithm, negentropy, and applications?

Maximising non-Gaussianity

$$ \max_w \; J(w^\top x) \;\;\text{s.t.}\;\; \mathbb{E}[(w^\top x)^2] = 1 $$

Jnon-Gaussianity measure — kurtosis, negentropy, or log-cosh contrast
Constraint keeps the solution unique (otherwise just scale w)
One w per source; deflate to find subsequent ones

$$ \max_w \;\; \text{non-Gaussianity of (combination of observations)} \;\;\text{such that}\;\; \text{average squared combination} \;=\; 1 $$

In words. ICA searches for a weight vector w such that the linear combination w · x (a dot product of weights with each observation) looks as un-Gaussian as possible. The Central Limit Theorem says mixtures of independent sources drift towards Gaussian — so undoing the mixing means heading away from Gaussian. J is just whatever score you use to measure non-Gaussianity: kurtosis (how heavy the tails are), negentropy, or a smoother contrast like log-cosh. The "s.t." constraint says the recovered signal must have unit variance — otherwise the optimiser would cheat by simply scaling w.

wunmixing weights — one per source
w · xweighted combination of the observations (one candidate recovered source)
non-Gaussianity (J)how far the combination's distribution is from a bell curve
average squared = 1unit-variance constraint that makes the answer unique
One w for each source; deflate to find the next

Pre-whitening. Centre and decorrelate (PCA) the data first. After whitening, the unmixing matrix W is orthogonal — a rotation. ICA then just searches for the right rotation that makes the components maximally non-Gaussian. Two-step: whiten, then rotate.

FastICA. Hyvärinen (1999). Approximate Newton method that's fast and reliable. Each component recovered via a fixed-point iteration. The default ICA algorithm in scikit-learn and Matlab.

Maximum likelihood ICA. Equivalent formulation: maximise the joint likelihood of the recovered components under an assumed non-Gaussian prior. log-cosh prior is common (heavy-tailed); flexible priors give InfoMax, JADE, Extended InfoMax.

Indeterminacies. ICA can't recover (a) the order of sources, (b) their scale, or (c) their sign. The unmixing matrix is right up to a permutation and a diagonal sign-and-scale matrix. Usually you can resolve order/scale by post-hoc convention.

Number of components. ICA usually requires components = observations (square mixing matrix). For overcomplete cases (more sources than mics), use overcomplete ICA or related sparse methods.

Non-Gaussianity measures. Kurtosis is simple but sensitive to outliers. Negentropy is the principled choice but expensive. The log-cosh non-linearity is a smooth proxy for negentropy — what FastICA uses by default.

from sklearn.decomposition import FastICA
import numpy as np

# Different non-linearities for different source distributions
ica_logcosh = FastICA(n_components=2, fun="logcosh")        # default — heavy-tailed sources
ica_exp     = FastICA(n_components=2, fun="exp")            # very heavy tails
ica_cube    = FastICA(n_components=2, fun="cube")           # sub-Gaussian sources

# The mixing matrix is recoverable too
S_hat = ica_logcosh.fit_transform(X)
A_hat = ica_logcosh.mixing_              # estimate of A; columns are sources in input space

# Verify: A_hat @ S_hat.T should reproduce X (up to permutation and sign)
assert np.allclose(X, S_hat @ A_hat.T + ica_logcosh.mean_, atol=1e-6)

Want InfoMax, JADE, non-linear ICA, and applications to neuroscience?

Negentropy

$$ J(y) = H(y_\text{gauss}) - H(y) \;\;\geq\;\; 0 $$

H(y)differential entropy of y
ygaussGaussian with the same variance
Always non-negative; zero iff y is Gaussian
Hard to compute exactly — approximated by polynomial cumulants or contrast functions

$$ \text{non-Gaussianity score} \;=\; \text{entropy of matched Gaussian} \;-\; \text{entropy of signal} \;\;\geq\;\; 0 $$

In words. Negentropy is a principled way to measure how far a signal is from a Gaussian. H stands for "entropy" — a measure of how unpredictable a distribution is. The Gaussian is the most unpredictable distribution for a given variance, so its entropy is the maximum. Subtracting the actual signal's entropy from that maximum gives a non-negative score that's zero only when the signal itself is Gaussian — and bigger the further the signal is from Gaussian. Computing the entropy exactly is hard, so practical ICA uses cheap approximations (kurtosis, log-cosh).

non-Gaussianity score (J)how far the signal is from a bell curve
entropy (H)measure of unpredictability of a distribution
matched Gaussiana Gaussian with the same variance as the signal
Always non-negative; equals zero iff the signal itself is Gaussian
Approximated in practice by kurtosis or log-cosh contrast

InfoMax. Bell & Sejnowski (1995). Maximise the mutual information between input and a non-linearly-transformed output. Equivalent to ICA under specific non-linearities. Foundational; the FastICA approach is a more efficient cousin.

JADE. Cardoso (1993). Joint approximate diagonalisation of eigen-matrices. Uses fourth-order cumulants directly. Slower than FastICA but better-conditioned in some cases. Standard in MATLAB's EEGLAB toolbox.

Non-linear ICA. The mixing x = f(s) is non-linear. Hard: without assumptions, the recovery is fundamentally ambiguous. Hyvärinen et al. recent work uses auxiliary variables (time, group labels) to make non-linear ICA identifiable — Time-Contrastive Learning, Permutation-Contrastive Learning.

Applications in neuroscience. EEG / MEG: separate eye-blink and muscle-noise components from neural sources. fMRI: identify spatially-independent brain networks. Now standard pre-processing in most analysis pipelines.

ICA vs sparse coding. Both find a sparse / non-Gaussian basis. ICA forces statistical independence; sparse coding only forces sparsity. Sparse coding is more flexible but has more local optima.

Identifiability. Comon (1994) proved that linear ICA is identifiable up to permutation and sign/scale when at most one source is Gaussian. Stronger identifiability theorems exist for time-dependent sources and for some forms of non-linear ICA. The frontier of modern ICA theory.

Variational autoencoders as non-linear ICA? Khemakhem et al. (2020) showed that under specific conditions, identifiable non-linear ICA falls out of VAEs with auxiliary variables. Connects two seemingly unrelated communities (signal processing, deep generative modelling).

import mne                    # mne-python — neuroscience-grade EEG / MEG
import numpy as np

# Classical EEG artefact removal with ICA
ica = mne.preprocessing.ICA(n_components=20, random_state=0)
ica.fit(raw_eeg)

# Find components that look like blinks via correlation with EOG channel
eog_idx, _ = ica.find_bads_eog(raw_eeg)
ica.exclude = eog_idx
raw_clean = ica.apply(raw_eeg.copy())

# Or use scikit-learn's FastICA programmatically
from sklearn.decomposition import FastICA
ica_sk = FastICA(n_components=20, max_iter=500, tol=1e-4, random_state=0)
S_hat = ica_sk.fit_transform(raw_eeg.get_data().T)

Too dense?