Mode

Key idea

"Two microphones recorded two speakers. Which speaker said what?" If you observe linear mixtures of statistically-independent sources, ICA can recover the sources — without ever hearing them in isolation. PCA finds uncorrelated directions; ICA finds independent ones. Independence is a much stronger condition than uncorrelation, and it's what makes ICA work.

Cocktail party — two sources, two mixed observations, two ICA-recovered signals

Two ground-truth sources — a sine wave (speaker A) and a sawtooth (speaker B). Each microphone hears a different linear combination. PCA decorrelates them but doesn't separate them. ICA gets the original sources back up to sign and scale.

The model. x = As, where s are independent sources, A is an unknown mixing matrix, x is what you observe. ICA finds an unmixing matrix W such that Wx ≈ s (up to permutation, sign, scale).

Why it works. Linear combinations of independent non-Gaussian sources are "more Gaussian" than the sources themselves (Central Limit Theorem). ICA maximises the non-Gaussianity of the recovered components — kurtosis, negentropy, log-cosh. Demixing increases non-Gaussianity → reveals the original sources.

The Gaussian impossibility. If the sources are themselves Gaussian, ICA can't tell them apart — every rotation of a multivariate Gaussian is also Gaussian. Two requirements: at most one source is Gaussian, and the sources are independent.

Reach for it when

  • Audio source separation (cocktail party)
  • EEG / MEG artefact removal (eye blinks, muscle noise)
  • fMRI component analysis
  • Removing baseline drift from signals
  • You need independent features, not just decorrelated ones

Doesn't help when

  • Sources are themselves Gaussian (no information in higher moments)
  • Mixing is non-linear
  • You only need visualisation (PCA, t-SNE, UMAP fit better)
  • You need predictive features (use the original data + regularisation)

from sklearn.decomposition import FastICA, PCA
import numpy as np

# Toy cocktail party
t = np.linspace(0, 8, 2000)
s1 = np.sin(2 * t)                                       # speaker A
s2 = np.sign(np.sin(3 * t))                              # speaker B (square)
S  = np.column_stack([s1, s2])
A  = np.array([[1.0, 0.5], [0.5, 2.0]])                  # mixing matrix
X  = S @ A.T                                              # observed mixtures

# Recover with FastICA
ica = FastICA(n_components=2, random_state=0)
S_hat = ica.fit_transform(X)                              # recovered sources

# Compare with PCA — same dimensions, different result
S_pca = PCA(n_components=2).fit_transform(X)              # decorrelated but not independent
Want the FastICA algorithm, negentropy, and applications?

Maximising non-Gaussianity

$$ \max_w \; J(w^\top x) \;\;\text{s.t.}\;\; \mathbb{E}[(w^\top x)^2] = 1 $$

  • Jnon-Gaussianity measure — kurtosis, negentropy, or log-cosh contrast
  • Constraint keeps the solution unique (otherwise just scale w)
  • One w per source; deflate to find subsequent ones

$$ \max_w \;\; \text{non-Gaussianity of (combination of observations)} \;\;\text{such that}\;\; \text{average squared combination} \;=\; 1 $$

In words. ICA searches for a weight vector w such that the linear combination w · x (a dot product of weights with each observation) looks as un-Gaussian as possible. The Central Limit Theorem says mixtures of independent sources drift towards Gaussian — so undoing the mixing means heading away from Gaussian. J is just whatever score you use to measure non-Gaussianity: kurtosis (how heavy the tails are), negentropy, or a smoother contrast like log-cosh. The "s.t." constraint says the recovered signal must have unit variance — otherwise the optimiser would cheat by simply scaling w.

  • wunmixing weights — one per source
  • w · xweighted combination of the observations (one candidate recovered source)
  • non-Gaussianity (J)how far the combination's distribution is from a bell curve
  • average squared = 1unit-variance constraint that makes the answer unique
  • One w for each source; deflate to find the next

Pre-whitening. Centre and decorrelate (PCA) the data first. After whitening, the unmixing matrix W is orthogonal — a rotation. ICA then just searches for the right rotation that makes the components maximally non-Gaussian. Two-step: whiten, then rotate.

FastICA. Hyvärinen (1999). Approximate Newton method that's fast and reliable. Each component recovered via a fixed-point iteration. The default ICA algorithm in scikit-learn and Matlab.

Maximum likelihood ICA. Equivalent formulation: maximise the joint likelihood of the recovered components under an assumed non-Gaussian prior. log-cosh prior is common (heavy-tailed); flexible priors give InfoMax, JADE, Extended InfoMax.

Indeterminacies. ICA can't recover (a) the order of sources, (b) their scale, or (c) their sign. The unmixing matrix is right up to a permutation and a diagonal sign-and-scale matrix. Usually you can resolve order/scale by post-hoc convention.

Number of components. ICA usually requires components = observations (square mixing matrix). For overcomplete cases (more sources than mics), use overcomplete ICA or related sparse methods.

Non-Gaussianity measures. Kurtosis is simple but sensitive to outliers. Negentropy is the principled choice but expensive. The log-cosh non-linearity is a smooth proxy for negentropy — what FastICA uses by default.

from sklearn.decomposition import FastICA
import numpy as np

# Different non-linearities for different source distributions
ica_logcosh = FastICA(n_components=2, fun="logcosh")        # default — heavy-tailed sources
ica_exp     = FastICA(n_components=2, fun="exp")            # very heavy tails
ica_cube    = FastICA(n_components=2, fun="cube")           # sub-Gaussian sources

# The mixing matrix is recoverable too
S_hat = ica_logcosh.fit_transform(X)
A_hat = ica_logcosh.mixing_              # estimate of A; columns are sources in input space

# Verify: A_hat @ S_hat.T should reproduce X (up to permutation and sign)
assert np.allclose(X, S_hat @ A_hat.T + ica_logcosh.mean_, atol=1e-6)
Want InfoMax, JADE, non-linear ICA, and applications to neuroscience?

Negentropy

$$ J(y) = H(y_\text{gauss}) - H(y) \;\;\geq\;\; 0 $$

  • H(y)differential entropy of y
  • ygaussGaussian with the same variance
  • Always non-negative; zero iff y is Gaussian
  • Hard to compute exactly — approximated by polynomial cumulants or contrast functions

$$ \text{non-Gaussianity score} \;=\; \text{entropy of matched Gaussian} \;-\; \text{entropy of signal} \;\;\geq\;\; 0 $$

In words. Negentropy is a principled way to measure how far a signal is from a Gaussian. H stands for "entropy" — a measure of how unpredictable a distribution is. The Gaussian is the most unpredictable distribution for a given variance, so its entropy is the maximum. Subtracting the actual signal's entropy from that maximum gives a non-negative score that's zero only when the signal itself is Gaussian — and bigger the further the signal is from Gaussian. Computing the entropy exactly is hard, so practical ICA uses cheap approximations (kurtosis, log-cosh).

  • non-Gaussianity score (J)how far the signal is from a bell curve
  • entropy (H)measure of unpredictability of a distribution
  • matched Gaussiana Gaussian with the same variance as the signal
  • Always non-negative; equals zero iff the signal itself is Gaussian
  • Approximated in practice by kurtosis or log-cosh contrast

InfoMax. Bell & Sejnowski (1995). Maximise the mutual information between input and a non-linearly-transformed output. Equivalent to ICA under specific non-linearities. Foundational; the FastICA approach is a more efficient cousin.

JADE. Cardoso (1993). Joint approximate diagonalisation of eigen-matrices. Uses fourth-order cumulants directly. Slower than FastICA but better-conditioned in some cases. Standard in MATLAB's EEGLAB toolbox.

Non-linear ICA. The mixing x = f(s) is non-linear. Hard: without assumptions, the recovery is fundamentally ambiguous. Hyvärinen et al. recent work uses auxiliary variables (time, group labels) to make non-linear ICA identifiable — Time-Contrastive Learning, Permutation-Contrastive Learning.

Applications in neuroscience. EEG / MEG: separate eye-blink and muscle-noise components from neural sources. fMRI: identify spatially-independent brain networks. Now standard pre-processing in most analysis pipelines.

ICA vs sparse coding. Both find a sparse / non-Gaussian basis. ICA forces statistical independence; sparse coding only forces sparsity. Sparse coding is more flexible but has more local optima.

Identifiability. Comon (1994) proved that linear ICA is identifiable up to permutation and sign/scale when at most one source is Gaussian. Stronger identifiability theorems exist for time-dependent sources and for some forms of non-linear ICA. The frontier of modern ICA theory.

Variational autoencoders as non-linear ICA? Khemakhem et al. (2020) showed that under specific conditions, identifiable non-linear ICA falls out of VAEs with auxiliary variables. Connects two seemingly unrelated communities (signal processing, deep generative modelling).

import mne                    # mne-python — neuroscience-grade EEG / MEG
import numpy as np

# Classical EEG artefact removal with ICA
ica = mne.preprocessing.ICA(n_components=20, random_state=0)
ica.fit(raw_eeg)

# Find components that look like blinks via correlation with EOG channel
eog_idx, _ = ica.find_bads_eog(raw_eeg)
ica.exclude = eog_idx
raw_clean = ica.apply(raw_eeg.copy())

# Or use scikit-learn's FastICA programmatically
from sklearn.decomposition import FastICA
ica_sk = FastICA(n_components=20, max_iter=500, tol=1e-4, random_state=0)
S_hat = ica_sk.fit_transform(raw_eeg.get_data().T)
Too dense?