Module Documentation¶

clustering-mi: Compute the mutual information between two clusterings of the same objects

clustering_mi.mutual_information(input_data_1, input_data_2=None, *, variation='reduced')[source]¶

Compute the mutual information between two labelings from a pair of lists, the name of a space separated file of labels, or a contingency table. Can specify the variation of mutual information to compute.

Raises AssertionError for invalid inputs.

Parameters:

input_data_1 (ArrayLike | str) – First argument. This will either be a 2D array-like which specifies the contingency table whose columns are the first labeling and rows are the second labeling, or a string which is the path to a file containing a list of pairs of labels, or a 1-D array-like of labels.
input_data_2 (ArrayLike | None) – Second argument. This can only be a 1-D array-like of labels in the case where the first argument is also such a list.
variation (str) –
Variation of mutual information to compute. Options are:
- ”reduced” (default): Reduced mutual information (RMI), reduction of https://arxiv.org/pdf/2405.05393, note that this can be (slightly) asymmetric.
- ”reduced_flat”: Reduced mutual information (RMI), flat reduction of https://arxiv.org/pdf/1907.12581
- ”adjusted”: Adjusted mutual information (AMI), correcting for chance: https://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf
- ”traditional”: Traditional mutual information (MI), microcanonical
- ”stirling”: Stirling’s approximation of the traditional mutual information, equal to the mutual information of the corresponding probability distributions (times the number of objects).

Returns:

Mutual information value in bits (base 2).

Return type:

ArrayLike

clustering_mi.normalized_mutual_information(input_data_1, input_data_2=None, *, variation='reduced', normalization='second')[source]¶

Compute the normalized mutual information between two labelings from a pair of lists, the name of a space separated file of labels, or a contingency table. Can specify the variation of mutual information and type of normalization. For the asymmetric (default) normalization, the result is reported as a fraction of the entropy of the second labeling, which is considered the ground truth.

Raises AssertionError for invalid inputs.

Parameters:

input_data_1 (ArrayLike | str) – First argument. This will either be a 2D array-like which specifies the contingency table whose columns are the first labeling and rows are the second labeling, or a string which is the path to a file containing a list of pairs of labels, or a 1-D array-like of labels.
input_data_2 (ArrayLike | None) – Second argument. This can only be a 1-D array-like of labels in the case where the first argument is also such a list.
variation (str) –
Variation of mutual information to compute. Options are:
- ”stirling”: Stirling’s approximation of the traditional mutual information, equal to the mutual information of the corresponding probability distributions (times the number of objects).
- ”reduced” (default): Reduced mutual information (RMI), Dirichlet-multinomial reduction of https://arxiv.org/pdf/2405.05393
- ”reduced_flat”: Reduced mutual information (RMI), flat reduction of https://arxiv.org/pdf/1907.12581
- ”adjusted”: Adjusted mutual information (AMI), correcting for chance: https://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf
- ”traditional”: Traditional mutual information (MI), microcanonical
- ”stirling”: Stirling’s approximation of the traditional mutual information, equal to the mutual information of the corresponding probability distributions (times the number of objects).
normalization (str) –
Type of normalization to apply. Options are:
- ”second” (default): Asymmetric normalization, measures how much the first labeling tells us about the second, as a fraction of all there is to know about the second labeling.
- ”first”: Asymmetric normalization, measures how much the second labeling tells us about the first, as a fraction of all there is to know about the first labeling.
- ”mean”: Symmetric normalization by the arithmetic mean of the two entropies.
- ”min”: Normalize by the minimum of the two entropies.
- ”max”: Normalize by the maximum of the two entropies.
- ”geometric”: Normalize by the geometric mean of the two entropies.
- ”none”: No normalization, returns the mutual information in bits.

Returns:

Normalized mutual information

Return type:

ArrayLike