Module Documentation¶
Copyright (c) 2025 Max Jerdee. All rights reserved.
clustering-mi: Compute the mutual information between two clusterings of the same objects
- clustering_mi.mutual_information(input_data_1, input_data_2=None, *, variation='reduced')[source]¶
Compute the mutual information between two labelings from a pair of lists, the name of a space separated file of labels, or a contingency table. Can specify the variation of mutual information to compute.
Raises AssertionError for invalid inputs.
- Parameters:
input_data_1 (ArrayLike or str) – First argument. This will either be a 2D array-like which specifies the contingency table whose columns are the first labeling and rows are the second labeling, or a string which is the path to a file containing a list of pairs of labels, or a 1-D array-like of labels.
input_data_2 (ArrayLike, optional) – Second argument. This can only be a 1-D array-like of labels in the case where the first argument is also such a list.
variation (str, optional) –
- Variation of mutual information to compute. Options are:
”reduced” (default): Reduced mutual information (RMI), reduction of https://arxiv.org/pdf/2405.05393, note that this can be (slightly) asymmetric.
”reduced_flat”: Reduced mutual information (RMI), flat reduction of https://arxiv.org/pdf/1907.12581
”adjusted”: Adjusted mutual information (AMI), correcting for chance: https://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf
”traditional”: Traditional mutual information (MI), microcanonical
”stirling”: Stirling’s approximation of the traditional mutual information, equal to the mutual information of the corresponding probability distributions (times the number of objects).
- Returns:
Mutual information value in bits (base 2).
- Return type:
- clustering_mi.normalized_mutual_information(input_data_1, input_data_2=None, *, variation='reduced', normalization='second')[source]¶
Compute the normalized mutual information between two labelings from a pair of lists, the name of a space separated file of labels, or a contingency table. Can specify the variation of mutual information and type of normalization. For the asymmetric (default) normalization, the result is reported as a fraction of the entropy of the second labeling, which is considered the ground truth.
Raises AssertionError for invalid inputs.
- Parameters:
input_data_1 (ArrayLike or str) – First argument. This will either be a 2D array-like which specifies the contingency table whose columns are the first labeling and rows are the second labeling, or a string which is the path to a file containing a list of pairs of labels, or a 1-D array-like of labels.
input_data_2 (ArrayLike, optional) – Second argument. This can only be a 1-D array-like of labels in the case where the first argument is also such a list.
variation (str, optional) –
- Variation of mutual information to compute. Options are:
”stirling”: Stirling’s approximation of the traditional mutual information, equal to the mutual information of the corresponding probability distributions (times the number of objects).
”reduced” (default): Reduced mutual information (RMI), Dirichlet-multinomial reduction of https://arxiv.org/pdf/2405.05393
”reduced_flat”: Reduced mutual information (RMI), flat reduction of https://arxiv.org/pdf/1907.12581
”adjusted”: Adjusted mutual information (AMI), correcting for chance: https://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf
”traditional”: Traditional mutual information (MI), microcanonical
”stirling”: Stirling’s approximation of the traditional mutual information, equal to the mutual information of the corresponding probability distributions (times the number of objects).
normalization (str, optional) –
- Type of normalization to apply. Options are:
”second” (default): Asymmetric normalization, measures how much the first labeling tells us about the second, as a fraction of all there is to know about the second labeling.
”first”: Asymmetric normalization, measures how much the second labeling tells us about the first, as a fraction of all there is to know about the first labeling.
”mean”: Symmetric normalization by the arithmetic mean of the two entropies.
”min”: Normalize by the minimum of the two entropies.
”max”: Normalize by the maximum of the two entropies.
”geometric”: Normalize by the geometric mean of the two entropies.
”none”: No normalization, returns the mutual information in bits.
- Returns:
Normalized mutual information
- Return type: