Module Documentation

Copyright (c) 2025 Max Jerdee. All rights reserved.

clustering-mi: Compute the mutual information between two clusterings of the same objects

clustering_mi.mutual_information(input_data_1, input_data_2=None, *, variation='reduced')[source]

Compute the mutual information between two labelings from a pair of lists, the name of a space separated file of labels, or a contingency table. Can specify the variation of mutual information to compute.

Raises AssertionError for invalid inputs.

Parameters:
  • input_data_1 (ArrayLike or str) – First argument. This will either be a 2D array-like which specifies the contingency table whose columns are the first labeling and rows are the second labeling, or a string which is the path to a file containing a list of pairs of labels, or a 1-D array-like of labels.

  • input_data_2 (ArrayLike, optional) – Second argument. This can only be a 1-D array-like of labels in the case where the first argument is also such a list.

  • variation (str, optional) –

    Variation of mutual information to compute. Options are:
    • ”reduced” (default): Reduced mutual information (RMI), reduction of https://arxiv.org/pdf/2405.05393, note that this can be (slightly) asymmetric.

    • ”reduced_flat”: Reduced mutual information (RMI), flat reduction of https://arxiv.org/pdf/1907.12581

    • ”adjusted”: Adjusted mutual information (AMI), correcting for chance: https://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf

    • ”traditional”: Traditional mutual information (MI), microcanonical

    • ”stirling”: Stirling’s approximation of the traditional mutual information, equal to the mutual information of the corresponding probability distributions (times the number of objects).

Returns:

Mutual information value in bits (base 2).

Return type:

float

clustering_mi.normalized_mutual_information(input_data_1, input_data_2=None, *, variation='reduced', normalization='second')[source]

Compute the normalized mutual information between two labelings from a pair of lists, the name of a space separated file of labels, or a contingency table. Can specify the variation of mutual information and type of normalization. For the asymmetric (default) normalization, the result is reported as a fraction of the entropy of the second labeling, which is considered the ground truth.

Raises AssertionError for invalid inputs.

Parameters:
  • input_data_1 (ArrayLike or str) – First argument. This will either be a 2D array-like which specifies the contingency table whose columns are the first labeling and rows are the second labeling, or a string which is the path to a file containing a list of pairs of labels, or a 1-D array-like of labels.

  • input_data_2 (ArrayLike, optional) – Second argument. This can only be a 1-D array-like of labels in the case where the first argument is also such a list.

  • variation (str, optional) –

    Variation of mutual information to compute. Options are:
    • ”stirling”: Stirling’s approximation of the traditional mutual information, equal to the mutual information of the corresponding probability distributions (times the number of objects).

    • ”reduced” (default): Reduced mutual information (RMI), Dirichlet-multinomial reduction of https://arxiv.org/pdf/2405.05393

    • ”reduced_flat”: Reduced mutual information (RMI), flat reduction of https://arxiv.org/pdf/1907.12581

    • ”adjusted”: Adjusted mutual information (AMI), correcting for chance: https://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf

    • ”traditional”: Traditional mutual information (MI), microcanonical

    • ”stirling”: Stirling’s approximation of the traditional mutual information, equal to the mutual information of the corresponding probability distributions (times the number of objects).

  • normalization (str, optional) –

    Type of normalization to apply. Options are:
    • ”second” (default): Asymmetric normalization, measures how much the first labeling tells us about the second, as a fraction of all there is to know about the second labeling.

    • ”first”: Asymmetric normalization, measures how much the second labeling tells us about the first, as a fraction of all there is to know about the first labeling.

    • ”mean”: Symmetric normalization by the arithmetic mean of the two entropies.

    • ”min”: Normalize by the minimum of the two entropies.

    • ”max”: Normalize by the maximum of the two entropies.

    • ”geometric”: Normalize by the geometric mean of the two entropies.

    • ”none”: No normalization, returns the mutual information in bits.

Returns:

Normalized mutual information

Return type:

float