I recently found this answer which provides the code of an unbiased version of Cramer's V for computing the correlation of two categorical variables:
import scipy.stats as ssdef cramers_corrected_stat(confusion_matrix):""" calculate Cramers V statistic for categorial-categorial association. uses correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328""" chi2 = ss.chi2_contingency(confusion_matrix)[0] n = confusion_matrix.sum() phi2 = chi2/n r,k = confusion_matrix.shape phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1)) rcorr = r - ((r-1)**2)/(n-1) kcorr = k - ((k-1)**2)/(n-1) return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))))However, if the number of samples n is equal to the number of categories of the first feature r, then rcorr = n - (n-1) = 1, which yields a division by zero in np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)) if (kcorr-1) is non-negative. I confirmed this with a simple example:
import pandas as pddata = [ {'name': 'Alice', 'occupation': 'therapist', 'favorite_color': 'red'}, {'name': 'Bob', 'occupation': 'fisherman', 'favorite_color': 'blue'}, {'name': 'Carol', 'occupation': 'scientist', 'favorite_color': 'orange'}, {'name': 'Doug', 'occupation': 'scientist', 'favorite_color': 'red'}, ]df = pd.DataFrame(data) confusion_matrix = pd.crosstab(df['name'], df['occupation']) # n = 4 (number of samples), r = 4 (number of unique names), k = 3 (number of unique occupations)print(cramers_corrected_stat(confusion_matrix))Output:
/tmp/ipykernel_227998/749514942.py:45: RuntimeWarning: invalid value encountered in scalar divide return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))nanIs this expected behavior?
If so, how should I use the corrected Cramer's V in cases where n = k, e.g., when all samples have a unique value for some feature?