I am selecting a subset of data from a larger dataframe.
dataset = df.select('RatingScore','CategoryScore','CouponBin','TTM','Price','Spread','Coupon', 'WAM', 'DV')dataset = dataset.fillna(0)dataset.show(5,True)dataset.printSchema()Now, I fee that into my KMeans model
from numpy import arrayfrom math import sqrtfrom pyspark.mllib.clustering import KMeans, KMeansModelimport numpy as npdata_array=np.array(dataset)#data_array = np.array(dataset.select('RatingScore', 'CategoryScore', 'CouponBin', 'TTM', 'Price', 'Spread', 'Coupon', 'WAM', #'DV').collect())# Build the model (cluster the data)clusters = KMeans.train(data_array, 2, maxIterations=10, initializationMode="random")# Evaluate clustering by computing Within Set Sum of Squared Errorsdef error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)]))WSSSE = data_array.map(lambda point: error(point)).reduce(lambda x, y: x + y)print("Within Set Sum of Squared Error = "+ str(WSSSE))This line: clusters = KMeans.train(data_array, 2, maxIterations=10, initializationMode="random")
Throws this error: AttributeError: 'numpy.ndarray' object has no attribute 'map'
From the code, you can see that I tried to create the array two different ways. Neither worked. If I try to fee in the items straight from the subset-dataframe, I get this error:
AttributeError: 'DataFrame' object has no attribute 'map'What am I missing here?