K-Means Clustering (Data Science: Unstructured Machine Learning Example)

For all of the details, you can get a paid Pro subscription plan from Codecademy.

Here are the basics: Clustering is the most well-known unsupervised learning technique. It finds structure in unlabeled data by identifying similar groups, or clusters. It’s an Unsupervised method of Machine Learning, which means that your data isn’t labeled beforehand.

You simply decide how many groups you’d like to split the data into (n_clusters = 3) and go from there. Setting this up correctly in Python along with all the modules (a PYTHONPATH issue) could be another blog in itself (see Stack Overflow for help), but the basic code looks something like this:

from sklearn.cluster import KMeans
import pandas as pd #structures your data as a spreadsheet
import numpy as np #handles the arrays created by your code

qds= pd.read_csv(“qds3.csv”)

model = KMeans(n_clusters = 3)
labels = model.predict(qds)

matrix= np.array(labels)
np.savetxt(‘file_2’, matrix, delimiter=”,”)

In this example, I put all of my data in a csv file with unlabeled columns (each cell was marked 0 or 1). Then once the code runs, it spits out which of the three group numbers (0, 1, 2) that each row belongs to.

It’s an interesting way to classify large chunks of data, and I’m just getting started. More to come soon.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s