Clustering in Ruby

512 days ago


Clustering algorithms play a very important role in the current ecosystem of web applications. If you’re at all interested in automated data grouping/sorting you should at least be familiar with one or two types of these algorithms (in addition to more advanced Machine Learning topics, but this serves a good start). In this article, I’m going to go through the process of implementing a very simple Ruby program that can group a set of 2D coordinates into clusters, where each group is composed of a center point, and all of the data points closest to it.


The algorithm that we’ll be using to accomplish this task is a simple one; k-means clustering will give us the behavior that we want with little fuss.


Overview


Our k-means clustering algorithm takes in, as input, a set of points in the 2-dimensional plane. As output, the points will be grouped into k clusters, where k is an integer specified by the user. Unfortunately, the algorithm can’t decide how many groups there are by itself without more complication, so k must be given. Nonetheless, I’ll describe the algorithm below:


  1. Start by choosing k random points within the dataset’s range as an initial guess for the positions of all the clusters. These points form the centroid point of all the clusters. All distances to other points will be measured from here.
  2. For each point in the input data, assign it to the cluster that it is nearest to. After this step, each cluster will somehow be associated with a set of nearby points.
  3. For each cluster, go through the set of associated datapoints and calculate the average among them. This will give a new centroid point that is directly in the center of all of the member points.
  4. If the clusters didn’t move from their previous locations after recentering, or if they all move less than a certain delta value, return the k clusters and their associated points. Otherwise, go back to Step 2 after deassociating all of the associated points with their cluster. This lets the algorithm start fresh, but with more accurate centroid points.



Implementation


To begin with, we’ll need a class to store the points to be clustered by the algorithm. Essentially, we just need a Point class to hold x and y values. This is implemented below (I won’t insult your intelligence trying to explain it).



Next, there has to be a class to hold clusters of data. As the algorithm described, clusters have groups of member points and a center point (not necessarily in the dataset) associated with them. This corresponds to two instance variables: points</tt>, a list of Points, and <tt>center, a single Point. Additionally, there needs to be a way for Clusters to update by averaging their member points. This is implemented in recenter!.



Finally, the algorithm itself needs to be implemented.


The parameters to the kmeans function are a dataset (list of Points), data, number of clusters to find, k, and an optional halting delta, delta. The algorithm will halt when all of the clusters are updated by a value less than delta on an iteration.

Initially, the algorithm needs to choose the starting guesses for cluster centers. It does this by generating k Cluster objects, and assigning them a center from a randomly selected Point from data.



Next is the main meat of the algorithm. The code loops indefinitely and assigns points to clusters by finding, for each point, which cluster center is the closest. This assignment will be updated, and become more accurate, each iteration of the loop while the clusters recenter.



Finally, in the code at the bottom of the while loop, we recalculate the centers of the clusters for the next iteration. This is done by calling recenter! on all of the Cluster objects. Additionally, we do some delta checking because we need to leave the loop eventually. By keeping track of the most that any Cluster was updated, we can compare it against delta to see if all of the Clusters were below the input delta. If the delta was hit, the algorithm terminates, returning a list of all of the Clusters found in the dataset.



Overall, k-means clustering is a pretty simple algorithm, as you can see from above. The entire source file, along with glue/integration code, is available here to download and/or view. Next, let’s see the program in action.


Example Run


As you can see from the link above, I ended up writing some additional shell code to implement reading in data, getting k, and plotting the output of the algorithm. For the plotting, I used rgplot (gem install gnuplot) to pipe commands to a running instance of Gnuplot.


To run the full program, open up a terminal and execute: $ ruby kmeans.rb CSVFILE.

The program assumes CSVFILE has two comma-separated floating point numbers per line, specifying both the x and y coordinates of a single point.


To give you a feel for the output that k-means generates, I ran it on a random dataset. In the graph below, each set of points plotted with the same color indicates a Cluster object’s member points. To run this yourself, you can grab my dataset here, although creating your own with more definite, pre-designed clusters may be more interesting to see.



Conclusion


While this is a very simple example, note that the x and y axis can be whatever you want them to be (latitude/longitude of households, baseball stats, etc). You could even (easily) extend the program to support 3 or more parameters (dimensions). Thus, k-means clustering can actually be a powerful tool for grouping real-world datasets, despite the apparent simplicity.

Colin Drake

,

Comment

---

Why I Like the BSD

512 days ago

Let’s find some common ground here before we start. For those of you that don’t know, the BSD is an open source software license. It is one of many “competing” licenses that aims at maximizing freedom for some group in the realm of software development and distribution. Specifically, the BSD attempts to maximize this desired freedom by saying as little as possible. Less clauses means less Legalese means less words means less chances to restrict a participant in the software development cycle. This stark simplicity has many implications which I will discuss later in the article.

Note: I’m not going to provide the text of the BSD license here, however short it may be. There’s Wikipedia for that. I want to discourage copying, pasting, and using without thinking (even though I’d love for the world to have more BSD-licensed software). This article merely reflects my own opinions and preferences. Furthermore, I am not a lawyer, so take what I say about this with a grain of salt. But enough with the disclaimers…

Control

As I’ve said before, the BSD is a very simple license. Unlike many of its competitors, it directly restricts only a handful of freedoms, putting it particularly near a declaration of public domain. Most importantly though, the BSD opts to say less about the developer’s liberties, which in turn increases the liberties allowed towards the end user. This is the most important quality of the BSD, because in that regard, it is unique when compared to other open source licenses, such as the GPL.

This simple ruleset manages to elegantly retain developer’s most basic rights to their code while keeping other routes of usage, not traditionally allowed by open source, free for use.

Commercialization

This lack of control allows for corporations, businesses, and users alike to share, copy, sell, and modify the project to their heart’s content. The amount of freedom given to them is very high: users can extend, rebrand, give commercial support for, or even start a business around somebody else’s BSD project. As long as a copy of the BSD license with the original author’s name is retained, the end user basically has the right to do whatever they want with it. The BSD’s sole goal here is the lack of warranty and prevention of somebody claiming other’s work as their own.

This freedom allowed Microsoft to implement Windows networking systems at a minimal cost, and gave Apple a stable base to start Mac OSX from. Rather than having to redesign and implement the wheel, both of these companies could focus on the facets of their respective project thats were the most important. This helped them get their product out of the door, and gave end users a time-tested infrastructure to work on. In short, the BSD isn’t just free. It’s profitable (both monetarily and pragmatically speaking) for all involved parties.

Simplicity

If you know me, it’s no big secret that I’m a fan of minimalism. Philip Glass’ simple arpeggios, Hemingway’s “iceburg” writing style, and the BSD all share an underlying philosophy behind them. Make your point with the least amount of extraneous bells and whistles as possible.

The BSD excels in this arena. Look at the GPL. Now, look at your man — Err, the BSD (reference). In a few short paragraphs, the BSD manages to describe a system of distribution that strikes a perfect balance between exclusivity and openness in a way that’s fair and beneficial to everyone.

Counterarguments

Typically, I’ve found that the BSD is very versatile. Whether you’re writing your sassy new jQuery plugin to turn all H2 tags baby blue, or a device driver for an OS kernel, the BSD normally has some type of net benefit to offer you. However, I concede that there are special cases where other licenses might be more apt to your interests.

Sometimes, by choice, you don’t want people to go off and run with your idea or code. This suggests that the GPL may be a better route. This is especially true in academic or learning projects, where your main goal is not utility, but obtaining an increase in personal knowledge. The “cost” of using or modifying your project essentially becomes more open source code given back to you.

Furthermore, not all source code can or should be open sourced. I like to make as much code as I can free or open source, but sometimes you may find that your business is surviving purely because you are the only one that knows how to do something, or at least do it effectively. In these cases, a more limited, or even proprietary license, may suffice.

Cases like these don’t illustrate a failing of the BSD. They only show that you should always take into consideration your goals for any given project. Software licenses should not be a dogmatic matter. No matter how much I like the BSD, I still evaluate different licenses on a per-project basis, because sometimes a different license is just flat out better suited.

Conclusion

Again, I’d like to state that the goal of this article is not to undermine the efforts of the GPL, WTFPL, et. al. I’m only here to provide a rationale for my own preference. That being said, I think the BSD is, in general, a good choice for many kinds of projects. It is open, allows authors to get credit for their work, is compatible with for-profit software, and yet still provides freedom all around the table.

Colin Drake

,

Comment

---