What's the best approach to recognize patterns in data, and what's the best way to learn more on the topic?

In addition to the useful comments about image processing, it also sounds like you're dealing with a clustering problem.

Clustering algorithms come from the machine learning literature, specifically unsupervised learning. As the name implies, the basic idea is to try to identify natural clusters of data points within some large set of data.

For example, the picture below shows how a clustering algorithm might group a bunch of points into 7 clusters (indicated by circles and color):

k-means
(source: natekohl.net)

In your case, a clustering algorithm would attempt to repeatedly merge small cracks to form larger cracks, until some stopping criteria is met. The end result would be a smaller set of joined cracks. Of course, cracks are a little different than two-dimensional points -- part of the trick in getting a clustering algorithm to work here will be defining a useful distance metric between two cracks.

Popular clustering algorithms include k-means clustering (demo) and hierarchical clustering. That second link also has a nice step-by-step explanation of how k-means works.

EDIT: This paper by some engineers at Phillips looks relevant to what you're trying to do:

  • Chenn-Jung Huang, Chua-Chin Wang, Chi-Feng Wu, "Image Processing Techniques for Wafer Defect Cluster Identification," IEEE Design and Test of Computers, vol. 19, no. 2, pp. 44-48, March/April, 2002.

They're doing a visual inspection for defects on silicon wafers, and use a median filter to remove noise before using a nearest-neighbor clustering algorithm to detect the defects.

Here are some related papers/books that they cite that might be useful:

  • M. Taubenlatt and J. Batchelder, “Patterned Wafer Inspection Using Spatial Filtering for Cluster Environment,” Applied Optics, vol. 31, no. 17, June 1992, pp. 3354-3362.
  • F.-L. Chen and S.-F. Liu, “A Neural-Network Approach to Recognize Defect Spatial Pattern in Semiconductor Fabrication.IEEE Trans. Semiconductor Manufacturing, vol. 13, no. 3, Aug. 2000, pp. 366-373.
  • G. Earl, R. Johnsonbaugh, and S. Jost, Pattern Recognition and Image Analysis, Prentice Hall, Upper Saddle River, N.J., 1996.

Your problem falls in the very broad field of image classification. These types of problems can be notoriously difficult, and at the end of the day, solving them is an art. You must exploit every piece of knowledge you have about the problem domain to make it tractable.

One fundamental issue is normalization. You want to have similarly classified objects to be as similar as possible in their data representation. For example, if you have an image of the cracks, do all images have the same orientation? If not, then rotating the image may help in your classification. Similarly, scaling and translation (refer to this)

You also want to remove as much irrelevant data as possible from your training sets. Rather than directly working on the image, perhaps you could use edge extraction (for example Canny edge detection). This will remove all the 'noise' from the image, leaving only the edges. The exercise is then reduced to identifying which edges are the cracks and which are the natural pavement.

If you want to fast track to a solution then I suggest you first try the your luck with a Convolutional Neural Net, which can perform pretty good image classification with a minimum of preprocessing and noramlization. Its pretty well known in handwriting recognition, and might be just right for what you're doing.


I'm a bit confused by the way you've chosen to break down the problem. If your coworker isn't identifying complete cracks, and that's the spec, then that makes it your problem. But if you manage to stitch all the cracks together, and avoid his false positives, then haven't you just done his job?

That aside, I think this is an edge detection problem rather than a classification problem. If the edge detector is good, then your issues go away.

If you are still set on classification, then you are going to need a training set with known answers, since you need a way to quantify what differentiates a false positive from a real crack. However I still think it is unlikely that your classifier will be able to connect the cracks, since these are specific to each individual paving slab.