Recognizing similar shapes at random scale and translation
Being a computer vision guy, I would normally point to feature-extraction and -matching (SIFT, SURF, LBP, etc.), but this is almost certainly an overkill, since most of these methods offer more invariances (=tolerances against transformations) than you actually require (e.g. against rotation, luminance change,...). Also, using features would involve either OpenCV or lots of programming.
So here is my proposal for a simple solution - you judge whether it passes the smartness threshold:
It looks like the image you are looking for has some very distinct structures (the letters, the logos, etc). I would suggest that you do a pixel-to-pixel match for every possible translation, and for a number of different scales (I assume the range of scales is limited) - but only for a small distinctive patch of the image you are looking for (say, a square portion of the yellow text). This is much faster than matching the whole thing. If you want a fancy name for it: In image processing it's called template matching by correlation. The "template" is the thing you are looking for.
Once you have found a few candidate locations for your small distinctive patch, you can verify that you have a hit by testing either the whole image or, more efficiently, a few other distinctive patches of the image (using, of course, the translation / scale you found). This makes your search robust against accidental matches of the original patch without stealing too much performance.
Regarding dithering tolerance, I would go for simple pre-filtering of both images (the template you are looking for, and the image that is your search space). Depending on the properties of the dithering, you can start experimenting with a simple box blur, and probably proceed to a median filter with a small kernel (3 x 3) if that does not work. This will not get you 100% identity between template and searched image, but robust numerical scores you can compare.
Edit in the light of comments
I understand that (1) you want something more robust, more "CV-like" and a bit more fancy as a solution, and that (2) you are skeptical towards achieving scale invariance by simply scanning though a large stack of different scales.
Regarding (1), the canonical approach is, as mentioned above, to use feature descriptors. Feature descriptors do not describe a complete image (or shape), but a small portion of an image in a way that is invariant against various transformations. Have a look at SIFT and SURF, and at VLFeat, which has a good SIFT implementation and also implements MSER and HOG (and is much smaller than OpenCV). SURF is easier to implement than SIFT, both are heavily patented. Both have an "upright" version, which has no rotation invariance. This should increase robustness in your case.
The strategy your describe in your comment goes more in the direction of shape descriptors than image feature descriptors. Make sure that you understand the difference between those! 2D shape descriptors aim at shapes which are typically described by an outline or a binary mask. Image feature descriptors (in the sense use above) aim at images with intensity values, typically photographs. An interesting shape descriptor is shape context, many others are summarized here. I don't think that your problem is best solved by shape descriptors, but maybe I misunderstood something. I would be very careful with shape descriptors on image edges, as edges, being first derivatives, can be strongly altered by dithering noise.
Regarding (2): I'd like to convince you that scanning through a bunch of different scales is not just a stupid hack for those who don't know Computer Vision! Actually, its done a lot in vision, we just have a fancy name for it to mislead the uninitiated - scale space search. That's a bit of an oversimplification, but really just a bit. Most image feature descriptors that are used in practice achieve scale invariance using a scale space, which is a stack of increasingly downscaled (and low-pass filtered) images. The only trick they add is to look for extrema in the scale space and compute descriptors only at those extrema. But still, the complete scale space is computed and traversed to find those extrema. Have a look in the original SIFT paper for a good explanation of this.
Nice. I once implemented some cheat on a flash game by capturing the screen as well :). If you need to find the exact border you gave in the image, you could create a color filter, thus removing all the rest and you end up with a binary image that you could use for further processing (the task at hand would be to find a matching rectangle with certain border ratio. Also, you could implement four kernels that would find the corners at a few different scales.
If you have a image stream, and know there is movement, you can also monitor the difference between frames to capture the action parts in your screen by employing a background modeling solution. Combine these and you will get quite far I guess, without resorting to more exotic methods, Like multi scale analysis and stuff.
It performance an issue? My cheat used about 20 fps as it needed to click a ball fast enough.