.NET and Hadoop - What should I know / learn and what is available?

At the moment, there is not much .NET specific stuff for Hadoop. You just follow the regular Hadoop tutorials. SQL Server connector allows you just to import input data and export results to a format that is easier to access for the remainder of your application.

You can run Hadoop on Windows. However, it requires Cygwin(a Unix-like environment and command-line interface for Microsoft Windows).

Basically, to use Hadoop you will need to learn Linux anyway.


it's a vague question so here's a vague answer :)

Hadoop on its own is a tool to run map-reduce jobs in a cluster, it's highly optimized for performance and a good deal of this optimization is done by distributing the data in a way that makes it easy to consume without incurring on I/O penalties.

for this you should read about HDFS and the internals that explain how is this done, in a nutshell what happens is that the input data is clumped together in nodes to run the processes locally and read sequentially (this is a property/limitation of HDFS).

this way you input your "BigData" and it gets split and processed in the most efficient way inside the cluster.

now that' all there is to Hadoop itself, there's tools that work on top of it that allow you to perform high-level abstractions on the data (map-reduce is among the simplest procedures).

those include:

  • Pig http://pig.apache.org/ which is a language to work with the map-reduce process and construct more complex operations
  • Hive http://hive.apache.org/ similar to the previous but more SQL-oriented
  • Cascading http://www.cascading.org/ yet another, more focused on data flow than queries
  • Cascalog https://github.com/nathanmarz/cascalog based on Cascading, written in Clojure
  • HBase http://hbase.apache.org/ a type of NoSQL database on top of HDFS
  • ElephantDB https://github.com/nathanmarz/elephantdb another NoSQL database for Hadoop

Specifics for .Net

For Hadoop on Azure (.Net) , there's an introduction on msdn here with more info here. Related to building Hadoop applications through their platform. It's only CTP for now, but off course this will change.

Here's another good blogpost about Hadoop and MapReduce with code

Additionally, there's also a company that frequently gives information about Hadoop: Cloudera, you should check there frequently for more information. For more information, check the cloudera page linked above and you can view all the concepts about Hadoop (it's pretty advanced though)

I'm pretty sure this isn't what you were looking for but I've no idea what you want so at least I hope you can check a few new projects that may help.

also check Storm: https://github.com/nathanmarz/storm it's not related to Hadoop but works on realtime scenarios which Hadoop is not suited for.