Help Understanding Cross Validation and Decision Trees

Cross validation isn't used for buliding/pruning the decision tree. It's used to estimate how good the tree (built on all of the data) will perform by simulating arrival of new data (by building the tree without some elements just as you wrote). I doesn't really make sense to pick one of the trees generated by it because the model is constrained by the data you have (and not using it all might actually be worse when you use the tree for new data).
The tree is built over the data that you choose (usualy all of it). Pruning is usually done by using some heuristic (i.e. 90% of the elements in the node belongs to class A so we don't go any further or the information gain is too small).


The problem I can't figure out is at the end you'll have k Decision trees that could all be slightly different because they might not split the same way, etc. Which tree do you pick?

The purpose of cross validation is not to help select a particular instance of the classifier (or decision tree, or whatever automatic learning application) but rather to qualify the model, i.e. to provide metrics such as the average error ratio, the deviation relative to this average etc. which can be useful in asserting the level of precision one can expect from the application. One of the things cross validation can help assert is whether the training data is big enough.

With regards to selecting a particular tree, you should instead run yet another training on 100% of the training data available, as this typically will produce a better tree. (The downside of the Cross Validation approach is that we need to divide the [typically little] amount of training data into "folds" and as you hint in the question this can lead to trees which are either overfit or underfit for particular data instances).

In the case of decision tree, I'm not sure what your reference to statistics gathered in the node and used to prune the tree pertains to. Maybe a particular use of cross-validation related techniques?...


For the first part, and like the others have pointed out, we usually use the entire dataset for building the final model, but we use cross-validation (CV) to get a better estimate of the generalization error on new unseen data.

For the second part, I think you are confusing CV with the validation set, used to avoid overfitting the tree by pruning a node when some function value computed on the validation set does not increase before/after the split.