Specify list of possible values for Pandas get_dummies
First, if you want pandas to take more values simply add them to the list sent to the get_dummies
method
data = {'numeric_1':[12.1, 3.2, 5.5, 6.8, 9.9],
'categorical_1':['A', 'B', 'C', 'B', 'B']}
frame = pd.DataFrame(data)
dummy_values = pd.get_dummies(data['categorical_1'] + ['D','E'])
as in python +
on lists works as a concatenate
operation, so
['A','B','C','B','B'] + ['D','E']
results in
['A', 'B', 'C', 'B', 'B', 'D', 'E']
In my mind this is necessary to account for test data with a value for that column outside of the values used in the training set, but being a novice in machine learning, perhaps that is not necessary so I'm open to a different way to approach this.
From the machine learning perspective, it is quite redundant. This column is a categorical one, so value 'D' means completely nothing to the model, that never seen it before. If you are coding the features unary (which I assume after seeing that you create columns for each value) it is enough to simply represent these 'D', 'E' values with
A B C
0 0 0
(i assume that you represent the 'B' value with 0 1 0
, 'C' with 0 0 1
etc.)
because if there were no such values in the training set, during testing - no model will distinguish between giving value 'D', or 'Elephant'
The only reason for such action would be to assume, that in the future you wish to add data with 'D' values, and simply do not want to modify the code, then it is reasonable to do it now, even though it could make training a bit more complex (as you add a dimension that as for now - carries completely no knowledge), but it seems a small problem.
If you are not going to encode it in the unary format, but rather want to use these values as one feature, simply with categorical values, then you would not need to create these "dummies" at all, and use a model which can work with such values, such as Naive Bayes, which could simply be trained with "Laplacian smoothing" to be able to work around non-existing values.