Strings in a DataFrame, but dtype is object
The dtype
object comes from NumPy, it describes the type of element in a ndarray
. Every element in an ndarray
must have the same size in bytes. For int64
and float64
, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of saving the bytes of strings in the ndarray
directly, Pandas uses an object ndarray
, which saves pointers to objects; because of this the dtype
of this kind ndarray
is object.
Here is an example:
- the int64 array contains 4 int64 value.
- the object array contains 4 pointers to 3 string objects.
@HYRY's answer is great. I just want to provide a little more context..
Arrays store data as contiguous, fixed-size memory blocks. The combination of these properties together is what makes arrays lightning fast for data access. For example, consider how your computer might store an array of 32-bit integers, [3,0,1]
.
If you ask your computer to fetch the 3rd element in the array, it'll start at the beginning and then jump across 64 bits to get to the 3rd element. Knowing exactly how many bits to jump across is what makes arrays fast.
Now consider the sequence of strings ['hello', 'i', 'am', 'a', 'banana']
. Strings are objects that vary in size, so if you tried to store them in contiguous memory blocks, it'd end up looking like this.
Now your computer doesn't have a fast way to access a randomly requested element. The key to overcoming this is to use pointers. Basically, store each string in some random memory location, and fill the array with the memory address of each string. (Memory addresses are just integers.) So now, things look like this
Now, if you ask your computer to fetch the 3rd element, just as before, it can jump across 64 bits (assuming the memory addresses are 32-bit integers) and then make one extra step to go fetch the string.
The challenge for NumPy is that there's no guarantee the pointers are actually pointing to strings. That's why it reports the dtype as 'object'.
Shamelessly gonna plug my own course on NumPy where I originally discussed this.