Python: Pandas Series - Why use loc?
Explicit is better than implicit.
df[boolean_mask]
selects rows whereboolean_mask
is True, but there is a corner case when you might not want it to: whendf
has boolean-valued column labels:In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df Out[229]: False True 0 3 1 1 4 2 2 5 3
You might want to use
df[[True]]
to select theTrue
column. Instead it raises aValueError
:In [230]: df[[True]] ValueError: Item wrong length 1 instead of 3.
Versus using
loc
:In [231]: df.loc[[True]] Out[231]: False True 0 3 1
In contrast, the following does not raise
ValueError
even though the structure ofdf2
is almost the same asdf1
above:In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2 Out[258]: A B 0 1 3 1 2 4 2 3 5 In [259]: df2[['B']] Out[259]: B 0 3 1 4 2 5
Thus,
df[boolean_mask]
does not always behave the same asdf.loc[boolean_mask]
. Even though this is arguably an unlikely use case, I would recommend always usingdf.loc[boolean_mask]
instead ofdf[boolean_mask]
because the meaning ofdf.loc
's syntax is explicit. Withdf.loc[indexer]
you know automatically thatdf.loc
is selecting rows. In contrast, it is not clear ifdf[indexer]
will select rows or columns (or raiseValueError
) without knowing details aboutindexer
anddf
.df.loc[row_indexer, column_index]
can select rows and columns.df[indexer]
can only select rows or columns depending on the type of values inindexer
and the type of column valuesdf
has (again, are they boolean?).In [237]: df2.loc[[True,False,True], 'B'] Out[237]: 0 3 2 5 Name: B, dtype: int64
When a slice is passed to
df.loc
the end-points are included in the range. When a slice is passed todf[...]
, the slice is interpreted as a half-open interval:In [239]: df2.loc[1:2] Out[239]: A B 1 2 4 2 3 5 In [271]: df2[1:2] Out[271]: A B 1 2 4
Performance Consideration on multiple columns "Chained Assignment" with and without using .loc
Let me supplement the already very good answers with the consideration of system performance.
The question itself includes a comparison on the system performance (execution time) of 2 pieces of codes with and without using .loc. The execution times are roughly the same for the code samples quoted. However, for some other code samples, there could be considerable difference on execution times with and without using .loc: e.g. several times difference or more!
A common case of pandas dataframe manipulation is we need to create a new column derived from values of an existing column. We may use the codes below to filter conditions (based on existing column) and set different values to the new column:
df[df['mark'] >= 50]['text_rating'] = 'Pass'
However, this kind of "Chained Assignment" does not work since it could create a "copy" instead of a "view" and assignment to the new column based on this "copy" will not update the original dataframe.
2 options available:
-
- We can either use .loc, or
-
- Code it another way without using .loc
2nd case e.g.:
df['text_rating'][df['mark'] >= 50] = 'Pass'
By placing the filtering at the last (after specifying the new column name), the assignment works well with the original dataframe updated.
The solution using .loc is as follows:
df.loc[df['mark'] >= 50, 'text_rating'] = 'Pass'
Now, let's see their execution time:
Without using .loc:
%%timeit
df['text_rating'][df['mark'] >= 50] = 'Pass'
2.01 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
With using .loc:
%%timeit
df.loc[df['mark'] >= 50, 'text_rating'] = 'Pass'
577 µs ± 5.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
As we can see, with using .loc, the execution time is more than 3X times faster!
For a more detailed explanation of "Chained Assignment", you can refer to another related post How to deal with SettingWithCopyWarning in pandas? and in particular the answer of cs95. The post is excellent in explaining the functional differences of using .loc. I just supplement here the system performance (execution time) difference.
In addition to what has already been said (issues with having True, False as column name without using loc and ability to select rows and columns with loc and ability to do slicing for row and column selections), another big difference is that you can use loc to assign values to specific rows and columns. If you try to select a subset of the dataframe using boolean series and attempt to change a value of that subset selection you will likely get the SettingWithCopy warning.
Let's say you're trying to change the "upper management" column for all the rows whose salary is bigger than 60000.
This:
mask = df["salary"] > 60000
df[mask]["upper management"] = True
throws the warning that "A value is is trying to be set on a copy of a slice from a Dataframe" and won't work because df[mask] creates a copy and trying to update "upper management" of that copy has no effect on the original df.
But this succeeds:
mask = df["salary"] > 60000
df.loc[mask,"upper management"] = True
Note that in both cases you can do df[df["salary"] > 60000]
or df.loc[df["salary"] > 60000]
, but I think storing boolean condition in a variable first is cleaner.