Why is SQLAlchemy count() much slower than the raw query?
It took me a long time to find this as the solution to my problem. I was getting the following error:
sqlalchemy.exc.DatabaseError: (mysql.connector.errors.DatabaseError) 126 (HY000): Incorrect key file for table '/tmp/#sql_40ab_0.MYI'; try to repair it
The problem was resolved when I changed this:
query = session.query(rumorClass).filter(rumorClass.exchangeDataState == state)
return query.count()
to this:
query = session.query(func.count(rumorClass.id)).filter(rumorClass.exchangeDataState == state)
return query.scalar()
Unfortunately MySQL has terrible, terrible support of subqueries and this is affecting us in a very negative way. The SQLAlchemy docs point out that the "optimized" query can be achieved using query(func.count(Segment.id))
:
Return a count of rows this Query would return.
This generates the SQL for this Query as follows:
SELECT count(1) AS count_1 FROM ( SELECT <rest of query follows...> ) AS anon_1
For fine grained control over specific columns to count, to skip the usage of a subquery or otherwise control of the FROM clause, or to use other aggregate functions, use func expressions in conjunction with query(), i.e.:
from sqlalchemy import func # count User records, without # using a subquery. session.query(func.count(User.id)) # return count of user "id" grouped # by "name" session.query(func.count(User.id)).\ group_by(User.name) from sqlalchemy import distinct # count distinct "name" values session.query(func.count(distinct(User.name)))
The reason is that SQLAlchemy's count() is counting the results of a subquery which is still doing the full amount of work to retrieve the rows you are counting. This behavior is agnostic of the underlying database; it isn't a problem with MySQL.
The SQLAlchemy docs explain how to issue a count without a subquery by importing func
from sqlalchemy
.
session.query(func.count(User.id)).scalar()
>>>SELECT count(users.id) AS count_1 \nFROM users')