How to handle a large set of data using Spring Data Repositories?
We have the classical consulting answer here: it depends. As the implementation of the method is store specific, we depend on the underlying store API. In case of JPA there's no chance to provide streaming access as ….getResultList()
returns a List
. Hence we also expose the List
to the client as especially JPA developers might be used to working with lists. So for JPA the only option is using the pagination API.
For a store like Neo4j we support the streaming access as the repositories return Iterable
on CRUD methods as well as on the execution of finder methods.
The implementation of findAll()
simply loads the entire list of all entities into memory. Its Iterable
return type doesn't imply that it implements some sort of database level cursor handling.
On the other hand your custom myQuery(Pageable)
method will only load one page worth of entities, because the generated implementation honours its Pageable
parameter. You can declare its return type either as Page
or List
. In the latter case you still receive the same (restricted) number of entities, but not the metadata that a Page
would additionally carry.
So you basically did the right thing to avoid loading all entities into memory in your custom query.
Please review the related documentation here.
I think what you are looking for is Spring Data JPA Stream. It brings a significant performance boost to data fetching particularly in databases with millions of record. In your case you have several options which you can consider
- Pull all data once in memory
- Use pagination and read pages each time
- Use something like Apache Spark
- Streaming data using Spring Data JPA
In order to make Spring Data JPA Stream to work, we need to modify our MyRepository
to return Stream<MyEntity>
like this:
public interface MyRepository extends PagingAndSortingRepository<MyEntity, Integer> {
@QueryHints(value = {
@QueryHint(name = HINT_CACHEABLE, value = "false"),
@QueryHint(name = READ_ONLY, value = "true")
})
@Query(value="SELECT * ...")
Stream<MyEntity> myQuery();
}
In this example, we disable second level caching and hint Hibernate that the entities will be read only. If your requirement is different, make sure to change those settings accordingly for your requirements.