Top per group: Take(1) works but FirstOrDefault() doesn't?
Looking at:
http://msdn.microsoft.com/en-us/library/system.linq.enumerable.firstordefault
http://msdn.microsoft.com/en-us/library/bb503062.aspx
there's very nice explanation on how Take works (lazy, early brekaing) but none of FirstOrDefault.. What's more, seeing the explanation of Take, I'd 'guestimate' that it the queries with Take may cut the number of rows due to an attempt to emulate the lazy evaluation in SQL, and your case indicates it's the other way! I do ont understand why you are observing such effect.
It's probably just implementation-specific.. For me, both Take(1) and FirstOrDefault might look like TOP 1
, however from functional point of view, there may be a slight difference in their 'laziness': one function may evaluate all elements and return first, second may evaluate first then return it and break evaluation.. It is only a "hint" on what might have happened. For me, it is a nonsense, because I see no docs on this subject and in general I'm sure that both Take/FirstOrDefault are lazy and should eval only the first N elements.
In the first part of your query, the group.Select+orderBy+TOP1 is a "clear indication" that you are interested in the single row with highest 'value' in a column per group - but in fact, there is no simple way to do declare that in SQL, so the indication is not that clear at all for the SQL engine and for EF engine neither.
As for me, the behaviour you present could indicate that the FirstOrDefault was 'propagated' by the EF translator upwards one layer of inner queries too much, as if to the Articles.GroupBy() (are you sure you have not misplaced parens adter the OrderBy? :) ) - and that would be a bug.
But -
As the difference must be somewhere in the meaning and/or order of execution, let's see what EF can guess about the meaning of your query. How the Author entity gets its Articles? How the EF knows which Article it is to bind to your author? Of course, the nav property. But how it happens that only some of articles are preloaded? Seems simple - the query returns some results with come columns, columns describe whole Author and Whole Articles, so lets map them to authors and articles and lets match them each other vis nav keys. OK. But add the complex filtering to that..?
With simple filter like by-date, it is a single subquery for all articles, rows are truncated by date, and all rows are consumed. But how about writing a complex query that would use several intermediate orderings and a produce several subsets of articles? Which subset should be bound to the resulting Author? Union of all of them? That would nullify all top level where-like clauses. First of them? Nonsense, first subqueries tend to be intermediary helpers. So, probably, when a query is seen as a set of subqueries with similar structure that all could be taken as the datasource for a partial-loading of a nav property, then most probably only the last subquery is taken as the actual result. This is all abstract thinking, but it made me notice that Take() versus FirstOrDefault and their overall Join versus LeftJoin meaning could in fact change the order of result set scanning, and, somehow, Take() was somehow optimized and done in one scan over whole result, thus visiting all author's articles at once, and the FirstOrDefault was executed as direct scan for each author * for each title-group * select top one and check count and substitue for null
that had many times produced small one-item collections of articles per each author, and thus resulted in one result - coming only from the last title-grouping visited.
This is the only explanation I can think of, except of obvious "BUG!" shout. As a LINQ-user, for me, it still is a bug. Either such optimization should not have taken place at all, or it should include the FirstOrDef too - as it is the same as Take(1).DefaultIfEmpty(). Heh, by the way - have you tried that? As I said, Take(1) is not same as FirstOrDefault due to the JOIN/LEFTJOIN meaning - but Take(1).DefaultIfEmpty() is actually semantically the same. It could be fun to see what SQL queries it produces at SQL and what results in EF layers.
I have to admit, that selection of the related-entities in partial-loading was never clear to me and I have actually not used the partial-loading for a looong time as always I stated the queries so that the results and groupings are explicitely defined (*).. Hence, I could simply have forgotten about some key aspect/rule/definition of its inner working and maybe, ie. it actually is to select every related record form the result set (not just the last-subcollection as I described now). If I had forgotten something, all what I just described would be obviously wrong.
(*) In your case, I'd make the Article.AuthorID a nav-property too (public Author Author get set), and then rewrite the query similar to be more flat/pipelined, like:
var aths = db.Articles
.GroupBy(ar => new {ar.Author, ar.Title})
.Take(10)
.Select(grp => new {grp.Key.Author, Arts = grp.OrderByDescending(ar => ar.Revision).Take(1)} )
and then fill the View with pairs of Author and Arts separately, instead of trying to partially fill the author and use author-only. Btw. I've not tested it against EF and SServer, it is just an example of 'flipping the query upside down' and 'flattening' the subqueries in case of JOINs and is unusable for LEFTJOINs, so if you'd like to view also the authors without articles, it has to start from the Authors like your original query..
I hope these loose thoughts will help a bit in finding 'why'..
The FirstOrDefault()
method is instant while the other one (Take(int)
) is deferred until execution.