Forcing EventProcessorHost to re-deliver failed Azure Event Hub eventData's to IEventProcessor.ProcessEvents method
TLDR: The only reliable way to re-play a failed batch of events to the IEventProcessor.ProcessEventsAsync
is to - Shutdown
the EventProcessorHost
(aka EPH
) immediately - either by using eph.UnregisterEventProcessorAsync()
or by terminating the process - based on the situation. This will let other EPH
instances to acquire the lease for this partition & start from the previous checkpoint.
Before explaining this - I want to call-out that, this is a great Question & indeed, was one of the toughest design choices we had to make for EPH
. In my view, it was a trade-off b/w: usability
/supportability
of the EPH
framework, vs Technical-Correctness
.
Ideal Situation would have been: When the user-code in IEventProcessorImpl.ProcessEventsAsync
throws an Exception - EPH
library shouldn't catch this. It should have let this Exception
- crash the process & the crash-dump
clearly shows the callstack
responsible. I still believe - this is the most technically-correct
solution.
Current situation: The contract of IEventProcessorImpl.ProcessEventsAsync
API & EPH
is,
- as long as
EventData
can be received from EventHubs service - continue invoking the user-callback (IEventProcessorImplementation.ProcessEventsAsync
) with theEventData's
& if the user-callback throws errors while invoking, notifyEventProcessorOptions.ExceptionReceived
. - User-code inside
IEventProcessorImpl.ProcessEventsAsync
should handle all errors and incorporateRetry's
as necessary.EPH
doesn't set any timeout on this call-back to give users full control over processing-time. - If a specific event is the cause of trouble - mark the
EventData
with a special property - for ex:type=poison-event
and re-send to the sameEventHub
(include a pointer to the actual event, copy theseEventData.Offset
andSequenceNumber
into the NewEventData.ApplicationProperties
) or fwd it to a SERVICEBUS Queue or store it elsewhere, basically, identify & defer processing the poison-event. - if you handled all possible cases and are still running into
Exceptions
- catch'em & shutdownEPH
orfailfast
the process with this exception. When theEPH
comes back up - it will start from where-it-left.
Why does check-pointing 'the old event' NOT work (read this to understand EPH
in general):
Behind the scenes, EPH
is running a pump per EventHub Consumergroup partition's receiver - whose job is to start the receiver from a given checkpoint
(if present) and create a dedicated instance of IEventProcessor
implementation and then receive
from the designated EventHub partition from the specified Offset
in the checkpoint (if not present - EventProcessorOptions.initialOffsetProvider
) and eventually invoke IEventProcessorImpl.ProcessEventsAsync
. The purpose of the Checkpoint
is to be able to reliably start processing messages, when the EPH
process Shutsdown and the ownership of Partition is moved to another EPH
instances. So, checkpoint
will be consumed only while starting the PUMP and will NOT be read, once the pump started.
As I am writing this, EPH
is at version 2.2.10.
more general reading on Event Hubs...