Storm max spout pending
One solution is to build the input queue outside the nextTuple method and the only thing to do in nextTuple is to poll the queue and emit. If you are processing multiple files, your nextTuple method should also check if the result of polling the queue is null, and if yes, atomically reset the source file that is populating your queue.
Exactly! Storm can only limit your spout with the next command, so if you transmit everything when you receive the first next, there is no way for Storm to throttle your spout.
The Storm developers recommend emitting a single tuple with a single next command. The Storm framework will then throttle your spout as needed to meet the "max spout pending" requirement. If you're emitting a high number of tuples, you can batch your emits to at most a tenth of your max spout pending, to give Storm the chance to throttle.
Storm topologies have a max spout pending parameter. The max spout pending value for a topology can be configured via the “topology.max.spout.pending” setting in the topology configuration yaml file. This value puts a limit on how many tuples can be in flight, i.e. have not yet been acked or failed, in a Storm topology at any point of time. The need for this parameter comes from the fact that Storm uses ZeroMQ to dispatch tuples from one task to another task. If the consumer side of ZeroMQ is unable to keep up with the tuple rate, then the ZeroMQ queue starts to build up. Eventually tuples timeout at the spout and get replayed to the topology thus adding more pressure on the queues. To avoid this pathological failure case, Storm allows the user to put a limit on the number of tuples that are in flight in the topology. This limit takes effect on a per spout task basis and not on a topology level.(source) For cases when the spouts are unreliable, i.e. they don’t emit a message id in their tuples, this value has no effect. One of the problems that Storm users continually face is in coming up with the right value for this max spout pending parameter. A very small value can easily starve the topology and a sufficiently large value can overload the topology with a huge number of tuples to the extent of causing failures and replays. Users have to go through several iterations of topology deployments with different max spout pending values to find the value that works best for them.