Middleware to build data-gathering and monitoring for a distributed system
Disclosure: I am a long-time DDS specialist/enthusiast and I work for one of the DDS vendors.
Good DDS implementations will provide you with what you are looking for. Collection of data and monitoring of nodes is a traditional use-case for DDS and should be its sweet spot. Interacting with nodes and tweaking them is possible as well, for example by using so-called content filters to send data to a particular node. This assumes that you have a means to uniquely identify each node in the system, for example by means of a string or integer ID.
Because of the hierarchical nature of the system and its sheer (potential) size, you will probably have to introduce some routing mechanisms to forward data between clusters. Some DDS implementations can provide generic services for that. Bridging to other technologies, like DBMS or web-interfaces, is often supported as well.
Especially if you have multicast at your disposal, discovery of all participants in the system can be done automatically and will require minimal configuration. This is not required though.
To me, it looks like your system is complicated enough to require customization. I do not believe that any solution will "fit the bill easily", especially if your system needs to be fault-tolerant and robust. Most of all, you need to be aware of your requirements. A few words about DDS in the context of the ones you have mentioned:
1000+ nodes publishing/offering continuous data
This is a big number, but should be possible, especially since you have the option to take advantage of the data-partitioning features supported by DDS.
Data needs to be reliably (in some way) and continuously gathered to one or more servers. This will likely be built on top of the middleware using some kind of explicit request/response to ask for lost data. If this could be handled automatically by the middleware this is of course a plus.
DDS supports a rich set of so-called Quality of Service (QoS) settings specifying how the infrastructure should treat that data it is distributing. These are name-value pairs set by the developer. Reliability and data-availability area among the supported QoS-es. This should take care of your requirement automatically.
More than one server/subscriber needs to be able to be connected to the same data producer/publisher and receive the same data
One-to-many or many-to-many distribution is a common use-case.
Data rate is max in the range of 10-20 per second per group
Adding up to a total maximum of 20,000 messages per second is doable, especially if data-flows are partitioned.
Messages sizes range from maybe ~100 bytes to 4-5 kbytes
As long as messages do not get excessively large, the number of messages is typically more limiting than the total amount of kbytes transported over the wire -- unless large messages are of very complicated structure.
Nodes range from embedded constrained systems to normal COTS Linux/Windows boxes
Some DDS implementations support a large range of OS/platform combinations, which can be mixed in a system.
Nodes generally use C/C++, servers and clients generally C++/C#
These are typically supported and can be mixed in a system.
Nodes should (preferable) not need to install additional SW or servers, i.e. one dedicated broker or extra service per node is expensive
Such options are available, but the need for extra services depends on the DDS implementation and the features you want to use.
Security will be message-based, i.e. no transport security needed
That certainly makes life easier for you -- but not so much for those who have to implement that protection at the message level. DDS Security is one of the newer standards in the DDS ecosystem that provides a comprehensive security model transparent to the application.
Seems ZeroMQ will fit the bill easily, with no central infrastructure to manage. Since your monitoring servers are fixed, it's really quite a simple problem to solve. This section in the 0MQ Guide may help:
http://zguide.zeromq.org/page:all#Distributed-Logging-and-Monitoring
You mention "reliability", but could you specify the actual set of failures you want to recover? If you are using TCP then the network is by definition "reliable" already.