What's the best practice for centralised logging?
One world of caution: at 100+ apps in a big shop, with hundreds perhaps thousands of hosts running those apps, steer clear of anything that induces a tight coupling. This pretty much rules out connect directly to SQL Server or any database solution, because your application logging will be dependent on the availability of the log repository.
Availability of the central repository is a little more complicated than just 'if you can't connect, don't log it' because usually the most interesting events occur when there are problems, not when things go smooth. If your logging drops entries exactly when things turn interesting, it will never be trusted to solve incidents and as such will fail to gain traction and support for other stake holders (ie. the application owners).
If you decide that you can implement retention and retry failed log info delivery on your own, you are facing an uphill battle: it is not a trivial task and is much more complex than it sounds, starting from eficient and reliable storage of the retained information and ending with putting in place good retry and inteligent fallback logic.
You also must have an answer to the problems of authentication and security. Large orgs have multiple domains with various trust relations, employees venture in via VPN or Direct Access from home, some applications run unattended, some services are configured to run as local users, some machines are not joined to the domain etc etc. You better have an asnwer to the question how is the logging module of each application, everywhere is deployed, going to authenticate with the central repository (and what situations are going to be unsuported).
Ideally you would use an out-of-the box delivery mechanism for your logging module. MSMQ is probably the most appropiate fit: robust asynchronous reliable delivery (at least to the extent of most use cases), available on every Windows host when is installed (optional). Which is the major pain point, your applications will take a dependency on a non-default OS component.
The central repository storage has to be able to deliver the information requested, perhaps:
- the application developers investigating incidents
- customer support team investigating a lost transaction reported by a customer complaint
- the security org doing forensics
- the business managers demanding statistics, trends and aggregated info (BI).
The only storage capable of delivering this for any serious org (size, lifetime) is a relational engine, so probably SQL Server. Doing analysis over text files is really not going to go the distance.
So I would recommend a messaging based log transport/delivery (MSMQ) and a relational central repository (SQL Server) perhaps with aanalitycal component on top of it (Analysis Services Data Mining). as you see, this is clearly no small feat and it covers slightly more than just configuring log4net.
As for what to log, you say you already give a thought but I'd like to chime in my extra 2c: often times, specially on incident investigation, you will like the ability to request extra information. This means you would like to know certain files content from the incident machine, or some registry keys, or some performance counter values, or a full process dump. It is very useful to be able to request this information from the central repository interface, but is impractical to always collect this information, just in case is needed. Which implies there has to be some sort of bidirectional communication between the applictaion and the central repository, when the application reports an incident it can be asked to add extra information (eg a dump of the process at fault). There has to be a lot of infrastructure in place for something like this to occur, from the protocol between application logging and the central repository, to the ability of the central repository to recognize an incident repeat, to the capacity of the loggin library to collect the extra information required and not least the ability of an operator to mark incidents as needing extra information on next occurence.
I understand this answer goes probably seems overkill at the moment, but I was involved with this problem space for quite a while, I had looked at many online crash reports from Dr. Watson back in the day when I was with MS, and I can tell you that these requirement exists, they are valid concerns and when implemented the solution helps tremendously. Ultimately, you can't fix what you cannot measure. A large organisation depends on good management and monitoring of its application stock, including logging and auditing.
There are some third party vendors that offer solutions, some even integrated with log4net, like bugcollect.com (Full disclosure: that's my own company), Error Traffic Controller or Exceptioneer and other.
Logstash + Elasticsearch + Kibana + Redis or RabbitMQ + NLog or Log4net
Storage + Search & Analytics: Elasticsearch
Collecting & Parsing : Logstash
Visualize: Kibana
Queue&Buffer: Redis
In Application: NLog
The 1024 byte Syslog message length limit mentioned so far is misleading and incorrectly biases against Syslog-based solutions to the problem.
The limit for the obsolete "BSD Syslog Protocol" is indeed 1024 bytes.
The BSD syslog Protocol - 4.1 syslog Message Parts
The limit for the modern "Syslog Protocol" is implementation-dependent but MUST be at least 480 bytes, SHOULD be at least 2048 bytes, and MAY be even higher.
The BSD syslog Protocol - 6.1. Message Length
As an example, Rsyslog's configuration setting is called MaxMessageSize
, which the documentation suggests can be set at least as high as 64kb.
rsyslog - Configuration Directives
That the asker's organisation is "a Microsoft house" where "UNIX solutions are no good" should not prevent less discriminatory readers from getting accurate information.