wa (Waiting for I/O) from top command is big
Solution 1:
Here are a few tools to find disk activity:
iotop
vmstat 1
iostat 1
lsof
strace -e trace=open <application>
strace -e trace=open -p <pid>
In ps auxf
you'll also see which processes are are in uninterpretable disk sleep (D
) because they are waiting for I/O.
Some days the load increase to reach 40 without increase of the number vistors.
You may also want to create a backup, and see if the harddrive is slowly failing. A harddrive generally starts to slow down before it deceases. This could also explain the high load.
Solution 2:
The output from top suggests that the DBMS is experiencing most of the I/O waits, so database tuning issues are an obvious candidate to investigate.
I/O waiting on a database server - particularly on load spikes - is a clue that your DBMS might be either disk bound (i.e. you need a faster disk subsystem) or it might have a tuning issue. You should probably also look into profiling your database server - i.e. get a trace of what it's doing and what queries are taking the time.
Some starter points for diagnising database tuning issues:-
Find the queries that take up the most time, and look at the query plans. See if any have odd query plans such as a table scan where it shouldn't be. Maybe the database needs an index added.
Long resource wait times may mean that some key resource pool needs to be expanded.
Long I/O wait times may mean that you need a faster disk subsystem.
Are your log and data volumes on separate drives? Database logs have a lot of small sequential writes (essentially they behave like a ring buffer). If you have a busy random access workload sharing the same disks as your logs this will disporportionately affect the throughput of the logging. For a database transaction to commit the log entries must be written out to disk, so this will place a bottleneck on the whole system.
Note that some MySQL storage engines don't use logs so this may not be an issue in your case.
Footnote: Queuing systems
Queuing systems (a statistical model for throughput) get hyperbolically slower as the system approaches saturation. For a high level approximation, a system that is 50% saturated has an average queue length of 2. A system that is 90% saturated has a queue length of 10, a system that is 99% saturated has a queue length of 100.
Thus, on a system that is close to saturation, small changes in load can result in large changes to wait times, in this case manifesting as time spent waiting on I/O. If the I/O capacity of your disk subsystem is nearly saturated then small changes in load can result in significant changes in response times.
Solution 3:
Run iotop
, or atop -dD
, to see what processes are doing io. Use strace
if you need a closer look.