soanen - Martti's SOA Blog: Troubleshoot performance part 4 - top to pinpoint the bottleneck process

Top for pinpointing the bottleneck process

Top can be used to pinpoint the exact process eating CPU resources. Top sorts processes by the amount of CPU resources they need so if there is some process hogging all CPU, it will be at the top.

Top is different from other commands because other commands produce output and exist, top on the other hands displays results on the screen and constantly refreshes it with new information until you stop it by pressing control-C

There is also a command line option –b that allows you to run top in batch mode. If you run it in batch modem you can use –n to indicate how many iterations to run. Sometimes this is handy if you for example want to make a test script that runs stability tests. You can use top with batch mode to run top results into a file. By inspecting the results you can quickly see if there are any memory leaks etc.

Below an example of top in batch mode:

[root@soaserver ~]# top -b -n 1
top - 12:48:20 up 79 days, 16:09, 1 user, load average: 0.00, 0.00, 0.00
Tasks: 97 total, 1 running, 96 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.7%us, 0.1%sy, 0.0%ni, 98.8%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 2621440k total, 2544616k used, 76824k free, 223376k buffers
Swap: 2104504k total, 131176k used, 1973328k free, 1974716k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1 root 15 0 10348 684 572 S 0.0 0.0 1:17.69 init
2 root RT -5 0 0 0 S 0.0 0.0 0:04.96 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.30 ksoftirqd/0
4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
5 root 10 -5 0 0 0 S 0.0 0.0 0:00.04 events/0
6 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 khelper
7 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kthread
9 root 11 -5 0 0 0 S 0.0 0.0 0:00.00 xenwatch
10 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 xenbus
20 root RT -5 0 0 0 S 0.0 0.0 0:03.27 migration/1
21 root 34 19 0 0 0 S 0.0 0.0 0:00.34 ksoftirqd/1
22 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
23 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 events/1
26 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kblockd/0
27 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kblockd/1
28 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 cqueue/0
29 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 cqueue/1
33 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 khubd
35 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kseriod
102 root 15 0 0 0 0 S 0.0 0.0 0:00.00 khungtaskd
105 root 10 -5 0 0 0 S 0.0 0.0 0:29.50 kswapd0
106 root 12 -5 0 0 0 S 0.0 0.0 0:00.00 aio/0
107 root 12 -5 0 0 0 S 0.0 0.0 0:00.00 aio/1
225 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 xenfb thread
243 root 11 -5 0 0 0 S 0.0 0.0 0:00.00 kpsmoused
270 root 14 -5 0 0 0 S 0.0 0.0 0:00.00 ata/0
271 root 14 -5 0 0 0 S 0.0 0.0 0:00.00 ata/1
272 root 14 -5 0 0 0 S 0.0 0.0 0:00.00 ata_aux
282 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kstriped
298 root 10 -5 0 0 0 S 0.0 0.0 0:01.87 kjournald
331 root 10 -5 0 0 0 S 0.0 0.0 0:03.82 kauditd
364 root 11 -4 14040 2240 488 S 0.0 0.1 0:00.04 udevd
847 root 13 -5 0 0 0 S 0.0 0.0 0:00.00 kmpathd/0
848 root 13 -5 0 0 0 S 0.0 0.0 0:00.00 kmpathd/1
849 root 13 -5 0 0 0 S 0.0 0.0 0:00.00 kmpath_handlerd
916 root 11 -5 0 0 0 S 0.0 0.0 0:00.00 kjournald
918 root 10 -5 0 0 0 S 0.0 0.0 0:04.80 kjournald
920 root 10 -5 0 0 0 S 0.0 0.0 0:03.26 kjournald
1247 root 11 -4 27324 828 584 S 0.0 0.0 0:00.44 auditd
1249 root 7 -8 81800 772 616 S 0.0 0.0 0:00.19 audispd
1271 root 15 0 5908 608 488 S 0.0 0.0 0:00.13 syslogd
1274 root 15 0 3804 428 344 S 0.0 0.0 0:00.00 klogd
1313 root 18 0 10760 372 244 S 0.0 0.0 0:00.50 irqbalance
1334 rpc 15 0 8052 572 452 S 0.0 0.0 0:00.00 portmap
1365 root 12 -5 0 0 0 S 0.0 0.0 0:00.00 rpciod/0
1366 root 13 -5 0 0 0 S 0.0 0.0 0:00.00 rpciod/1
1375 root 17 0 10160 796 656 S 0.0 0.0 0:00.00 rpc.statd
1409 root 19 0 55180 768 304 S 0.0 0.0 0:00.00 rpc.idmapd
1434 dbus 18 0 31500 1100 832 S 0.0 0.0 0:00.00 dbus-daemon
1500 haldaemo 15 0 30520 3524 1520 S 0.0 0.1 0:00.11 hald
1501 root 18 0 21692 1032 868 S 0.0 0.0 0:00.00 hald-runner
1536 root 20 0 119m 1520 1116 S 0.0 0.1 0:00.32 automount
1587 root 15 0 62608 1212 656 S 0.0 0.0 0:00.00 sshd
1626 root 17 0 21644 880 668 S 0.0 0.0 0:00.00 xinetd
1643 ntp 15 0 23388 5028 3904 S 0.0 0.2 0:00.07 ntpd
1657 root 15 0 74808 1220 644 S 0.0 0.0 0:00.52 crond
1691 xfs 18 0 20828 1636 704 S 0.0 0.1 0:00.00 xfs
1753 oracle 15 0 81248 12m 9416 S 0.0 0.5 0:01.94 tnslsnr
1897 root 17 0 3792 484 412 S 0.0 0.0 0:00.00 mingetty
1898 root 16 0 3792 480 412 S 0.0 0.0 0:00.00 mingetty
1899 root 16 0 3792 484 412 S 0.0 0.0 0:00.00 mingetty
1900 root 16 0 3792 480 412 S 0.0 0.0 0:00.00 mingetty
1901 root 16 0 3792 480 412 S 0.0 0.0 0:00.00 mingetty
1906 root 20 0 3792 480 412 S 0.0 0.0 0:00.00 mingetty
1913 root 18 0 3800 540 464 S 0.0 0.0 0:00.00 agetty
2048 root 15 0 0 0 0 S 0.0 0.0 0:00.30 pdflush
2059 root 10 -5 0 0 0 S 0.0 0.0 0:01.24 kjournald
2096 root 15 0 0 0 0 S 0.0 0.0 0:00.02 pdflush
2382 oracle 15 0 1257m 391m 373m S 0.0 15.3 0:03.40 oracle
8025 oracle 18 0 1235m 59m 57m S 0.0 2.3 0:00.16 oracle
8032 root 15 0 90112 3384 2608 S 0.0 0.1 0:00.03 sshd
8034 root 15 0 66060 1528 1144 S 0.0 0.1 0:00.00 bash
8195 oracle 15 0 1236m 27m 24m S 0.0 1.1 0:00.04 oracle
8197 oracle 15 0 1236m 32m 29m S 0.0 1.3 0:00.05 oracle
8199 oracle 15 0 1235m 14m 13m S 0.0 0.6 0:00.02 oracle
8200 root 15 0 12604 948 708 R 0.0 0.0 0:00.00 top
20456 oracle 15 0 1237m 18m 16m S 0.0 0.7 0:01.98 oracle
20458 oracle -2 0 1235m 15m 13m S 0.0 0.6 0:00.05 oracle
20462 oracle 15 0 1235m 15m 13m S 0.0 0.6 0:00.10 oracle
20464 oracle 18 0 1235m 15m 13m S 0.0 0.6 0:00.18 oracle
20466 oracle 15 0 1235m 123m 121m S 0.0 4.8 0:00.40 oracle
20468 oracle 15 0 1235m 15m 13m S 0.0 0.6 0:32.55 oracle
20470 oracle 18 0 1235m 19m 17m S 0.0 0.8 0:03.46 oracle
20472 oracle 15 0 1235m 34m 33m S 0.0 1.4 0:00.20 oracle
20474 oracle 15 0 1263m 194m 169m S 0.0 7.6 0:30.04 oracle
20476 oracle 15 0 1251m 37m 35m S 0.0 1.5 15:30.81 oracle
20478 oracle 16 0 1235m 26m 24m S 0.0 1.0 0:08.56 oracle
20480 oracle 15 0 1245m 417m 412m S 0.0 16.3 2:49.99 oracle
20482 oracle 15 0 1236m 124m 121m S 0.0 4.9 0:00.39 oracle
20484 oracle 15 0 1241m 397m 391m S 0.0 15.5 0:39.48 oracle
20486 oracle 15 0 1235m 64m 62m S 0.0 2.5 0:01.28 oracle
20488 oracle 18 0 1241m 15m 13m S 0.0 0.6 0:00.07 oracle
20490 oracle 18 0 1236m 14m 12m S 0.0 0.6 0:00.08 oracle
20528 oracle 15 0 1235m 17m 15m S 0.0 0.7 0:00.33 oracle
20540 oracle 15 0 1240m 451m 443m S 0.0 17.6 10:28.91 oracle
20552 oracle 15 0 1240m 288m 280m S 0.0 11.3 0:02.58 oracle
20608 oracle 15 0 1235m 16m 14m S 0.0 0.6 0:01.02 oracle

Lets go through the data reported by top.

First line:
top - 12:48:20 up 79 days, 16:09, 1 user, load average: 0.00, 0.00, 0.00
First line tells the current time (12:48:20), the system has been up 79 day, there is only one user logged on and the system is totally free. The three numbers tell the load average for the last 1,5 and 15 minutes. The uptime command gives the same report as the first line of top command.

Second line tells the amount of processes in the system
Tasks: 97 total, 1 running, 96 sleeping, 0 stopped, 0 zombie

Third and fourth lines tell the CPU utilization. The system is very free with 98.9% being available. In a multi-CPU system you will see a separate line for each CPU.
Cpu(s): 0.7%us, 0.1%sy, 0.0%ni, 98.8%id, 0.3%wa, 0.0%hi, 0.0%si, 0.0%st

Following two lines report on memory usage
Mem: 2621440k total, 2544616k used, 76824k free, 223376k buffers
Swap: 2104504k total, 131176k used, 1973328k free, 1974716k cached

There is about 2,56 of main memory. The free part indicates that there is 700k. Actually Linux uses free memory as IO-cache. It is not totally uncommon to be alarmed when the amount of free memory looks low. The real amount of free memory is free+buffers variable so in my example I have almost all memory as free. For more information see: http://serverfault.com/questions/377617/how-to-interpret-output-from-linux-top-command.

The lines that follow in top output tell information about individual processes running on the system.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20587 root 15 0 10860 1060 772 R 0.3 0.0 0:00.03 top
1 root 15 0 10348 696 584 S 0.0 0.0 0:00.82 init
2 root RT -5 0 0 0 S 0.0 0.0 0:00.17 migration/0

Check details on the fields from here (and more on top command):

http://linux.die.net/man/1/top

The fields used are:
PID = process id
User = user who started the process
PRI = priority of the process
NI = nice value, higher values indicate lower priority. You can change the priority of processes with the nice command
Virt = virtual memory used by this process
SHR = shared memory used by this process.
S = The status of the task which can be one of: 'D' = uninterruptible sleep 'R' = running 'S' = sleeping 'T' = traced or stopped 'Z' = zombie
%CPU = percentage of CPU used by this process. The sum of all processes is 100%.
%MEM = percentage of memory used by this process
TIME+= Total CPU time used by this process
Command = the command that was used to start this process

Some formatting and display options of top
If you run top in interactive mode, pressing the uppercase M key sorts the output by memory usage. (Note that using lowercase m will turn the memory summary lines on or off at the top of the display.) This is very useful when you want to find out who is consuming the memory.

The most useful is -d, which indicates the delay between the screen refreshes. To refresh every second, use top -d 1.

The other useful option is -p. If you want to monitor only a few processes, not all, you can specify only those after the -p option. To monitor processes 13609, 13608 and 13554, issue:

top -p 13609 -p 13608 -p 13554

Tip for Oracle database

If the process that is causing either CPU or IO load is an oracle database process, you can use the following handy command to found out what part of the DB is the cause:

select s.sid, s.username, s.program

from v$session s, v$process p

where spid = <process id from top command>

and p.addr = s.paddr

/

This tip is from this good article (the actual tip is in the middle of the article):

http://www.oracle.com/technetwork/articles/linux/part2-085179.html

soanen - Martti's SOA Blog

keskiviikko 27. helmikuuta 2013

Troubleshoot performance part 4 - top to pinpoint the bottleneck process

Ei kommentteja:

Lähetä kommentti