Details
-
Bug
-
Resolution: Won't Fix
-
Major
-
None
-
Lustre 2.5.0
-
None
-
lustre-master build #1687 tag 2.4.93
OpenSFS cluster with combined MGS/MDS, single OSS with two OSTs, four clients; one agent (c07), one with robinhood/db running (c08) and just running as Lustre clients (c09, c10)
-
3
-
10784
Description
There is an issue when running acceptance-small where the file system hangs during or right after racer test and will sometime return after several hours or will not return after several (~6) hours. This is a very long description. So, here is the outline of the testing and results:
1. Run acceptance-small with normal Lustre configuration and HSM not enabled. acceptance-small completes.
2. Run sanity-hsm.sh with HSM enabled. sanity-hsm completes.
3. Run acceptance-small with HSM enabled. acceptance-small will not complete.
The problem of acceptance-small not completing is very reproducible; every time I run acceptance-small with HSM enabled, it hangs on some unmount/cleanup call in the tests.
Here are the details:
On a newly build Lustre file system, I ran the acceptance-small suite without starting HSM (I left hsm_control = stopped, didn't start the copy tool, no agents registered, didn't mount the archive) , but had all the agent variables in my configuration file set. acceptance-small completed and when it came to sanity-hsm, the last test in acceptance-small, HSM ran fine, some test failures, but all tests completed. There are some failures here that could be related to this mount/unmount problem. For example, lfsck fails with
oss: removing objects from /dev/sda3 on ost1: oss: mount: mount point /mnt/ost1 does not exist pdsh@c10: oss: ssh exited with exit code 32
Test results are at https://maloo.whamcloud.com/test_sessions/eb9d697c-2658-11e3-8d26-52540035b04c.
Next ran sanity-hsm with HSM enabled (started agent, started copytool, registered agent, enabled HSM, etc.) and the tests completed with some failures. Results are at https://maloo.whamcloud.com/test_sessions/15535878-2639-11e3-8d26-52540035b04c.
Next started acceptance-small without changing any parameters, i.e. HSM was enabled as above. Everything looked like it was running fine until I saw messages on the client console that kicked off the tests (c10):
... file_create: FILE=/lustre/scratch/racer/1 SIZE=66368 file_create: FILE=/lustre/scratch/racer/4 SIZE=155632 file_create: FILE=/lustre/scratch/racer/7 SIZE=36848 file_create: FILE=/lustre/scratch/racer/4 SIZE=198040 file_create: FILE=/lustre/scratch/racer/12 SIZE=70920 file_create: FILE=/lustre/scratch/racer/17 SIZE=163576 racer cleanup racer cleanup racer cleanup racer cleanup sleeping 5 sec ... sleeping 5 sec ... sleeping 5 sec ... sleeping 5 sec ... there should be NO racer processes: root 7383 0.0 0.0 103244 880 ? S 15:59 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 7385 0.0 0.0 103244 880 ? S 15:59 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec there should be NO racer processes: Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 374928 346576 8576 98% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. there should be NO racer processes: root 3805 0.0 0.0 103244 872 ? S 15:59 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 3807 0.0 0.0 103244 872 ? S 15:59 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 374928 346576 8576 98% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. root 31318 0.0 0.0 103244 892 ? S 15:59 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 31320 0.0 0.0 103244 888 ? R 15:59 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 374928 346576 8576 98% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. there should be NO racer processes: root 30362 0.0 0.0 103244 892 ? S 15:59 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 374928 346576 8576 98% /lustre/scratch We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. pid=16205 rc=0 there should be NO racer processes: root 3805 0.0 0.0 103244 872 ? S 15:59 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 3807 0.0 0.0 103244 872 ? S 15:59 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 374928 346576 8576 98% /lustre/scratch2 We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. root 31318 0.0 0.0 103244 892 ? S 15:59 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 31320 0.0 0.0 103244 896 ? R 15:59 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 374928 346576 8576 98% /lustre/scratch2 We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. there should be NO racer processes: root 30362 0.0 0.0 103244 892 ? S 15:59 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec root 30364 0.0 0.0 103244 892 ? S 15:59 0:00 grep -E file_create|dir_create|file_rm|file_rename|file_link|file_symlink|file_list|file_concat|file_exec Filesystem 1K-blocks Used Available Use% Mounted on mds@o2ib:/scratch 374928 346576 8576 98% /lustre/scratch2 We survived /usr/lib64/lustre/tests/racer/racer.sh for 300 seconds. pid=16206 rc=0 Resetting fail_loc on all nodes...done. PASS 1 (306s) == racer test complete, duration 312 sec == 15:59:37 (1380149977) Stopping clients: c07,c08,c09,c10 /lustre/scratch2 (opts:) Stopping client c07 /lustre/scratch2 opts: Stopping client c09 /lustre/scratch2 opts: Stopping client c08 /lustre/scratch2 opts: Stopping client c10 /lustre/scratch2 opts: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME dd 24324 root 1w REG 3953,876890 31744 144115574404849344 /lustre/scratch2/racer/1 /lustre/scratch2 is still busy, wait one second COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME dd 23480 root 1w REG 3953,876890 3895296 144115574438378493 /lustre/scratch/racer/13 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME dd 9884 root 1w REG 3953,876890 21794816 144115574438363138 /lustre/scratch/racer/11 dd 30142 root 1w REG 3953,876890 13414400 144115574388043790 /lustre/scratch2/racer/3 /lustre/scratch2 is still busy, wait one second COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME dd 11466 root 1w REG 3953,876890 3895296 144115574438378493 /lustre/scratch2/racer/13 (deleted) dd 16589 root 1w REG 3953,876890 21794816 144115574438363138 /lustre/scratch/racer/11 dd 24466 root 1w REG 3953,876890 31744 144115574404849344 /lustre/scratch/racer/1 /lustre/scratch2 is still busy, wait one second /lustre/scratch2 is still busy, wait one second /lustre/scratch2 is still busy, wait one second
And the "still busy message" continued for several hours. After approximately 3.5 hours, the tests started running again. So, I let things run and went to bed. When I checked back in the morning, the testing was stuck in conf-sanity with the following on the c10 console:
oss: Are the lustre modules loaded? oss: Check /etc/modprobe.conf and /proc/filesystems pdsh@c10: oss: ssh exited with exit code 19 Start of /dev/sda3 on ost1 failed 19 pdsh@c10: mds: ssh exited with exit code 3 can't get osc.scratch-OST0000-osc-MDT0000.ost_server_uuid by list_param in 40 secs Go with osc.scratch-OST0000-osc-MDT0000.ost_server_uuid directly mds: rpc : @@@@@@ FAIL: can't put import for osc.scratch-OST0000-osc-MDT0000.ost_server_uuid into FULL state after 40 sec, have mds: Trace dump: mds: = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit() mds: = /usr/lib64/lustre/tests/test-framework.sh:4291:error() mds: = /usr/lib64/lustre/tests/test-framework.sh:5374:_wait_import_state() mds: = /usr/lib64/lustre/tests/test-framework.sh:5393:wait_import_state() mds: = rpc.sh:20:main() mds: Dumping lctl log to /tmp/test_logs/1380167996/rpc..*.1380168036.log mds: Dumping logs only on local client. pdsh@c10: mds: ssh exited with exit code 1 conf-sanity test_54a: @@@@@@ FAIL: import is not in FULL state Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4291:error() = /usr/lib64/lustre/tests/test-framework.sh:5481:_wait_osc_import_state() = /usr/lib64/lustre/tests/test-framework.sh:5496:wait_osc_import_state() = /usr/lib64/lustre/tests/conf-sanity.sh:99:gen_config() = /usr/lib64/lustre/tests/conf-sanity.sh:109:reformat_and_config() = /usr/lib64/lustre/tests/conf-sanity.sh:3321:test_54a() = /usr/lib64/lustre/tests/test-framework.sh:4530:run_one() = /usr/lib64/lustre/tests/test-framework.sh:4563:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4433:run_test() = /usr/lib64/lustre/tests/conf-sanity.sh:3323:main() Dumping lctl log to /tmp/test_logs/2013-09-25/143504/conf-sanity.test_54a.*.1380168037.log oss: Host key verification failed. oss: rsync: connection unexpectedly closed (0 bytes received so far) [sender] mds: Host key verification failed. oss: rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6] mds: rsync: connection unexpectedly closed (0 bytes received so far) [sender] mds: rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6] pdsh@c10: oss: ssh exited with exit code 255 pdsh@c10: mds: ssh exited with exit code 255 FAIL 54a (100s) == conf-sanity test 54b: test llverfs and partial verify of filesystem == 21:00:38 (1380168038) start mds service on mds Starting mds1: /dev/sda3 /lustre/scratch/mdt0 mds: mount.lustre: according to /etc/mtab /dev/sda3 is already mounted on /lustre/scratch/mdt0 pdsh@c10: mds: ssh exited with exit code 17 Start of /dev/sda3 on mds1 failed 17 conf-sanity test_54b: @@@@@@ FAIL: MDT start failed Trace dump: = /usr/lib64/lustre/tests/test-framework.sh:4264:error_noexit() = /usr/lib64/lustre/tests/test-framework.sh:4291:error() = /usr/lib64/lustre/tests/conf-sanity.sh:231:setup() = /usr/lib64/lustre/tests/conf-sanity.sh:3331:test_54b() = /usr/lib64/lustre/tests/test-framework.sh:4530:run_one() = /usr/lib64/lustre/tests/test-framework.sh:4563:run_one_logged() = /usr/lib64/lustre/tests/test-framework.sh:4433:run_test() = /usr/lib64/lustre/tests/conf-sanity.sh:3336:main() Dumping lctl log to /tmp/test_logs/2013-09-25/143504/conf-sanity.test_54b.*.1380168039.log oss: Host key verification failed. oss: rsync: connection unexpectedly closed (0 bytes received so far) [sender] oss: rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6] mds: Host key verification failed. pdsh@c10: oss: ssh exited with exit code 255 mds: rsync: connection unexpectedly closed (0 bytes received so far) [sender] mds: rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6] pdsh@c10: mds: ssh exited with exit code 255 FAIL 54b (2s) == conf-sanity test 55: check lov_objid size == 21:00:40 (1380168040) Stopping /lustre/scratch/mdt0 (opts:-f) on mds pdsh@c10: mds: ssh exited with exit code 1 Permanent disk data: Target: scratch:MDT0000 Index: 0 Lustre FS: scratch Mount type: ldiskfs Flags: 0x65 (MDT MGS first_time update ) Persistent mount opts: user_xattr,errors=remount-ro Parameters: sys.timeout=20 lov.stripesize=1048576 lov.stripecount=0 mdt.identity_upcall=/usr/sbin/l_getidentity device size = 20481MB formatting backing filesystem ldiskfs on /dev/sda3 target name scratch:MDT0000 4k blocks 50000 options -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_itable_init,lazy_journal_init -F mkfs_cmd = mke2fs -j -b 4096 -L scratch:MDT0000 -I 512 -i 2048 -q -O dirdata,uninit_bg,^extents,dir_nlink,quota,huge_file,flex_bg -E lazy_itable_init,lazy_journal_init -F /dev/sda3 50000 Writing CONFIGS/mountdata pdsh@c10: oss: ssh exited with exit code 1 pdsh@c10: oss: ssh exited with exit code 1 Permanent disk data: Target: scratch:OST03ff Index: 1023 Lustre FS: scratch Mount type: ldiskfs Flags: 0x62 (OST first_time update ) Persistent mount opts: errors=remount-ro Parameters: mgsnode=192.168.2.105@o2ib sys.timeout=20 device size = 12292MB formatting backing filesystem ldiskfs on /dev/sda3 target name scratch:OST03ff 4k blocks 50000 options -I 256 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,resize=4290772992,lazy_journal_init -F mkfs_cmd = mke2fs -j -b 4096 -L scratch:OST03ff -I 256 -q -O extents,uninit_bg,dir_nlink,quota,huge_file,flex_bg -G 256 -E lazy_itable_init,resize=4290772992,lazy_journal_init -F /dev/sda3 50000 Writing CONFIGS/mountdata start mds service on mds Starting mds1: /dev/sda3 /lustre/scratch/mdt0 Started scratch-MDT0000 start ost1 service on oss Starting ost1: /dev/sda3 /lustre/scratch/ost1 oss: mount.lustre: mount /dev/sda3 at /lustre/scratch/ost1 failed: No such device oss: Are the lustre modules loaded? oss: Check /etc/modprobe.conf and /proc/filesystems pdsh@c10: oss: ssh exited with exit code 19 Start of /dev/sda3 on ost1 failed 19 mount scratch on /lustre/scratch..... Starting client: c10: -o user_xattr,flock mds@o2ib:/scratch /lustre/scratch
At this time, the MDT was mounted, but nothing else was mounted; no OSTs were mounted and none of the nodes had the Lustre file system mounted.
Started testing again with a clean system with 2.4.93 meaning used loadjenkins to load a new OS and formatted all storage. After everything was up and running, I enabled HSM on the MDS, registered agent, started copytool, but I left the agent (c07) off of the remote client list in case it was having problems with mount/unmounts. Leaving the agent out didn't help anything as you can see below. sanity, sanityn, sanity-benchmark and metadata-updates all ran, but when racer was running or just completed, the following was on c10 console:
.... == racer test complete, duration 312 sec == 14:47:42 (1380232062) Stopping clients: c08,c09,c10 /lustre/scratch2 (opts:) Stopping client c10 /lustre/scratch2 opts: Stopping client c08 /lustre/scratch2 opts: Stopping client c09 /lustre/scratch2 opts: COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME dd 1344 root 1w REG 3953,876890 11075584 144115473691245005 /lustre/scratch/racer/10 dd 10191 root 1w REG 3953,876890 171792384 144115473741559103 /lustre/scratch2/racer/19 dd 13174 root 1w REG 3953,876890 172275712 144115473691238078 /lustre/scratch/racer/19 cat 13200 root 1w REG 3953,876890 342456320 144115473724745995 /lustre/scratch2/racer/6 cat 13200 root 3r REG 3953,876890 172275712 144115473691238078 /lustre/scratch2/racer/19 cat 14311 root 1w REG 3953,876890 2856960 144115473691246648 /lustre/scratch/racer/4 cat 14311 root 3r REG 3953,876890 2857984 144115473775086376 /lustre/scratch/racer/12 dd 29283 root 1w REG 3953,876890 373760 144115473758289951 /lustre/scratch/racer/6 dd 29288 root 1w REG 3953,876890 373760 144115473758289951 /lustre/scratch2/racer/13 dd 32281 root 1w REG 3953,876890 4388864 144115473758324783 /lustre/scratch2/racer/7 /lustre/scratch2 is still busy, wait one second COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME cat 1744 root 1w REG 3953,876890 1141760 144115473791881355 /lustre/scratch2/racer/12 cat 1744 root 3r REG 3953,876890 1142784 144115473741554890 /lustre/scratch2/racer/14 dd 2731 root 1w REG 3953,876890 40736768 144115473791888614 /lustre/scratch2/racer/4 dd 3692 root 1w REG 3953,876890 159873024 144115473724773982 /lustre/scratch/racer/15 cat 7919 root 1w REG 3953,876890 4327424 144115473791855853 /lustre/scratch2/racer/14 cat 7919 root 3r unknown 3953,876890 /lustre/scratch2/racer/8/4 dd 16792 root 1w REG 3953,876890 424960 144115473758289949 /lustre/scratch2/racer/16 dd 16795 root 1w REG 3953,876890 424960 144115473758289949 /lustre/scratch/racer/16 dd 16826 root 1w REG 3953,876890 934912 144115473775067150 /lustre/scratch2/racer/14 dd 16851 root 1w REG 3953,876890 934912 144115473775067150 /lustre/scratch/racer/14 cat 19347 root 3r REG 3953,876890 11075584 144115473691245005 /lustre/scratch/racer/16 cat 30055 root 1w REG 3953,876890 342456320 144115473724745995 /lustre/scratch/racer/1 cat 30055 root 3r REG 3953,876890 172275712 144115473691238078 /lustre/scratch/racer/4 /lustre/scratch2 is still busy, wait one second COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME cat 9753 root 3r REG 3953,876890 4389888 144115473758324783 /lustre/scratch/racer/7 dd 19612 root 1w REG 3953,876890 2859008 144115473775086376 /lustre/scratch2/racer/12 dd 20439 root 1w unknown 3953,876890 /lustre/scratch/racer/8/1 dd 20860 root 1w REG 3953,876890 1143808 144115473741554890 /lustre/scratch/racer/5 dd 23548 root 1w REG 3953,876890 374784 144115473758289951 /lustre/scratch/racer/13 dd 27659 root 1w REG 3953,876890 4328448 144115473691239547 /lustre/scratch2/racer/6 /lustre/scratch2 is still busy, wait one second /lustre/scratch2 is still busy, wait one second /lustre/scratch2 is still busy, wait one second ...
MDS and OSS all mounted, agent has scratch mounted and copytool running, c08, c09 and c10 all have the Lustre file system mounted on /lustre/scratch and /lustre/scratch2. From the MDS dmesg:
LustreError: 25757:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) scratch-MDT0000: Cannot find agent for archive 0: rc = -11 LustreError: 25757:0:(mdt_hsm_cdt_agent.c:338:mdt_hsm_agent_send()) Skipped 179 previous similar messages
The agent did mount the Lustre file system, but is not taking part in the testing, i.e. not on the remote client list.
On the agent, I see:
LustreError: 166-1: MGC192.168.2.105@o2ib: Connection to MGS (at 192.168.2.105@o2ib) was lost; in progress operations using this service will fail LustreError: 11-0: scratch-OST0001-osc-ffff88082d9d7000: Communicating with 192.168.2.106@o2ib, operation obd_ping failed with -107. Lustre: Evicted from MGS (at 192.168.2.105@o2ib) after server handle changed from 0xc3f51c4989c55d76 to 0xc3f51c4989e1bc9e Lustre: MGC192.168.2.105@o2ib: Connection restored to MGS (at 192.168.2.105@o2ib)
On all the client nodes, in dmesg I see:
LustreError: 26096:0:(file.c:3068:ll_inode_revalidate_fini()) scratch: revalidate FID [0x200004283:0x9f09:0x0] error: rc = -116
Only on c10 do I see:
Lustre: DEBUG MARKER: == sanity-benchmark test bonnie: bonnie++ == 14:14:25 (1380230065) Lustre: DEBUG MARKER: min OST has 19147140kB available, using 28720710kB file size INFO: task bonnie++:19773 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. bonnie++ D 0000000000000004 0 19773 19510 0x00000080 ffff8807b77a7d38 0000000000000082 0000000000000000 ffffffff81277679 0000000000016700 ffff8800458d01f0 ffff8807b77a7cf8 ffffffff810573c5 ffff8807b797dab8 ffff8807b77a7fd8 000000000000fb88 ffff8807b797dab8 Call Trace: [<ffffffff81277679>] ? cpumask_next_and+0x29/0x50 [<ffffffff810573c5>] ? select_idle_sibling+0x95/0x150 [<ffffffff8150efa5>] schedule_timeout+0x215/0x2e0 [<ffffffff81065d54>] ? enqueue_task_fair+0x64/0x100 [<ffffffff81055f96>] ? enqueue_task+0x66/0x80 [<ffffffff8150ec23>] wait_for_common+0x123/0x180 [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffff8150ed3d>] wait_for_completion+0x1d/0x20 [<ffffffff811ac055>] writeback_inodes_sb_nr+0x75/0xa0 [<ffffffff811ac0c3>] writeback_inodes_sb+0x43/0x50 [<ffffffff811b1f9e>] __sync_filesystem+0x3e/0x90 [<ffffffff811b20e8>] sync_filesystems+0xf8/0x130 [<ffffffff811b2177>] sys_sync+0x17/0x40 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b INFO: task bonnie++:19773 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. bonnie++ D 0000000000000004 0 19773 19510 0x00000080 ffff8807b77a7d38 0000000000000082 0000000000000000 ffffffff81277679 0000000000016700 ffff8800458d01f0 ffff8807b77a7cf8 ffffffff810573c5 ffff8807b797dab8 ffff8807b77a7fd8 000000000000fb88 ffff8807b797dab8 Call Trace: [<ffffffff81277679>] ? cpumask_next_and+0x29/0x50 [<ffffffff810573c5>] ? select_idle_sibling+0x95/0x150 [<ffffffff8150efa5>] schedule_timeout+0x215/0x2e0 [<ffffffff81065d54>] ? enqueue_task_fair+0x64/0x100 [<ffffffff81055f96>] ? enqueue_task+0x66/0x80 [<ffffffff8150ec23>] wait_for_common+0x123/0x180 [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffff8150ed3d>] wait_for_completion+0x1d/0x20 [<ffffffff811ac055>] writeback_inodes_sb_nr+0x75/0xa0 [<ffffffff811ac0c3>] writeback_inodes_sb+0x43/0x50 [<ffffffff811b1f9e>] __sync_filesystem+0x3e/0x90 [<ffffffff811b20e8>] sync_filesystems+0xf8/0x130 [<ffffffff811b2177>] sys_sync+0x17/0x40 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b INFO: task bonnie++:19773 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. bonnie++ D 0000000000000004 0 19773 19510 0x00000080 ffff8807b77a7d38 0000000000000082 0000000000000000 ffffffff81277679 0000000000016700 ffff8800458d01f0 ffff8807b77a7cf8 ffffffff810573c5 ffff8807b797dab8 ffff8807b77a7fd8 000000000000fb88 ffff8807b797dab8 Call Trace: [<ffffffff81277679>] ? cpumask_next_and+0x29/0x50 [<ffffffff810573c5>] ? select_idle_sibling+0x95/0x150 [<ffffffff8150efa5>] schedule_timeout+0x215/0x2e0 [<ffffffff81065d54>] ? enqueue_task_fair+0x64/0x100 [<ffffffff81055f96>] ? enqueue_task+0x66/0x80 [<ffffffff8150ec23>] wait_for_common+0x123/0x180 [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffff8150ed3d>] wait_for_completion+0x1d/0x20 [<ffffffff811ac055>] writeback_inodes_sb_nr+0x75/0xa0 [<ffffffff811ac0c3>] writeback_inodes_sb+0x43/0x50 [<ffffffff811b1f9e>] __sync_filesystem+0x3e/0x90 [<ffffffff811b20e8>] sync_filesystems+0xf8/0x130 [<ffffffff811b2177>] sys_sync+0x17/0x40 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b INFO: task bonnie++:19773 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. bonnie++ D 0000000000000004 0 19773 19510 0x00000080 ffff8807b77a7d38 0000000000000082 0000000000000000 ffffffff81277679 0000000000016700 ffff8800458d01f0 ffff8807b77a7cf8 ffffffff810573c5 ffff8807b797dab8 ffff8807b77a7fd8 000000000000fb88 ffff8807b797dab8 Call Trace: [<ffffffff81277679>] ? cpumask_next_and+0x29/0x50 [<ffffffff810573c5>] ? select_idle_sibling+0x95/0x150 [<ffffffff8150efa5>] schedule_timeout+0x215/0x2e0 [<ffffffff81065d54>] ? enqueue_task_fair+0x64/0x100 [<ffffffff81055f96>] ? enqueue_task+0x66/0x80 [<ffffffff8150ec23>] wait_for_common+0x123/0x180 [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffff8150ed3d>] wait_for_completion+0x1d/0x20 [<ffffffff811ac055>] writeback_inodes_sb_nr+0x75/0xa0 [<ffffffff811ac0c3>] writeback_inodes_sb+0x43/0x50 [<ffffffff811b1f9e>] __sync_filesystem+0x3e/0x90 [<ffffffff811b20e8>] sync_filesystems+0xf8/0x130 [<ffffffff811b2177>] sys_sync+0x17/0x40 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b INFO: task bonnie++:19773 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. bonnie++ D 0000000000000004 0 19773 19510 0x00000080 ffff8807b77a7d38 0000000000000082 0000000000000000 ffffffff81277679 0000000000016700 ffff8800458d01f0 ffff8807b77a7cf8 ffffffff810573c5 ffff8807b797dab8 ffff8807b77a7fd8 000000000000fb88 ffff8807b797dab8 Call Trace: [<ffffffff81277679>] ? cpumask_next_and+0x29/0x50 [<ffffffff810573c5>] ? select_idle_sibling+0x95/0x150 [<ffffffff8150efa5>] schedule_timeout+0x215/0x2e0 [<ffffffff81065d54>] ? enqueue_task_fair+0x64/0x100 [<ffffffff81055f96>] ? enqueue_task+0x66/0x80 [<ffffffff8150ec23>] wait_for_common+0x123/0x180 [<ffffffff81063410>] ? default_wake_function+0x0/0x20 [<ffffffff8150ed3d>] wait_for_completion+0x1d/0x20 [<ffffffff811ac055>] writeback_inodes_sb_nr+0x75/0xa0 [<ffffffff811ac0c3>] writeback_inodes_sb+0x43/0x50 [<ffffffff811b1f9e>] __sync_filesystem+0x3e/0x90 [<ffffffff811b20e8>] sync_filesystems+0xf8/0x130 [<ffffffff811b2177>] sys_sync+0x17/0x40 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b LNet: 19792:0:(debug.c:218:libcfs_debug_str2mask()) You are trying to use a numerical value for the mask - this will be deprecated in a future release. Lustre: DEBUG MARKER: SKIP: sanity-benchmark test_iozone skipping SLOW test iozone ... Lustre: DEBUG MARKER: == racer test 1: racer on clients: c08,c09,c10 DURATION=300 == 14:42:36 (1380231756) LustreError: 30894:0:(mdc_locks.c:915:mdc_enqueue()) ldlm_cli_enqueue: -2 LustreError: 29189:0:(file.c:3068:ll_inode_revalidate_fini()) scratch: revalidate FID [0x200004282:0x3fd:0x0] error: rc = -116
While scratch2 is still busy, on the clients I see variations of:
[root@c09 ~]# lsof /lustre/scratch2 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME cat 9753 root 3r REG 3953,876890 23490560 144115473758324783 /lustre/scratch/racer/7 dd 19612 root 1w REG 3953,876890 21051392 144115473775086376 /lustre/scratch2/racer/12 dd 20860 root 1w REG 3953,876890 18997248 144115473741554890 /lustre/scratch/racer/5 dd 23548 root 1w REG 3953,876890 4968448 144115473758289951 /lustre/scratch/racer/13 dd 27659 root 1w REG 3953,876890 20217856 144115473691239547 /lustre/scratch2/racer/6
and
[root@c09 ~]# ps ax | grep racer 2328 pts/0 S+ 0:00 grep racer 9753 ? S 0:10 cat /lustre/scratch/racer/16 19612 ? D 0:08 dd if=/dev/zero of=/lustre/scratch2/racer/12 bs=1k count=90648 20860 ? S 0:08 dd if=/dev/zero of=/lustre/scratch/racer/5 bs=1k count=141568 23548 ? S 0:02 dd if=/dev/zero of=/lustre/scratch/racer/17 bs=1k count=180680 27659 ? S 0:08 dd if=/dev/zero of=/lustre/scratch2/racer/0 bs=1k count=62512
After five hours, the state of the nodes has not changed. I was able to upload partial logs for this last run at https://maloo.whamcloud.com/test_sessions/a4370948-2721-11e3-b741-52540035b04c