Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.4.0, Lustre 2.4.1, Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1
-
Lustre Branch: master
Lustre Build: http://build.whamcloud.com/job/lustre-master/1486
Distro/Arch: RHEL6.4/x86_64
Test Group: failover
-
3
-
8208
Description
After running recovery-mds-scale test_failover_ost for 1.5 hours (OST failed over 6 times), client load on one of the clients failed as follows:
<snip> tar: etc/mail/submit.cf: Cannot open: No space left on device tar: etc/mail/trusted-users: Cannot open: No space left on device tar: etc/mail/virtusertable: Cannot open: No space left on device tar: etc/mail/access: Cannot open: No space left on device tar: etc/mail/aliasesdb-stamp: Cannot open: No space left on device tar: etc/gssapi_mech.conf: Cannot open: No space left on device tar: Exiting with failure status due to previous errors
Console log on the client (client-32vm6) showed that:
19:40:31:INFO: task tar:2790 blocked for more than 120 seconds. 19:40:31:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 19:40:31:tar D 0000000000000000 0 2790 2788 0x00000080 19:40:31: ffff88004eb73a28 0000000000000082 ffff88004eb739d8 ffff88007c24fe50 19:40:31: 0000000000000286 0000000000000003 0000000000000001 0000000000000286 19:40:31: ffff88007bcb3ab8 ffff88004eb73fd8 000000000000fb88 ffff88007bcb3ab8 19:40:31:Call Trace: 19:40:31: [<ffffffffa03d775a>] ? cfs_waitq_signal+0x1a/0x20 [libcfs] 19:40:31: [<ffffffff8150ea05>] schedule_timeout+0x215/0x2e0 19:40:31: [<ffffffffa068517c>] ? ptlrpc_request_bufs_pack+0x5c/0x80 [ptlrpc] 19:40:31: [<ffffffffa069a770>] ? lustre_swab_ost_body+0x0/0x10 [ptlrpc] 19:40:31: [<ffffffff8150e683>] wait_for_common+0x123/0x180 19:40:31: [<ffffffff81063310>] ? default_wake_function+0x0/0x20 19:40:31: [<ffffffff8150e79d>] wait_for_completion+0x1d/0x20 19:40:31: [<ffffffffa08cbf6c>] osc_io_setattr_end+0xbc/0x190 [osc] 19:40:31: [<ffffffffa095cde0>] ? lov_io_end_wrapper+0x0/0x100 [lov] 19:40:31: [<ffffffffa055cf30>] cl_io_end+0x60/0x150 [obdclass] 19:40:31: [<ffffffffa055d7e0>] ? cl_io_start+0x0/0x140 [obdclass] 19:40:31: [<ffffffffa095ced1>] lov_io_end_wrapper+0xf1/0x100 [lov] 19:40:31: [<ffffffffa095c86e>] lov_io_call+0x8e/0x130 [lov] 19:40:31: [<ffffffffa095e3bc>] lov_io_end+0x4c/0xf0 [lov] 19:40:31: [<ffffffffa055cf30>] cl_io_end+0x60/0x150 [obdclass] 19:40:31: [<ffffffffa0561f92>] cl_io_loop+0xc2/0x1b0 [obdclass] 19:40:31: [<ffffffffa0a2aa08>] cl_setattr_ost+0x208/0x2c0 [lustre] 19:40:31: [<ffffffffa09f8b0e>] ll_setattr_raw+0x9ce/0x1000 [lustre] 19:40:31: [<ffffffffa09f919b>] ll_setattr+0x5b/0xf0 [lustre] 19:40:31: [<ffffffff8119e708>] notify_change+0x168/0x340 19:40:31: [<ffffffff811b284c>] utimes_common+0xdc/0x1b0 19:40:31: [<ffffffff811828d1>] ? __fput+0x1a1/0x210 19:40:31: [<ffffffff811b29fe>] do_utimes+0xde/0xf0 19:40:31: [<ffffffff811b2b12>] sys_utimensat+0x32/0x90 19:40:31: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Maloo report: https://maloo.whamcloud.com/test_sets/053120d2-bb19-11e2-8824-52540035b04c