[LU-3230] conf-sanity fails to start run: umount of OST fails Created: 25/Apr/13 Updated: 10/Jun/14 Resolved: 21/Feb/14 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.4.1, Lustre 2.5.0, Lustre 2.4.2, Lustre 2.5.1 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.5.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mn4, zfs | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 7893 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite runs: The sub-test conf-sanity failed with the following error:
Info required for matching: conf-sanity conf-sanity |
| Comments |
| Comment by Niu Yawei (Inactive) [ 26/Apr/13 ] |
|
https://maloo.whamcloud.com/test_sets/2ad1712a-ae62-11e2-a8d0-52540035b04c |
| Comment by Chris Gearing (Inactive) [ 26/Apr/13 ] |
|
Niu: Apparently you believe this is a test environment issue, could you possibly indicate why. I'm open to the idea but if you thoughts will help us in our work. |
| Comment by Andreas Dilger [ 26/Apr/13 ] |
|
It seems this problem is only intermittently being hit, though it seems more common on review-zfs tests than regular review tests. I don't know if it is a test environment problem or not, but there isn't any data in the Maloo results to try and diagnose what the problem really is. |
| Comment by Niu Yawei (Inactive) [ 27/Apr/13 ] |
|
Chris, the maloo shows "Failure Rate: 100.00% of last 100 executions [all branches]", and there isn't any log in Maloo result, so I think we'd keep you in the loop to see if there is any problem in the testing environment, I'm not sure what the problem really is. |
| Comment by Jian Yu [ 27/Apr/13 ] |
|
Another instance while testing patch http://review.whamcloud.com/6154 : There were no console logs on Maloo. After looking into the console log file of client-14vm2 on brent node, I found: Lustre: DEBUG MARKER: -----============= acceptance-small: conf-sanity ============----- Fri Apr 26 20:27:00 PDT 2013^M
Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: 32newtarball 59 64^M
Lustre: DEBUG MARKER: excepting tests: 32newtarball 59 64^M
Lustre: DEBUG MARKER: /usr/sbin/lctl mark skipping tests SLOW=no: 30a 31 45^M
Lustre: DEBUG MARKER: skipping tests SLOW=no: 30a 31 45^M
Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts);^M
if [ $running -ne 0 ] ; then^M
echo Stopping client $(hostname) /mnt/lustre opts:;^M
lsof /mnt/lustre || need_kill=no;^M
if [ x != x -a x$need_kill != xno ]; then^M
pids=$(lsof -t /mnt/lustre | sort -u);^M
if ^M
LustreError: 4077:0:(osc_cache.c:2381:osc_teardown_async_page()) extent ffff88006c4317c8@{[0 -> 127/255], [2|0|-|cache|wi|ffff88006c8953c8], [524288|128|+|-|ffff88005e101978|256|(null)]} trunc at 0.^M
LustreError: 4077:0:(osc_page.c:430:osc_page_delete()) page@ffff88007a7e6600[2 ffff88007ac06a50:0 ^(null)_ffff8800597fce00 4 0 1 (null) (null) 0x0]^M
LustreError: 4077:0:(osc_page.c:430:osc_page_delete()) page@ffff8800597fce00[2 ffff88007c9a5c48:0 ^ffff88007a7e6600_(null) 4 0 1 (null) (null) 0x0]^M
LustreError: 4077:0:(osc_page.c:430:osc_page_delete()) vvp-page@ffff88007a7e66c0(0:0:0) vm@ffffea0000f5be88 20000000000075 3:0 0 0 lru^M
LustreError: 4077:0:(osc_page.c:430:osc_page_delete()) lov-page@ffff88007a7e6710^M
LustreError: 4077:0:(osc_page.c:430:osc_page_delete()) osc-page@ffff8800597fcee8: 1< 0x845fed 258 0 + - > 2< 0 0 4096 0x0 0x520 | (null) ffff88006e6bc7c0 ffff88006c8953c8 > 3< + ffff88007a141540 0 0 0 > 4< 0 0 8 0 - | - - + - > 5< - - + - | 0 - | 128 - ->^M
LustreError: 4077:0:(osc_page.c:430:osc_page_delete()) end page@ffff88007a7e6600^M
LustreError: 4077:0:(osc_page.c:430:osc_page_delete()) Trying to teardown failed: -16^M
LustreError: 4077:0:(osc_page.c:431:osc_page_delete()) ASSERTION( 0 ) failed: ^M
LustreError: 4077:0:(osc_page.c:431:osc_page_delete()) LBUG^M
Pid: 4077, comm: umount^M
^M
Call Trace:^M
[<ffffffffa0b74895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
[<ffffffffa0b74e97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
[<ffffffffa0f7fbf1>] osc_page_delete+0x311/0x320 [osc]^M
[<ffffffffa0c4feb5>] cl_page_delete0+0xc5/0x4e0 [obdclass]^M
[<ffffffffa0c50312>] cl_page_delete+0x42/0x120 [obdclass]^M
[<ffffffffa10a556d>] ll_invalidatepage+0x8d/0x160 [lustre]^M
[<ffffffff811304e5>] do_invalidatepage+0x25/0x30^M
[<ffffffff81130802>] truncate_inode_page+0xa2/0xc0^M
[<ffffffff81130baf>] truncate_inode_pages_range+0x16f/0x500^M
[<ffffffff81130fd5>] truncate_inode_pages+0x15/0x20^M
[<ffffffff8119d188>] dispose_list+0xe8/0x120^M
[<ffffffff8119d58a>] invalidate_inodes+0xea/0x190^M
[<ffffffff8118333c>] generic_shutdown_super+0x4c/0xe0^M
[<ffffffff81183436>] kill_anon_super+0x16/0x60^M
[<ffffffffa0c35b2a>] lustre_kill_super+0x4a/0x60 [obdclass]^M
[<ffffffff81183bd7>] deactivate_super+0x57/0x80^M
[<ffffffff811a1c4f>] mntput_no_expire+0xbf/0x110^M
[<ffffffff811a26bb>] sys_umount+0x7b/0x3a0^M
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b^M
^M
Kernel panic - not syncing: LBUG^M
Pid: 4077, comm: umount Not tainted 2.6.32-358.2.1.el6.x86_64 #1^M
Call Trace:^M
[<ffffffff8150d248>] ? panic+0xa7/0x16f^M
[<ffffffffa0b74eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]^M
[<ffffffffa0f7fbf1>] ? osc_page_delete+0x311/0x320 [osc]^M
[<ffffffffa0c4feb5>] ? cl_page_delete0+0xc5/0x4e0 [obdclass]^M
[<ffffffffa0c50312>] ? cl_page_delete+0x42/0x120 [obdclass]^M
[<ffffffffa10a556d>] ? ll_invalidatepage+0x8d/0x160 [lustre]^M
[<ffffffff811304e5>] ? do_invalidatepage+0x25/0x30^M
[<ffffffff81130802>] ? truncate_inode_page+0xa2/0xc0^M
[<ffffffff81130baf>] ? truncate_inode_pages_range+0x16f/0x500^M
[<ffffffff81130fd5>] ? truncate_inode_pages+0x15/0x20^M
[<ffffffff8119d188>] ? dispose_list+0xe8/0x120^M
[<ffffffff8119d58a>] ? invalidate_inodes+0xea/0x190^M
[<ffffffff8118333c>] ? generic_shutdown_super+0x4c/0xe0^M
[<ffffffff81183436>] ? kill_anon_super+0x16/0x60^M
[<ffffffffa0c35b2a>] ? lustre_kill_super+0x4a/0x60 [obdclass]^M
[<ffffffff81183bd7>] ? deactivate_super+0x57/0x80^M
[<ffffffff811a1c4f>] ? mntput_no_expire+0xbf/0x110^M
[<ffffffff811a26bb>] ? sys_umount+0x7b/0x3a0^M
[<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b^M
|
| Comment by Niu Yawei (Inactive) [ 28/Apr/13 ] |
|
Thanks, Yujian. This failure is caused by the patch. I'll update the patch. |
| Comment by Peter Jones [ 28/Apr/13 ] |
|
Dropping priority because failures seen testing |
| Comment by Nathaniel Clark [ 31/May/13 ] |
|
https://maloo.whamcloud.com/test_sets/84ae39b2-c942-11e2-97fe-52540035b04c Pulled console logs from Rosso directly: Lustre: DEBUG MARKER: /usr/sbin/lctl mark -----============= acceptance-small: conf-sanity ============----- Wed May 29 21:46:54 PDT 2013 Lustre: DEBUG MARKER: -----============= acceptance-small: conf-sanity ============----- Wed May 29 21:46:54 PDT 2013 Lustre: DEBUG MARKER: /usr/sbin/lctl mark excepting tests: 32newtarball 59 64 57b 50h Lustre: DEBUG MARKER: excepting tests: 32newtarball 59 64 57b 50h Lustre: DEBUG MARKER: /usr/sbin/lctl mark skipping tests SLOW=no: 30a 31 45 Lustre: DEBUG MARKER: skipping tests SLOW=no: 30a 31 45 Lustre: DEBUG MARKER: grep -c /mnt/ost1' ' /proc/mounts Lustre: DEBUG MARKER: umount -d -f /mnt/ost1 Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 3. Is it stuck? Lustre: server umount lustre-OST0000 complete Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' Lustre: DEBUG MARKER: grep -c /mnt/ost2' ' /proc/mounts Lustre: DEBUG MARKER: umount -d -f /mnt/ost2 Lustre: server umount lustre-OST0001 complete Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST ' Lustre: DEBUG MARKER: grep -c /mnt/ost3' ' /proc/mounts Lustre: DEBUG MARKER: umount -d -f /mnt/ost3 Lustre: lustre-OST0002 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? Lustre: 5845:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1369889304/real 1369889304] req@ffff8800633a8000 x1436410772180168/t0(0) o38->lustre-MDT0000-lwp-OST0004@10.10.17.6@tcp:12/10 lens 400/544 e 0 to 1 dl 1369889329 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Lustre: 5845:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 39 previous similar messages Lustre: lustre-OST0002 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck? Lustre: lustre-OST0002 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck? Lustre: lustre-OST0002 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck? INFO: task umount:21142 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D 0000000000000000 0 21142 21141 0x00000080 ffff88006bb17aa8 0000000000000086 ffffffff00000010 ffff88006bb17a58 ffff88006bb17a18 ffff88004943e800 ffffffffa078c44c 0000000000000000 ffff880067569098 ffff88006bb17fd8 000000000000fb88 ffff880067569098 Call Trace: [<ffffffff8150ed82>] schedule_timeout+0x192/0x2e0 [<ffffffff810810e0>] ? process_timeout+0x0/0x10 [<ffffffffa05d56bd>] cfs_schedule_timeout_and_set_state+0x1d/0x20 [libcfs] [<ffffffffa070c848>] obd_exports_barrier+0x98/0x170 [obdclass] [<ffffffffa0e44a72>] ofd_device_fini+0x42/0x230 [ofd] [<ffffffffa0739b57>] class_cleanup+0x577/0xda0 [obdclass] [<ffffffffa070eae6>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa073b43c>] class_process_config+0x10bc/0x1c80 [obdclass] [<ffffffffa0734c63>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] [<ffffffffa073c179>] class_manual_cleanup+0x179/0x6f0 [obdclass] [<ffffffffa070eae6>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa0770e7c>] server_put_super+0x5bc/0xf00 [obdclass] [<ffffffff8118334b>] generic_shutdown_super+0x5b/0xe0 [<ffffffff81183436>] kill_anon_super+0x16/0x60 [<ffffffffa073dfd6>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff81183bd7>] deactivate_super+0x57/0x80 [<ffffffff811a1c4f>] mntput_no_expire+0xbf/0x110 [<ffffffff811a26bb>] sys_umount+0x7b/0x3a0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Lustre: lustre-OST0002 is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck? INFO: task umount:21142 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D 0000000000000000 0 21142 21141 0x00000080 ffff88006bb17aa8 0000000000000086 ffffffff00000010 ffff88006bb17a58 ffff88006bb17a18 ffff88004943e800 ffffffffa078c44c 0000000000000000 ffff880067569098 ffff88006bb17fd8 000000000000fb88 ffff880067569098 Call Trace: [<ffffffff8150ed82>] schedule_timeout+0x192/0x2e0 [<ffffffff810810e0>] ? process_timeout+0x0/0x10 [<ffffffffa05d56bd>] cfs_schedule_timeout_and_set_state+0x1d/0x20 [libcfs] [<ffffffffa070c848>] obd_exports_barrier+0x98/0x170 [obdclass] [<ffffffffa0e44a72>] ofd_device_fini+0x42/0x230 [ofd] [<ffffffffa0739b57>] class_cleanup+0x577/0xda0 [obdclass] [<ffffffffa070eae6>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa073b43c>] class_process_config+0x10bc/0x1c80 [obdclass] [<ffffffffa0734c63>] ? lustre_cfg_new+0x353/0x7e0 [obdclass] [<ffffffffa073c179>] class_manual_cleanup+0x179/0x6f0 [obdclass] [<ffffffffa070eae6>] ? class_name2dev+0x56/0xe0 [obdclass] [<ffffffffa0770e7c>] server_put_super+0x5bc/0xf00 [obdclass] [<ffffffff8118334b>] generic_shutdown_super+0x5b/0xe0 [<ffffffff81183436>] kill_anon_super+0x16/0x60 [<ffffffffa073dfd6>] lustre_kill_super+0x36/0x60 [obdclass] [<ffffffff81183bd7>] deactivate_super+0x57/0x80 [<ffffffff811a1c4f>] mntput_no_expire+0xbf/0x110 [<ffffffff811a26bb>] sys_umount+0x7b/0x3a0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Lustre: lustre-OST0002 is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck? Lustre: 5845:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1369889919/real 1369889919] req@ffff88006eacd000 x1436410772180668/t0(0) o38->lustre-MDT0000-lwp-OST0003@10.10.17.6@tcp:12/10 lens 400/544 e 0 to 1 dl 1369889945 |
| Comment by Nathaniel Clark [ 31/May/13 ] |
|
This bug seems related to |
| Comment by Keith Mannthey (Inactive) [ 03/Jun/13 ] |
|
https://maloo.whamcloud.com/test_sets/e980c046-caee-11e2-95b5-52540035b04c Still an issue as of the 1st. conf-sanity has no logs and reports 100/100 failure rate. This was on zfs. |
| Comment by Nathaniel Clark [ 05/Jun/13 ] |
|
https://maloo.whamcloud.com/test_sets/86686766-cd5f-11e2-a1e0-52540035b04c 04:38:20:Lustre: DEBUG MARKER: umount -d /mnt/ost4 04:38:20:Lustre: Failing over lustre-OST0003 04:38:20:Lustre: lustre-OST0003: Not available for connect from 10.10.16.173@tcp (stopping) 04:38:20:Lustre: Skipped 1 previous similar message 04:38:20:Lustre: lustre-OST0003 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? trimmed OST debug log: (genops.c:1581:obd_exports_barrier()) lustre-OST0003 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? (genops.c:1541:print_export_data()) lustre-OST0003: UNLINKED ffff88003890ac00 lustre-MDT0000-mdtlov_UUID 10.10.16.170@tcp 1 (0 0 0) 1 0 1 0: (null) 0 |
| Comment by Bruno Faccini (Inactive) [ 02/Jul/13 ] |
|
Just got an other occurence of such hang/loop in obd_exports_barrier() on OSS causing auto-test at https://maloo.whamcloud.com/test_sets/67d6eaea-e28e-11e2-b3c9-52540035b04c to timeout. Looks like an export waits for ever to be unlinked from obd_unlinked_exports object's list. BTW, looking in the source it's not obvious for me where this should happen. Any advice ? Here a some logs/traces extracts from the failing test. Test log : == replay-single test 90: lfs find identifies the missing striped file segments == 08:45:48 (1372693548) Create the files CMD: wtm-13vm4 lctl get_param -n obdfilter.lustre-OST0004.uuid Fail ost5 lustre-OST0004_UUID, display the list of affected files CMD: wtm-13vm4 grep -c /mnt/ost5' ' /proc/mounts Stopping /mnt/ost5 (opts:) on wtm-13vm4 CMD: wtm-13vm4 umount -d /mnt/ost5 <<<---- Stuck here OSS debug_log : 00000020:00080000:0.0:1372693555.591347:0:9638:0:(genops.c:1535:print_export_data()) lustre-OST0004: ACTIVE ffff8800720a9800 lustre-MDT0000-mdtlov_UUID 10.10.16.143@tcp 3 (0 0 0) 0 0 0 0: (null) 0 00000020:00080000:0.0:1372693555.591350:0:9638:0:(genops.c:1535:print_export_data()) lustre-OST0004: ACTIVE ffff880068d27400 480e6903-ca7b-ddb8-25aa-8c792a44efa7 10.10.16.145@tcp 3 (0 0 0) 0 0 0 0: (null) 0 00000020:00080000:0.0:1372693555.591352:0:9638:0:(genops.c:1535:print_export_data()) lustre-OST0004: ACTIVE ffff88004462d800 87ca4efc-b558-d054-8613-d9adf4621433 10.10.16.146@tcp 5 (0 0 2) 0 0 0 0: (null) 0 00000020:00080000:0.0:1372693555.591354:0:9638:0:(genops.c:1535:print_export_data()) lustre-OST0004: ACTIVE ffff88006ab4d800 lustre-OST0004_UUID (no nid) 1 (0 0 0) 0 0 0 0: (null) 0 00000020:00080000:0.0:1372693555.591356:0:9638:0:(genops.c:1535:print_export_data()) lustre-OST0004: UNLINKED ffff88002030d800 lustre-MDT0000-mdtlov_UUID 10.10.16.143@tcp 1 (0 0 0) 1 0 1 0: (null) 0 00000020:00080000:0.0:1372693555.591357:0:9638:0:(genops.c:1308:class_disconnect_exports()) OBD device 15 (ffff88006aa343f8) has exports, disconnecting them 00000020:00080000:0.0:1372693555.591359:0:9638:0:(genops.c:1283:class_disconnect_export_list()) lustre-OST0004: disconnecting export at 10.10.16.143@tcp (ffff8800720a9800), last request at 1372693545 00000100:00080000:0.0:1372693555.591366:0:9638:0:(import.c:1512:ptlrpc_cleanup_imp()) ffff88001973f000 : changing import state from FULL to CLOSED 00000020:00080000:0.0:1372693555.591377:0:9638:0:(genops.c:1288:class_disconnect_export_list()) disconnected export at 10.10.16.143@tcp (ffff8800720a9800): rc 0 00000020:00080000:0.0:1372693555.591379:0:9638:0:(genops.c:1283:class_disconnect_export_list()) lustre-OST0004: disconnecting export at 10.10.16.145@tcp (ffff880068d27400), last request at 1372693549 00000100:00080000:0.0:1372693555.591381:0:9638:0:(import.c:1512:ptlrpc_cleanup_imp()) ffff8800190a4000 : changing import state from FULL to CLOSED 00000020:00080000:0.0:1372693555.591386:0:9638:0:(genops.c:1288:class_disconnect_export_list()) disconnected export at 10.10.16.145@tcp (ffff880068d27400): rc 0 00000020:00080000:0.0:1372693555.591388:0:9638:0:(genops.c:1283:class_disconnect_export_list()) lustre-OST0004: disconnecting export at 10.10.16.146@tcp (ffff88004462d800), last request at 1372693549 00000100:00080000:0.0:1372693555.591389:0:9638:0:(import.c:1512:ptlrpc_cleanup_imp()) ffff880019151800 : changing import state from FULL to CLOSED 00010000:00010000:0.0:1372693555.591393:0:9638:0:(ldlm_lock.c:2218:ldlm_cancel_locks_for_export_cb()) ### export ffff88004462d800 ns: filter-lustre-OST0004_UUID lock: ffff88007a1969c0/0xf03433df160b73d8 lrc: 4/0,0 mode: PR/PR res: [0x2d02:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40000000000000 nid: 10.10.16.146@tcp remote: 0xe2b9ddc0f9be730c expref: 4 pid: 10014 timeout: 0 lvb_type: 1 00010000:00010000:0.0:1372693555.591417:0:9638:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: filter-lustre-OST0004_UUID lock: ffff88007a1969c0/0xf03433df160b73d8 lrc: 0/0,0 mode: --/PR res: [0x2d02:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x44801000000000 nid: 10.10.16.146@tcp remote: 0xe2b9ddc0f9be730c expref: 4 pid: 10014 timeout: 0 lvb_type: 1 00010000:00010000:0.0:1372693555.591424:0:9638:0:(ldlm_lock.c:2218:ldlm_cancel_locks_for_export_cb()) ### export ffff88004462d800 ns: filter-lustre-OST0004_UUID lock: ffff880024fe3280/0xf03433df160b741e lrc: 4/0,0 mode: PR/PR res: [0x2d03:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40000000000000 nid: 10.10.16.146@tcp remote: 0xe2b9ddc0f9be74c5 expref: 3 pid: 7587 timeout: 0 lvb_type: 1 00010000:00010000:0.0:1372693555.591434:0:9638:0:(ldlm_lock.c:219:ldlm_lock_put()) ### final lock_put on destroyed lock, freeing it. ns: filter-lustre-OST0004_UUID lock: ffff880024fe3280/0xf03433df160b741e lrc: 0/0,0 mode: --/PR res: [0x2d03:0x0:0x0].0 rrc: 1 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x44801000000000 nid: 10.10.16.146@tcp remote: 0xe2b9ddc0f9be74c5 expref: 3 pid: 7587 timeout: 0 lvb_type: 1 00000020:00080000:0.0:1372693555.591440:0:9638:0:(genops.c:1288:class_disconnect_export_list()) disconnected export at 10.10.16.146@tcp (ffff88004462d800): rc 0 00000020:00080000:0.0:1372693555.591442:0:9638:0:(genops.c:1271:class_disconnect_export_list()) exp ffff88006ab4d800 export uuid == obd uuid, don't discon 00000020:00080000:0.0:1372693555.591444:0:9638:0:(genops.c:1313:class_disconnect_exports()) OBD device 15 (ffff88006aa343f8) has no exports and then several of the following with increasing time 00000020:02000400:0.0:1372693569.591340:0:9638:0:(genops.c:1575:obd_exports_barrier()) lustre-OST0004 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? 00000020:00080000:0.0:1372693569.594019:0:9638:0:(genops.c:1535:print_export_data()) lustre-OST0004: UNLINKED ffff88002030d800 lustre-MDT0000-mdtlov_UUID 10.10.16.143@tcp 1 (0 0 0) 1 0 1 0: (null) 0 ..... 00000020:02000400:0.0:1372693585.593287:0:9638:0:(genops.c:1575:obd_exports_barrier()) lustre-OST0004 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck? 00000020:00080000:0.0:1372693585.594530:0:9638:0:(genops.c:1535:print_export_data()) lustre-OST0004: UNLINKED ffff88002030d800 lustre-MDT0000-mdtlov_UUID 10.10.16.143@tcp 1 (0 0 0) 1 0 1 0: (null) 0 ..... OSS syslog : ..... Jul 1 08:32:53 wtm-13vm4 mrshd[31217]: root@wtm-13vm6.rosso.whamcloud.com as root: cmd='/usr/sbin/lctl mark "/usr/sbin/lctl mark test_70b fail mds1 1 times";echo XXRETCODE:$?' Jul 1 08:32:53 wtm-13vm4 kernel: Lustre: DEBUG MARKER: /usr/sbin/lctl mark test_70b fail mds1 1 times Jul 1 08:32:53 wtm-13vm4 xinetd[1761]: EXIT: mshell status=0 pid=31216 duration=0(sec) Jul 1 08:32:53 wtm-13vm4 xinetd[1761]: START: mshell pid=31239 from=::ffff:10.10.16.146 Jul 1 08:32:53 wtm-13vm4 mrshd[31240]: root@wtm-13vm6.rosso.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" MGSFSTYPE=zfs MDSFSTYPE=zfs OSTFSTYPE=zfs FSTYPE=zfs sh -c "/usr/sbin/lctl mark test_70b fail mds1 1 times");echo XXRETCODE:$?' Jul 1 08:32:53 wtm-13vm4 kernel: Lustre: DEBUG MARKER: test_70b fail mds1 1 times Jul 1 08:32:53 wtm-13vm4 xinetd[1761]: EXIT: mshell status=0 pid=31239 duration=0(sec) Jul 1 08:32:56 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for cpu_wio#012 Jul 1 08:33:10 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for proc_run#012 Jul 1 08:33:11 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for bytes_out#012 Jul 1 08:33:11 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for pkts_out#012 Jul 1 08:33:16 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for cpu_user#012 Jul 1 08:33:16 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for cpu_idle#012 Jul 1 08:33:16 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for cpu_aidle#012 Jul 1 08:33:36 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for cpu_user#012 Jul 1 08:33:36 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for cpu_idle#012 Jul 1 08:33:36 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for load_one#012 Jul 1 08:33:36 wtm-13vm4 kernel: Lustre: lustre-OST0004: Export ffff88002030d800 already connecting from 10.10.16.143@tcp Jul 1 08:33:36 wtm-13vm4 kernel: Lustre: lustre-OST0005: Export ffff88006c972400 already connecting from 10.10.16.143@tcp Jul 1 08:33:39 wtm-13vm4 kernel: Lustre: 5417:0:(client.c:1868:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1372692794/real 1372692794] req@ffff880070a3bc00 x1439352804834872/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.16.143@tcp:12/10 lens 400/544 e 0 to 1 dl 1372692819 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 Jul 1 08:33:39 wtm-13vm4 kernel: Lustre: 5417:0:(client.c:1868:ptlrpc_expire_one_request()) Skipped 34 previous similar messages Jul 1 08:33:41 wtm-13vm4 kernel: Lustre: lustre-OST0004: Export ffff88002030d800 already connecting from 10.10.16.143@tcp Jul 1 08:33:46 wtm-13vm4 kernel: Lustre: lustre-OST0004: Export ffff88002030d800 already connecting from 10.10.16.143@tcp Jul 1 08:33:46 wtm-13vm4 kernel: Lustre: Skipped 1 previous similar message ..... Jul 1 08:33:56 wtm-13vm4 kernel: Lustre: lustre-OST0004: Export ffff88002030d800 already connecting from 10.10.16.143@tcp Jul 1 08:33:56 wtm-13vm4 kernel: Lustre: Skipped 1 previous similar message Jul 1 08:34:07 wtm-13vm4 kernel: Lustre: 7568:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (20:16s); client may timeout. req@ffff880044924800 x1439352772828224/t0(0) o8->lustre-MDT0000-mdtlov_UUID@10.10.16.143@tcp:0/0 lens 400/264 e 0 to 0 dl 1372692831 ref 1 fl Complete:/0/0 rc 0/0 Jul 1 08:34:16 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for cpu_nice#012 Jul 1 08:34:20 wtm-13vm4 kernel: Lustre: lustre-OST0004: haven't heard from client lustre-MDT0000-mdtlov_UUID (at (no nid)) in 49 seconds. I think it's dead, and I am evicting it. exp ffff88002030d800, cur 1372692860 expire 1372692830 last 1372692811 ..... Jul 1 08:45:50 wtm-13vm4 mrshd[9593]: root@wtm-13vm6.rosso.whamcloud.com as root: cmd='/usr/sbin/lctl mark "umount -d /mnt/ost5";echo XXRETCODE:$?' Jul 1 08:45:50 wtm-13vm4 kernel: Lustre: DEBUG MARKER: umount -d /mnt/ost5 Jul 1 08:45:50 wtm-13vm4 xinetd[1761]: EXIT: mshell status=0 pid=9592 duration=0(sec) Jul 1 08:45:50 wtm-13vm4 xinetd[1761]: START: mshell pid=9615 from=::ffff:10.10.16.146 Jul 1 08:45:50 wtm-13vm4 mrshd[9616]: root@wtm-13vm6.rosso.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" sh -c "umount -d /mnt/ost5");echo XXRETCODE:$?' Jul 1 08:45:50 wtm-13vm4 kernel: Lustre: Failing over lustre-OST0004 Jul 1 08:45:51 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for mem_free#012 Jul 1 08:45:51 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for mem_shared#012 Jul 1 08:45:51 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for mem_cached#012 Jul 1 08:45:51 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for swap_free#012 Jul 1 08:45:51 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for bytes_in#012 Jul 1 08:45:56 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for load_five#012 Jul 1 08:45:56 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for load_fifteen#012 Jul 1 08:45:56 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for cpu_user#012 Jul 1 08:45:58 wtm-13vm4 kernel: Lustre: lustre-OST0004: Not available for connect from 10.10.16.143@tcp (stopping) Jul 1 08:45:58 wtm-13vm4 kernel: Lustre: Skipped 1 previous similar message Jul 1 08:46:03 wtm-13vm4 kernel: Lustre: lustre-OST0004: Not available for connect from 10.10.16.143@tcp (stopping) Jul 1 08:46:03 wtm-13vm4 kernel: Lustre: Skipped 2 previous similar messages Jul 1 08:46:08 wtm-13vm4 kernel: Lustre: lustre-OST0004: Not available for connect from 10.10.16.143@tcp (stopping) Jul 1 08:46:08 wtm-13vm4 kernel: Lustre: Skipped 2 previous similar messages Jul 1 08:46:09 wtm-13vm4 kernel: Lustre: lustre-OST0004 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? Jul 1 08:46:16 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for cpu_user#012 Jul 1 08:46:16 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for cpu_system#012 Jul 1 08:46:16 wtm-13vm4 /usr/sbin/gmond[1777]: Error 1 sending the modular data for cpu_idle#012 Jul 1 08:46:18 wtm-13vm4 kernel: Lustre: lustre-OST0004: Not available for connect from 10.10.16.143@tcp (stopping) Jul 1 08:46:18 wtm-13vm4 kernel: Lustre: Skipped 5 previous similar messages Jul 1 08:46:25 wtm-13vm4 kernel: Lustre: lustre-OST0004 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck? .... then several stacks are dumped for same hung process umount/9638 like the following : Jul 1 08:52:18 wtm-13vm4 kernel: INFO: task umount:9638 blocked for more than 120 seconds. Jul 1 08:52:18 wtm-13vm4 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Jul 1 08:52:18 wtm-13vm4 kernel: umount D 0000000000000000 0 9638 9637 0x00000080 Jul 1 08:52:18 wtm-13vm4 kernel: ffff8800194b3aa8 0000000000000082 ffffffff00000010 ffff8800194b3a58 Jul 1 08:52:18 wtm-13vm4 kernel: ffff8800194b3a18 ffff88002030d800 ffffffffa078e0ff 0000000000000000 Jul 1 08:52:18 wtm-13vm4 kernel: ffff880019e91ab8 ffff8800194b3fd8 000000000000fb88 ffff880019e91ab8 Jul 1 08:52:18 wtm-13vm4 kernel: Call Trace: Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffff8150ee02>] schedule_timeout+0x192/0x2e0 Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffff810810e0>] ? process_timeout+0x0/0x10 Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffffa05d662d>] cfs_schedule_timeout_and_set_state+0x1d/0x20 [libcfs] Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffffa070f4d8>] obd_exports_barrier+0x98/0x170 [obdclass] Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffffa0e42962>] ofd_device_fini+0x42/0x230 [ofd] Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffffa073c2e7>] class_cleanup+0x577/0xda0 [obdclass] Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffffa0711686>] ? class_name2dev+0x56/0xe0 [obdclass] Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffffa073dbcc>] class_process_config+0x10bc/0x1c80 [obdclass] Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffffa07375b3>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass] Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffffa073e909>] class_manual_cleanup+0x179/0x6f0 [obdclass] Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffffa0711686>] ? class_name2dev+0x56/0xe0 [obdclass] Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffffa077845c>] server_put_super+0x5ec/0xf60 [obdclass] Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffff811833ab>] generic_shutdown_super+0x5b/0xe0 Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffff81183496>] kill_anon_super+0x16/0x60 Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffffa0740756>] lustre_kill_super+0x36/0x60 [obdclass] Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffff81183c37>] deactivate_super+0x57/0x80 Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffff811a1c8f>] mntput_no_expire+0xbf/0x110 Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffff811a26fb>] sys_umount+0x7b/0x3a0 Jul 1 08:52:18 wtm-13vm4 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b and until auto-test detects a time-out and request all threads to dump their stack where nothing interesting be found, only the concerned export address (ffff88002030d800) being in ll_evictor thread context/stack : Jul 1 09:50:26 wtm-13vm4 kernel: ll_evictor S 0000000000000000 0 5227 2 0x00000080 Jul 1 09:50:26 wtm-13vm4 kernel: ffff88006d29fe00 0000000000000046 0000000000000000 0000000000000002 Jul 1 09:50:26 wtm-13vm4 kernel: ffff88006d29fd80 ffff88002030d800 000000000000000a 0000000000000000 Jul 1 09:50:26 wtm-13vm4 kernel: ffff880069e9e638 ffff88006d29ffd8 000000000000fb88 ffff880069e9e638 Jul 1 09:50:26 wtm-13vm4 kernel: Call Trace: Jul 1 09:50:26 wtm-13vm4 kernel: [<ffffffffa05d666e>] cfs_waitq_wait+0xe/0x10 [libcfs] Jul 1 09:50:26 wtm-13vm4 kernel: [<ffffffffa08ea98b>] ping_evictor_main+0x4ab/0x640 [ptlrpc] Jul 1 09:50:26 wtm-13vm4 kernel: [<ffffffff81063310>] ? default_wake_function+0x0/0x20 Jul 1 09:50:26 wtm-13vm4 kernel: [<ffffffffa08ea4e0>] ? ping_evictor_main+0x0/0x640 [ptlrpc] Jul 1 09:50:26 wtm-13vm4 kernel: [<ffffffff81096936>] kthread+0x96/0xa0 Jul 1 09:50:26 wtm-13vm4 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jul 1 09:50:26 wtm-13vm4 kernel: [<ffffffff810968a0>] ? kthread+0x0/0xa0 Jul 1 09:50:26 wtm-13vm4 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 and still with the hung umount/9638 process !! So seems that the orphan export still on the unlinked list and preventing ost umount, comes from a MDS eviction during a previous test (test_70b). |
| Comment by Bruno Faccini (Inactive) [ 08/Jul/13 ] |
|
Same at https://maloo.whamcloud.com/test_sets/229cc1ca-e5d6-11e2-a8a4-52540035b04c, again for review-zfs/replay-single/test_90 where the orphan/unlink export seems to be the one for the MDS-lov that was evicted from OSS in earlier test_70b. |
| Comment by Nathaniel Clark [ 11/Jul/13 ] |
|
Bruno, I agree it's definitely the lustre-MDT0000-mdtlov export that has an extra reference. oss debug log from one of the replay-single/90 failures: 00000020:00080000:0.0:1373044595.526429:0:6742:0:(genops.c:1535:print_export_data()) lustre-OST0002: ACTIVE ffff880071708000 lustre-MDT0000-mdtlov_UUID 10.10.16.107@tcp 3 (0 0 0) 0 0 0 0: (null) 0 00000020:00080000:0.0:1373044595.526433:0:6742:0:(genops.c:1535:print_export_data()) lustre-OST0002: ACTIVE ffff8800705e1400 75dd01f6-8a7e-1e4e-7e18-9ea0580f4dfc 10.10.16.109@tcp 3 (0 0 0) 0 0 0 0: (null) 0 00000020:00080000:0.0:1373044595.526435:0:6742:0:(genops.c:1535:print_export_data()) lustre-OST0002: ACTIVE ffff8800692a9000 08b7a486-2e63-5cd2-f09f-fd1b2503a9b4 10.10.16.110@tcp 5 (0 0 2) 0 0 0 0: (null) 0 00000020:00080000:0.0:1373044595.526437:0:6742:0:(genops.c:1535:print_export_data()) lustre-OST0002: ACTIVE ffff88006e732400 lustre-OST0002_UUID (no nid) 1 (0 0 0) 0 0 0 0: (null) 0 00000020:00080000:0.0:1373044595.526440:0:6742:0:(genops.c:1535:print_export_data()) lustre-OST0002: UNLINKED ffff88004d2cfc00 lustre-MDT0000-mdtlov_UUID 10.10.16.107@tcp 1 (0 0 0) 1 0 1 0: (null) 0 This is while the ost is being unmounted. |
| Comment by Nathaniel Clark [ 11/Jul/13 ] |
|
OSS Console log from replay-single/70b 10:06:06:LNet: Service thread pid 5468 was inactive for 62.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 10:06:06:Pid: 5468, comm: ll_ost_io00_000 10:06:06: 10:06:06:Call Trace: 10:06:06: [<ffffffff81096eae>] ? prepare_to_wait_exclusive+0x4e/0x80 10:06:06: [<ffffffffa03e6dad>] cv_wait_common+0xed/0x100 [spl] 10:06:06: [<ffffffff81096ca0>] ? autoremove_wake_function+0x0/0x40 10:06:06: [<ffffffffa03e6e15>] __cv_wait+0x15/0x20 [spl] 10:06:06: [<ffffffffa04e518b>] txg_wait_open+0x7b/0xa0 [zfs] 10:06:06: [<ffffffffa04aa6ad>] dmu_tx_wait+0xed/0xf0 [zfs] 10:06:06: [<ffffffffa04aa736>] dmu_tx_assign+0x86/0x480 [zfs] 10:06:06: [<ffffffffa0dab56c>] osd_trans_start+0x9c/0x410 [osd_zfs] 10:06:06: [<ffffffffa0e58cbd>] ofd_commitrw_write+0x92d/0x11b0 [ofd] 10:06:06: [<ffffffffa0e5b77d>] ofd_commitrw+0x5cd/0xbb0 [ofd] 10:06:06: [<ffffffffa06847e5>] ? lprocfs_counter_add+0x125/0x182 [lvfs] 10:06:06: [<ffffffffa0e101d8>] obd_commitrw+0x128/0x3d0 [ost] 10:06:06: [<ffffffffa0e1a0f1>] ost_brw_write+0xea1/0x15d0 [ost] 10:06:06: [<ffffffff81281536>] ? vsnprintf+0x336/0x5e0 10:06:06: [<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13 10:06:06: [<ffffffffa0897350>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] 10:06:06: [<ffffffffa0e2033b>] ost_handle+0x3ecb/0x48e0 [ost] 10:06:06: [<ffffffffa05e1d94>] ? libcfs_id2str+0x74/0xb0 [libcfs] 10:06:07: [<ffffffffa08e7378>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] 10:06:07: [<ffffffffa05d654e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 10:06:07: [<ffffffffa05e7a9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 10:06:07: [<ffffffffa08de799>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 10:06:07: [<ffffffff81055ab3>] ? __wake_up+0x53/0x70 10:06:07: [<ffffffffa08e86fd>] ptlrpc_main+0xabd/0x1700 [ptlrpc] 10:06:07: [<ffffffffa08e7c40>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 10:06:07: [<ffffffff81096936>] kthread+0x96/0xa0 10:06:07: [<ffffffff8100c0ca>] child_rip+0xa/0x20 10:06:07: [<ffffffff810968a0>] ? kthread+0x0/0xa0 10:06:07: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 |
| Comment by Bruno Faccini (Inactive) [ 12/Jul/13 ] |
|
Yes these stacks are suspect, but they are no longer present in the stacks listing at the time umount of OST in test_90 becomes stuck ... Also, the OSS debug_log is not big enough at the time of test_90 failure to also contain the necessary traces from test_70b to debug further ... Do you think like me it could be of interest to isolate test_70b in a loop and detect its failure using all OSTs umounts ?? |
| Comment by Nathaniel Clark [ 15/Jul/13 ] |
| Comment by Li Wei (Inactive) [ 25/Jul/13 ] |
|
https://maloo.whamcloud.com/test_sets/9bd8497c-f46f-11e2-b8a2-52540035b04c |
| Comment by Jian Yu [ 12/Aug/13 ] |
|
The failure occurs consistently on Lustre b2_4 branch while running obdfilter-survey test 3a with FSTYPE=zfs: https://maloo.whamcloud.com/test_sets/1c69df6c-025b-11e3-b384-52540035b04c |
| Comment by Nathaniel Clark [ 20/Aug/13 ] |
|
Landed on master |
| Comment by Andreas Dilger [ 20/Aug/13 ] |
|
Cherry-picked to b2_4 as commit d4532fbf06d392212463355e2ba1525e9aea97eb. |
| Comment by Jian Yu [ 24/Sep/13 ] |
|
While testing patch http://review.whamcloud.com/7644 on master branch, I hit the failure on replay-single test 53e: https://maloo.whamcloud.com/test_sets/ddb85db2-208b-11e3-b9bc-52540035b04c 03:55:06:Lustre: DEBUG MARKER: umount -d /mnt/mds1 03:55:06:LustreError: 166-1: MGC10.10.4.154@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail 03:55:07:Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? 03:55:31:Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck? 03:56:05:Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck? 03:57:09:Lustre: MGS is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 5. Is it stuck? 03:59:15:Lustre: MGS is waiting for obd_unlinked_exports more than 128 seconds. The obd refcount = 5. Is it stuck? 04:02:31:INFO: task umount:2412 blocked for more than 120 seconds. 04:02:31:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 04:02:32:umount D 0000000000000000 0 2412 2411 0x00000080 04:02:32: ffff88005ead3aa8 0000000000000086 ffff88005ead3a08 ffff880079da1400 04:02:32: ffffffffa062f3e0 0000000000000000 ffff88006684f084 ffffffffa062f3e0 04:02:32: ffff8800567645f8 ffff88005ead3fd8 000000000000fb88 ffff8800567645f8 04:02:32:Call Trace: 04:02:33: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0 04:02:33: [<ffffffff810811e0>] ? process_timeout+0x0/0x10 04:02:33: [<ffffffffa05b06fb>] obd_exports_barrier+0xab/0x180 [obdclass] 04:02:34: [<ffffffffa0d2852e>] mgs_device_fini+0xfe/0x580 [mgs] 04:02:35: [<ffffffffa05dbc93>] class_cleanup+0x573/0xd30 [obdclass] 04:02:35: [<ffffffffa05b2896>] ? class_name2dev+0x56/0xe0 [obdclass] 04:02:35: [<ffffffffa05dd9ba>] class_process_config+0x156a/0x1ad0 [obdclass] 04:02:35: [<ffffffffa05d6b13>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass] 04:02:36: [<ffffffffa05de099>] class_manual_cleanup+0x179/0x6f0 [obdclass] 04:02:36: [<ffffffffa05b2896>] ? class_name2dev+0x56/0xe0 [obdclass] 04:02:37: [<ffffffffa061955d>] server_put_super+0x45d/0xf60 [obdclass] 04:02:38: [<ffffffff8118363b>] generic_shutdown_super+0x5b/0xe0 04:02:38: [<ffffffff81183726>] kill_anon_super+0x16/0x60 04:02:39: [<ffffffffa05dff46>] lustre_kill_super+0x36/0x60 [obdclass] 04:02:39: [<ffffffff81183ec7>] deactivate_super+0x57/0x80 04:02:39: [<ffffffff811a21bf>] mntput_no_expire+0xbf/0x110 04:02:42: [<ffffffff811a2c2b>] sys_umount+0x7b/0x3a0 04:02:42: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 04:03:25:Lustre: MGS is waiting for obd_unlinked_exports more than 256 seconds. The obd refcount = 5. Is it stuck? |
| Comment by Nathaniel Clark [ 01/Oct/13 ] |
|
This seems to have cropped up several times recently with sanity/132 https://maloo.whamcloud.com/test_sets/47f3dcd2-2a88-11e3-864b-52540035b04c |
| Comment by Nathaniel Clark [ 01/Oct/13 ] |
|
sanity/132 seem to share the following OST logs: 15:51:18:Lustre: DEBUG MARKER: == sanity test 132: som avoids glimpse rpc == 15:50:26 (1380581426) 15:51:18:LustreError: 23533:0:(ost_handler.c:1775:ost_blocking_ast()) Error -2 syncing data on lock cancel 15:51:18:Lustre: lustre-OST0006: Client lustre-MDT0000-mdtlov_UUID (at 10.10.16.120@tcp) reconnecting 15:51:18:Lustre: lustre-OST0006: Client lustre-MDT0000-mdtlov_UUID (at 10.10.16.120@tcp) refused reconnection, still busy with 1 active RPCs 15:51:18:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n ost.OSS.ost.stats 15:51:18:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n ost.OSS.ost.stats 15:51:18:Lustre: lustre-OST0006: Client lustre-MDT0000-mdtlov_UUID (at 10.10.16.120@tcp) reconnecting 15:51:18:Lustre: lustre-OST0006: Client lustre-MDT0000-mdtlov_UUID (at 10.10.16.120@tcp) refused reconnection, still busy with 1 active RPCs 15:51:18:LustreError: 11-0: lustre-MDT0000-lwp-OST0001: Communicating with 10.10.16.120@tcp, operation obd_ping failed with -107. 15:51:18:Lustre: lustre-MDT0000-lwp-OST0001: Connection to lustre-MDT0000 (at 10.10.16.120@tcp) was lost; in progress operations using this service will wait for recovery to complete Then a umount of OST0006 which never completes: 15:52:09:Lustre: 7404:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1380581484/real 1380581484] req@ffff8800634d5800 x1447637766224616/t0(0) o250->MGC10.10.16.120@tcp@10.10.16.120@tcp:26/25 lens 400/544 e 0 to 1 dl 1380581500 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 15:52:09:Lustre: lustre-OST0006 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? From the MDT console log: 16:51:27:Lustre: DEBUG MARKER: == sanity test 132: som avoids glimpse rpc == 15:50:26 (1380581426) 16:51:27:LustreError: 11-0: lustre-OST0006-osc-MDT0000: Communicating with 10.10.16.121@tcp, operation ost_connect failed with -16. 16:51:27:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param mdt.*.som 16:51:27:LustreError: 11-0: lustre-OST0006-osc-MDT0000: Communicating with 10.10.16.121@tcp, operation ost_connect failed with -16. 16:51:27:Lustre: DEBUG MARKER: /usr/sbin/lctl conf_param lustre.mdt.som=enabled 16:51:27:Lustre: Setting parameter lustre-MDT0000.mdt.som in log lustre-MDT0000 16:51:27:Lustre: Skipped 5 previous similar messages 16:51:27:Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts 16:51:27:Lustre: DEBUG MARKER: umount -d -f /mnt/mds1 16:51:27:LustreError: 3509:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88004efcd000 x1447637735940204/t0(0) o13->lustre-OST0000-osc-MDT0000@10.10.16.121@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 16:51:27:LustreError: 3509:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88004efcd000 x1447637735940208/t0(0) o13->lustre-OST0002-osc-MDT0000@10.10.16.121@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 16:51:27:LustreError: 3509:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff88004efcd000 x1447637735940216/t0(0) o6->lustre-OST0003-osc-MDT0000@10.10.16.121@tcp:28/4 lens 664/432 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 16:51:27:LustreError: 3509:0:(client.c:1076:ptlrpc_import_delay_req()) Skipped 1 previous similar message 16:51:27:Lustre: lustre-MDT0000: Not available for connect from 10.10.16.121@tcp (stopping) 16:51:27:Lustre: lustre-MDT0000: Not available for connect from 10.10.16.121@tcp (stopping) 16:51:27:LustreError: 3508:0:(client.c:1076:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff8800569b5400 x1447637735940228/t0(0) o13->lustre-OST0004-osc-MDT0000@10.10.16.121@tcp:7/4 lens 224/368 e 0 to 0 dl 0 ref 1 fl Rpc:/0/ffffffff rc 0/-1 16:51:27:Lustre: 15981:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1380581444/real 1380581444] req@ffff8800569b5400 x1447637735940248/t0(0) o251->MGC10.10.16.120@tcp@0@lo:26/25 lens 224/224 e 0 to 1 dl 1380581450 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 16:51:27:LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.10.16.121@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. 16:51:27:Lustre: server umount lustre-MDT0000 complete From debug log on OST: ... 1380410772.384659:(ldlm_lock.c:454:lock_handle_free()) slab-freed 'lock': 504 at ffff880025067c80. 1380410772.386661:(ldlm_lock.c:454:lock_handle_free()) slab-freed 'lock': 504 at ffff88002583e380. 1380410831.744886:(ofd_objects.c:563:ofd_attr_get()) Process entered 1380410831.744887:(ofd_objects.c:588:ofd_attr_get()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 1380410831.744889:(lprocfs_jobstats.c:217:lprocfs_job_stats_log()) Process entered 1380410831.744890:(lprocfs_jobstats.c:224:lprocfs_job_stats_log()) Process leaving (rc=18446744073709551594 : -22 : ffffffffffffffea) 1380410831.744891:(ofd_obd.c:1456:ofd_sync()) Process leaving 1380410831.744892:(lustre_fid.h:719:fid_flatten32()) Process leaving (rc=4279240389 : 4279240389 : ff1006c5) 1380410831.744893:(lustre_fid.h:719:fid_flatten32()) Process leaving (rc=4279240389 : 4279240389 : ff1006c5) 1380410831.744897:(ofd_dev.c:285:ofd_object_free()) Process entered 1380410831.744897:(ofd_dev.c:289:ofd_object_free()) object free, fid = [0x100000000:0x17c5:0x0] 1380410831.744898:(ofd_dev.c:293:ofd_object_free()) slab-freed '(of)': 160 at ffff880026e3e9f0. 1380410831.744899:(ofd_dev.c:294:ofd_object_free()) Process leaving 1380410831.744899:(obd_class.h:1326:obd_sync()) Process leaving (rc=18446744073709551614 : -2 : fffffffffffffffe) 1380410831.744900:(ost_handler.c:1775:ost_blocking_ast()) Error -2 syncing data on lock cancel 1380410831.745806:(ost_handler.c:1777:ost_blocking_ast()) slab-freed '((oa))': 208 at ffff88002690ca40. 1380410831.745808:(ost_handler.c:1778:ost_blocking_ast()) kfreed 'oinfo': 112 at ffff880026b61140. |
| Comment by Nathaniel Clark [ 03/Oct/13 ] |
|
sanity/132 failures appear to be |
| Comment by Nathaniel Clark [ 17/Oct/13 ] |
|
There have been two "recent" (Sept 2013) non conf-sanity/- failures (both in replay-single): replay-single/74 https://maloo.whamcloud.com/test_sets/f441c460-227f-11e3-af6a-52540035b04c 21:28:53:Lustre: DEBUG MARKER: umount -d /mnt/ost1 21:28:53:Lustre: Failing over lustre-OST0000 21:28:53:LustreError: 15640:0:(ost_handler.c:1782:ost_blocking_ast()) Error -2 syncing data on lock cancel 21:28:53:Lustre: 15640:0:(service.c:2030:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (50:74s); client may timeout. req@ffff880046d72c00 x1446662193136696/t0(0) o103->cea0ffc2-1873-4321-a1a2-348391764373@10.10.16.253@tcp:0/0 lens 328/192 e 0 to 0 dl 1379651120 ref 1 fl Complete:H/0/0 rc -19/-19 21:28:53:LustreError: 7671:0:(ost_handler.c:1782:ost_blocking_ast()) Error -2 syncing data on lock cancel 21:28:53:Lustre: lustre-OST0000: Not available for connect from 10.10.17.1@tcp (stopping) 21:28:53:Lustre: Skipped 5 previous similar messages 21:28:53:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 7. Is it stuck? 21:28:53:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 7. Is it stuck? 21:28:53:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 7. Is it stuck? 21:40:22:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 64 seconds. The obd refcount = 7. Is it stuck? The other is review run replay-single/53e https://maloo.whamcloud.com/test_sets/ddb85db2-208b-11e3-b9bc-52540035b04c (NOT ZFS) 03:55:06:Lustre: DEBUG MARKER: umount -d /mnt/mds1 03:55:06:LustreError: 166-1: MGC10.10.4.154@tcp: Connection to MGS (at 0@lo) was lost; in progress operations using this service will fail 03:55:07:Lustre: MGS is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 5. Is it stuck? 03:55:31:Lustre: MGS is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 5. Is it stuck? 03:56:05:Lustre: MGS is waiting for obd_unlinked_exports more than 32 seconds. The obd refcount = 5. Is it stuck? |
| Comment by Nathaniel Clark [ 17/Oct/13 ] |
|
Debugging patch to try to see if 6988 was on the right track but not broad enough. |
| Comment by Jian Yu [ 04/Nov/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/47/ FSTYPE=zfs obdfilter-survey test 3a hit the same failure: |
| Comment by Nathaniel Clark [ 08/Nov/13 ] |
| Comment by Jian Yu [ 26/Nov/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/58/ FSTYPE=zfs obdfilter-survey test 3a hit the same failure: |
| Comment by Jian Yu [ 05/Dec/13 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/63/ replay-dual test 3 hit this failure: |
| Comment by Jian Yu [ 13/Dec/13 ] |
|
More instances on Lustre b2_4 branch: |
| Comment by Nathaniel Clark [ 16/Dec/13 ] |
|
It looks like this bug is fixed with the landing of #7995. Should I create gerrit patch to port to b2_4 and b2_5? |
| Comment by Nathaniel Clark [ 16/Dec/13 ] |
|
back-port to b2_4 http://review.whamcloud.com/8591 |
| Comment by Andreas Dilger [ 17/Dec/13 ] |
|
Typically, if a patch can be cherry-picked cleanly to the older branches there is no need for a separate patch. No harm in doing this, but it is also possible to ask Oleg to do the cherry-pick into the maintenance branch(es). |
| Comment by Jian Yu [ 21/Dec/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/69/ (2.4.2 RC1) obdfilter-survey hit this failure again: |
| Comment by Jian Yu [ 03/Jan/14 ] |
|
Although patch http://review.whamcloud.com/7995 was cherry-picked to Lustre b2_5 branch, the failure still occurred while running obdfilter-survey test 3a with FSTYPE=zfs on Lustre b2_5 build #5: 01:44:19:Lustre: DEBUG MARKER: umount -d -f /mnt/ost1 01:44:19:Lustre: lustre-OST0000: Not available for connect from 10.10.4.37@tcp (stopping) 01:44:19:Lustre: lustre-OST0000: Not available for connect from 10.10.4.37@tcp (stopping) 01:44:19:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 8 seconds. The obd refcount = 4. Is it stuck? 01:44:19:Lustre: 4092:0:(client.c:1897:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1388655826/real 1388655826] req@ffff88006090dc00 x1456104749935752/t0(0) o38->lustre-MDT0000-lwp-OST0001@10.10.4.39@tcp:12/10 lens 400/544 e 0 to 1 dl 1388655842 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 01:44:19:Lustre: lustre-OST0000: Not available for connect from 10.10.4.37@tcp (stopping) 01:44:19:Lustre: lustre-OST0000: Not available for connect from 10.10.4.37@tcp (stopping) 01:46:34:Lustre: lustre-OST0000: Not available for connect from 10.10.4.37@tcp (stopping) 01:46:34:Lustre: lustre-OST0000 is waiting for obd_unlinked_exports more than 16 seconds. The obd refcount = 4. Is it stuck? <~snip~> 01:50:58:INFO: task umount:11612 blocked for more than 120 seconds. 01:50:58:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 01:50:58:umount D 0000000000000001 0 11612 11611 0x00000080 01:50:58: ffff88002be63ab8 0000000000000082 0000000000000000 000000006e5e0af6 01:50:58: ffffffffa07ca7f0 ffff88006e68236f ffff88005a64d184 ffffffffa0788975 01:50:58: ffff8800638d7098 ffff88002be63fd8 000000000000fb88 ffff8800638d7098 01:50:58:Call Trace: 01:50:58: [<ffffffff8150f362>] schedule_timeout+0x192/0x2e0 01:50:58: [<ffffffff810811e0>] ? process_timeout+0x0/0x10 01:50:58: [<ffffffffa070aeab>] obd_exports_barrier+0xab/0x180 [obdclass] 01:50:58: [<ffffffffa0e8194f>] ofd_device_fini+0x5f/0x240 [ofd] 01:50:58: [<ffffffffa0736493>] class_cleanup+0x573/0xd30 [obdclass] 01:50:58: [<ffffffffa070d046>] ? class_name2dev+0x56/0xe0 [obdclass] 01:50:58: [<ffffffffa07381ba>] class_process_config+0x156a/0x1ad0 [obdclass] 01:50:58: [<ffffffffa0731313>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass] 01:50:58: [<ffffffffa0738899>] class_manual_cleanup+0x179/0x6f0 [obdclass] 01:50:58: [<ffffffffa070d046>] ? class_name2dev+0x56/0xe0 [obdclass] 01:50:59: [<ffffffffa0773edc>] server_put_super+0x5ec/0xf60 [obdclass] 01:50:59: [<ffffffff8118363b>] generic_shutdown_super+0x5b/0xe0 01:50:59: [<ffffffff81183726>] kill_anon_super+0x16/0x60 01:50:59: [<ffffffffa073a746>] lustre_kill_super+0x36/0x60 [obdclass] 01:50:59: [<ffffffff81183ec7>] deactivate_super+0x57/0x80 01:50:59: [<ffffffff811a21bf>] mntput_no_expire+0xbf/0x110 01:50:59: [<ffffffff811a2c2b>] sys_umount+0x7b/0x3a0 01:50:59: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Maloo report: https://maloo.whamcloud.com/test_sets/8b620634-73c5-11e3-b4ff-52540035b04c |
| Comment by Jian Yu [ 03/Jan/14 ] |
|
I've to reopen the ticket because the failure still occurs. |
| Comment by Jian Yu [ 06/Jan/14 ] |
|
More instance on Lustre b2_5 build #5 with FSTYPE=zfs: |
| Comment by Jian Yu [ 26/Jan/14 ] |
|
More instances on Lustre b2_5 branch: |
| Comment by Nathaniel Clark [ 21/Feb/14 ] |
|
All the b2_5 TIMEOUTs happened in obdfilter-survey/3a, but for each of them, there were errors in test 1c or 2a that I believe left echo-client on the OST that then caused the umount to TIMEOUT. |
| Comment by Nathaniel Clark [ 21/Feb/14 ] |
|
b2_5 issues are different than original issue and better handled by |