[LU-5105] Test failure sanity-lfsck test_18d: umount mds hung Created: 27/May/14  Updated: 23/Oct/15  Resolved: 23/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Nathaniel Clark Assignee: nasf (Inactive)
Resolution: Won't Fix Votes: 0
Labels: zfs

Attachments: Text File 0001-dump-referenced-dnodes-at-umount.patch    
Severity: 3
Rank (Obsolete): 14085

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/6c3d597e-e351-11e3-93d9-52540035b04c.

The sub-test test_18d failed with the following error:

test failed to respond and timed out

Info required for matching: sanity-lfsck 18d

MDS syslog:

umount        D 0000000000000000     0 19510  19509 0x00000080
 ffff880051eaf8b8 0000000000000082 0000000000000000 ffff880051eaf87c
 0000000000000282 0000000000000282 ffff880051eaf858 ffffffff8108410c
 ffff8800554de5f8 ffff880051eaffd8 000000000000fbc8 ffff8800554de5f8
Call Trace:
 [<ffffffff8108410c>] ? lock_timer_base+0x3c/0x70
 [<ffffffff815291c2>] schedule_timeout+0x192/0x2e0
 [<ffffffff81084220>] ? process_timeout+0x0/0x10
 [<ffffffff8152932e>] schedule_timeout_uninterruptible+0x1e/0x20
 [<ffffffffa123ddea>] dnode_special_close+0x2a/0x60 [zfs]
 [<ffffffffa1232652>] dmu_objset_evict+0x92/0x400 [zfs]
 [<ffffffffa1243c50>] dsl_dataset_evict+0x30/0x1b0 [zfs]
 [<ffffffffa1223dd9>] dbuf_evict_user+0x49/0x80 [zfs]
 [<ffffffffa1225087>] dbuf_rele_and_unlock+0xf7/0x1e0 [zfs]
 [<ffffffffa12254e0>] dmu_buf_rele+0x30/0x40 [zfs]
 [<ffffffffa1249170>] dsl_dataset_disown+0xb0/0x1d0 [zfs]
 [<ffffffffa1231751>] dmu_objset_disown+0x11/0x20 [zfs]
 [<ffffffffa18f690e>] udmu_objset_close+0x2e/0x40 [osd_zfs]
 [<ffffffffa18f4f86>] osd_device_fini+0x366/0x5c0 [osd_zfs]
 [<ffffffffa0d9dd53>] class_cleanup+0x573/0xd30 [obdclass]
 [<ffffffffa0d757a6>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa0d9fa7a>] class_process_config+0x156a/0x1ad0 [obdclass]
 [<ffffffffa0d97d53>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
 [<ffffffffa0da0159>] class_manual_cleanup+0x179/0x6f0 [obdclass]
 [<ffffffffa0d73c7b>] ? class_export_put+0x10b/0x2c0 [obdclass]
 [<ffffffffa18f412d>] osd_obd_disconnect+0x1bd/0x1c0 [osd_zfs]
 [<ffffffffa0da273b>] lustre_put_lsi+0x1ab/0x11a0 [obdclass]
 [<ffffffffa0daacf8>] lustre_common_put_super+0x5d8/0xbe0 [obdclass]
 [<ffffffffa0dd8c70>] server_put_super+0x180/0xe40 [obdclass]
 [<ffffffff8118b31b>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff8118b406>] kill_anon_super+0x16/0x60
 [<ffffffffa0da2016>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff8118bba7>] deactivate_super+0x57/0x80
 [<ffffffff811aabdf>] mntput_no_expire+0xbf/0x110
 [<ffffffff811ab72b>] sys_umount+0x7b/0x3a0


 Comments   
Comment by nasf (Inactive) [ 09/Oct/14 ]

Another failure instance:
https://testing.hpdd.intel.com/test_sets/adb9bee6-4b17-11e4-941e-5254006e85c2

Comment by nasf (Inactive) [ 07/Jan/15 ]

For the failures in https://testing.hpdd.intel.com/test_sets/adb9bee6-4b17-11e4-941e-5254006e85c2:
sanity-lfsck test_18c failure is another instance of LU-5848. test_18d hung when umount the mds4 because the former lfsck assistant thread for test_18c was blocked at dt_sync(), so it is the side effect of test_18c failure (LU-5848), not the same as the original ZFS based test_18d hung in http://maloo.whamcloud.com/test_sets/6c3d597e-e351-11e3-93d9-52540035b04c.

Comment by nasf (Inactive) [ 08/Jan/15 ]

Alex, do you have any idea about umount MDS hung for ZFS based backend?
https://testing.hpdd.intel.com/test_sets/6c3d597e-e351-11e3-93d9-52540035b04c

Comment by Alex Zhuravlev [ 08/Jan/15 ]

well, in that specific case it looks like some dnode was still referenced:
Call Trace:
[<ffffffff8108410c>] ? lock_timer_base+0x3c/0x70
[<ffffffff815291c2>] schedule_timeout+0x192/0x2e0
[<ffffffff81084220>] ? process_timeout+0x0/0x10
[<ffffffff8152932e>] schedule_timeout_uninterruptible+0x1e/0x20
[<ffffffffa123ddea>] dnode_special_close+0x2a/0x60 [zfs]
[<ffffffffa1232652>] dmu_objset_evict+0x92/0x400 [zfs]
[<ffffffffa1243c50>] dsl_dataset_evict+0x30/0x1b0 [zfs]
[<ffffffffa1223dd9>] dbuf_evict_user+0x49/0x80 [zfs]
[<ffffffffa1225087>] dbuf_rele_and_unlock+0xf7/0x1e0 [zfs]
[<ffffffffa12254e0>] dmu_buf_rele+0x30/0x40 [zfs]
[<ffffffffa1249170>] dsl_dataset_disown+0xb0/0x1d0 [zfs]
[<ffffffffa1231751>] dmu_objset_disown+0x11/0x20 [zfs]
[<ffffffffa18f690e>] udmu_objset_close+0x2e/0x40 [osd_zfs]
[<ffffffffa18f4f86>] osd_device_fini+0x366/0x5c0 [osd_zfs]

so the metadnode can't go blocking umount.

but this seem to be some old version? we don't have udmu wrappers anymore.

Comment by nasf (Inactive) [ 08/Jan/15 ]

Yes, there seems no way to know which dnode still be referenced. The original issue was hit by the patch http://review.whamcloud.com/#/c/10223/. I am not sure whether it is such patch special or not. But since such patch has been landed to master, there should be similar trouble on master branch. But it is also possible that such trouble has been fixed by other patch occasionally.

Comment by Alex Zhuravlev [ 11/Jan/15 ]

a patch to dump referenced dnodes.

Comment by nasf (Inactive) [ 23/Oct/15 ]

Close it since the issue only has been reported on very old version.

Generated at Sat Feb 10 01:48:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.