[LU-5105] Test failure sanity-lfsck test_18d: umount mds hung - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Minor
Fix Version/s: None
Affects Version/s: Lustre 2.6.0
Labels:
- zfs

Severity:
3
Rank (Obsolete):
14085

Description

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/6c3d597e-e351-11e3-93d9-52540035b04c.

The sub-test test_18d failed with the following error:

test failed to respond and timed out

Info required for matching: sanity-lfsck 18d

MDS syslog:

umount        D 0000000000000000     0 19510  19509 0x00000080
 ffff880051eaf8b8 0000000000000082 0000000000000000 ffff880051eaf87c
 0000000000000282 0000000000000282 ffff880051eaf858 ffffffff8108410c
 ffff8800554de5f8 ffff880051eaffd8 000000000000fbc8 ffff8800554de5f8
Call Trace:
 [<ffffffff8108410c>] ? lock_timer_base+0x3c/0x70
 [<ffffffff815291c2>] schedule_timeout+0x192/0x2e0
 [<ffffffff81084220>] ? process_timeout+0x0/0x10
 [<ffffffff8152932e>] schedule_timeout_uninterruptible+0x1e/0x20
 [<ffffffffa123ddea>] dnode_special_close+0x2a/0x60 [zfs]
 [<ffffffffa1232652>] dmu_objset_evict+0x92/0x400 [zfs]
 [<ffffffffa1243c50>] dsl_dataset_evict+0x30/0x1b0 [zfs]
 [<ffffffffa1223dd9>] dbuf_evict_user+0x49/0x80 [zfs]
 [<ffffffffa1225087>] dbuf_rele_and_unlock+0xf7/0x1e0 [zfs]
 [<ffffffffa12254e0>] dmu_buf_rele+0x30/0x40 [zfs]
 [<ffffffffa1249170>] dsl_dataset_disown+0xb0/0x1d0 [zfs]
 [<ffffffffa1231751>] dmu_objset_disown+0x11/0x20 [zfs]
 [<ffffffffa18f690e>] udmu_objset_close+0x2e/0x40 [osd_zfs]
 [<ffffffffa18f4f86>] osd_device_fini+0x366/0x5c0 [osd_zfs]
 [<ffffffffa0d9dd53>] class_cleanup+0x573/0xd30 [obdclass]
 [<ffffffffa0d757a6>] ? class_name2dev+0x56/0xe0 [obdclass]
 [<ffffffffa0d9fa7a>] class_process_config+0x156a/0x1ad0 [obdclass]
 [<ffffffffa0d97d53>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
 [<ffffffffa0da0159>] class_manual_cleanup+0x179/0x6f0 [obdclass]
 [<ffffffffa0d73c7b>] ? class_export_put+0x10b/0x2c0 [obdclass]
 [<ffffffffa18f412d>] osd_obd_disconnect+0x1bd/0x1c0 [osd_zfs]
 [<ffffffffa0da273b>] lustre_put_lsi+0x1ab/0x11a0 [obdclass]
 [<ffffffffa0daacf8>] lustre_common_put_super+0x5d8/0xbe0 [obdclass]
 [<ffffffffa0dd8c70>] server_put_super+0x180/0xe40 [obdclass]
 [<ffffffff8118b31b>] generic_shutdown_super+0x5b/0xe0
 [<ffffffff8118b406>] kill_anon_super+0x16/0x60
 [<ffffffffa0da2016>] lustre_kill_super+0x36/0x60 [obdclass]
 [<ffffffff8118bba7>] deactivate_super+0x57/0x80
 [<ffffffff811aabdf>] mntput_no_expire+0xbf/0x110
 [<ffffffff811ab72b>] sys_umount+0x7b/0x3a0

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

0001-dump-referenced-dnodes-at-umount.patch
2 kB
11/Jan/15 4:49 AM

Activity

[LU-5105] Test failure sanity-lfsck test_18d: umount mds hung

nasf (Inactive) added a comment - 23/Oct/15 2:45 PM

Close it since the issue only has been reported on very old version.

nasf (Inactive) added a comment - 23/Oct/15 2:45 PM Close it since the issue only has been reported on very old version.

Alex Zhuravlev added a comment - 11/Jan/15 4:49 AM

a patch to dump referenced dnodes.

Alex Zhuravlev added a comment - 11/Jan/15 4:49 AM a patch to dump referenced dnodes.

nasf (Inactive) added a comment - 08/Jan/15 4:11 PM

Yes, there seems no way to know which dnode still be referenced. The original issue was hit by the patch http://review.whamcloud.com/#/c/10223/. I am not sure whether it is such patch special or not. But since such patch has been landed to master, there should be similar trouble on master branch. But it is also possible that such trouble has been fixed by other patch occasionally.

nasf (Inactive) added a comment - 08/Jan/15 4:11 PM Yes, there seems no way to know which dnode still be referenced. The original issue was hit by the patch http://review.whamcloud.com/#/c/10223/ . I am not sure whether it is such patch special or not. But since such patch has been landed to master, there should be similar trouble on master branch. But it is also possible that such trouble has been fixed by other patch occasionally.

Alex Zhuravlev added a comment - 08/Jan/15 3:55 PM

well, in that specific case it looks like some dnode was still referenced:
Call Trace:
[<ffffffff8108410c>] ? lock_timer_base+0x3c/0x70
[<ffffffff815291c2>] schedule_timeout+0x192/0x2e0
[<ffffffff81084220>] ? process_timeout+0x0/0x10
[<ffffffff8152932e>] schedule_timeout_uninterruptible+0x1e/0x20
[<ffffffffa123ddea>] dnode_special_close+0x2a/0x60 [zfs]
[<ffffffffa1232652>] dmu_objset_evict+0x92/0x400 [zfs]
[<ffffffffa1243c50>] dsl_dataset_evict+0x30/0x1b0 [zfs]
[<ffffffffa1223dd9>] dbuf_evict_user+0x49/0x80 [zfs]
[<ffffffffa1225087>] dbuf_rele_and_unlock+0xf7/0x1e0 [zfs]
[<ffffffffa12254e0>] dmu_buf_rele+0x30/0x40 [zfs]
[<ffffffffa1249170>] dsl_dataset_disown+0xb0/0x1d0 [zfs]
[<ffffffffa1231751>] dmu_objset_disown+0x11/0x20 [zfs]
[<ffffffffa18f690e>] udmu_objset_close+0x2e/0x40 [osd_zfs]
[<ffffffffa18f4f86>] osd_device_fini+0x366/0x5c0 [osd_zfs]

so the metadnode can't go blocking umount.

but this seem to be some old version? we don't have udmu wrappers anymore.

Alex Zhuravlev added a comment - 08/Jan/15 3:55 PM well, in that specific case it looks like some dnode was still referenced: Call Trace: [<ffffffff8108410c>] ? lock_timer_base+0x3c/0x70 [<ffffffff815291c2>] schedule_timeout+0x192/0x2e0 [<ffffffff81084220>] ? process_timeout+0x0/0x10 [<ffffffff8152932e>] schedule_timeout_uninterruptible+0x1e/0x20 [<ffffffffa123ddea>] dnode_special_close+0x2a/0x60 [zfs] [<ffffffffa1232652>] dmu_objset_evict+0x92/0x400 [zfs] [<ffffffffa1243c50>] dsl_dataset_evict+0x30/0x1b0 [zfs] [<ffffffffa1223dd9>] dbuf_evict_user+0x49/0x80 [zfs] [<ffffffffa1225087>] dbuf_rele_and_unlock+0xf7/0x1e0 [zfs] [<ffffffffa12254e0>] dmu_buf_rele+0x30/0x40 [zfs] [<ffffffffa1249170>] dsl_dataset_disown+0xb0/0x1d0 [zfs] [<ffffffffa1231751>] dmu_objset_disown+0x11/0x20 [zfs] [<ffffffffa18f690e>] udmu_objset_close+0x2e/0x40 [osd_zfs] [<ffffffffa18f4f86>] osd_device_fini+0x366/0x5c0 [osd_zfs] so the metadnode can't go blocking umount. but this seem to be some old version? we don't have udmu wrappers anymore.

nasf (Inactive) added a comment - 08/Jan/15 11:18 AM

Alex, do you have any idea about umount MDS hung for ZFS based backend?
https://testing.hpdd.intel.com/test_sets/6c3d597e-e351-11e3-93d9-52540035b04c

nasf (Inactive) added a comment - 08/Jan/15 11:18 AM Alex, do you have any idea about umount MDS hung for ZFS based backend? https://testing.hpdd.intel.com/test_sets/6c3d597e-e351-11e3-93d9-52540035b04c

nasf (Inactive) added a comment - 07/Jan/15 12:46 AM - edited

For the failures in https://testing.hpdd.intel.com/test_sets/adb9bee6-4b17-11e4-941e-5254006e85c2:
sanity-lfsck test_18c failure is another instance of ~~LU-5848~~. test_18d hung when umount the mds4 because the former lfsck assistant thread for test_18c was blocked at dt_sync(), so it is the side effect of test_18c failure (~~LU-5848~~), not the same as the original ZFS based test_18d hung in http://maloo.whamcloud.com/test_sets/6c3d597e-e351-11e3-93d9-52540035b04c.

nasf (Inactive) added a comment - 07/Jan/15 12:46 AM - edited For the failures in https://testing.hpdd.intel.com/test_sets/adb9bee6-4b17-11e4-941e-5254006e85c2: sanity-lfsck test_18c failure is another instance of LU-5848 . test_18d hung when umount the mds4 because the former lfsck assistant thread for test_18c was blocked at dt_sync(), so it is the side effect of test_18c failure ( LU-5848 ), not the same as the original ZFS based test_18d hung in http://maloo.whamcloud.com/test_sets/6c3d597e-e351-11e3-93d9-52540035b04c .

nasf (Inactive) added a comment - 09/Oct/14 9:54 AM

Another failure instance:
https://testing.hpdd.intel.com/test_sets/adb9bee6-4b17-11e4-941e-5254006e85c2

nasf (Inactive) added a comment - 09/Oct/14 9:54 AM Another failure instance: https://testing.hpdd.intel.com/test_sets/adb9bee6-4b17-11e4-941e-5254006e85c2

People

Assignee:: nasf (Inactive)

Reporter:: Nathaniel Clark

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 27/May/14 2:23 PM

Updated:: 23/Oct/15 2:45 PM

Resolved:: 23/Oct/15 2:45 PM