Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5105

Test failure sanity-lfsck test_18d: umount mds hung

Details

    • Bug
    • Resolution: Won't Fix
    • Minor
    • None
    • Lustre 2.6.0
    • 3
    • 14085

    Description

      This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

      This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/6c3d597e-e351-11e3-93d9-52540035b04c.

      The sub-test test_18d failed with the following error:

      test failed to respond and timed out

      Info required for matching: sanity-lfsck 18d

      MDS syslog:

      umount        D 0000000000000000     0 19510  19509 0x00000080
       ffff880051eaf8b8 0000000000000082 0000000000000000 ffff880051eaf87c
       0000000000000282 0000000000000282 ffff880051eaf858 ffffffff8108410c
       ffff8800554de5f8 ffff880051eaffd8 000000000000fbc8 ffff8800554de5f8
      Call Trace:
       [<ffffffff8108410c>] ? lock_timer_base+0x3c/0x70
       [<ffffffff815291c2>] schedule_timeout+0x192/0x2e0
       [<ffffffff81084220>] ? process_timeout+0x0/0x10
       [<ffffffff8152932e>] schedule_timeout_uninterruptible+0x1e/0x20
       [<ffffffffa123ddea>] dnode_special_close+0x2a/0x60 [zfs]
       [<ffffffffa1232652>] dmu_objset_evict+0x92/0x400 [zfs]
       [<ffffffffa1243c50>] dsl_dataset_evict+0x30/0x1b0 [zfs]
       [<ffffffffa1223dd9>] dbuf_evict_user+0x49/0x80 [zfs]
       [<ffffffffa1225087>] dbuf_rele_and_unlock+0xf7/0x1e0 [zfs]
       [<ffffffffa12254e0>] dmu_buf_rele+0x30/0x40 [zfs]
       [<ffffffffa1249170>] dsl_dataset_disown+0xb0/0x1d0 [zfs]
       [<ffffffffa1231751>] dmu_objset_disown+0x11/0x20 [zfs]
       [<ffffffffa18f690e>] udmu_objset_close+0x2e/0x40 [osd_zfs]
       [<ffffffffa18f4f86>] osd_device_fini+0x366/0x5c0 [osd_zfs]
       [<ffffffffa0d9dd53>] class_cleanup+0x573/0xd30 [obdclass]
       [<ffffffffa0d757a6>] ? class_name2dev+0x56/0xe0 [obdclass]
       [<ffffffffa0d9fa7a>] class_process_config+0x156a/0x1ad0 [obdclass]
       [<ffffffffa0d97d53>] ? lustre_cfg_new+0x2d3/0x6e0 [obdclass]
       [<ffffffffa0da0159>] class_manual_cleanup+0x179/0x6f0 [obdclass]
       [<ffffffffa0d73c7b>] ? class_export_put+0x10b/0x2c0 [obdclass]
       [<ffffffffa18f412d>] osd_obd_disconnect+0x1bd/0x1c0 [osd_zfs]
       [<ffffffffa0da273b>] lustre_put_lsi+0x1ab/0x11a0 [obdclass]
       [<ffffffffa0daacf8>] lustre_common_put_super+0x5d8/0xbe0 [obdclass]
       [<ffffffffa0dd8c70>] server_put_super+0x180/0xe40 [obdclass]
       [<ffffffff8118b31b>] generic_shutdown_super+0x5b/0xe0
       [<ffffffff8118b406>] kill_anon_super+0x16/0x60
       [<ffffffffa0da2016>] lustre_kill_super+0x36/0x60 [obdclass]
       [<ffffffff8118bba7>] deactivate_super+0x57/0x80
       [<ffffffff811aabdf>] mntput_no_expire+0xbf/0x110
       [<ffffffff811ab72b>] sys_umount+0x7b/0x3a0
      

      Attachments

        Activity

          [LU-5105] Test failure sanity-lfsck test_18d: umount mds hung

          Close it since the issue only has been reported on very old version.

          yong.fan nasf (Inactive) added a comment - Close it since the issue only has been reported on very old version.

          a patch to dump referenced dnodes.

          bzzz Alex Zhuravlev added a comment - a patch to dump referenced dnodes.

          Yes, there seems no way to know which dnode still be referenced. The original issue was hit by the patch http://review.whamcloud.com/#/c/10223/. I am not sure whether it is such patch special or not. But since such patch has been landed to master, there should be similar trouble on master branch. But it is also possible that such trouble has been fixed by other patch occasionally.

          yong.fan nasf (Inactive) added a comment - Yes, there seems no way to know which dnode still be referenced. The original issue was hit by the patch http://review.whamcloud.com/#/c/10223/ . I am not sure whether it is such patch special or not. But since such patch has been landed to master, there should be similar trouble on master branch. But it is also possible that such trouble has been fixed by other patch occasionally.

          well, in that specific case it looks like some dnode was still referenced:
          Call Trace:
          [<ffffffff8108410c>] ? lock_timer_base+0x3c/0x70
          [<ffffffff815291c2>] schedule_timeout+0x192/0x2e0
          [<ffffffff81084220>] ? process_timeout+0x0/0x10
          [<ffffffff8152932e>] schedule_timeout_uninterruptible+0x1e/0x20
          [<ffffffffa123ddea>] dnode_special_close+0x2a/0x60 [zfs]
          [<ffffffffa1232652>] dmu_objset_evict+0x92/0x400 [zfs]
          [<ffffffffa1243c50>] dsl_dataset_evict+0x30/0x1b0 [zfs]
          [<ffffffffa1223dd9>] dbuf_evict_user+0x49/0x80 [zfs]
          [<ffffffffa1225087>] dbuf_rele_and_unlock+0xf7/0x1e0 [zfs]
          [<ffffffffa12254e0>] dmu_buf_rele+0x30/0x40 [zfs]
          [<ffffffffa1249170>] dsl_dataset_disown+0xb0/0x1d0 [zfs]
          [<ffffffffa1231751>] dmu_objset_disown+0x11/0x20 [zfs]
          [<ffffffffa18f690e>] udmu_objset_close+0x2e/0x40 [osd_zfs]
          [<ffffffffa18f4f86>] osd_device_fini+0x366/0x5c0 [osd_zfs]

          so the metadnode can't go blocking umount.

          but this seem to be some old version? we don't have udmu wrappers anymore.

          bzzz Alex Zhuravlev added a comment - well, in that specific case it looks like some dnode was still referenced: Call Trace: [<ffffffff8108410c>] ? lock_timer_base+0x3c/0x70 [<ffffffff815291c2>] schedule_timeout+0x192/0x2e0 [<ffffffff81084220>] ? process_timeout+0x0/0x10 [<ffffffff8152932e>] schedule_timeout_uninterruptible+0x1e/0x20 [<ffffffffa123ddea>] dnode_special_close+0x2a/0x60 [zfs] [<ffffffffa1232652>] dmu_objset_evict+0x92/0x400 [zfs] [<ffffffffa1243c50>] dsl_dataset_evict+0x30/0x1b0 [zfs] [<ffffffffa1223dd9>] dbuf_evict_user+0x49/0x80 [zfs] [<ffffffffa1225087>] dbuf_rele_and_unlock+0xf7/0x1e0 [zfs] [<ffffffffa12254e0>] dmu_buf_rele+0x30/0x40 [zfs] [<ffffffffa1249170>] dsl_dataset_disown+0xb0/0x1d0 [zfs] [<ffffffffa1231751>] dmu_objset_disown+0x11/0x20 [zfs] [<ffffffffa18f690e>] udmu_objset_close+0x2e/0x40 [osd_zfs] [<ffffffffa18f4f86>] osd_device_fini+0x366/0x5c0 [osd_zfs] so the metadnode can't go blocking umount. but this seem to be some old version? we don't have udmu wrappers anymore.

          Alex, do you have any idea about umount MDS hung for ZFS based backend?
          https://testing.hpdd.intel.com/test_sets/6c3d597e-e351-11e3-93d9-52540035b04c

          yong.fan nasf (Inactive) added a comment - Alex, do you have any idea about umount MDS hung for ZFS based backend? https://testing.hpdd.intel.com/test_sets/6c3d597e-e351-11e3-93d9-52540035b04c
          yong.fan nasf (Inactive) added a comment - - edited

          For the failures in https://testing.hpdd.intel.com/test_sets/adb9bee6-4b17-11e4-941e-5254006e85c2:
          sanity-lfsck test_18c failure is another instance of LU-5848. test_18d hung when umount the mds4 because the former lfsck assistant thread for test_18c was blocked at dt_sync(), so it is the side effect of test_18c failure (LU-5848), not the same as the original ZFS based test_18d hung in http://maloo.whamcloud.com/test_sets/6c3d597e-e351-11e3-93d9-52540035b04c.

          yong.fan nasf (Inactive) added a comment - - edited For the failures in https://testing.hpdd.intel.com/test_sets/adb9bee6-4b17-11e4-941e-5254006e85c2: sanity-lfsck test_18c failure is another instance of LU-5848 . test_18d hung when umount the mds4 because the former lfsck assistant thread for test_18c was blocked at dt_sync(), so it is the side effect of test_18c failure ( LU-5848 ), not the same as the original ZFS based test_18d hung in http://maloo.whamcloud.com/test_sets/6c3d597e-e351-11e3-93d9-52540035b04c .
          yong.fan nasf (Inactive) added a comment - Another failure instance: https://testing.hpdd.intel.com/test_sets/adb9bee6-4b17-11e4-941e-5254006e85c2

          People

            yong.fan nasf (Inactive)
            utopiabound Nathaniel Clark
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: