Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8399

MDT hung at lu_object_find_at during umount

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.9.0
    • Lustre 2.9.0
    • None
    • autotest review-dne
    • 3
    • 9223372036854775807

    Description

      sanity-scrub test_1c and test_4a are timing out with a hang in unmount of a MDT. All failures so far are in review-dne.

      From the test_log, we see the test is umounting MDTs:

      …
      stop mds1
      CMD: onyx-33vm7 grep -c /mnt/lustre-mds1' ' /proc/mounts
      CMD: onyx-33vm7 umount -d /mnt/lustre-mds1
      CMD: onyx-33vm7 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      stop mds2
      CMD: onyx-33vm3 grep -c /mnt/lustre-mds2' ' /proc/mounts
      CMD: onyx-33vm3 umount -d /mnt/lustre-mds2
      CMD: onyx-33vm3 lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
      stop mds3
      CMD: onyx-33vm7 grep -c /mnt/lustre-mds3' ' /proc/mounts
      CMD: onyx-33vm7 umount -d /mnt/lustre-mds3
      

      Looking at the MDS that is in the process of unmounting an MDT, in the console logs, we see errors when trying to unmount and the stack trace:

      06:49:24:[12121.531608] Lustre: DEBUG MARKER: umount -d /mnt/lustre-mds3
      06:49:24:[12141.673117] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
      06:49:24:[12141.681956] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
      06:49:24:[12141.688142] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
      06:49:24:[12161.692141] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
      06:49:24:[12161.701736] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
      06:49:24:[12161.708145] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
      06:49:24:[12181.712143] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
      06:49:24:[12181.722112] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
      06:49:24:[12181.729016] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
      06:49:24:[12201.733130] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
      06:49:24:[12201.742520] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
      06:49:24:[12201.749824] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
      06:49:24:[12221.754157] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
      06:49:24:[12221.763268] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
      06:49:24:[12221.770918] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
      06:49:24:[12241.776150] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
      06:49:24:[12241.784285] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
      06:49:24:[12241.792494] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
      06:49:24:[12261.798128] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
      06:54:55:[12261.805202] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
      06:54:55:[12261.813777] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
      06:54:55:[12301.819116] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
      06:54:55:[12301.825881] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 1 previous similar message
      06:54:55:[12301.831998] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
      06:54:55:[12301.840643] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) Skipped 1 previous similar message
      06:54:55:[12301.843920] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
      06:54:55:[12301.849581] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) Skipped 1 previous similar message
      06:54:55:[12360.587118] INFO: task osp_up0-2:22450 blocked for more than 120 seconds.
      06:54:55:[12360.592257] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      06:54:55:[12360.595199] osp_up0-2       D ffffc90004019000     0 22450      2 0x00000080
      06:54:55:[12360.597925]  ffff88007a277a98 0000000000000046 ffff8800441b8000 ffff88007a277fd8
      06:54:55:[12360.600858]  ffff88007a277fd8 ffff88007a277fd8 ffff8800441b8000 ffff88004435c240
      06:54:55:[12360.603645]  ffff8800435be000 ffff880042fd6020 0000000000000000 ffffc90004019000
      06:54:55:[12360.606403] Call Trace:
      06:54:55:[12360.608738]  [<ffffffff8163ba29>] schedule+0x29/0x70
      06:54:55:[12360.611237]  [<ffffffffa07f797d>] lu_object_find_at+0x4d/0xe0 [obdclass]
      06:54:55:[12360.613840]  [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20
      06:54:55:[12360.616305]  [<ffffffffa07f7e3f>] lu_object_find_slice+0x1f/0x90 [obdclass]
      06:54:55:[12360.618842]  [<ffffffffa0f7c5f4>] osp_trans_stop_cb+0x1b4/0x2c0 [osp]
      06:54:55:[12360.621315]  [<ffffffffa0f7eeeb>] osp_update_interpret+0x21b/0x4b0 [osp]
      06:54:55:[12360.623814]  [<ffffffffa0a0d725>] ptlrpc_check_set.part.23+0x425/0x1dd0 [ptlrpc]
      06:54:55:[12360.626273]  [<ffffffff8108bf50>] ? internal_add_timer+0x70/0x70
      06:54:55:[12360.628649]  [<ffffffffa0a0f12b>] ptlrpc_check_set+0x5b/0xe0 [ptlrpc]
      06:54:55:[12360.631038]  [<ffffffffa0a0f6a1>] ptlrpc_set_wait+0x4f1/0x900 [ptlrpc]
      06:54:55:[12360.633299]  [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20
      06:54:55:[12360.635543]  [<ffffffffa0a0fb2d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
      06:54:55:[12360.637787]  [<ffffffffa0f7f6b2>] osp_send_update_req+0x1c2/0x830 [osp]
      06:54:55:[12360.640042]  [<ffffffffa0f80573>] osp_send_update_thread+0x233/0x5e0 [osp]
      06:54:55:[12360.642227]  [<ffffffff810b88c0>] ? wake_up_state+0x20/0x20
      06:54:55:[12360.644346]  [<ffffffffa0f80340>] ? osp_invalidate_request+0x370/0x370 [osp]
      06:54:55:[12360.646486]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
      06:54:55:[12360.648492]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
      06:54:55:[12360.650536]  [<ffffffff816469d8>] ret_from_fork+0x58/0x90
      06:54:55:[12360.652551]  [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
      06:54:55:[12372.704108] Lustre: 6959:0:(client.c:2113:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1468417841/real 1468417841]  req@ffff880054be9500 x1539734990553872/t0(0) o250->MGC10.2.4.126@tcp@0@lo:26/25 lens 520/544 e 0 to 1 dl 1468417866 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
      06:54:55:[12372.714842] Lustre: 6959:0:(client.c:2113:ptlrpc_expire_one_request()) Skipped 46 previous similar messages
      06:54:55:[12381.852134] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: rc = -110 waiting for callback (1 != 0)
      06:54:55:[12381.861581] LustreError: 23989:0:(import.c:338:ptlrpc_invalidate_import()) Skipped 3 previous similar messages
      06:54:55:[12381.864156] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) @@@ still on delayed list  req@ffff880054be8300 x1539734990549200/t0(0) o1000->lustre-MDT0000-osp-MDT0002@0@lo:24/4 lens 488/192 e 0 to 0 dl 1468417620 ref 2 fl Interpret:ES/0/0 rc -5/-107
      06:54:55:[12381.870152] LustreError: 23989:0:(import.c:372:ptlrpc_invalidate_import()) Skipped 3 previous similar messages
      06:54:55:[12381.872522] LustreError: 23989:0:(import.c:378:ptlrpc_invalidate_import()) lustre-MDT0000_UUID: Unregistering RPCs found (0). Network is sluggish? Waiting them to error out.
      …
      

      sanity-scrub test_1c started failing with this error on July 13, 2016. Logs for some of these failures are at:
      https://testing.hpdd.intel.com/test_sets/c22c1f38-48c5-11e6-bf87-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/1bf26ff2-48eb-11e6-bf87-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/055ea44e-4900-11e6-bf87-5254006e85c2

      sanity-scrub test_4a started failing with this error on July 13, 2016. Logs for some of these failures are at:
      https://testing.hpdd.intel.com/test_sets/a3238bf4-48d8-11e6-9f8e-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/61c003cc-492d-11e6-8968-5254006e85c2
      https://testing.hpdd.intel.com/test_sets/cdb52f20-492e-11e6-8968-5254006e85c2

      Attachments

        Issue Links

          Activity

            People

              yong.fan nasf (Inactive)
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: