Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3716

mds-survey ASSERTION( lu_device_is_mdt(o->lo_dev) )

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.7.0
    • Lustre 2.4.0, Lustre 2.5.1
    • 3
    • 9570

    Description

      Observed fairly repeatably when running mds-survey.sh after the first few million creates. It appears unlikely to occur when creating less than 100,00 objects (the default).

      2013-08-06 15:05:51 LustreError: 4379:0:(mdt_handler.c:2352:mdt_obj()) ASSERTION( lu_device_is_mdt(o->lo_dev) ) failed:
      2013-08-06 15:05:51 LustreError: 4379:0:(mdt_handler.c:2352:mdt_obj()) LBUG
      

      The failures where observed with code post 2.4.0 which includes the LU-2758 fix.

      Attachments

        Activity

          [LU-3716] mds-survey ASSERTION( lu_device_is_mdt(o->lo_dev) )
          yujian Jian Yu added a comment -

          Both sanity test 225a and 225b run mds-survey.

          I saw the following error in the test output of sanity test 225a in https://maloo.whamcloud.com/test_sets/f9ad31f6-aac8-11e3-a41c-52540035b04c :

          ERROR: Module obdecho is in use
          

          So, this is the same situation that Wang Di described above:

          Usually this LBUG means mds-survey and other clients(probably normal lustre client) access the same object at the same time, or the object has been accessed by normal clients previously, and object stack is still in the server cache.

          Wang Di, should I improve mds-survey to check whether obdecho is in use or not so as to avoid the LBUG?

          yujian Jian Yu added a comment - Both sanity test 225a and 225b run mds-survey. I saw the following error in the test output of sanity test 225a in https://maloo.whamcloud.com/test_sets/f9ad31f6-aac8-11e3-a41c-52540035b04c : ERROR: Module obdecho is in use So, this is the same situation that Wang Di described above: Usually this LBUG means mds-survey and other clients(probably normal lustre client) access the same object at the same time, or the object has been accessed by normal clients previously, and object stack is still in the server cache. Wang Di, should I improve mds-survey to check whether obdecho is in use or not so as to avoid the LBUG?

          Running 2.5.1-RC4, using ldiskfs, I ran into this assertion when running sanity test_225a. The test results are here https://maloo.whamcloud.com/test_sets/13e6472a-aba9-11e3-a696-52540035b04c, but there are not MDS logs. I was able to capture some information from the crash dump.

          From the MDSs console

          Message from syslogd@c16 at Mar 13 16:10:16 ...
           kernel:LustreError: 32557:0:(mdt_handler.c:2414:mdt_obj()) ASSERTION( lu_device_is_mdt(o->lo_dev) ) failed: 
          
          Message from syslogd@c16 at Mar 13 16:10:16 ...
           kernel:LustreError: 32557:0:(mdt_handler.c:2414:mdt_obj()) LBUG
          

          From dmesg save by crash:

          <4>Lustre: DEBUG MARKER: == sanity test 225a: Metadata survey sanity with zero-stripe == 16:10:12 (1394752212)
          <6>Lustre: Echo OBD driver; http://www.lustre.org/
          <3>LustreError: 2019:0:(echo_client.c:1743:echo_md_lookup()) lookup MDT0000-tests: rc = -2
          <3>LustreError: 2019:0:(echo_client.c:1942:echo_md_destroy_internal()) Can't find child MDT0000-tests: rc = -2
          <3>LustreError: 2041:0:(echo_client.c:1743:echo_md_lookup()) lookup MDT0000-tests1: rc = -2
          <3>LustreError: 2041:0:(echo_client.c:1942:echo_md_destroy_internal()) Can't find child MDT0000-tests1: rc = -2
          <0>LustreError: 32557:0:(mdt_handler.c:2414:mdt_obj()) ASSERTION( lu_device_is_mdt(o->lo_dev) ) failed: 
          <0>LustreError: 32557:0:(mdt_handler.c:2414:mdt_obj()) LBUG
          <4>Pid: 32557, comm: mdt00_004
          <4>
          <4>Call Trace:
          <4> [<ffffffffa052e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          <4> [<ffffffffa052ee97>] lbug_with_loc+0x47/0xb0 [libcfs]
          <4> [<ffffffffa0e37365>] mdt_obj+0x55/0x80 [mdt]
          <4> [<ffffffffa0e3bc46>] mdt_object_find+0x66/0x170 [mdt]
          <4> [<ffffffffa0e4ef84>] mdt_getattr_name_lock+0x7f4/0x1aa0 [mdt]
          <4> [<ffffffffa0821ba5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
          <4> [<ffffffffa0848bb6>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
          <4> [<ffffffffa0823e34>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
          <4> [<ffffffffa0e504c9>] mdt_intent_getattr+0x299/0x480 [mdt]
          <4> [<ffffffffa0e3ea0e>] mdt_intent_policy+0x3ae/0x770 [mdt]
          <4> [<ffffffffa07d9511>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
          <4> [<ffffffffa0802abf>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
          <4> [<ffffffffa0e3eed6>] mdt_enqueue+0x46/0xe0 [mdt]
          <4> [<ffffffffa0e45bca>] mdt_handle_common+0x52a/0x1470 [mdt]
          <4> [<ffffffffa0e80545>] mds_regular_handle+0x15/0x20 [mdt]
          <4> [<ffffffffa0832a45>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
          <4> [<ffffffffa05403df>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
          <4> [<ffffffffa082a0e9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
          <4> [<ffffffffa0833dad>] ptlrpc_main+0xaed/0x1740 [ptlrpc]
          <4> [<ffffffffa08332c0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
          <4> [<ffffffff8109aee6>] kthread+0x96/0xa0
          <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
          <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
          <3>LustreError: 2242:0:(echo_client.c:1016:echo_device_free()) echo_client still has objects at cleanup time, wait for 1 second
          <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
          <4>
          <0>Kernel panic - not syncing: LBUG
          <4>Pid: 32557, comm: mdt00_004 Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1
          <4>Call Trace:
          <4> [<ffffffff81527983>] ? panic+0xa7/0x16f
          <4> [<ffffffffa052eeeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
          <4> [<ffffffffa0e37365>] ? mdt_obj+0x55/0x80 [mdt]
          <4> [<ffffffffa0e3bc46>] ? mdt_object_find+0x66/0x170 [mdt]
          <4> [<ffffffffa0e4ef84>] ? mdt_getattr_name_lock+0x7f4/0x1aa0 [mdt]
          <4> [<ffffffffa0821ba5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc]
          <4> [<ffffffffa0848bb6>] ? __req_capsule_get+0x166/0x710 [ptlrpc]
          <4> [<ffffffffa0823e34>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc]
          <4> [<ffffffffa0e504c9>] ? mdt_intent_getattr+0x299/0x480 [mdt]
          <4> [<ffffffffa0e3ea0e>] ? mdt_intent_policy+0x3ae/0x770 [mdt]
          <4> [<ffffffffa07d9511>] ? ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc]
          <3>LustreError: 2242:0:(echo_client.c:1016:echo_device_free()) echo_client still has objects at cleanup time, wait for 1 second
          <4> [<ffffffffa0802abf>] ? ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc]
          <4> [<ffffffffa0e3eed6>] ? mdt_enqueue+0x46/0xe0 [mdt]
          <4> [<ffffffffa0e45bca>] ? mdt_handle_common+0x52a/0x1470 [mdt]
          <4> [<ffffffffa0e80545>] ? mds_regular_handle+0x15/0x20 [mdt]
          <4> [<ffffffffa0832a45>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]
          <4> [<ffffffffa05403df>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
          <4> [<ffffffffa082a0e9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
          <4> [<ffffffffa0833dad>] ? ptlrpc_main+0xaed/0x1740 [ptlrpc]
          <4> [<ffffffffa08332c0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc]
          <4>Lustre: DEBUG MARKER: sanity test_225a: @@@@@@ FAIL: LBUG/LASSERT detected
          <4> [<ffffffff8109aee6>] ? kthread+0x96/0xa0
          <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20
          <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
          <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
          

          I experienced this crash both times I ran sanity test 225a.

          jamesanunez James Nunez (Inactive) added a comment - Running 2.5.1-RC4, using ldiskfs, I ran into this assertion when running sanity test_225a. The test results are here https://maloo.whamcloud.com/test_sets/13e6472a-aba9-11e3-a696-52540035b04c , but there are not MDS logs. I was able to capture some information from the crash dump. From the MDSs console Message from syslogd@c16 at Mar 13 16:10:16 ... kernel:LustreError: 32557:0:(mdt_handler.c:2414:mdt_obj()) ASSERTION( lu_device_is_mdt(o->lo_dev) ) failed: Message from syslogd@c16 at Mar 13 16:10:16 ... kernel:LustreError: 32557:0:(mdt_handler.c:2414:mdt_obj()) LBUG From dmesg save by crash: <4>Lustre: DEBUG MARKER: == sanity test 225a: Metadata survey sanity with zero-stripe == 16:10:12 (1394752212) <6>Lustre: Echo OBD driver; http://www.lustre.org/ <3>LustreError: 2019:0:(echo_client.c:1743:echo_md_lookup()) lookup MDT0000-tests: rc = -2 <3>LustreError: 2019:0:(echo_client.c:1942:echo_md_destroy_internal()) Can't find child MDT0000-tests: rc = -2 <3>LustreError: 2041:0:(echo_client.c:1743:echo_md_lookup()) lookup MDT0000-tests1: rc = -2 <3>LustreError: 2041:0:(echo_client.c:1942:echo_md_destroy_internal()) Can't find child MDT0000-tests1: rc = -2 <0>LustreError: 32557:0:(mdt_handler.c:2414:mdt_obj()) ASSERTION( lu_device_is_mdt(o->lo_dev) ) failed: <0>LustreError: 32557:0:(mdt_handler.c:2414:mdt_obj()) LBUG <4>Pid: 32557, comm: mdt00_004 <4> <4>Call Trace: <4> [<ffffffffa052e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa052ee97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0e37365>] mdt_obj+0x55/0x80 [mdt] <4> [<ffffffffa0e3bc46>] mdt_object_find+0x66/0x170 [mdt] <4> [<ffffffffa0e4ef84>] mdt_getattr_name_lock+0x7f4/0x1aa0 [mdt] <4> [<ffffffffa0821ba5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] <4> [<ffffffffa0848bb6>] ? __req_capsule_get+0x166/0x710 [ptlrpc] <4> [<ffffffffa0823e34>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] <4> [<ffffffffa0e504c9>] mdt_intent_getattr+0x299/0x480 [mdt] <4> [<ffffffffa0e3ea0e>] mdt_intent_policy+0x3ae/0x770 [mdt] <4> [<ffffffffa07d9511>] ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc] <4> [<ffffffffa0802abf>] ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc] <4> [<ffffffffa0e3eed6>] mdt_enqueue+0x46/0xe0 [mdt] <4> [<ffffffffa0e45bca>] mdt_handle_common+0x52a/0x1470 [mdt] <4> [<ffffffffa0e80545>] mds_regular_handle+0x15/0x20 [mdt] <4> [<ffffffffa0832a45>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] <4> [<ffffffffa05403df>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] <4> [<ffffffffa082a0e9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] <4> [<ffffffffa0833dad>] ptlrpc_main+0xaed/0x1740 [ptlrpc] <4> [<ffffffffa08332c0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] <4> [<ffffffff8109aee6>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0 <3>LustreError: 2242:0:(echo_client.c:1016:echo_device_free()) echo_client still has objects at cleanup time, wait for 1 second <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 32557, comm: mdt00_004 Not tainted 2.6.32-431.5.1.el6_lustre.x86_64 #1 <4>Call Trace: <4> [<ffffffff81527983>] ? panic+0xa7/0x16f <4> [<ffffffffa052eeeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] <4> [<ffffffffa0e37365>] ? mdt_obj+0x55/0x80 [mdt] <4> [<ffffffffa0e3bc46>] ? mdt_object_find+0x66/0x170 [mdt] <4> [<ffffffffa0e4ef84>] ? mdt_getattr_name_lock+0x7f4/0x1aa0 [mdt] <4> [<ffffffffa0821ba5>] ? lustre_msg_buf+0x55/0x60 [ptlrpc] <4> [<ffffffffa0848bb6>] ? __req_capsule_get+0x166/0x710 [ptlrpc] <4> [<ffffffffa0823e34>] ? lustre_msg_get_flags+0x34/0xb0 [ptlrpc] <4> [<ffffffffa0e504c9>] ? mdt_intent_getattr+0x299/0x480 [mdt] <4> [<ffffffffa0e3ea0e>] ? mdt_intent_policy+0x3ae/0x770 [mdt] <4> [<ffffffffa07d9511>] ? ldlm_lock_enqueue+0x361/0x8c0 [ptlrpc] <3>LustreError: 2242:0:(echo_client.c:1016:echo_device_free()) echo_client still has objects at cleanup time, wait for 1 second <4> [<ffffffffa0802abf>] ? ldlm_handle_enqueue0+0x4ef/0x10a0 [ptlrpc] <4> [<ffffffffa0e3eed6>] ? mdt_enqueue+0x46/0xe0 [mdt] <4> [<ffffffffa0e45bca>] ? mdt_handle_common+0x52a/0x1470 [mdt] <4> [<ffffffffa0e80545>] ? mds_regular_handle+0x15/0x20 [mdt] <4> [<ffffffffa0832a45>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc] <4> [<ffffffffa05403df>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] <4> [<ffffffffa082a0e9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] <4> [<ffffffffa0833dad>] ? ptlrpc_main+0xaed/0x1740 [ptlrpc] <4> [<ffffffffa08332c0>] ? ptlrpc_main+0x0/0x1740 [ptlrpc] <4>Lustre: DEBUG MARKER: sanity test_225a: @@@@@@ FAIL: LBUG/LASSERT detected <4> [<ffffffff8109aee6>] ? kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20 <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 I experienced this crash both times I ran sanity test 225a.

          I ran into this assertion when running sanity test 225a https://maloo.whamcloud.com/test_sessions/f6c5b8e6-aac8-11e3-a41c-52540035b04c .
          Unfortunately, there are no logs on the MDS. I'll try and run again to see what I can collect.

          jamesanunez James Nunez (Inactive) added a comment - I ran into this assertion when running sanity test 225a https://maloo.whamcloud.com/test_sessions/f6c5b8e6-aac8-11e3-a41c-52540035b04c . Unfortunately, there are no logs on the MDS. I'll try and run again to see what I can collect.

          Yes, we do have other Lustre clients mounted during the test. It's very inconvenient to have to unmount them. It would be best if it were possible to run mds-survey safely with mounted clients.

          behlendorf Brian Behlendorf added a comment - Yes, we do have other Lustre clients mounted during the test. It's very inconvenient to have to unmount them. It would be best if it were possible to run mds-survey safely with mounted clients.
          di.wang Di Wang added a comment -

          Usually this LBUG means mds-survey and other clients(probably normal lustre client) access the same object at the same time, or the object has been accessed by normal clients previously, and object stack is still in the server cache.

          Are there any normal lustre client being attached at the same time when you were running the test. I would suggest you umount all of clients, and remount the server(to clear the cache) before run mds-survey.

          di.wang Di Wang added a comment - Usually this LBUG means mds-survey and other clients(probably normal lustre client) access the same object at the same time, or the object has been accessed by normal clients previously, and object stack is still in the server cache. Are there any normal lustre client being attached at the same time when you were running the test. I would suggest you umount all of clients, and remount the server(to clear the cache) before run mds-survey.

          We have the mdt and mgs for a single Lustre filesystem running in different zfs datasets on the same node.

          behlendorf Brian Behlendorf added a comment - We have the mdt and mgs for a single Lustre filesystem running in different zfs datasets on the same node.

          is it the only filesystem or you have few running on the same node?

          bzzz Alex Zhuravlev added a comment - is it the only filesystem or you have few running on the same node?
          behlendorf Brian Behlendorf added a comment - - edited

          Sorry, I meant to add the full backtrace was distracted. Here it is.

          I've hit this failure a few more times now and while I haven't really investigated I'm less inclined now to think it has anything to do with the number of files. I haven't been able to reproduce the problem when the filesystem is idle aside from mds-survey. This makes a lot of sense looking at the stack since it appears the offending call path occurred while handling a request from a real client. When there's a user workload on the system as well I can trigger this fairly easily. We do have full crash dumps available if needed and since this is a test system it's easy to add a debugging patch.

          2013-08-06 15:22:10 Lustre: lcz-MDT0000: Recovery over after 1:15, of 146 clients 146 recovered and 0 were evicted.
          2013-08-06 15:23:34 Lustre: Echo OBD driver; http://www.lustre.org/
          2013-08-06 15:23:34 LustreError: 4487:0:(echo_client.c:1959:echo_md_destroy_internal()) Can not unlink child tests: rc = -39
          2013-08-06 15:23:34 LustreError: 4489:0:(echo_client.c:1959:echo_md_destroy_internal()) Can not unlink child tests1: rc = -39
          2013-08-06 15:24:13 LustreError: 4557:0:(echo_client.c:1744:echo_md_lookup()) lookup tests: rc = -2
          2013-08-06 15:24:13 LustreError: 4557:0:(echo_client.c:1943:echo_md_destroy_internal()) Can't find child tests: rc = -2
          2013-08-06 15:24:13 LustreError: 4559:0:(echo_client.c:1744:echo_md_lookup()) lookup tests1: rc = -2
          2013-08-06 15:24:13 LustreError: 4559:0:(echo_client.c:1943:echo_md_destroy_internal()) Can't find child tests1: rc = -2
          2013-08-06 15:25:58 LustreError: 5279:0:(echo_client.c:1744:echo_md_lookup()) lookup tests: rc = -2
          2013-08-06 15:25:58 LustreError: 5279:0:(echo_client.c:1744:echo_md_lookup()) Skipped 2 previous similar messages
          2013-08-06 15:25:58 LustreError: 5279:0:(echo_client.c:1943:echo_md_destroy_internal()) Can't find child tests: rc = -2
          2013-08-06 15:25:58 LustreError: 5279:0:(echo_client.c:1943:echo_md_destroy_internal()) Skipped 2 previous similar messages
          2013-08-06 15:27:38 LustreError: 4181:0:(mdt_handler.c:2352:mdt_obj()) ASSERTION( lu_device_is_mdt(o->lo_dev) ) failed:
          2013-08-06 15:27:38 LustreError: 4181:0:(mdt_handler.c:2352:mdt_obj()) LBUG
          2013-08-06 15:27:38 Pid: 4181, comm: mdt01_010
          2013-08-06 15:27:38
          2013-08-06 15:27:38 Call Trace:
          2013-08-06 15:27:38  [<ffffffffa071f8f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
          2013-08-06 15:27:38  [<ffffffffa071fef7>] lbug_with_loc+0x47/0xb0 [libcfs]
          2013-08-06 15:27:38  [<ffffffffa0f9f365>] mdt_obj+0x55/0x80 [mdt]
          2013-08-06 15:27:38  [<ffffffffa0fa2db6>] mdt_object_find+0x66/0x170 [mdt]
          2013-08-06 15:27:38  [<ffffffffa0fa317c>] mdt_unpack_req_pack_rep+0x2bc/0x4d0 [mdt]
          2013-08-06 15:27:38  [<ffffffffa0fa5ef3>] mdt_intent_policy+0x353/0x720 [mdt]
          2013-08-06 15:27:38  [<ffffffffa09d559e>] ldlm_lock_enqueue+0x11e/0x960 [ptlrpc]
          2013-08-06 15:27:38  [<ffffffffa09fc2bf>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc]
          2013-08-06 15:27:38  [<ffffffffa0fa63c6>] mdt_enqueue+0x46/0xe0 [mdt]
          2013-08-06 15:27:39  [<ffffffffa0facab8>] mdt_handle_common+0x648/0x1660 [mdt]
          2013-08-06 15:27:39  [<ffffffffa0fe6145>] mds_regular_handle+0x15/0x20 [mdt]
          2013-08-06 15:27:39  [<ffffffffa0a2e738>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
          2013-08-06 15:27:39  [<ffffffffa072063e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
          2013-08-06 15:27:39  [<ffffffffa0731def>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
          2013-08-06 15:27:39  [<ffffffffa0a25a99>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
          2013-08-06 15:27:39  [<ffffffff81055ab3>] ? __wake_up+0x53/0x70
          2013-08-06 15:27:39  [<ffffffffa0a2face>] ptlrpc_main+0xace/0x1700 [ptlrpc]
          2013-08-06 15:27:39  [<ffffffffa0a2f000>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
          2013-08-06 15:27:39  [<ffffffff8100c0ca>] child_rip+0xa/0x20
          2013-08-06 15:27:39  [<ffffffffa0a2f000>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
          2013-08-06 15:27:39  [<ffffffffa0a2f000>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
          2013-08-06 15:27:39  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
          
          behlendorf Brian Behlendorf added a comment - - edited Sorry, I meant to add the full backtrace was distracted. Here it is. I've hit this failure a few more times now and while I haven't really investigated I'm less inclined now to think it has anything to do with the number of files. I haven't been able to reproduce the problem when the filesystem is idle aside from mds-survey. This makes a lot of sense looking at the stack since it appears the offending call path occurred while handling a request from a real client. When there's a user workload on the system as well I can trigger this fairly easily. We do have full crash dumps available if needed and since this is a test system it's easy to add a debugging patch. 2013-08-06 15:22:10 Lustre: lcz-MDT0000: Recovery over after 1:15, of 146 clients 146 recovered and 0 were evicted. 2013-08-06 15:23:34 Lustre: Echo OBD driver; http://www.lustre.org/ 2013-08-06 15:23:34 LustreError: 4487:0:(echo_client.c:1959:echo_md_destroy_internal()) Can not unlink child tests: rc = -39 2013-08-06 15:23:34 LustreError: 4489:0:(echo_client.c:1959:echo_md_destroy_internal()) Can not unlink child tests1: rc = -39 2013-08-06 15:24:13 LustreError: 4557:0:(echo_client.c:1744:echo_md_lookup()) lookup tests: rc = -2 2013-08-06 15:24:13 LustreError: 4557:0:(echo_client.c:1943:echo_md_destroy_internal()) Can't find child tests: rc = -2 2013-08-06 15:24:13 LustreError: 4559:0:(echo_client.c:1744:echo_md_lookup()) lookup tests1: rc = -2 2013-08-06 15:24:13 LustreError: 4559:0:(echo_client.c:1943:echo_md_destroy_internal()) Can't find child tests1: rc = -2 2013-08-06 15:25:58 LustreError: 5279:0:(echo_client.c:1744:echo_md_lookup()) lookup tests: rc = -2 2013-08-06 15:25:58 LustreError: 5279:0:(echo_client.c:1744:echo_md_lookup()) Skipped 2 previous similar messages 2013-08-06 15:25:58 LustreError: 5279:0:(echo_client.c:1943:echo_md_destroy_internal()) Can't find child tests: rc = -2 2013-08-06 15:25:58 LustreError: 5279:0:(echo_client.c:1943:echo_md_destroy_internal()) Skipped 2 previous similar messages 2013-08-06 15:27:38 LustreError: 4181:0:(mdt_handler.c:2352:mdt_obj()) ASSERTION( lu_device_is_mdt(o->lo_dev) ) failed: 2013-08-06 15:27:38 LustreError: 4181:0:(mdt_handler.c:2352:mdt_obj()) LBUG 2013-08-06 15:27:38 Pid: 4181, comm: mdt01_010 2013-08-06 15:27:38 2013-08-06 15:27:38 Call Trace: 2013-08-06 15:27:38 [<ffffffffa071f8f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2013-08-06 15:27:38 [<ffffffffa071fef7>] lbug_with_loc+0x47/0xb0 [libcfs] 2013-08-06 15:27:38 [<ffffffffa0f9f365>] mdt_obj+0x55/0x80 [mdt] 2013-08-06 15:27:38 [<ffffffffa0fa2db6>] mdt_object_find+0x66/0x170 [mdt] 2013-08-06 15:27:38 [<ffffffffa0fa317c>] mdt_unpack_req_pack_rep+0x2bc/0x4d0 [mdt] 2013-08-06 15:27:38 [<ffffffffa0fa5ef3>] mdt_intent_policy+0x353/0x720 [mdt] 2013-08-06 15:27:38 [<ffffffffa09d559e>] ldlm_lock_enqueue+0x11e/0x960 [ptlrpc] 2013-08-06 15:27:38 [<ffffffffa09fc2bf>] ldlm_handle_enqueue0+0x4ef/0x10b0 [ptlrpc] 2013-08-06 15:27:38 [<ffffffffa0fa63c6>] mdt_enqueue+0x46/0xe0 [mdt] 2013-08-06 15:27:39 [<ffffffffa0facab8>] mdt_handle_common+0x648/0x1660 [mdt] 2013-08-06 15:27:39 [<ffffffffa0fe6145>] mds_regular_handle+0x15/0x20 [mdt] 2013-08-06 15:27:39 [<ffffffffa0a2e738>] ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc] 2013-08-06 15:27:39 [<ffffffffa072063e>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-08-06 15:27:39 [<ffffffffa0731def>] ? lc_watchdog_touch+0x6f/0x170 [libcfs] 2013-08-06 15:27:39 [<ffffffffa0a25a99>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-08-06 15:27:39 [<ffffffff81055ab3>] ? __wake_up+0x53/0x70 2013-08-06 15:27:39 [<ffffffffa0a2face>] ptlrpc_main+0xace/0x1700 [ptlrpc] 2013-08-06 15:27:39 [<ffffffffa0a2f000>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2013-08-06 15:27:39 [<ffffffff8100c0ca>] child_rip+0xa/0x20 2013-08-06 15:27:39 [<ffffffffa0a2f000>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2013-08-06 15:27:39 [<ffffffffa0a2f000>] ? ptlrpc_main+0x0/0x1700 [ptlrpc] 2013-08-06 15:27:39 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
          yujian Jian Yu added a comment -

          Hi Brian,

          Could you please tell me what the number of "file_count" you specified while running mds-survey.sh to repeat the failure?

          I tested the following numbers separately on the latest Lustre b2_4 branch (build http://build.whamcloud.com/job/lustre-b2_4/27/), all the test runs passed.

          file_count=150000
          https://maloo.whamcloud.com/test_sets/b6c02f58-ff4a-11e2-a3fb-52540035b04c

          file_count=1000000
          https://maloo.whamcloud.com/test_sets/a45000fe-ff5f-11e2-a3fb-52540035b04c

          file_count=1610021
          https://maloo.whamcloud.com/test_sets/b50fc3b6-ff5f-11e2-a3fb-52540035b04c

          yujian Jian Yu added a comment - Hi Brian, Could you please tell me what the number of "file_count" you specified while running mds-survey.sh to repeat the failure? I tested the following numbers separately on the latest Lustre b2_4 branch (build http://build.whamcloud.com/job/lustre-b2_4/27/ ), all the test runs passed. file_count=150000 https://maloo.whamcloud.com/test_sets/b6c02f58-ff4a-11e2-a3fb-52540035b04c file_count=1000000 https://maloo.whamcloud.com/test_sets/a45000fe-ff5f-11e2-a3fb-52540035b04c file_count=1610021 https://maloo.whamcloud.com/test_sets/b50fc3b6-ff5f-11e2-a3fb-52540035b04c

          Brian, can you paste the backtrace please?

          bzzz Alex Zhuravlev added a comment - Brian, can you paste the backtrace please?
          pjones Peter Jones added a comment -

          Yu Jian

          Could you please look into this one?

          Thanks

          Peter

          pjones Peter Jones added a comment - Yu Jian Could you please look into this one? Thanks Peter

          People

            yujian Jian Yu
            behlendorf Brian Behlendorf
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: