Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6312

soft lockup in sanity-lfsck test_18f

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.7.0, Lustre 2.8.0
    • Lustre 2.7.0, Lustre 2.8.0
    • None
    • 3
    • 17667

    Description

      I was running sanity-lfsck on master (v2_7_50_0-3-g7ed514f) and hit a soft lockup in test_18f:

      Feb 27 05:01:30 LustreError: 8252:0:(fail.c:132:__cfs_fail_timeout_set()) cfs_fail_timeout id 1602 sleeping for 10000ms
      Feb 27 05:01:30 LustreError: 8252:0:(fail.c:132:__cfs_fail_timeout_set()) Skipped 1 previous similar message
      Feb 27 05:01:40 LustreError: 8252:0:(fail.c:136:__cfs_fail_timeout_set()) cfs_fail_timeout id 1602 awake
      Feb 27 05:01:40 LustreError: 8252:0:(fail.c:136:__cfs_fail_timeout_set()) Skipped 1 previous similar message
      Feb 27 05:01:43 Lustre: DEBUG MARKER: == sanity-lfsck test 18f: Skip the failed OST(s) when handle orphan OST-objects ====================== 05:01:42 (1425038502)
      Feb 27 05:01:43 Pid: 7574, comm: mdt01_002
      Feb 27 05:01:43 Call Trace:
      Feb 27 05:01:43 [<ffffffffa0f1e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      Feb 27 05:01:43 [<ffffffffa0ff462f>] osd_trans_start+0x63f/0x660 [osd_ldiskfs]
      Feb 27 05:01:43 [<ffffffffa11eea1f>] lod_trans_start+0x9f/0x190 [lod]
      Feb 27 05:01:43 [<ffffffffa10e0624>] mdd_trans_start+0x14/0x20 [mdd]
      Feb 27 05:01:43 [<ffffffffa10cbd05>] mdd_create+0xbf5/0x1730 [mdd]
      Feb 27 05:01:43 [<ffffffffa113ecb8>] mdo_create+0x18/0x50 [mdt]
      Feb 27 05:01:43 [<ffffffffa1148e6f>] mdt_reint_open+0x1f8f/0x2c70 [mdt]
      Feb 27 05:01:43 [<ffffffffa0f3b84c>] ? upcall_cache_get_entry+0x29c/0x880 [libcfs]
      Feb 27 05:01:43 [<ffffffffa11300cd>] mdt_reint_rec+0x5d/0x200 [mdt]
      Feb 27 05:01:43 [<ffffffffa111423b>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
      Feb 27 05:01:43 [<ffffffffa1114706>] mdt_intent_reint+0x1f6/0x430 [mdt]
      Feb 27 05:01:43 [<ffffffffa1112cf4>] mdt_intent_policy+0x494/0xce0 [mdt]
      Feb 27 05:01:43 [<ffffffffa0af94f9>] ldlm_lock_enqueue+0x129/0x9d0 [ptlrpc]
      Feb 27 05:01:43 [<ffffffffa0b2566b>] ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc]
      Feb 27 05:01:43 [<ffffffffa0ba6331>] tgt_enqueue+0x61/0x230 [ptlrpc]
      Feb 27 05:01:43 [<ffffffffa0ba6f7e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
      Feb 27 05:01:43 [<ffffffffa0b569e1>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
      Feb 27 05:01:43 [<ffffffffa0b55ba0>] ? ptlrpc_main+0x0/0x1960 [ptlrpc]
      Feb 27 05:01:43 [<ffffffff8109abf6>] kthread+0x96/0xa0
      Feb 27 05:01:43 [<ffffffff8100c20a>] child_rip+0xa/0x20
      Feb 27 05:01:43 [<ffffffff8109ab60>] ? kthread+0x0/0xa0
      Feb 27 05:01:43 [<ffffffff8100c200>] ? child_rip+0x0/0x20
      Feb 27 05:01:43 
      Feb 27 05:01:44 Lustre: DEBUG MARKER: cancel_lru_locks osc start
      Feb 27 05:01:44 Lustre: DEBUG MARKER: cancel_lru_locks osc stop
      Feb 27 05:01:44 Lustre: *** cfs_fail_loc=1616, val=0***
      Feb 27 05:01:44 Lustre: Skipped 1 previous similar message
      Feb 27 05:01:47 Lustre: DEBUG MARKER: cancel_lru_locks mdc stop
      Feb 27 05:01:47 Lustre: DEBUG MARKER: cancel_lru_locks osc start
      Feb 27 05:01:47 Lustre: DEBUG MARKER: cancel_lru_locks osc stop
      Feb 27 05:01:47 Lustre: *** cfs_fail_loc=161c, val=0***
      Feb 27 05:01:54 Lustre: testfs-OST0002: Client testfs-MDT0001-mdtlov_UUID (at 0@lo) reconnecting
      Feb 27 05:01:54 Lustre: testfs-OST0002: deleting orphan objects from 0x280000400:102 to 0x280000400:193
      Feb 27 05:01:54 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:929
      Feb 27 05:02:01 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
      Feb 27 05:02:01 Lustre: Skipped 1 previous similar message
      Feb 27 05:02:01 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:961
      Feb 27 05:02:08 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
      Feb 27 05:02:08 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:993
      Feb 27 05:02:15 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
      Feb 27 05:02:15 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1025
      Feb 27 05:02:22 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
      Feb 27 05:02:22 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1057
      Feb 27 05:02:29 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1089
      Feb 27 05:02:36 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
      Feb 27 05:02:36 Lustre: Skipped 1 previous similar message
      Feb 27 05:02:36 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1121
      Feb 27 05:02:43 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1153
      Feb 27 05:02:50 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1185
      Feb 27 05:02:57 Lustre: testfs-OST0000: Client testfs-MDT0000-mdtlov_UUID (at 0@lo) reconnecting
      Feb 27 05:02:57 Lustre: Skipped 2 previous similar messages
      Feb 27 05:02:57 Lustre: testfs-OST0000: deleting orphan objects from 0x0:901 to 0x0:1217
      Feb 27 05:03:00 BUG: soft lockup - CPU#0 stuck for 67s! [ll_ost00_015:7824]
      

      The stack trace is a bit strange since it lists all of the addresses on the stack as "uncertain" (leading '?'):

      Pid: 7824, comm: ll_ost00_015 Tainted: P           ---------------    2.6.32-431.29.2.el6_lustre.g36cd22b.x86_64RIP: 0010:[<ffffffff8152c10e>]  [<ffffffff8152c10e>] _spin_lock+0x1e/0x30
      Process ll_ost00_015 (pid: 7824, threadinfo ffff8800ca29a000, task ffff8800d91e0040)
      Call Trace:
       ? lfsck_layout_slave_in_notify+0x210/0x15b0 [lfsck]
       ? null_alloc_rs+0xf3/0x390 [ptlrpc]
       ? sptlrpc_svc_alloc_rs+0x74/0x360 [ptlrpc]
       ? lustre_msg_buf+0x55/0x60 [ptlrpc]
       ? __req_capsule_get+0x162/0x6d0 [ptlrpc]
       ? lfsck_in_notify+0xef/0x330 [lfsck]
       ? tgt_handle_lfsck_notify+0x64/0x150 [ptlrpc]
       ? tgt_request_handle+0x8be/0x1000 [ptlrpc]
       ? ptlrpc_main+0xe41/0x1960 [ptlrpc]
      

      The test did not progress beyond this, and the same process has been stuck in the watchdog for many hours.

      Attachments

        Activity

          People

            yong.fan nasf (Inactive)
            adilger Andreas Dilger
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: