Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.10.6
    • None
    • 3.10.0-693.2.2.el7_lustre.pl1.x86_64
    • 3
    • 9223372036854775807

    Description

      We had an issue yesterday on Oak storage with Lustre 2.10.6. MDT0 didn't crash but filesystem got stuck. Several stack traces showed up on oak-md1-s2 (serving MDT0000). Note: Oak uses DNE1 and another MDT0001 is mounted on oak-md1-s1, but I didn't find any stack trace on this one. A restart of MDT0000 fixed the issue (after a workaround to mitigate LU-8992).

      My short-term plan is to upgrade Oak to 2.10.7 in a rolling fashion but I thought it would be of interest to have a ticket to track this issue. I'm also attaching kernel logs from this server as oak-md1-s2-kernel.log where all stack traces can be seen.

      First call trace was:

      Mar 29 09:38:38 oak-md1-s2 kernel: INFO: task mdt00_003:3491 blocked for more than 120 seconds.
      Mar 29 09:38:38 oak-md1-s2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      Mar 29 09:38:38 oak-md1-s2 kernel: mdt00_003       D ffffffff00000000     0  3491      2 0x00000080
      Mar 29 09:38:38 oak-md1-s2 kernel: ffff88201e3f74b8 0000000000000046 ffff88201e3c3f40 ffff88201e3f7fd8
      Mar 29 09:38:38 oak-md1-s2 kernel: ffff88201e3f7fd8 ffff88201e3f7fd8 ffff88201e3c3f40 ffff88201e3c3f40
      Mar 29 09:38:38 oak-md1-s2 kernel: ffff88101fc13248 ffff88101fc13240 fffffffe00000001 ffffffff00000000
      Mar 29 09:38:38 oak-md1-s2 kernel: Call Trace:
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff816aadd5>] rwsem_down_write_failed+0x225/0x3a0
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff81332047>] call_rwsem_down_write_failed+0x17/0x30
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff816a87cd>] down_write+0x2d/0x3d
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc121b34f>] lod_alloc_qos.constprop.17+0x1af/0x1590 [lod]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0fa49a1>] ? qsd_op_begin0+0x181/0x940 [lquota]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0ed322f>] ? ldiskfs_xattr_ibody_get+0xef/0x1a0 [ldiskfs]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc12204d1>] lod_qos_prep_create+0x1291/0x17f0 [lod]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1220bf9>] ? lod_prepare_inuse+0x1c9/0x2e0 [lod]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1220f6d>] lod_prepare_create+0x25d/0x360 [lod]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc121578e>] lod_declare_striped_create+0x1ee/0x970 [lod]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1217c04>] lod_declare_create+0x1e4/0x540 [lod]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc12828cf>] mdd_declare_create_object_internal+0xdf/0x2f0 [mdd]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1274023>] mdd_declare_create+0x53/0xe20 [mdd]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1277ec9>] mdd_create+0x879/0x1400 [mdd]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc114ab93>] mdt_reint_open+0x2173/0x3190 [mdt]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0931dde>] ? upcall_cache_get_entry+0x20e/0x8f0 [obdclass]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc112fad3>] ? ucred_set_jobid+0x53/0x70 [mdt]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc113fa40>] mdt_reint_rec+0x80/0x210 [mdt]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc112131b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc1121842>] mdt_intent_reint+0x162/0x430 [mdt]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc112c5ae>] mdt_intent_policy+0x43e/0xc70 [mdt]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0afc12f>] ? ldlm_resource_get+0x9f/0xa30 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0af5277>] ldlm_lock_enqueue+0x387/0x970 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b1e9e3>] ldlm_handle_enqueue0+0x9c3/0x1680 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b46bc0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0ba3e92>] tgt_enqueue+0x62/0x210 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0ba7d95>] tgt_request_handle+0x925/0x1370 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b50bf6>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b4d228>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff810c4822>] ? default_wake_function+0x12/0x20
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff810ba588>] ? __wake_up_common+0x58/0x90
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b54332>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffffc0b538a0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff816b4f58>] ret_from_fork+0x58/0x90
      Mar 29 09:38:38 oak-md1-s2 kernel: [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
      

      Attachments

        1. LU-12136-OSS-logs.tar.gz
          49 kB
          Stephane Thiell
        2. oak-md1-s2-kernel.log
          450 kB
          Stephane Thiell

        Issue Links

          Activity

            [LU-12136] mdt threads blocked in mdd_create

            Thanks Alex! Agreed that this is an old ticket and is probably not relevant anymore.

            sthiell Stephane Thiell added a comment - Thanks Alex! Agreed that this is an old ticket and is probably not relevant anymore.

            the number of improvements have been made in this area (LU-8367, LU-15082, LU-15724, LU-13073, LU-13513, LU-12397, etc). probably makes sense to check is this still a problem with the newer versions?

            bzzz Alex Zhuravlev added a comment - the number of improvements have been made in this area ( LU-8367 , LU-15082 , LU-15724 , LU-13073 , LU-13513 , LU-12397 , etc). probably makes sense to check is this still a problem with the newer versions?

            Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38428
            Subject: LU-12136 osp: blocked-in-precreation stats
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a3c054c4b39c49a750666698c52385d4f7133a98

            gerrit Gerrit Updater added a comment - Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/38428 Subject: LU-12136 osp: blocked-in-precreation stats Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a3c054c4b39c49a750666698c52385d4f7133a98

            Hi Nathan,

            Just FYI, we've not seen this problem on our 2.10 servers since we use 2.10.8 + 3 patches (https://github.com/stanford-rc/lustre/commits/b2_10_oak). I'm wondering if you have "LU-12018 quota: do not start a thread under memory pressure". It's a critical patch that is missing from 2.10.8 IMHO.

            sthiell Stephane Thiell added a comment - Hi Nathan, Just FYI, we've not seen this problem on our 2.10 servers since we use 2.10.8 + 3 patches ( https://github.com/stanford-rc/lustre/commits/b2_10_oak ). I'm wondering if you have " LU-12018 quota: do not start a thread under memory pressure". It's a critical patch that is missing from 2.10.8 IMHO.

            Any update on getting a fix for this issue?  We have hit it at NOAA at least twice, with no sign of any LBUG.  This is with lustre-2.10.7_ddn8.  Here is the call trace in case it helps:

            Oct 21 23:08:11 lfs-mds-1-1 kernel: INFO: task mdt00_002:17259 blocked for more than 120 seconds.
            Oct 21 23:08:11 lfs-mds-1-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            Oct 21 23:08:11 lfs-mds-1-1 kernel: mdt00_002       D ffff96927956d140     0 17259      2 0x00000000
            Oct 21 23:08:11 lfs-mds-1-1 kernel: Call Trace:
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffa4b68e19>] schedule+0x29/0x70
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffa4b6a705>] rwsem_down_write_failed+0x225/0x3a0
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffa47874a7>] call_rwsem_down_write_failed+0x17/0x30
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffa4b6814d>] down_write+0x2d/0x3d
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc1624cef>] lod_alloc_qos.constprop.17+0x1af/0x1590 [lod]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc12d7a47>] ? qsd_op_begin0+0xc7/0x940 [lquota]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffa467966d>] ? __getblk+0x2d/0x300
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc1628d51>] lod_qos_prep_create+0x12b1/0x1810 [lod]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc1629479>] ? lod_prepare_inuse+0x1c9/0x2e0 [lod]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc16297ed>] lod_prepare_create+0x25d/0x360 [lod]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc161e11e>] lod_declare_striped_create+0x1ee/0x970 [lod]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc1620594>] lod_declare_create+0x1e4/0x550 [lod]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc168dc72>] mdd_declare_create_object_internal+0xe2/0x2f0 [mdd]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc167d1a3>] mdd_declare_create+0x53/0xe40 [mdd]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc1681739>] mdd_create+0x9c9/0x1420 [mdd]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc1552685>] mdt_reint_open+0x2175/0x3190 [mdt]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc0ff2037>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc1547403>] mdt_reint_rec+0x83/0x210 [mdt]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc152818b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc15286b2>] mdt_intent_reint+0x162/0x480 [mdt]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc152b50b>] mdt_intent_opc+0x1eb/0xaf0 [mdt]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc0ff4060>] ? lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc1533e70>] mdt_intent_policy+0x130/0x320 [mdt]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc0fa2226>] ldlm_lock_enqueue+0x396/0x960 [ptlrpc]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc0fcbcd7>] ldlm_handle_enqueue0+0xa97/0x1640 [ptlrpc]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc0ff40e0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc1054882>] tgt_enqueue+0x62/0x210 [ptlrpc]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc10587ba>] tgt_request_handle+0x92a/0x1370 [ptlrpc]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc0ffe12b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc0ffa758>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffa44d6af2>] ? default_wake_function+0x12/0x20
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffa44cbb8b>] ? __wake_up_common+0x5b/0x90
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc1001872>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffc1000de0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc]
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffa44c1d21>] kthread+0xd1/0xe0
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffa44c1c50>] ? insert_kthread_work+0x40/0x40
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffa4b75c37>] ret_from_fork_nospec_begin+0x21/0x21
            Oct 21 23:08:11 lfs-mds-1-1 kernel:  [<ffffffffa44c1c50>] ? insert_kthread_work+0x40/0x40
            
            dauchy Nathan Dauchy (Inactive) added a comment - Any update on getting a fix for this issue?  We have hit it at NOAA at least twice, with no sign of any LBUG.  This is with lustre-2.10.7_ddn8.  Here is the call trace in case it helps: Oct 21 23:08:11 lfs-mds-1-1 kernel: INFO: task mdt00_002:17259 blocked for more than 120 seconds. Oct 21 23:08:11 lfs-mds-1-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 21 23:08:11 lfs-mds-1-1 kernel: mdt00_002 D ffff96927956d140 0 17259 2 0x00000000 Oct 21 23:08:11 lfs-mds-1-1 kernel: Call Trace: Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffa4b68e19>] schedule+0x29/0x70 Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffa4b6a705>] rwsem_down_write_failed+0x225/0x3a0 Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffa47874a7>] call_rwsem_down_write_failed+0x17/0x30 Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffa4b6814d>] down_write+0x2d/0x3d Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc1624cef>] lod_alloc_qos.constprop.17+0x1af/0x1590 [lod] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc12d7a47>] ? qsd_op_begin0+0xc7/0x940 [lquota] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffa467966d>] ? __getblk+0x2d/0x300 Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc1628d51>] lod_qos_prep_create+0x12b1/0x1810 [lod] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc1629479>] ? lod_prepare_inuse+0x1c9/0x2e0 [lod] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc16297ed>] lod_prepare_create+0x25d/0x360 [lod] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc161e11e>] lod_declare_striped_create+0x1ee/0x970 [lod] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc1620594>] lod_declare_create+0x1e4/0x550 [lod] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc168dc72>] mdd_declare_create_object_internal+0xe2/0x2f0 [mdd] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc167d1a3>] mdd_declare_create+0x53/0xe40 [mdd] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc1681739>] mdd_create+0x9c9/0x1420 [mdd] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc1552685>] mdt_reint_open+0x2175/0x3190 [mdt] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc0ff2037>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc1547403>] mdt_reint_rec+0x83/0x210 [mdt] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc152818b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc15286b2>] mdt_intent_reint+0x162/0x480 [mdt] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc152b50b>] mdt_intent_opc+0x1eb/0xaf0 [mdt] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc0ff4060>] ? lustre_swab_ldlm_policy_data+0x30/0x30 [ptlrpc] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc1533e70>] mdt_intent_policy+0x130/0x320 [mdt] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc0fa2226>] ldlm_lock_enqueue+0x396/0x960 [ptlrpc] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc0fcbcd7>] ldlm_handle_enqueue0+0xa97/0x1640 [ptlrpc] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc0ff40e0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc1054882>] tgt_enqueue+0x62/0x210 [ptlrpc] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc10587ba>] tgt_request_handle+0x92a/0x1370 [ptlrpc] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc0ffe12b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc0ffa758>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffa44d6af2>] ? default_wake_function+0x12/0x20 Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffa44cbb8b>] ? __wake_up_common+0x5b/0x90 Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc1001872>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffc1000de0>] ? ptlrpc_register_service+0xe30/0xe30 [ptlrpc] Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffa44c1d21>] kthread+0xd1/0xe0 Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffa44c1c50>] ? insert_kthread_work+0x40/0x40 Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffa4b75c37>] ret_from_fork_nospec_begin+0x21/0x21 Oct 21 23:08:11 lfs-mds-1-1 kernel: [<ffffffffa44c1c50>] ? insert_kthread_work+0x40/0x40

            Andreas, normally we don't wait on RPC in that path - when specific OST is out of precreated objects, but once we hit it then the calling thread wats for obd_timeout..

            bzzz Alex Zhuravlev added a comment - Andreas, normally we don't wait on RPC in that path - when specific OST is out of precreated objects, but once we hit it then the calling thread wats for obd_timeout..
            bzzz Alex Zhuravlev added a comment - - edited

            hmm, the cause for reboot was LBUG, AFAICS:

            Mar 29 10:28:48 oak-md1-s2 kernel: LustreError: 118086:0:(osd_handler.c:1982:osd_object_release()) LBUG
            Mar 29 10:28:48 oak-md1-s2 kernel: Pid: 118086, comm: tgt_recover_0 3.10.0-693.2.2.el7_lustre.pl1.x86_64 #1 SMP Sat Sep 30 00:44:25 PDT 2017
            Mar 29 10:28:48 oak-md1-s2 kernel: Call Trace:
            Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffff8103a212>] save_stack_trace_tsk+0x22/0x40
            Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffffc08247cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
            Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffffc082487c>] lbug_with_loc+0x4c/0xa0 [libcfs]
            Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffffc0fe3660>] osd_get_ldiskfs_dirent_param+0x0/0x130 [osd_ldiskfs]
            Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffffc096ac4d>] lu_object_put+0x17d/0x3d0 [obdclass]
            Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffffc0c0e738>] distribute_txn_replay_handle+0x418/0xc00 [ptlrpc]
            Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffffc0b54987>] replay_request_or_update.isra.21+0x247/0x8c0 [ptlrpc]
            Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffffc0b55616>] target_recovery_thread+0x616/0x1360 [ptlrpc]
            Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0
            
            bzzz Alex Zhuravlev added a comment - - edited hmm, the cause for reboot was LBUG, AFAICS: Mar 29 10:28:48 oak-md1-s2 kernel: LustreError: 118086:0:(osd_handler.c:1982:osd_object_release()) LBUG Mar 29 10:28:48 oak-md1-s2 kernel: Pid: 118086, comm: tgt_recover_0 3.10.0-693.2.2.el7_lustre.pl1.x86_64 #1 SMP Sat Sep 30 00:44:25 PDT 2017 Mar 29 10:28:48 oak-md1-s2 kernel: Call Trace: Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffff8103a212>] save_stack_trace_tsk+0x22/0x40 Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffffc08247cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffffc082487c>] lbug_with_loc+0x4c/0xa0 [libcfs] Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffffc0fe3660>] osd_get_ldiskfs_dirent_param+0x0/0x130 [osd_ldiskfs] Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffffc096ac4d>] lu_object_put+0x17d/0x3d0 [obdclass] Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffffc0c0e738>] distribute_txn_replay_handle+0x418/0xc00 [ptlrpc] Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffffc0b54987>] replay_request_or_update.isra.21+0x247/0x8c0 [ptlrpc] Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffffc0b55616>] target_recovery_thread+0x616/0x1360 [ptlrpc] Mar 29 10:28:48 oak-md1-s2 kernel: [<ffffffff810b098f>] kthread+0xcf/0xe0

            Alex, did you ever get a chance to look into this code? It seems like we should not be holding this global mutex over RPC calls. If the OST is unavailable, or doesn't have any objects, then the call should trigger the precreated thread and return immediately, rather than blocking the MDS thread.

            adilger Andreas Dilger added a comment - Alex, did you ever get a chance to look into this code? It seems like we should not be holding this global mutex over RPC calls. If the OST is unavailable, or doesn't have any objects, then the call should trigger the precreated thread and return immediately, rather than blocking the MDS thread.

            Hi Alex and Joe – I just attached the OSS kernel logs, thanks!
            Note: We upgraded Oak servers to 2.10.7 last night.

            sthiell Stephane Thiell added a comment - Hi Alex and Joe – I just attached the OSS kernel logs, thanks! Note: We upgraded Oak servers to 2.10.7 last night.

            Hi Stephane,

            Any luck in getting the OST logs as requested?

            Thanks.

            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Stephane, Any luck in getting the OST logs as requested? Thanks. Joe

            People

              bzzz Alex Zhuravlev
              sthiell Stephane Thiell
              Votes:
              1 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated: