[LU-13540] MDS "lock on destroyed export", "lock timed out", client hangs Created: 08/May/20  Updated: 02/Feb/22  Resolved: 02/Feb/22

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.4
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Michael Sternberg Assignee: WC Triage
Resolution: Duplicate Votes: 1
Labels: None
Environment:

OS: CentOS 7.7.1908
Kernel: 3.10.0-1062.9.1.el7_lustre.x86_64
lustre version: lustre-2.12.4-1.el7.src.rpm - server and clients
heterogeneous IB network: servers on IB-EDR, clients on IB-

{DDR,QDR,FDR40,FDR56,EDR}

* 2 MDS (as 1 HA pair), 6 OSS (as 3 HA pairs), serving 2 Lustre file systems: carbofs2, sandbox2.

  • approx. 220 clients

Attachments: File lustre-log.1588945198.180635.dump    
Epic/Theme: Lustre-2.12
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We've been struggling for several weeks with recurring issues where at a high turnover rates of files I get MDS hangs like the following:

May  8 08:36:16 mds12 kernel: LustreError: 161992:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 172.17.2.4@o2ib  ns: mdt-carbofs2-MDT0001_UUID lock: ffff90d827671b00/0xf489323b9f7cfa63 lrc: 3/0,0 mode: PR/PR res: [0x24000b480:0xc5b6:0x0].0x0 bits 0x40/0x0 rrc: 229 type: IBT flags: 0x60200400000020 nid: 172.17.2.4@o2ib remote: 0x819f729a7001b5e5 expref: 74 pid: 162427 timeout: 1597115 lvb_type: 0
May  8 08:36:16 mds12 kernel: LustreError: 162425:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffff90d7f7331c00 ns: mdt-carbofs2-MDT0001_UUID lock: ffff90d827673cc0/0xf489323b9f7cfdf8 lrc: 3/0,0 mode: PW/PW res: [0x24000b480:0xc5b6:0x0].0x0 bits 0x40/0x0 rrc: 207 type: IBT flags: 0x50200400000020 nid: 172.17.2.2@o2ib remote: 0x3006f447d61f79eb expref: 26 pid: 162425 timeout: 0 lvb_type: 0
May  8 08:36:16 mds12 kernel: LustreError: 162425:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) Skipped 195 previous similar messages
May  8 08:36:16 mds12 kernel: Lustre: carbofs2-MDT0001: Connection restored to ab1e3886-7712-33ee-ea50-ce69a69a491d (at 172.17.2.4@o2ib)
May  8 08:36:16 mds12 kernel: Lustre: Skipped 454 previous similar messages
May  8 08:38:04 mds12 kernel: LustreError: 161992:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 99s: evicting client at 172.17.1.221@o2ib  ns: mdt-carbofs2-MDT0001_UUID lock: ffff90d7f6fbee40/0xf489323b9f82b37e lrc: 3/0,0 mode: PR/PR res: [0x24000b480:0xc5d4:0x0].0x0 bits 0x40/0x0 rrc: 211 type: IBT flags: 0x60200400000020 nid: 172.17.1.221@o2ib remote: 0x85f23451b667ea0d expref: 138 pid: 162424 timeout: 1597223 lvb_type: 0
May  8 08:38:04 mds12 kernel: LustreError: 161992:0:(ldlm_lockd.c:256:expired_lock_main()) Skipped 8 previous similar messages
May  8 08:38:54 mds12 kernel: LustreError: 161992:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 149s: evicting client at 172.17.2.5@o2ib  ns: mdt-carbofs2-MDT0001_UUID lock: ffff90d61bebe9c0/0xf489323b9f82b53e lrc: 3/0,0 mode: PR/PR res: [0x24000b480:0xc5ca:0x0].0x0 bits 0x40/0x0 rrc: 197 type: IBT flags: 0x60200400000020 nid: 172.17.2.5@o2ib remote: 0xe98766af8c44579e expref: 71 pid: 162423 timeout: 1597273 lvb_type: 0
May  8 08:38:54 mds12 kernel: LustreError: 162415:0:(service.c:2128:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-172.17.1.221@o2ib: deadline 138:11s ago#012  req@ffff90ced5611f80 x1662838202338496/t0(0) o101->454af10f-c871-de3c-e61f-fa5345d10e3a@172.17.1.221@o2ib:58/0 lens 480/0 e 0 to 0 dl 1588945123 ref 1 fl Interpret:/0/ffffffff rc 0/-1
May  8 08:38:54 mds12 kernel: LustreError: 180178:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) ### lock on destroyed export ffff90d819744000 ns: mdt-carbofs2-MDT0001_UUID lock: ffff90d276622ac0/0xf489323b9f82ba1c lrc: 3/0,0 mode: PR/PR res: [0x24000b480:0xc5ca:0x0].0x0 bits 0x40/0x0 rrc: 185 type: IBT flags: 0x50200400000020 nid: 172.17.2.5@o2ib remote: 0xe98766af8c4457ac expref: 15 pid: 180178 timeout: 0 lvb_type: 0
May  8 08:38:54 mds12 kernel: LustreError: 180178:0:(ldlm_lockd.c:1348:ldlm_handle_enqueue0()) Skipped 50 previous similar messages
May  8 08:38:54 mds12 kernel: Lustre: 162443:0:(service.c:2165:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (138:11s); client may timeout.  req@ffff90ced5616300 x1662838202338560/t0(0) o101->454af10f-c871-de3c-e61f-fa5345d10e3a@172.17.1.221@o2ib:58/0 lens 480/0 e 0 to 0 dl 1588945123 ref 1 fl Interpret:/0/ffffffff rc 0/-1
May  8 08:38:54 mds12 kernel: Lustre: 162443:0:(service.c:2165:ptlrpc_server_handle_request()) Skipped 1754 previous similar messages
May  8 08:38:54 mds12 kernel: Lustre: carbofs2-MDT0001: Connection restored to 454af10f-c871-de3c-e61f-fa5345d10e3a (at 172.17.1.221@o2ib)
May  8 08:38:54 mds12 kernel: LustreError: 162415:0:(service.c:2128:ptlrpc_server_handle_request()) Skipped 7 previous similar messages
May  8 08:39:34 mds12 kernel: Lustre: carbofs2-MDT0001: Client 454af10f-c871-de3c-e61f-fa5345d10e3a (at 172.17.1.221@o2ib) reconnecting
May  8 08:39:34 mds12 kernel: Lustre: Skipped 8 previous similar messages
May  8 08:39:34 mds12 kernel: Lustre: carbofs2-MDT0001: Connection restored to 454af10f-c871-de3c-e61f-fa5345d10e3a (at 172.17.1.221@o2ib)
May  8 08:39:34 mds12 kernel: Lustre: Skipped 4 previous similar messages
May  8 08:39:58 mds12 kernel: LNet: Service thread pid 180635 was inactive for 212.81s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
May  8 08:39:58 mds12 kernel: LNet: Skipped 2 previous similar messages
May  8 08:39:58 mds12 kernel: Pid: 180635, comm: mdt00_041 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Tue Feb 11 19:11:56 UTC 2020
May  8 08:39:58 mds12 kernel: Call Trace:
May  8 08:39:58 mds12 kernel: [<ffffffffc135ac90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
May  8 08:39:58 mds12 kernel: [<ffffffffc135b7b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
May  8 08:39:58 mds12 kernel: [<ffffffffc17ad5cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
May  8 08:39:58 mds12 kernel: [<ffffffffc17adc50>] mdt_object_lock_internal+0x70/0x360 [mdt]
May  8 08:39:58 mds12 kernel: [<ffffffffc17adf60>] mdt_object_lock+0x20/0x30 [mdt]
May  8 08:39:58 mds12 kernel: [<ffffffffc17ed0eb>] mdt_brw_enqueue+0x44b/0x760 [mdt]
May  8 08:39:58 mds12 kernel: [<ffffffffc179b4bf>] mdt_intent_brw+0x1f/0x30 [mdt]
May  8 08:39:58 mds12 kernel: [<ffffffffc17b3d35>] mdt_intent_policy+0x435/0xd80 [mdt]
May  8 08:39:58 mds12 kernel: [<ffffffffc1341e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
May  8 08:39:58 mds12 kernel: [<ffffffffc136a516>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
May  8 08:39:58 mds12 kernel: [<ffffffffc13f2bb2>] tgt_enqueue+0x62/0x210 [ptlrpc]
May  8 08:39:58 mds12 kernel: [<ffffffffc13f99ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
May  8 08:39:58 mds12 kernel: [<ffffffffc139e47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
May  8 08:39:58 mds12 kernel: [<ffffffffc13a1de4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
May  8 08:39:58 mds12 kernel: [<ffffffff928c61f1>] kthread+0xd1/0xe0
May  8 08:39:58 mds12 kernel: [<ffffffff92f8dd37>] ret_from_fork_nospec_end+0x0/0x39
May  8 08:39:58 mds12 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
May  8 08:39:58 mds12 kernel: LustreError: dumping log to /tmp/lustre-log.1588945198.180635
May  8 08:39:59 mds12 kernel: Pid: 180647, comm: mdt00_053 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Tue Feb 11 19:11:56 UTC 2020
May  8 08:39:59 mds12 kernel: Call Trace:
May  8 08:39:59 mds12 kernel: [<ffffffffc135ac90>] ldlm_completion_ast+0x430/0x860 [ptlrpc]
May  8 08:39:59 mds12 kernel: [<ffffffffc135b7b1>] ldlm_cli_enqueue_local+0x231/0x830 [ptlrpc]
May  8 08:39:59 mds12 kernel: [<ffffffffc17ad5cb>] mdt_object_local_lock+0x50b/0xb20 [mdt]
May  8 08:39:59 mds12 kernel: [<ffffffffc17adc50>] mdt_object_lock_internal+0x70/0x360 [mdt]
May  8 08:39:59 mds12 kernel: [<ffffffffc17adf60>] mdt_object_lock+0x20/0x30 [mdt]
May  8 08:39:59 mds12 kernel: [<ffffffffc17ed0eb>] mdt_brw_enqueue+0x44b/0x760 [mdt]
May  8 08:39:59 mds12 kernel: [<ffffffffc179b4bf>] mdt_intent_brw+0x1f/0x30 [mdt]
May  8 08:39:59 mds12 kernel: [<ffffffffc17b3d35>] mdt_intent_policy+0x435/0xd80 [mdt]
May  8 08:39:59 mds12 kernel: [<ffffffffc1341e06>] ldlm_lock_enqueue+0x356/0xa20 [ptlrpc]
May  8 08:39:59 mds12 kernel: [<ffffffffc136a516>] ldlm_handle_enqueue0+0xa56/0x15f0 [ptlrpc]
May  8 08:39:59 mds12 kernel: [<ffffffffc13f2bb2>] tgt_enqueue+0x62/0x210 [ptlrpc]
May  8 08:39:59 mds12 kernel: [<ffffffffc13f99ca>] tgt_request_handle+0xada/0x1570 [ptlrpc]
May  8 08:39:59 mds12 kernel: [<ffffffffc139e47b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
May  8 08:39:59 mds12 kernel: [<ffffffffc13a1de4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
May  8 08:39:59 mds12 kernel: [<ffffffff928c61f1>] kthread+0xd1/0xe0
May  8 08:39:59 mds12 kernel: [<ffffffff92f8dd37>] ret_from_fork_nospec_end+0x0/0x39
May  8 08:39:59 mds12 kernel: [<ffffffffffffffff>] 0xffffffffffffffff
May  8 08:39:59 mds12 kernel: Pid: 180247, comm: mdt00_035 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Tue Feb 11 19:11:56 UTC 2020
May  8 08:39:59 mds12 kernel: Call Trace:

It's been frustratingly difficult to pinpoint a single cause and reproduce those hangs. The stacktrace led me to the following possibly related issues:

Thus, I've tried:

  • Increasing or decreasing mds.MDS.mdt.threads_max.
    • Default was 120. I (perhaps unwisely) increased to 240, then set back to 120. Current value is 80.
  • Setting enable_remote_rename to 0
  • Disabling DoM for new files.
    • DoM was originally active, at max 64KB on MDT.
    • I recently disabled it by lfs setstripe on directories.

There are no lustre-log dumps on the first MDS mds11, and the following on the second MDS mds12.

Counts of entry time stamps near the crash time are:

1588930000 = 2020-05-08 04:26:40 CDT -0500
1588940000 = 2020-05-08 07:13:20 CDT -0500
1588950000 = 2020-05-08 10:00:00 CDT -0500
# grep -E -c ':15889[345]....\.' /tmp/lustre-log.15889*[0-9].dump
/tmp/lustre-log.1588945198.180635.dump:168133
/tmp/lustre-log.1588945199.180195.dump:131
/tmp/lustre-log.1588945459.92545.dump:2283
/tmp/lustre-log.1588945461.162415.dump:31
/tmp/lustre-log.1588948374.168310.dump:293947
/tmp/lustre-log.1588948526.162419.dump:218
/tmp/lustre-log.1588949132.162453.dump:3906
/tmp/lustre-log.1588949280.92548.dump:7764
/tmp/lustre-log.1588949296.180233.dump:51
/tmp/lustre-log.1588949447.180235.dump:396
/tmp/lustre-log.1588949599.180642.dump:365
/tmp/lustre-log.1588949734.180631.dump:395
/tmp/lustre-log.1588949886.180651.dump:339
/tmp/lustre-log.1588950037.24710.dump:662
/tmp/lustre-log.1588950488.180208.dump:6239
/tmp/lustre-log.1588950643.162456.dump:4
/tmp/lustre-log.1588950795.180657.dump:243
/tmp/lustre-log.1588951549.180643.dump:1670
/tmp/lustre-log.1588952274.180185.dump:2231
/tmp/lustre-log.1588952278.168309.dump:12
/tmp/lustre-log.1588952282.180237.dump:26
/tmp/lustre-log.1588952425.162437.dump:788
/tmp/lustre-log.1588952433.24712.dump:20
/tmp/lustre-log.1588952581.180200.dump:384
/tmp/lustre-log.1588952585.162224.dump:14
/tmp/lustre-log.1588952737.180196.dump:478
/tmp/lustre-log.1588952909.180649.dump:1548

Excerpt from near the crash time, around first error:

# grep -E ':15889[345]....\.' /tmp/lustre-log.15889*[0-9].dump | sort -t: -k5,7n
...
/tmp/lustre-log.1588945198.180635.dump:00000004:00080000:3.0:1588944970.333403:0:122652:0:(osp_object.c:1592:osp_create()) carbofs2-OST0005-osc-MDT0001: Wrote last used FID: [0x380000400:0xf75cd:0x0], index 5: 0
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:0.0:1588944970.334436:0:161976:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) carbofs2-MDT0001: transno 193291912093 is committed
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:3.0:1588944970.334684:0:122652:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 193291912094
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:3.0:1588944970.335904:0:122652:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 193291912095
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:7.0:1588944970.375259:0:168303:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 197573715030
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:7.0:1588944971.150971:0:168303:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 197573715038
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:0.0:1588944971.151852:0:161974:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) sandbox2-MDT0001: transno 197573715038 is committed
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:0.0:1588944971.151861:0:161974:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) sandbox2-MDT0001: transno 197573715033 is committed
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:0.0:1588944971.151867:0:161974:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) sandbox2-MDT0001: transno 197573715026 is committed
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:7.0:1588944971.966343:0:92555:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 193291912098
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:7.0:1588944971.968021:0:92555:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 193291912100
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:7.0:1588944971.969193:0:168303:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 193291912102
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:7.0:1588944972.404425:0:168303:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 197573715052
/tmp/lustre-log.1588945198.180635.dump:00000004:00080000:7.0:1588944972.955662:0:92555:0:(osp_object.c:1592:osp_create()) carbofs2-OST0008-osc-MDT0001: Wrote last used FID: [0x480000400:0xf780d:0x0], index 8: 0
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:0.0:1588944972.957507:0:92555:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 193291912104
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:0.0:1588944972.958735:0:92555:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 193291912105
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:7.0:1588944974.376516:0:168303:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 197573715066
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:0.0:1588944974.595200:0:92555:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 193291912108
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:0.0:1588944974.596092:0:92555:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 193291912110
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:7.0:1588944974.597258:0:168303:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 193291912112
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:7.0:1588944975.133486:0:168303:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 197573715072
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:0.0:1588944975.486761:0:161976:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) carbofs2-MDT0001: transno 193291912113 is committed
/tmp/lustre-log.1588945198.180635.dump:00000004:00080000:0.0:1588944975.565523:0:92555:0:(osp_object.c:1592:osp_create()) carbofs2-OST0002-osc-MDT0001: Wrote last used FID: [0x340000400:0xf75cd:0x0], index 2: 0
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:1.0:1588944975.567362:0:92555:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 193291912115
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:1.0:1588944975.568690:0:92555:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 193291912116
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:0.0:1588944976.478896:0:161974:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) sandbox2-MDT0001: transno 197573715077 is committed
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:0.0:1588944976.478911:0:161974:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) sandbox2-MDT0001: transno 197573715064 is committed
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:0.0:1588944976.478915:0:161974:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) sandbox2-MDT0001: transno 197573715062 is committed
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:0.0:1588944976.478920:0:161974:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) sandbox2-MDT0001: transno 197573715057 is committed
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:0.0:1588944976.478922:0:161974:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) sandbox2-MDT0001: transno 197573715055 is committed
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:0.0:1588944976.478933:0:161974:0:(tgt_lastrcvd.c:902:tgt_cb_last_committed()) sandbox2-MDT0001: transno 197573715042 is committed
/tmp/lustre-log.1588945198.180635.dump:00000001:00080000:7.0:1588944976.494952:0:168303:0:(tgt_lastrcvd.c:1873:tgt_txn_stop_cb()) More than one transaction 197573715078
/tmp/lustre-log.1588945198.180635.dump:00010000:00020000:8.0:1588944976.790931:0:161992:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 172.17.2.4@o2ib  ns: mdt-carbofs2-MDT0001_UUID lock: ffff90d827671b00/0xf489323b9f7cfa63 lr
c: 3/0,0 mode: PR/PR res: [0x24000b480:0xc5b6:0x0].0x0 bits 0x40/0x0 rrc: 229 type: IBT flags: 0x60200400000020 nid: 172.17.2.4@o2ib remote: 0x819f729a7001b5e5 expref: 74 pid: 162427 timeout: 1597115 lvb_type: 0
/tmp/lustre-log.1588945198.180635.dump:00000020:00080000:8.0:1588944976.795299:0:161992:0:(genops.c:1662:class_fail_export()) disconnecting export ffff90d80e74e400/89c3daaf-89ab-87e8-0414-0b4c5a62ec8e
/tmp/lustre-log.1588945198.180635.dump:00000020:00000080:8.0:1588944976.795310:0:161992:0:(genops.c:1501:class_disconnect()) disconnect: cookie 0xf489323b396e171f
/tmp/lustre-log.1588945198.180635.dump:00000020:00080000:8.0:1588944976.795456:0:161992:0:(genops.c:1679:class_fail_export()) disconnected export ffff90d80e74e400/89c3daaf-89ab-87e8-0414-0b4c5a62ec8e
/tmp/lustre-log.1588945198.180635.dump:00010000:00020000:8.0:1588944976.795466:0:161992:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 172.17.2.5@o2ib  ns: mdt-carbofs2-MDT0001_UUID lock: ffff90d5dfcf9d40/0xf489323b9f7cfb35 lrc: 3/0,0 mode: PR/PR res: [0x24000b480:0xc5b6:0x0].0x0 bits 0x40/0x0 rrc: 225 type: IBT flags: 0x60200400000020 nid: 172.17.2.5@o2ib remote: 0xe98766af8c442ce1 expref: 396 pid: 180230 timeout: 1597115 lvb_type: 0
/tmp/lustre-log.1588945198.180635.dump:00000020:00080000:8.0:1588944976.795489:0:161992:0:(genops.c:1662:class_fail_export()) disconnecting export ffff90d7f7330c00/7685d096-da00-7900-01cd-d51973277575
/tmp/lustre-log.1588945198.180635.dump:00000020:00000080:8.0:1588944976.795492:0:161992:0:(genops.c:1501:class_disconnect()) disconnect: cookie 0xf489323b396e1589
/tmp/lustre-log.1588945198.180635.dump:00000020:00080000:8.0:1588944976.795644:0:161992:0:(genops.c:1679:class_fail_export()) disconnected export ffff90d7f7330c00/7685d096-da00-7900-01cd-d51973277575
/tmp/lustre-log.1588945198.180635.dump:00010000:00020000:8.0:1588944976.795652:0:161992:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 172.17.1.227@o2ib  ns: mdt-carbofs2-MDT0001_UUID lock: ffff90cfe0a10fc0/0xf489323b9f7cfb58 lrc: 3/0,0 mode: PR/PR res: [0x24000b480:0xc5b6:0x0].0x0 bits 0x40/0x0 rrc: 220 type: IBT flags: 0x60200400000020 nid: 172.17.1.227@o2ib remote: 0xf0c3059052dfd0f8 expref: 423 pid: 162415 timeout: 1597115 lvb_type: 0
/tmp/lustre-log.1588945198.180635.dump:00000020:00080000:8.0:1588944976.795663:0:161992:0:(genops.c:1662:class_fail_export()) disconnecting export ffff90d3dd5f4400/e294594d-e357-7d83-07c6-94cee77191e2
/tmp/lustre-log.1588945198.180635.dump:00000020:00000080:8.0:1588944976.795666:0:161992:0:(genops.c:1501:class_disconnect()) disconnect: cookie 0xf489323b396e144e
/tmp/lustre-log.1588945198.180635.dump:00000020:00080000:8.0:1588944976.795715:0:161992:0:(genops.c:1679:class_fail_export()) disconnected export ffff90d3dd5f4400/e294594d-e357-7d83-07c6-94cee77191e2
/tmp/lustre-log.1588945198.180635.dump:00010000:00020000:8.0:1588944976.795722:0:161992:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 172.17.1.227@o2ib  ns: mdt-carbofs2-MDT0001_UUID lock: ffff90ce9bb90000/0xf489323b9f7cfb66 lrc: 3/0,0 mode: PR/PR res: [0x24000b480:0xc5b6:0x0].0x0 bits 0x40/0x0 rrc: 220 type: IBT flags: 0x60200400000020 nid: 172.17.1.227@o2ib remote: 0xf0c3059052dfd0f1 expref: 420 pid: 162417 timeout: 1597115 lvb_type: 0
/tmp/lustre-log.1588945198.180635.dump:00000020:00080000:8.0:1588944976.795732:0:161992:0:(genops.c:1657:class_fail_export()) disconnecting dead export ffff90d3dd5f4400/e294594d-e357-7d83-07c6-94cee77191e2; skipping
/tmp/lustre-log.1588945198.180635.dump:00010000:00020000:8.0:1588944976.795738:0:161992:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 172.17.1.227@o2ib  ns: mdt-carbofs2-MDT0001_UUID lock: ffff90cd4ae01b00/0xf489323b9f7cfba5 lrc: 3/0,0 mode: PR/PR res: [0x24000b480:0xc5b6:0x0].0x0 bits 0x40/0x0 rrc: 220 type: IBT flags: 0x60200400000020 nid: 172.17.1.227@o2ib remote: 0xf0c3059052dfd0ff expref: 420 pid: 162220 timeout: 1597115 lvb_type: 0
/tmp/lustre-log.1588945198.180635.dump:00000020:00080000:8.0:1588944976.795748:0:161992:0:(genops.c:1657:class_fail_export()) disconnecting dead export ffff90d3dd5f4400/e294594d-e357-7d83-07c6-94cee77191e2; skipping
/tmp/lustre-log.1588945198.180635.dump:00010000:00020000:8.0:1588944976.795754:0:161992:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 172.17.2.2@o2ib  ns: mdt-carbofs2-MDT0001_UUID lock: ffff90d5dfcfec00/0xf489323b9f7cfc70 lrc: 3/0,0 mode: PR/PR res: [0x24000b480:0xc5b6:0x0].0x0 bits 0x40/0x0 rrc: 220 type: IBT flags: 0x60200400000020 nid: 172.17.2.2@o2ib remote: 0x3006f447d61f79e4 expref: 78 pid: 180225 timeout: 1597115 lvb_type: 0
/tmp/lustre-log.1588945198.180635.dump:00000020:00080000:8.0:1588944976.795765:0:161992:0:(genops.c:1662:class_fail_export()) disconnecting export ffff90d7f7331c00/644262f6-0235-7572-f194-d4689a69670b
/tmp/lustre-log.1588945198.180635.dump:00000020:00000080:8.0:1588944976.795767:0:161992:0:(genops.c:1501:class_disconnect()) disconnect: cookie 0xf489323b396e15cf
/tmp/lustre-log.1588945198.180635.dump:00000020:00080000:8.0:1588944976.795814:0:161992:0:(genops.c:1679:class_fail_export()) disconnected export ffff90d7f7331c00/644262f6-0235-7572-f194-d4689a69670b
/tmp/lustre-log.1588945198.180635.dump:00010000:00020000:8.0:1588944976.795822:0:161992:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 172.17.2.10@o2ib  ns: mdt-carbofs2-MDT0001_UUID lock: ffff90d5cee0ba80/0xf489323b9f7cf77d lrc: 3/0,0 mode: PR/PR res: [0x24000b480:0xc5c0:0x0].0x0 bits 0x40/0x0 rrc: 204 type: IBT flags: 0x60200400000020 nid: 172.17.2.10@o2ib remote: 0x70316307a2048b75 expref: 66 pid: 180231 timeout: 1597115 lvb_type: 0
/tmp/lustre-log.1588945198.180635.dump:00000020:00080000:8.0:1588944976.795861:0:161992:0:(genops.c:1662:class_fail_export()) disconnecting export ffff90d80cbae400/629b4428-94ac-83ba-f61c-69aa684bee44
/tmp/lustre-log.1588945198.180635.dump:00000020:00000080:8.0:1588944976.795865:0:161992:0:(genops.c:1501:class_disconnect()) disconnect: cookie 0xf489323b396e1703
/tmp/lustre-log.1588945198.180635.dump:00000020:00080000:8.0:1588944976.795920:0:161992:0:(genops.c:1679:class_fail_export()) disconnected export ffff90d80cbae400/629b4428-94ac-83ba-f61c-69aa684bee44
/tmp/lustre-log.1588945198.180635.dump:00010000:00020000:8.0:1588944976.795926:0:161992:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 172.17.2.10@o2ib  ns: mdt-carbofs2-MDT0001_UUID lock: ffff90d51f3cb180/0xf489323b9f7cf78b lrc: 3/0,0 mode: PR/PR res: [0x24000b480:0xc5c0:0x0].0x0 bits 0x40/0x0 rrc: 204 type: IBT flags: 0x60200400000020 nid: 172.17.2.10@o2ib remote: 0x70316307a2048b7c expref: 63 pid: 162423 timeout: 1597115 lvb_type: 0
/tmp/lustre-log.1588945198.180635.dump:00000020:00080000:8.0:1588944976.795936:0:161992:0:(genops.c:1657:class_fail_export()) disconnecting dead export ffff90d80cbae400/629b4428-94ac-83ba-f61c-69aa684bee44; skipping
/tmp/lustre-log.1588945198.180635.dump:00010000:00020000:8.0:1588944976.795943:0:161992:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 100s: evicting client at 172.17.1.224@o2ib  ns: mdt-carbofs2-MDT0001_UUID lock: ffff90d36f8bad00/0xf489323b9f7cf784 lrc: 3/0,0 mode: PR/PR res: [0x24000b480:0xc5c0:0x0].0x0 bits 0x40/0x0 rrc: 204 type: IBT flags: 0x60200400000020 nid: 172.17.1.224@o2ib remote: 0xed5de96130f26ae expref: 108 pid: 180199 timeout: 1597115 lvb_type: 0
/tmp/lustre-log.1588945198.180635.dump:00000020:00080000:8.0:1588944976.795952:0:161992:0:(genops.c:1662:class_fail_export()) disconnecting export ffff90d7f7334000/30180198-3c05-91ca-6f14-1fc4bc7a164e
...


 Comments   
Comment by Michael Sternberg [ 08/May/20 ]

PS: Similar to several of the LU issues referenced, stopping and restarting the MDT mounts will usually "unwedge" the file system.

Comment by Oleg Drokin [ 02/Feb/22 ]

this is a dup of LU-14522

Generated at Sat Feb 10 03:02:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.