[LU-9988] sometime the mds can not service but umount the mdt and mount the mdt the mds can work again Created: 14/Sep/17  Updated: 21/Dec/17  Resolved: 21/Dec/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: songzhlong Assignee: WC Triage
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

thelustre of client is 2.5.1
thelustre of mds is 2.5.3


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

some time the mds will stop service how to fix this
the mds dmesg is just like
Sep 8 02:44:05 mds0 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 2 previous similar messages
Sep 8 02:44:05 mds0 kernel: LustreError: 6956:0:(ldlm_lockd.c:1335:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881022d83800 ns: mdt-THFS-MDT0000_UUID lock: ffff88101be459c0/0x81ad926eaa8a4f44 lrc: 3/0,0 mode: CR/CR res: [0x20000f70d:0x3e:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x50200000000000 nid: 12.0.5.19@tcp1 remote: 0x721abb56a3b64281 expref: 3 pid: 6956 timeout: 0 lvb_type: 0
Sep 8 02:44:05 mds0 kernel: Lustre: 7357:0:(service.c:2039:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:14871s); client may timeout. req@ffff8810011a0c00 x1573500607175288/t236318034970(0) o101->44dc37a9-770f-5801-6327-54280e43793f@12.0.2.94@tcp1:0/0 lens 584/600 e 0 to 0 dl 1504794974 ref 1 fl Complete:/0/0 rc -107/-107
Sep 8 02:44:05 mds0 kernel: Lustre: 7357:0:(service.c:2039:ptlrpc_server_handle_request()) Skipped 404 previous similar messages
Sep 8 02:44:05 mds0 kernel: LNet: Service thread pid 7357 completed after 15625.97s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 8 02:44:05 mds0 kernel: LNet: Skipped 2 previous similar messages
Sep 8 02:44:05 mds0 kernel: LustreError: 7357:0:(service.c:2007:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-12.0.3.0@tcp1: deadline 755:893s ago
Sep 8 02:44:05 mds0 kernel: req@ffff881003fd0400 x1573501172011116/t0(0) o101->f8b0131d-9be5-85ce-cf36-173369e67ace@12.0.3.0@tcp1:0/0 lens 584/0 e 0 to 0 dl 1504808952 ref 1 fl Interpret:/2/ffffffff rc 0/-1
Sep 8 02:44:05 mds0 kernel: LustreError: 7357:0:(service.c:2007:ptlrpc_server_handle_request()) Skipped 284 previous similar messages
Sep 8 02:44:05 mds0 kernel: LustreError: 6956:0:(ldlm_lockd.c:1335:ldlm_handle_enqueue0()) Skipped 1 previous similar message
Sep 8 02:46:34 mds0 kernel: Lustre: 7280:0:(service.c:1347:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
Sep 8 02:46:34 mds0 kernel: req@ffff88102370bc00 x1573501624739012/t0(0) o101->6a87f33d-82c7-dcec-e57f-0df777a4974e@12.0.4.128@tcp1:0/0 lens 576/0 e 0 to 0 dl 1504809999 ref 2 fl New:/0/ffffffff rc 0/-1
Sep 8 02:46:34 mds0 kernel: Lustre: 7280:0:(service.c:1347:ptlrpc_at_send_early_reply()) Skipped 155 previous similar messages
Sep 8 02:52:23 mds0 kernel: Lustre: THFS-MDT0000: Client 80d94066-14b1-587f-c9b4-1ee79e3cf810 (at 12.0.2.134@tcp1) reconnecting
Sep 8 02:52:23 mds0 kernel: Lustre: Skipped 8534 previous similar messages
Sep 8 02:52:23 mds0 kernel: Lustre: THFS-MDT0000: Client 80d94066-14b1-587f-c9b4-1ee79e3cf810 (at 12.0.2.134@tcp1) refused reconnection, still busy with 1 active RPCs
Sep 8 02:52:23 mds0 kernel: Lustre: Skipped 8448 previous similar messages
Sep 8 02:54:04 mds0 kernel: Lustre: lock timed out (enqueued at 1504809244, 1200s ago)
Sep 8 02:54:04 mds0 kernel: Lustre: Skipped 5 previous similar messages
Sep 8 02:54:06 mds0 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 16227s: evicting client at 12.0.5.9@tcp1 ns: mdt-THFS-MDT0000_UUID lock: ffff881023a3b980/0x81ad926eaa82125a lrc: 3/0,0 mode: PR/PR res: [0x200004f76:0x6698:0x0].0 bits 0x13 rrc: 524 type: IBT flags: 0x60200400000020 nid: 12.0.5.9@tcp1 remote: 0xad23bcd378391d2a expref: 851 pid: 6993 timeout: 4956172002 lvb_type: 0
Sep 8 02:54:06 mds0 kernel: LustreError: 7332:0:(ldlm_lockd.c:1335:ldlm_handle_enqueue0()) ### lock on destroyed export ffff88101b680000 ns: mdt-THFS-MDT0000_UUID lock: ffff8810003a8c80/0x81ad926eaa8a56c9 lrc: 3/0,0 mode: CR/CR res: [0x20000f708:0x21:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x50200000000000 nid: 12.0.3.16@tcp1 remote: 0x129fa813fa73c9 expref: 3 pid: 7332 timeout: 0 lvb_type: 0
Sep 8 02:54:06 mds0 kernel: Lustre: 6986:0:(service.c:2039:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:15472s); client may timeout. req@ffff88101f784400 x1573501840576392/t236318035045(0) o101->902f9430-1cb5-bcd4-0b1b-7653215af491@12.0.5.147@tcp1:0/0 lens 584/600 e 0 to 0 dl 1504794974 ref 1 fl Complete:/0/0 rc -107/-107
Sep 8 02:54:06 mds0 kernel: Lustre: 6986:0:(service.c:2039:ptlrpc_server_handle_request()) Skipped 811 previous similar messages
Sep 8 02:54:06 mds0 kernel: LNet: Service thread pid 6986 completed after 16226.97s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 8 02:54:06 mds0 kernel: LNet: Skipped 3 previous similar messages
Sep 8 02:54:06 mds0 kernel: LustreError: 6986:0:(service.c:2007:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-12.0.2.67@tcp1: deadline 755:583s ago
Sep 8 02:54:06 mds0 kernel: req@ffff88100cf4c850 x1573500701417108/t0(0) o101->833faabe-1727-671c-00c6-93dce877c09a@12.0.2.67@tcp1:0/0 lens 576/0 e 0 to 0 dl 1504809863 ref 1 fl Interpret:/2/ffffffff rc 0/-1
Sep 8 02:54:06 mds0 kernel: LustreError: 6986:0:(service.c:2007:ptlrpc_server_handle_request()) Skipped 602 previous similar messages
Sep 8 02:54:06 mds0 kernel: LustreError: 7332:0:(ldlm_lockd.c:1335:ldlm_handle_enqueue0()) Skipped 1 previous similar message
Sep 8 02:56:35 mds0 kernel: Lustre: 7056:0:(service.c:1347:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
Sep 8 02:56:35 mds0 kernel: req@ffff88022cd0b800 x1573500512886332/t0(0) o101->ea21e094-8402-cce5-b0c6-954851181800@12.0.2.27@tcp1:0/0 lens 576/3384 e 0 to 0 dl 1504810600 ref 2 fl Interpret:/0/0 rc 0/0
Sep 8 02:56:35 mds0 kernel: Lustre: 7056:0:(service.c:1347:ptlrpc_at_send_early_reply()) Skipped 159 previous similar messages
Sep 8 03:02:09 mds0 kernel: Lustre: THFS-MDT0000: haven't heard from client a16e49fa-13a1-36d5-748c-e80828650109 (at 12.0.4.21@tcp1) in 4089 seconds. I think it's dead, and I am evicting it. exp ffff880ff8b5e400, cur 1504810929 expire 1504810779 last 1504806840
Sep 8 03:02:09 mds0 kernel: Lustre: Skipped 1 previous similar message
Sep 8 03:02:23 mds0 kernel: Lustre: THFS-MDT0000: Client cd066d1d-c979-29a2-a453-d5878a89c5da (at 12.0.3.1@tcp1) reconnecting
Sep 8 03:02:23 mds0 kernel: Lustre: Skipped 8591 previous similar messages
Sep 8 03:02:23 mds0 kernel: Lustre: THFS-MDT0000: Client cd066d1d-c979-29a2-a453-d5878a89c5da (at 12.0.3.1@tcp1) refused reconnection, still busy with 1 active RPCs
Sep 8 03:02:23 mds0 kernel: Lustre: Skipped 8513 previous similar messages
Sep 8 03:04:05 mds0 kernel: Lustre: lock timed out (enqueued at 1504809845, 1200s ago)
Sep 8 03:04:05 mds0 kernel: Lustre: Skipped 2 previous similar messages
Sep 8 03:04:07 mds0 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 16828s: evicting client at 12.0.5.26@tcp1 ns: mdt-THFS-MDT0000_UUID lock: ffff8801e194a0c0/0x81ad926eaa821310 lrc: 3/0,0 mode: PR/PR res: [0x200004f76:0x6698:0x0].0 bits 0x13 rrc: 521 type: IBT flags: 0x60200400000020 nid: 12.0.5.26@tcp1 remote: 0xec650b58537f9b4b expref: 327 pid: 7022 timeout: 4956773004 lvb_type: 0
Sep 8 03:04:07 mds0 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 1 previous similar message
Sep 8 03:04:07 mds0 kernel: LustreError: 7199:0:(ldlm_lockd.c:1335:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881022548800 ns: mdt-THFS-MDT0000_UUID lock: ffff880ff52f30c0/0x81ad926eaa8a5bed lrc: 3/0,0 mode: CR/CR res: [0x200010662:0x1:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x50200000000000 nid: 12.0.2.204@tcp1 remote: 0x99a417b55be15413 expref: 3 pid: 7199 timeout: 0 lvb_type: 0
Sep 8 03:04:07 mds0 kernel: Lustre: 7288:0:(service.c:2039:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:16073s); client may timeout. req@ffff8801e46a5800 x1573501446921444/t0(0) o101->d233abf3-44fe-8871-0243-4d52ca064c3d@12.0.4.15@tcp1:0/0 lens 576/536 e 0 to 0 dl 1504794974 ref 1 fl Complete:/0/0 rc 0/0
Sep 8 03:04:07 mds0 kernel: Lustre: 7288:0:(service.c:2039:ptlrpc_server_handle_request()) Skipped 465 previous similar messages
Sep 8 03:04:07 mds0 kernel: LNet: Service thread pid 7288 completed after 16827.97s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 8 03:04:07 mds0 kernel: LNet: Skipped 2 previous similar messages
Sep 8 03:04:07 mds0 kernel: LustreError: 7288:0:(service.c:2007:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-12.0.7.69@tcp1: deadline 755:444s ago
Sep 8 03:04:07 mds0 kernel: req@ffff8802791dcc00 x1573351331945832/t0(0) o101->26e8264d-f311-a187-3b70-7126cc743384@12.0.7.69@tcp1:0/0 lens 584/0 e 0 to 0 dl 1504810603 ref 1 fl Interpret:/2/ffffffff rc 0/-1
Sep 8 03:04:07 mds0 kernel: LustreError: 7288:0:(service.c:2007:ptlrpc_server_handle_request()) Skipped 331 previous similar messages
Sep 8 03:06:36 mds0 kernel: Lustre: 7056:0:(service.c:1347:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
Sep 8 03:06:36 mds0 kernel: req@ffff8805131b4800 x1573500512894404/t0(0) o101->ea21e094-8402-cce5-b0c6-954851181800@12.0.2.27@tcp1:0/0 lens 576/0 e 0 to 0 dl 1504811201 ref 2 fl New:/0/ffffffff rc 0/-1
Sep 8 03:06:36 mds0 kernel: Lustre: 7056:0:(service.c:1347:ptlrpc_at_send_early_reply()) Skipped 221 previous similar messages



 Comments   
Comment by Peter Jones [ 21/Dec/17 ]

Hi there

This project is intended for tracking work on current Lustre community releases. I suggest upgrading to the latest 2.10.x release to see whether that helps. Alternatively, there may be someone on the Lustre mailing lists who can give you some tips for these older versions

Peter

Generated at Sat Feb 10 02:31:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.