|
some time the mds will stop service how to fix this
the mds dmesg is just like
Sep 8 02:44:05 mds0 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 2 previous similar messages
Sep 8 02:44:05 mds0 kernel: LustreError: 6956:0:(ldlm_lockd.c:1335:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881022d83800 ns: mdt-THFS-MDT0000_UUID lock: ffff88101be459c0/0x81ad926eaa8a4f44 lrc: 3/0,0 mode: CR/CR res: [0x20000f70d:0x3e:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x50200000000000 nid: 12.0.5.19@tcp1 remote: 0x721abb56a3b64281 expref: 3 pid: 6956 timeout: 0 lvb_type: 0
Sep 8 02:44:05 mds0 kernel: Lustre: 7357:0:(service.c:2039:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:14871s); client may timeout. req@ffff8810011a0c00 x1573500607175288/t236318034970(0) o101->44dc37a9-770f-5801-6327-54280e43793f@12.0.2.94@tcp1:0/0 lens 584/600 e 0 to 0 dl 1504794974 ref 1 fl Complete:/0/0 rc -107/-107
Sep 8 02:44:05 mds0 kernel: Lustre: 7357:0:(service.c:2039:ptlrpc_server_handle_request()) Skipped 404 previous similar messages
Sep 8 02:44:05 mds0 kernel: LNet: Service thread pid 7357 completed after 15625.97s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 8 02:44:05 mds0 kernel: LNet: Skipped 2 previous similar messages
Sep 8 02:44:05 mds0 kernel: LustreError: 7357:0:(service.c:2007:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-12.0.3.0@tcp1: deadline 755:893s ago
Sep 8 02:44:05 mds0 kernel: req@ffff881003fd0400 x1573501172011116/t0(0) o101->f8b0131d-9be5-85ce-cf36-173369e67ace@12.0.3.0@tcp1:0/0 lens 584/0 e 0 to 0 dl 1504808952 ref 1 fl Interpret:/2/ffffffff rc 0/-1
Sep 8 02:44:05 mds0 kernel: LustreError: 7357:0:(service.c:2007:ptlrpc_server_handle_request()) Skipped 284 previous similar messages
Sep 8 02:44:05 mds0 kernel: LustreError: 6956:0:(ldlm_lockd.c:1335:ldlm_handle_enqueue0()) Skipped 1 previous similar message
Sep 8 02:46:34 mds0 kernel: Lustre: 7280:0:(service.c:1347:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
Sep 8 02:46:34 mds0 kernel: req@ffff88102370bc00 x1573501624739012/t0(0) o101->6a87f33d-82c7-dcec-e57f-0df777a4974e@12.0.4.128@tcp1:0/0 lens 576/0 e 0 to 0 dl 1504809999 ref 2 fl New:/0/ffffffff rc 0/-1
Sep 8 02:46:34 mds0 kernel: Lustre: 7280:0:(service.c:1347:ptlrpc_at_send_early_reply()) Skipped 155 previous similar messages
Sep 8 02:52:23 mds0 kernel: Lustre: THFS-MDT0000: Client 80d94066-14b1-587f-c9b4-1ee79e3cf810 (at 12.0.2.134@tcp1) reconnecting
Sep 8 02:52:23 mds0 kernel: Lustre: Skipped 8534 previous similar messages
Sep 8 02:52:23 mds0 kernel: Lustre: THFS-MDT0000: Client 80d94066-14b1-587f-c9b4-1ee79e3cf810 (at 12.0.2.134@tcp1) refused reconnection, still busy with 1 active RPCs
Sep 8 02:52:23 mds0 kernel: Lustre: Skipped 8448 previous similar messages
Sep 8 02:54:04 mds0 kernel: Lustre: lock timed out (enqueued at 1504809244, 1200s ago)
Sep 8 02:54:04 mds0 kernel: Lustre: Skipped 5 previous similar messages
Sep 8 02:54:06 mds0 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 16227s: evicting client at 12.0.5.9@tcp1 ns: mdt-THFS-MDT0000_UUID lock: ffff881023a3b980/0x81ad926eaa82125a lrc: 3/0,0 mode: PR/PR res: [0x200004f76:0x6698:0x0].0 bits 0x13 rrc: 524 type: IBT flags: 0x60200400000020 nid: 12.0.5.9@tcp1 remote: 0xad23bcd378391d2a expref: 851 pid: 6993 timeout: 4956172002 lvb_type: 0
Sep 8 02:54:06 mds0 kernel: LustreError: 7332:0:(ldlm_lockd.c:1335:ldlm_handle_enqueue0()) ### lock on destroyed export ffff88101b680000 ns: mdt-THFS-MDT0000_UUID lock: ffff8810003a8c80/0x81ad926eaa8a56c9 lrc: 3/0,0 mode: CR/CR res: [0x20000f708:0x21:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x50200000000000 nid: 12.0.3.16@tcp1 remote: 0x129fa813fa73c9 expref: 3 pid: 7332 timeout: 0 lvb_type: 0
Sep 8 02:54:06 mds0 kernel: Lustre: 6986:0:(service.c:2039:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:15472s); client may timeout. req@ffff88101f784400 x1573501840576392/t236318035045(0) o101->902f9430-1cb5-bcd4-0b1b-7653215af491@12.0.5.147@tcp1:0/0 lens 584/600 e 0 to 0 dl 1504794974 ref 1 fl Complete:/0/0 rc -107/-107
Sep 8 02:54:06 mds0 kernel: Lustre: 6986:0:(service.c:2039:ptlrpc_server_handle_request()) Skipped 811 previous similar messages
Sep 8 02:54:06 mds0 kernel: LNet: Service thread pid 6986 completed after 16226.97s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 8 02:54:06 mds0 kernel: LNet: Skipped 3 previous similar messages
Sep 8 02:54:06 mds0 kernel: LustreError: 6986:0:(service.c:2007:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-12.0.2.67@tcp1: deadline 755:583s ago
Sep 8 02:54:06 mds0 kernel: req@ffff88100cf4c850 x1573500701417108/t0(0) o101->833faabe-1727-671c-00c6-93dce877c09a@12.0.2.67@tcp1:0/0 lens 576/0 e 0 to 0 dl 1504809863 ref 1 fl Interpret:/2/ffffffff rc 0/-1
Sep 8 02:54:06 mds0 kernel: LustreError: 6986:0:(service.c:2007:ptlrpc_server_handle_request()) Skipped 602 previous similar messages
Sep 8 02:54:06 mds0 kernel: LustreError: 7332:0:(ldlm_lockd.c:1335:ldlm_handle_enqueue0()) Skipped 1 previous similar message
Sep 8 02:56:35 mds0 kernel: Lustre: 7056:0:(service.c:1347:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
Sep 8 02:56:35 mds0 kernel: req@ffff88022cd0b800 x1573500512886332/t0(0) o101->ea21e094-8402-cce5-b0c6-954851181800@12.0.2.27@tcp1:0/0 lens 576/3384 e 0 to 0 dl 1504810600 ref 2 fl Interpret:/0/0 rc 0/0
Sep 8 02:56:35 mds0 kernel: Lustre: 7056:0:(service.c:1347:ptlrpc_at_send_early_reply()) Skipped 159 previous similar messages
Sep 8 03:02:09 mds0 kernel: Lustre: THFS-MDT0000: haven't heard from client a16e49fa-13a1-36d5-748c-e80828650109 (at 12.0.4.21@tcp1) in 4089 seconds. I think it's dead, and I am evicting it. exp ffff880ff8b5e400, cur 1504810929 expire 1504810779 last 1504806840
Sep 8 03:02:09 mds0 kernel: Lustre: Skipped 1 previous similar message
Sep 8 03:02:23 mds0 kernel: Lustre: THFS-MDT0000: Client cd066d1d-c979-29a2-a453-d5878a89c5da (at 12.0.3.1@tcp1) reconnecting
Sep 8 03:02:23 mds0 kernel: Lustre: Skipped 8591 previous similar messages
Sep 8 03:02:23 mds0 kernel: Lustre: THFS-MDT0000: Client cd066d1d-c979-29a2-a453-d5878a89c5da (at 12.0.3.1@tcp1) refused reconnection, still busy with 1 active RPCs
Sep 8 03:02:23 mds0 kernel: Lustre: Skipped 8513 previous similar messages
Sep 8 03:04:05 mds0 kernel: Lustre: lock timed out (enqueued at 1504809845, 1200s ago)
Sep 8 03:04:05 mds0 kernel: Lustre: Skipped 2 previous similar messages
Sep 8 03:04:07 mds0 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 16828s: evicting client at 12.0.5.26@tcp1 ns: mdt-THFS-MDT0000_UUID lock: ffff8801e194a0c0/0x81ad926eaa821310 lrc: 3/0,0 mode: PR/PR res: [0x200004f76:0x6698:0x0].0 bits 0x13 rrc: 521 type: IBT flags: 0x60200400000020 nid: 12.0.5.26@tcp1 remote: 0xec650b58537f9b4b expref: 327 pid: 7022 timeout: 4956773004 lvb_type: 0
Sep 8 03:04:07 mds0 kernel: LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 1 previous similar message
Sep 8 03:04:07 mds0 kernel: LustreError: 7199:0:(ldlm_lockd.c:1335:ldlm_handle_enqueue0()) ### lock on destroyed export ffff881022548800 ns: mdt-THFS-MDT0000_UUID lock: ffff880ff52f30c0/0x81ad926eaa8a5bed lrc: 3/0,0 mode: CR/CR res: [0x200010662:0x1:0x0].0 bits 0x9 rrc: 2 type: IBT flags: 0x50200000000000 nid: 12.0.2.204@tcp1 remote: 0x99a417b55be15413 expref: 3 pid: 7199 timeout: 0 lvb_type: 0
Sep 8 03:04:07 mds0 kernel: Lustre: 7288:0:(service.c:2039:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (755:16073s); client may timeout. req@ffff8801e46a5800 x1573501446921444/t0(0) o101->d233abf3-44fe-8871-0243-4d52ca064c3d@12.0.4.15@tcp1:0/0 lens 576/536 e 0 to 0 dl 1504794974 ref 1 fl Complete:/0/0 rc 0/0
Sep 8 03:04:07 mds0 kernel: Lustre: 7288:0:(service.c:2039:ptlrpc_server_handle_request()) Skipped 465 previous similar messages
Sep 8 03:04:07 mds0 kernel: LNet: Service thread pid 7288 completed after 16827.97s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
Sep 8 03:04:07 mds0 kernel: LNet: Skipped 2 previous similar messages
Sep 8 03:04:07 mds0 kernel: LustreError: 7288:0:(service.c:2007:ptlrpc_server_handle_request()) @@@ Dropping timed-out request from 12345-12.0.7.69@tcp1: deadline 755:444s ago
Sep 8 03:04:07 mds0 kernel: req@ffff8802791dcc00 x1573351331945832/t0(0) o101->26e8264d-f311-a187-3b70-7126cc743384@12.0.7.69@tcp1:0/0 lens 584/0 e 0 to 0 dl 1504810603 ref 1 fl Interpret:/2/ffffffff rc 0/-1
Sep 8 03:04:07 mds0 kernel: LustreError: 7288:0:(service.c:2007:ptlrpc_server_handle_request()) Skipped 331 previous similar messages
Sep 8 03:06:36 mds0 kernel: Lustre: 7056:0:(service.c:1347:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/-150), not sending early reply
Sep 8 03:06:36 mds0 kernel: req@ffff8805131b4800 x1573500512894404/t0(0) o101->ea21e094-8402-cce5-b0c6-954851181800@12.0.2.27@tcp1:0/0 lens 576/0 e 0 to 0 dl 1504811201 ref 2 fl New:/0/ffffffff rc 0/-1
Sep 8 03:06:36 mds0 kernel: Lustre: 7056:0:(service.c:1347:ptlrpc_at_send_early_reply()) Skipped 221 previous similar messages
|