Details
-
Bug
-
Resolution: Fixed
-
Minor
-
None
-
None
-
3
-
9223372036854775807
Description
We're seeing quite a few errors on clients, OSSes and the MDS.
For example on clients:
Jan 16 09:53:09 juliet2 kernel: LustreError: 49499:0:(mdc_request.c:1441:mdc_read_page()) juliet-MDT0000-mdc-ffff99a3723aa800: [0x200001b3e:0x5f66:0x0] lock enqueue fails: rc = -4 Jan 16 21:30:41 juliet2 kernel: LustreError: 11-0: juliet-OST002a-osc-ffff99a3723aa800: operation ldlm_enqueue to node 10.29.22.93@tcp failed: rc = -107 Jan 16 21:30:41 juliet2 kernel: Lustre: juliet-OST002a-osc-ffff99a3723aa800: Connection to juliet-OST002a (at 10.29.22.93@tcp) was lost; in progress operations using this service will wait for recovery to complete Jan 16 21:30:41 juliet2 kernel: LustreError: 167-0: juliet-OST002a-osc-ffff99a3723aa800: This client was evicted by juliet-OST002a; in progress operations using this service will fail. Jan 16 21:30:41 juliet2 kernel: Lustre: 4193:0:(llite_lib.c:2762:ll_dirty_page_discard_warn()) juliet: dirty page discard: 10.29.22.90@tcp:/juliet/fid: [0x20002dd8a:0x16daa:0x0]/ may get corrupted (rc -108) Jan 16 21:30:41 juliet2 kernel: Lustre: 4191:0:(llite_lib.c:2762:ll_dirty_page_discard_warn()) juliet: dirty page discard: 10.29.22.90@tcp:/juliet/fid: [0x20002dd8a:0x16cb1:0x0]/ may get corrupted (rc -108)
OSS:
Jan 16 06:17:54 joss1 kernel: LustreError: 6496:0:(events.c:455:server_bulk_callback()) event type 3, status -5, desc ffff92ef5dbb3000 Jan 16 06:17:54 joss1 kernel: LustreError: 16260:0:(ldlm_lib.c:3363:target_bulk_io()) @@@ network error on bulk WRITE req@ffff92f3dfcbb850 x1760556572171776/t0(0) o4->bd9b8fe9-b80f-7114-7b35-663a8e9d48db@10.29.22.97@tcp:446/0 lens 488/448 e 0 to 0 dl 1673867911 ref 1 fl Interpret:/0/0 rc 0/0 Jan 16 06:17:54 joss1 kernel: Lustre: juliet-OST0009: Client bd9b8fe9-b80f-7114-7b35-663a8e9d48db (at 10.29.22.97@tcp) reconnecting Jan 16 06:17:54 joss1 kernel: Lustre: juliet-OST0009: Connection restored to 3d01cce1-cfce-5103-0db6-32c1aa8f728c (at 10.29.22.97@tcp) Jan 16 06:17:54 joss1 kernel: Lustre: juliet-OST0009: Bulk IO write error with bd9b8fe9-b80f-7114-7b35-663a8e9d48db (at 10.29.22.97@tcp), client will retry: rc = -110 Jan 16 06:17:54 joss1 kernel: Lustre: Skipped 1 previous similar message Jan 16 06:17:54 joss1 kernel: LustreError: 16218:0:(ldlm_lib.c:3357:target_bulk_io()) @@@ Reconnect on bulk WRITE req@ffff92eb76e54050 x1760556572184448/t0(0) o4->bd9b8fe9-b80f-7114-7b35-663a8e9d48db@10.29.22.97@tcp:452/0 lens 488/448 e 0 to 0 dl 1673867917 ref 1 fl Interpret:/0/0 rc 0/0
MDS:
Jan 16 19:52:10 jmds1 kernel: LustreError: 47609:0:(ldlm_lib.c:3357:target_bulk_io()) @@@ Reconnect on bulk READ req@ffff995a6c544850 x1760579715652736/t0(0) o37->bd9b8fe9-b80f-7114-7b35-663a8e9d48db@10.29.22.97@tcp:220/0 lens 448/440 e 1 to 0 dl 1673916760 ref 1 fl Interpret:/0/0 rc 0/0
Jan 19 12:11:29 jmds1 kernel: LustreError: 15481:0:(mgs_handler.c:282:mgs_revoke_lock()) MGS: can't take cfg lock for 0x736d61726170/0x3 : rc = -11
Is it possible to give us an idea of what these errors might indicate? e.g., network issues, misconfiguration, load etc, so we can narrow down the focus of investigation. Let us know what extra details (logs, cluster settings) you might need if further information is needed.
Attachments
Issue Links
- is related to
-
LU-14644 IOR SSF PFL ill-formed I/O job aborted with EIO during automated FOFB testing
- Resolved