We're seeing quite a few errors on clients, OSSes and the MDS.
For example on clients:
Jan 16 09:53:09 juliet2 kernel: LustreError: 49499:0:(mdc_request.c:1441:mdc_read_page()) juliet-MDT0000-mdc-ffff99a3723aa800: [0x200001b3e:0x5f66:0x0] lock enqueue fails: rc = -4
Jan 16 21:30:41 juliet2 kernel: LustreError: 11-0: juliet-OST002a-osc-ffff99a3723aa800: operation ldlm_enqueue to node 10.29.22.93@tcp failed: rc = -107
Jan 16 21:30:41 juliet2 kernel: Lustre: juliet-OST002a-osc-ffff99a3723aa800: Connection to juliet-OST002a (at 10.29.22.93@tcp) was lost; in progress operations using this service will wait for recovery to complete
Jan 16 21:30:41 juliet2 kernel: LustreError: 167-0: juliet-OST002a-osc-ffff99a3723aa800: This client was evicted by juliet-OST002a; in progress operations using this service will fail.
Jan 16 21:30:41 juliet2 kernel: Lustre: 4193:0:(llite_lib.c:2762:ll_dirty_page_discard_warn()) juliet: dirty page discard: 10.29.22.90@tcp:/juliet/fid: [0x20002dd8a:0x16daa:0x0]/ may get corrupted (rc -108)
Jan 16 21:30:41 juliet2 kernel: Lustre: 4191:0:(llite_lib.c:2762:ll_dirty_page_discard_warn()) juliet: dirty page discard: 10.29.22.90@tcp:/juliet/fid: [0x20002dd8a:0x16cb1:0x0]/ may get corrupted (rc -108)
OSS:
Jan 16 06:17:54 joss1 kernel: LustreError: 6496:0:(events.c:455:server_bulk_callback()) event type 3, status -5, desc ffff92ef5dbb3000
Jan 16 06:17:54 joss1 kernel: LustreError: 16260:0:(ldlm_lib.c:3363:target_bulk_io()) @@@ network error on bulk WRITE req@ffff92f3dfcbb850 x1760556572171776/t0(0) o4->bd9b8fe9-b80f-7114-7b35-663a8e9d48db@10.29.22.97@tcp:446/0 lens 488/448 e 0 to 0 dl 1673867911 ref 1 fl Interpret:/0/0 rc 0/0
Jan 16 06:17:54 joss1 kernel: Lustre: juliet-OST0009: Client bd9b8fe9-b80f-7114-7b35-663a8e9d48db (at 10.29.22.97@tcp) reconnecting
Jan 16 06:17:54 joss1 kernel: Lustre: juliet-OST0009: Connection restored to 3d01cce1-cfce-5103-0db6-32c1aa8f728c (at 10.29.22.97@tcp)
Jan 16 06:17:54 joss1 kernel: Lustre: juliet-OST0009: Bulk IO write error with bd9b8fe9-b80f-7114-7b35-663a8e9d48db (at 10.29.22.97@tcp), client will retry: rc = -110
Jan 16 06:17:54 joss1 kernel: Lustre: Skipped 1 previous similar message
Jan 16 06:17:54 joss1 kernel: LustreError: 16218:0:(ldlm_lib.c:3357:target_bulk_io()) @@@ Reconnect on bulk WRITE req@ffff92eb76e54050 x1760556572184448/t0(0) o4->bd9b8fe9-b80f-7114-7b35-663a8e9d48db@10.29.22.97@tcp:452/0 lens 488/448 e 0 to 0 dl 1673867917 ref 1 fl Interpret:/0/0 rc 0/0
MDS:
Jan 16 19:52:10 jmds1 kernel: LustreError: 47609:0:(ldlm_lib.c:3357:target_bulk_io()) @@@ Reconnect on bulk READ req@ffff995a6c544850 x1760579715652736/t0(0) o37->bd9b8fe9-b80f-7114-7b35-663a8e9d48db@10.29.22.97@tcp:220/0 lens 448/440 e 1 to 0 dl 1673916760 ref 1 fl Interpret:/0/0 rc 0/0
Jan 19 12:11:29 jmds1 kernel: LustreError: 15481:0:(mgs_handler.c:282:mgs_revoke_lock()) MGS: can't take cfg lock for 0x736d61726170/0x3 : rc = -11
Is it possible to give us an idea of what these errors might indicate? e.g., network issues, misconfiguration, load etc, so we can narrow down the focus of investigation. Let us know what extra details (logs, cluster settings) you might need if further information is needed.
The one tip that I have heard is to be cautious to not inadvertently reformat your OSTs but, basically, this should be a standard OS upgrade. Based on the showing at the Lustre BOF at SC22, it seems like a number of people have already navigated this upgrade so you could always poll lustre-discuss to see if other community members have any experiences to share.
Peter Jones
added a comment - The one tip that I have heard is to be cautious to not inadvertently reformat your OSTs but, basically, this should be a standard OS upgrade. Based on the showing at the Lustre BOF at SC22, it seems like a number of people have already navigated this upgrade so you could always poll lustre-discuss to see if other community members have any experiences to share.
Quick question Peter: since we're upgrading from EL7 to EL8 as well as from 2.12 to 2.15, is there any special steps to take?
Dneg (Inactive)
added a comment - Quick question Peter: since we're upgrading from EL7 to EL8 as well as from 2.12 to 2.15, is there any special steps to take?
There are no current plans to issue at 2.12.10. What Colin means is that the fix has been merged to the branch so that if we ever did decide to do one, it would include this fix. Using the latest 2.15.x LTS release (2.15.2) would be the most expedient option
Peter
Peter Jones
added a comment - Campbell
There are no current plans to issue at 2.12.10. What Colin means is that the fix has been merged to the branch so that if we ever did decide to do one, it would include this fix. Using the latest 2.15.x LTS release (2.15.2) would be the most expedient option
Peter
Thanks Colin, looking at the Whamcloud public repos, the 'latest-2.12-release' link points to 2.12.9-1. Has this got the required patches, or should we wait for 2.12.10 (and when will that be released?)
Regards,
Campbell
Dneg (Inactive)
added a comment - Thanks Colin, looking at the Whamcloud public repos, the 'latest-2.12-release' link points to 2.12.9-1. Has this got the required patches, or should we wait for 2.12.10 (and when will that be released?)
Regards,
Campbell
Based on what I'm seeing this does look very similar to LU-14644, this has been addressed in 2.15.0 and 2.12.10. Can you try upgrading and see if you still experience?
Colin Faber
added a comment - Hi dneg
Based on what I'm seeing this does look very similar to LU-14644 , this has been addressed in 2.15.0 and 2.12.10. Can you try upgrading and see if you still experience?
A bit more information. This cluster was upgraded to 2.12.8_6_g5457c37, to address issues seen in LU-15915 and LU-16343. As a temporary mitigation, we lowered lru size to 128 from 10000. We set it back to 10000 just recently, and since that happened (around 19th of Jan), evictions have started occurring as well. Logs attached
Dneg (Inactive)
added a comment - - edited A bit more information. This cluster was upgraded to 2.12.8_6_g5457c37, to address issues seen in LU-15915 and LU-16343 . As a temporary mitigation, we lowered lru size to 128 from 10000. We set it back to 10000 just recently, and since that happened (around 19th of Jan), evictions have started occurring as well. Logs attached
[{"id":-1,"name":"My open issues","jql":"assignee = currentUser() AND resolution = Unresolved order by updated DESC","isSystem":true,"sharePermissions":[],"requiresLogin":true},{"id":-2,"name":"Reported by me","jql":"reporter = currentUser() order by created DESC","isSystem":true,"sharePermissions":[],"requiresLogin":true},{"id":-4,"name":"All issues","jql":"order by created DESC","isSystem":true,"sharePermissions":[],"requiresLogin":false},{"id":-5,"name":"Open issues","jql":"resolution = Unresolved order by priority DESC,updated DESC","isSystem":true,"sharePermissions":[],"requiresLogin":false},{"id":-9,"name":"Done issues","jql":"statusCategory = Done order by updated DESC","isSystem":true,"sharePermissions":[],"requiresLogin":false},{"id":-3,"name":"Viewed recently","jql":"issuekey in issueHistory() order by lastViewed DESC","isSystem":true,"sharePermissions":[],"requiresLogin":false},{"id":-6,"name":"Created recently","jql":"created >= -1w order by created DESC","isSystem":true,"sharePermissions":[],"requiresLogin":false},{"id":-7,"name":"Resolved recently","jql":"resolutiondate >= -1w order by updated DESC","isSystem":true,"sharePermissions":[],"requiresLogin":false},{"id":-8,"name":"Updated recently","jql":"updated >= -1w order by updated DESC","isSystem":true,"sharePermissions":[],"requiresLogin":false}]
Resolving as this is fixed in 2.15.2