Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16497

various lustre errors on clients and servers

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.2
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      We're seeing quite a few errors on clients, OSSes and the MDS.

      For example on clients:

      Jan 16 09:53:09 juliet2 kernel: LustreError: 49499:0:(mdc_request.c:1441:mdc_read_page()) juliet-MDT0000-mdc-ffff99a3723aa800: [0x200001b3e:0x5f66:0x0] lock enqueue fails: rc = -4
      Jan 16 21:30:41 juliet2 kernel: LustreError: 11-0: juliet-OST002a-osc-ffff99a3723aa800: operation ldlm_enqueue to node 10.29.22.93@tcp failed: rc = -107
      Jan 16 21:30:41 juliet2 kernel: Lustre: juliet-OST002a-osc-ffff99a3723aa800: Connection to juliet-OST002a (at 10.29.22.93@tcp) was lost; in progress operations using this service will wait for recovery to complete
      Jan 16 21:30:41 juliet2 kernel: LustreError: 167-0: juliet-OST002a-osc-ffff99a3723aa800: This client was evicted by juliet-OST002a; in progress operations using this service will fail.
      Jan 16 21:30:41 juliet2 kernel: Lustre: 4193:0:(llite_lib.c:2762:ll_dirty_page_discard_warn()) juliet: dirty page discard: 10.29.22.90@tcp:/juliet/fid: [0x20002dd8a:0x16daa:0x0]/ may get corrupted (rc -108)
      Jan 16 21:30:41 juliet2 kernel: Lustre: 4191:0:(llite_lib.c:2762:ll_dirty_page_discard_warn()) juliet: dirty page discard: 10.29.22.90@tcp:/juliet/fid: [0x20002dd8a:0x16cb1:0x0]/ may get corrupted (rc -108)
      

      OSS:

      Jan 16 06:17:54 joss1 kernel: LustreError: 6496:0:(events.c:455:server_bulk_callback()) event type 3, status -5, desc ffff92ef5dbb3000
      Jan 16 06:17:54 joss1 kernel: LustreError: 16260:0:(ldlm_lib.c:3363:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff92f3dfcbb850 x1760556572171776/t0(0) o4->bd9b8fe9-b80f-7114-7b35-663a8e9d48db@10.29.22.97@tcp:446/0 lens 488/448 e 0 to 0 dl 1673867911 ref 1 fl Interpret:/0/0 rc 0/0
      Jan 16 06:17:54 joss1 kernel: Lustre: juliet-OST0009: Client bd9b8fe9-b80f-7114-7b35-663a8e9d48db (at 10.29.22.97@tcp) reconnecting
      Jan 16 06:17:54 joss1 kernel: Lustre: juliet-OST0009: Connection restored to 3d01cce1-cfce-5103-0db6-32c1aa8f728c (at 10.29.22.97@tcp)
      Jan 16 06:17:54 joss1 kernel: Lustre: juliet-OST0009: Bulk IO write error with bd9b8fe9-b80f-7114-7b35-663a8e9d48db (at 10.29.22.97@tcp), client will retry: rc = -110
      Jan 16 06:17:54 joss1 kernel: Lustre: Skipped 1 previous similar message
      Jan 16 06:17:54 joss1 kernel: LustreError: 16218:0:(ldlm_lib.c:3357:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff92eb76e54050 x1760556572184448/t0(0) o4->bd9b8fe9-b80f-7114-7b35-663a8e9d48db@10.29.22.97@tcp:452/0 lens 488/448 e 0 to 0 dl 1673867917 ref 1 fl Interpret:/0/0 rc 0/0
      

      MDS:

      Jan 16 19:52:10 jmds1 kernel: LustreError: 47609:0:(ldlm_lib.c:3357:target_bulk_io()) @@@ Reconnect on bulk READ  req@ffff995a6c544850 x1760579715652736/t0(0) o37->bd9b8fe9-b80f-7114-7b35-663a8e9d48db@10.29.22.97@tcp:220/0 lens 448/440 e 1 to 0 dl 1673916760 ref 1 fl Interpret:/0/0 rc 0/0
      Jan 19 12:11:29 jmds1 kernel: LustreError: 15481:0:(mgs_handler.c:282:mgs_revoke_lock()) MGS: can't take cfg lock for 0x736d61726170/0x3 : rc = -11
      

      Is it possible to give us an idea of what these errors might indicate? e.g., network issues, misconfiguration, load etc, so we can narrow down the focus of investigation. Let us know what extra details (logs, cluster settings) you might need if further information is needed.

      Attachments

        1. jmds1-messages.gz
          438 kB
        2. jmds1-messages-20230115.gz
          2.69 MB
        3. joss1-messages.gz
          415 kB
        4. joss1-messages-20230115.gz
          2.74 MB
        5. joss2-messages.gz
          443 kB
        6. joss2-messages-20230115.gz
          2.78 MB
        7. joss3-messages-20230115.gz
          2.74 MB
        8. joss4-messages-20230115.gz
          2.73 MB
        9. joss5-messages-20230115.gz
          2.73 MB
        10. joss6-messages-20230115.gz
          2.76 MB
        11. juliet1-messages.gz
          511 kB
        12. juliet1-messages-20230115.gz
          3.15 MB
        13. juliet1-messages-20230115 (1).gz
          3.15 MB
        14. juliet2-messages-20230115.gz
          1.79 MB
        15. juliet2-messages-20230115 (1).gz
          1.79 MB

        Issue Links

          Activity

            [LU-16497] various lustre errors on clients and servers
            pjones Peter Jones added a comment -

            The one tip that I have heard is to be cautious to not inadvertently reformat your OSTs but, basically, this should be a standard OS upgrade. Based on the showing at the Lustre BOF at SC22, it seems like a number of people have already navigated this upgrade so you could always poll lustre-discuss to see if other community members have any experiences to share.

            pjones Peter Jones added a comment - The one tip that I have heard is to be cautious to not inadvertently reformat your OSTs but, basically, this should be a standard OS upgrade. Based on the showing at the Lustre BOF at SC22, it seems like a number of people have already navigated this upgrade so you could always poll lustre-discuss to see if other community members have any experiences to share.

            Quick question Peter: since we're upgrading from EL7 to EL8 as well as from 2.12 to 2.15, is there any special steps to take?

            dneg Dneg (Inactive) added a comment - Quick question Peter: since we're upgrading from EL7 to EL8 as well as from 2.12 to 2.15, is there any special steps to take?

            Ok, thanks Peter

            dneg Dneg (Inactive) added a comment - Ok, thanks Peter
            pjones Peter Jones added a comment -

            Campbell

            There are no current plans to issue at 2.12.10. What Colin means is that the fix has been merged to the branch so that if we ever did decide to do one, it would include this fix. Using the latest 2.15.x LTS release (2.15.2) would be the most expedient option

             

            Peter

            pjones Peter Jones added a comment - Campbell There are no current plans to issue at 2.12.10. What Colin means is that the fix has been merged to the branch so that if we ever did decide to do one, it would include this fix. Using the latest 2.15.x LTS release (2.15.2) would be the most expedient option   Peter

            Thanks Colin, looking at the Whamcloud public repos, the 'latest-2.12-release' link points to 2.12.9-1. Has this got the required patches, or should we wait for 2.12.10 (and when will that be released?)

            Regards,

            Campbell

            dneg Dneg (Inactive) added a comment - Thanks Colin, looking at the Whamcloud public repos, the 'latest-2.12-release' link points to 2.12.9-1. Has this got the required patches, or should we wait for 2.12.10 (and when will that be released?) Regards, Campbell
            cfaber Colin Faber added a comment -

            Hi dneg 

            Based on what I'm seeing this does look very similar to LU-14644, this has been addressed in 2.15.0 and 2.12.10. Can you try upgrading and see if you still experience?

            cfaber Colin Faber added a comment - Hi dneg   Based on what I'm seeing this does look very similar to LU-14644 , this has been addressed in 2.15.0 and 2.12.10. Can you try upgrading and see if you still experience?
            dneg Dneg (Inactive) added a comment - - edited

            A bit more information. This cluster was upgraded to 2.12.8_6_g5457c37, to address issues seen in LU-15915 and LU-16343. As a temporary mitigation, we lowered lru size to 128 from 10000. We set it back to 10000 just recently, and since that happened (around 19th of Jan), evictions have started occurring as well. Logs attached

            dneg Dneg (Inactive) added a comment - - edited A bit more information. This cluster was upgraded to 2.12.8_6_g5457c37, to address issues seen in LU-15915 and LU-16343 . As a temporary mitigation, we lowered lru size to 128 from 10000. We set it back to 10000 just recently, and since that happened (around 19th of Jan), evictions have started occurring as well. Logs attached

            Logs attached

            dneg Dneg (Inactive) added a comment - Logs attached
            cfaber Colin Faber added a comment -

            Hi,

            Can you please attach full logs around this incident? Thank you!

            (dmesg / syslogs, etc)

            cfaber Colin Faber added a comment - Hi, Can you please attach full logs around this incident? Thank you! (dmesg / syslogs, etc)

            Version of Lustre is 2.12.8_6_g5457c37 on all OSSes, the MDS and clients (apart from one, which is running 2.12.6)

            dneg Dneg (Inactive) added a comment - Version of Lustre is 2.12.8_6_g5457c37 on all OSSes, the MDS and clients (apart from one, which is running 2.12.6)

            People

              cfaber Colin Faber
              dneg Dneg (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: