Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16690

kernel: obd_memory max: 1854996506, obd_memory current: 1854996506

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • None
    • None
    • 2
    • 9223372036854775807

    Description

      After upgrade to 2.15.2 server hung with errors

      kernel: obd_memory max: 1854996506, obd_memory current: 1854996506

      see attached logs for full set of longs.

      Attachments

        1. nbp13_hung
          70 kB
        2. stack.out
          80 kB
        3. stack.out2
          82 kB
        4. stack.out3
          80 kB

        Issue Links

          Activity

            [LU-16690] kernel: obd_memory max: 1854996506, obd_memory current: 1854996506
            pjones Peter Jones added a comment -

            Ok so in that case let's assume that this is fixed unless future evidence comes to light that this is not the case

            pjones Peter Jones added a comment - Ok so in that case let's assume that this is fixed unless future evidence comes to light that this is not the case

            Peter,

            I haven't been able to reproduce it on our test filesystem, I think it is too small. We may need to wait on a production filesystem during an extended dedicated. 

            mhanafi Mahmoud Hanafi added a comment - Peter, I haven't been able to reproduce it on our test filesystem, I think it is too small. We may need to wait on a production filesystem during an extended dedicated. 
            pjones Peter Jones added a comment -

            Mahmoud

            The LU-16413 fix has been merged to b2_15 and will be in the upcoming 2.15.3 release. Have you tested the effectiveness of this release?

            Peter 

            pjones Peter Jones added a comment - Mahmoud The LU-16413 fix has been merged to b2_15 and will be in the upcoming 2.15.3 release. Have you tested the effectiveness of this release? Peter 
            dongyang Dongyang Li added a comment -

            Andreas,
            From the logs the kernel is 4.18.0-425.3.1.el8_lustre.x86_64.

            I think we are seeing memory allocation issue is because of LU-16413, the patch was landed in master but not in 2.15.2.
            Before the patch the bio-integrity kernel patch is broken for 4.18 kernels, it removed check in bio_integrity_prep() if the integrity payload is already allocated or not, and from osd we are calling bio_integrity_prep() twice, so we are leaking the integrity payload. Both are fixed in LU-16413.
            I will port LU-16413 to b2_15.

            dongyang Dongyang Li added a comment - Andreas, From the logs the kernel is 4.18.0-425.3.1.el8_lustre.x86_64. I think we are seeing memory allocation issue is because of LU-16413 , the patch was landed in master but not in 2.15.2. Before the patch the bio-integrity kernel patch is broken for 4.18 kernels, it removed check in bio_integrity_prep() if the integrity payload is already allocated or not, and from osd we are calling bio_integrity_prep() twice, so we are leaking the integrity payload. Both are fixed in LU-16413 . I will port LU-16413 to b2_15.

            Mahmoud, which kernel is running on the servers for this system?

            adilger Andreas Dilger added a comment - Mahmoud, which kernel is running on the servers for this system?

            Lower the number threads help but eventualy the service hit the issue. But host that had ib_iser t10pi disable was find. So for now we have disabled t10pi and lowered the number of threads.

             

            I will try to see if I can get a reproducer to help debug the issue. 

            mhanafi Mahmoud Hanafi added a comment - Lower the number threads help but eventualy the service hit the issue. But host that had ib_iser t10pi disable was find. So for now we have disabled t10pi and lowered the number of threads.   I will try to see if I can get a reproducer to help debug the issue. 

            People

              dongyang Dongyang Li
              mhanafi Mahmoud Hanafi
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: