Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-185

LBUG: (cl_page.c:1362:cl_page_completion()) !(pg->cp_flags & CPF_READ_COMPLETED) ASSERTION(0) failed

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.1.0
    • Lustre 2.0.0
    • None
    • 3
    • 19,352
    • 5056

    Description

      Hi,

      At CEA they have 'special' client nodes dedicated to file exchange between two clusters. These nodes frequently crash with the following messages in the syslog:

      LustreError: 8142:0:(osc_request.c:773:osc_announce_cached()) dirty 1807 - 1807 > system dirty_max 8650752
      LustreError: 8142:0:(osc_request.c:773:osc_announce_cached()) Skipped 50 previous similar messages
      LustreError: 7389:0:(cl_page.c:1362:cl_page_completion()) page@ffff880e31124140[2 ffff8803e8022548:0 ^(null)_ffff880e311248c0
      3 0 1 (null) (null) 0x1]
      LustreError: 7389:0:(cl_page.c:1362:cl_page_completion()) page@ffff880e311248c0[1 ffff8803609a0508:0 ^ffff880e31124140_(null)
      3 0 1 (null) (null) 0x0]
      LustreError: 7389:0:(cl_page.c:1362:cl_page_completion()) vvp-page@ffff880f15628960(1:0:0) vm@ffffea00343cb960
      3800000000000821 3:0 ffff880e31124140 0 lru
      LustreError: 7389:0:(cl_page.c:1362:cl_page_completion()) lov-page@ffff880f9bcebb88
      LustreError: 7389:0:(cl_page.c:1362:cl_page_completion()) osc-page@ffff881047778db0: 1< 0x845fed 257 0 - - - >
      2< 0 0 0x0 0x108 | (null) ffff8808641887c8 ffff8802c39a60c0 ffffffffa0714c20 ffff881047778db0 > 3<

      • ffff880eca22b0e0 0 0 0 > 4< 0 0 8 2097152 - | - - - - > 5< - - - - | 0 - - | 0 - ->
        LustreError: 7389:0:(cl_page.c:1362:cl_page_completion()) end page@ffff880e31124140
        LustreError: 7389:0:(cl_page.c:1362:cl_page_completion()) !(pg->cp_flags & CPF_READ_COMPLETED)
        LustreError: 7389:0:(cl_page.c:1362:cl_page_completion()) ASSERTION(0) failed
        LustreError: 7389:0:(cl_page.c:1362:cl_page_completion()) LBUG
        Pid: 7389, comm: ptlrpcd-brw

      Analyzing the crash dump we can see the following stack:
      crash> bt
      PID: 7389 TASK: ffff88087aa22ed0 CPU: 3 COMMAND: "ptlrpcd-brw"
      #0 [ffff88087caeb838] machine_kexec at ffffffff8102e66b
      #1 [ffff88087caeb898] crash_kexec at ffffffff810a9b08
      #2 [ffff88087caeb968] panic at ffffffff8145212d
      #3 [ffff88087caeb9e8] lbug_with_loc at ffffffffa03b8eeb
      #4 [ffff88087caeba38] libcfs_assertion_failed at ffffffffa03c47d6
      #5 [ffff88087caeba88] cl_page_completion at ffffffffa047fc5a
      #6 [ffff88087caebb28] osc_completion at ffffffffa070eccf
      #7 [ffff88087caebba8] osc_ap_completion at ffffffffa06f79ce
      #8 [ffff88087caebc28] brw_interpret at ffffffffa0704759
      #9 [ffff88087caebcf8] ptlrpc_check_set at ffffffffa05573fa
      #10 [ffff88087caebdd8] ptlrpcd_check at ffffffffa058a840
      #11 [ffff88087caebe38] ptlrpcd at ffffffffa058ac93
      #12 [ffff88087caebf48] kernel_thread at ffffffff8100d1aa

      In the crash dump we also see that the concerned cl_page struct has only CPF_READ_COMPLETED set.

      Looking for similar issues in Lustre bugzilla database, I found bug 19352. To me this is exactly the same bug, but the problem is a fix for this bug was landed in 2.0. I have made sure that our sources do include this fix.

      At CEA, it seems that this problem began to occur when the copy tool running on these nodes was modified to do O_DIRECT IOs.

      Attachments

        Activity

          People

            jay Jinshan Xiong (Inactive)
            sebastien.buisson Sebastien Buisson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: