Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3580

Panic in ptlrpc when rerunning lustre-rsync-test/8 without remount

Details

    • Bug
    • Resolution: Low Priority
    • Minor
    • None
    • Lustre 2.5.0
    • 1 OSS (2 osts), 1 MDS, 1 Client (all running lustre-master build 1546), MDS and OSS using ZFS
    • 3
    • 9068

    Description

      I was running lustre-rsync-test test_8 repeatedly without umount/remount to reproduce LU-3573, when my MDS hit an LBUG:

      LustreError: 48946:0:(sec_null.c:318:null_alloc_rs()) ASSERTION( rs->rs_size >= rs_size ) failed: 
      LustreError: 48946:0:(sec_null.c:318:null_alloc_rs()) LBUG
      Kernel panic - not syncing: LBUG
      Pid: 48946, comm: mdt00_002 Tainted: P           ---------------    2.6.32-358.11.1.el6_lustre.g3b657b6.x86_64 #1
      Call Trace:
       [<ffffffff8150d8f8>] ? panic+0xa7/0x16f
       [<ffffffffa0629eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
       [<ffffffffa0979632>] ? null_alloc_rs+0x272/0x390 [ptlrpc]
       [<ffffffffa0967dd9>] ? sptlrpc_svc_alloc_rs+0x1d9/0x2a0 [ptlrpc]
       [<ffffffffa093d533>] ? lustre_pack_reply_v2+0x93/0x280 [ptlrpc]
       [<ffffffffa093d7ce>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
       [<ffffffffa093d921>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
       [<ffffffffa09654e3>] ? req_capsule_server_pack+0x53/0x100 [ptlrpc]
       [<ffffffffa0d37f1e>] ? mdt_get_info+0xae/0x19b0 [mdt]
       [<ffffffffa0d29fbd>] ? mdt_unpack_req_pack_rep+0x4d/0x4d0 [mdt]
       [<ffffffffa093e52c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
       [<ffffffffa0d33cf7>] ? mdt_handle_common+0x647/0x16d0 [mdt]
       [<ffffffffa0d6d155>] ? mds_regular_handle+0x15/0x20 [mdt]
       [<ffffffffa094d978>] ? ptlrpc_server_handle_request+0x398/0xc60 [ptlrpc]
       [<ffffffffa062a54e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
       [<ffffffffa063ba9f>] ? lc_watchdog_touch+0x6f/0x170 [libcfs]
       [<ffffffffa0944d99>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
       [<ffffffff81063310>] ? default_wake_function+0x0/0x20
       [<ffffffffa094ecfd>] ? ptlrpc_main+0xabd/0x1700 [ptlrpc]
       [<ffffffffa094e240>] ? ptlrpc_main+0x0/0x1700 [ptlrpc]
       [<ffffffff81096936>] ? kthread+0x96/0xa0
       [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
       [<ffffffff810968a0>] ? kthread+0x0/0xa0
       [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      

      Attachments

        Issue Links

          Activity

            [LU-3580] Panic in ptlrpc when rerunning lustre-rsync-test/8 without remount

            Close old bug

            adilger Andreas Dilger added a comment - Close old bug

            Oleg,

            it's looks new bug in sptlrpc code, and don't related to the MDC<>MDT exchange.
            OSC have an own pool for requests - where it's preallocted with messages - but looks some reply`s need more size then set in preallocate time or it's related to the early reply.
            as i see lustre_pack_reply may called more then one time - first for early reply second for the real reply - in that case we will have different request format and size for

            rc = sptlrpc_svc_alloc_rs(req, msg_len);
            

            did we have a crashdump ?

            shadow Alexey Lyashkov added a comment - Oleg, it's looks new bug in sptlrpc code, and don't related to the MDC<>MDT exchange. OSC have an own pool for requests - where it's preallocted with messages - but looks some reply`s need more size then set in preallocate time or it's related to the early reply. as i see lustre_pack_reply may called more then one time - first for early reply second for the real reply - in that case we will have different request format and size for rc = sptlrpc_svc_alloc_rs(req, msg_len); did we have a crashdump ?

            Oleg, The OST wasn't down. lustre-rsync-test/8 builds a directory tree with createmany and some nested Gfor loops for directories, and then does a lustre_rsync to a local directory (on the client). I had been running that in a loop to try to recreate the bug I was looking for when the MDT went down, it's pretty reproducable, you just have to keep the filesystem mounted between runs. I can reproduce if you want cleaner logs.

            utopiabound Nathaniel Clark added a comment - Oleg, The OST wasn't down. lustre-rsync-test/8 builds a directory tree with createmany and some nested Gfor loops for directories, and then does a lustre_rsync to a local directory (on the client). I had been running that in a loop to try to recreate the bug I was looking for when the MDT went down, it's pretty reproducable, you just have to keep the filesystem mounted between runs. I can reproduce if you want cleaner logs.
            green Oleg Drokin added a comment -

            Shadow, I wonder if you have an opinion on this?

            Thee was a class of bugs in the past that you worked on where a missing OST led to some smaller allocations and then everything came down once we realized we had more OSTs in the system.

            Nathaniel, why is the OST down?

            green Oleg Drokin added a comment - Shadow, I wonder if you have an opinion on this? Thee was a class of bugs in the past that you worked on where a missing OST led to some smaller allocations and then everything came down once we realized we had more OSTs in the system. Nathaniel, why is the OST down?

            People

              wc-triage WC Triage
              utopiabound Nathaniel Clark
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: