Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-20158

BUG: unable to handle kernel paging request at 000000000000121a

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Medium
    • Lustre 2.18.0
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Client mounted over o2ib, then unmounted.

      No module reload

      Discovery disabled, peers deleted manually.

      kfi and tcp networks added to client and server.

      Client re-mount over o2ib:

      mount -t lustre -o retry=2,network=o2ib 172.18.2.8@o2ib:172.18.2.7@o2ib:/lustre /mnt/lustre
      

      Client mount over kfi:

      mount -t lustre -o retry=2,network=kfi1 16@kfi1:/lustre /mnt/lustre-kfi1
      

      Client BUG:

        [168432.962201] Lustre: 771313:0:(client.c:131:ptlrpc_uuid_to_connection()) cannot find peer 172.18.2.7@o2ib!
        [168432.971854] LustreError: 771313:0:(ldlm_lib.c:573:client_obd_setup()) can't add initial connection
        [168432.980907] BUG: unable to handle kernel paging request at 000000000000121a
        [168432.987948] PGD 0 P4D 0
        [168432.990570] Oops: 0000 [#1] SMP NOPTI
        [168432.994315] CPU: 23 PID: 771313 Comm: llog_process_th Kdump: loaded Tainted: G           OE     -------- -  -
        4.18.0-553.36.1.el8_10.x86_64 #1
        [168433.007165] Hardware name: Viking Enterprise Solutions VSSEP1EA/VSSEP1EA, BIOS 10.09.02 10/26/2020
        [168433.016196] RIP: 0010:mdc_device_free+0xb/0x160 [mdc]
        [168433.021367] Code: 48 c7 05 c4 ff 01 00 00 00 00 00 e8 0f f9 6c ff e9 46 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00
        0f 1f 44 00 00 55 53 48 8b 6e 28 <66> 83 bd 1a 12 00 00 00 75 20 48 89 f3 48 85 f6 75 48 31 ff e8 4c
        [168433.040191] RSP: 0018:ffffa5072d30fba8 EFLAGS: 00010282
        [168433.045502] RAX: 00000000ffffff9b RBX: ffffffffffffff9b RCX: 0000000000000000
        [168433.052711] RDX: 0000000000000000 RSI: ffff97607dc58880 RDI: ffffa5072d30fc18
        [168433.059922] RBP: 0000000000000000 R08: 0000000080000000 R09: 0000000000000000
        [168433.067140] R10: 00000000ffffff9b R11: ffff975f742e43fd R12: ffff975f39cb8000
        [168433.074349] R13: ffffa5072d30fc18 R14: 0000000000000000 R15: 0000000000000000
        [168433.081560] FS:  0000000000000000(0000) GS:ffff977ead2c0000(0000) knlGS:0000000000000000
        [168433.089731] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [168433.095556] CR2: 000000000000121a CR3: 0000002a9e210000 CR4: 0000000000350ee0
        [168433.102775] Call Trace:
        [168433.105317]  ? __die_body+0x1a/0x60
        [168433.108900]  ? no_context+0x1ba/0x3f0
        [168433.112653]  ? __bad_area_nosemaphore+0x157/0x180
        [168433.117444]  ? do_page_fault+0x37/0x12d
        [168433.121370]  ? page_fault+0x1e/0x30
        [168433.124950]  ? mdc_device_free+0xb/0x160 [mdc]
        [168433.129488]  mdc_device_alloc+0x1f5/0x240 [mdc]
        [168433.134111]  obd_setup+0x224/0x470 [obdclass]
        [168433.138738]  class_setup+0x5b7/0x760 [obdclass]
        [168433.143400]  class_process_config+0x120a/0x20b0 [obdclass]
        [168433.149023]  class_config_llog_handler+0x727/0x11c0 [obdclass]
        [168433.154986]  llog_process_thread+0xd79/0x1b30 [obdclass]
        [168433.160426]  ? llog_validate+0x370/0x370 [obdclass]
        [168433.165434]  llog_process_thread_daemonize+0x70/0x90 [obdclass]
        [168433.171482]  kthread+0x134/0x150
        [168433.174804]  ? set_kthread_struct+0x50/0x50
        [168433.179073]  ret_from_fork+0x35/0x40
      

      RCA from sonnet:

      NULL Pointer Dereference Bug (Crash)
      
        The real kernel BUG is in mdc_device_alloc() (mdc_dev.c:1741): when mdc_setup() fails, it calls mdc_device_free(env,
        d) as cleanup — but d->ld_obd is still NULL at this point.
      
        The framework (obd_setup() in obd_class.h:584) only sets dev->ld_obd = obd after ldto_device_alloc() returns
        successfully. Since the error occurs inside mdc_device_alloc(), this assignment never happens.
      
        mdc_device_free() then does:
      
         struct obd_device *obd = lu->ld_obd;  // NULL
         struct client_obd *cli = &obd->u.cli; // NULL + offset
         LASSERT(cli->cl_mod_rpcs_in_flight == 0); // → fault at 0x0 + 0x121a = 0x121a
      
        CR2=0x121a is exactly offsetof(struct obd_device, u.cli.cl_mod_rpcs_in_flight) with a NULL base pointer. RBP=0 in the
        register dump confirms it.
      
        osc_device_alloc() has the identical bug (osc/osc_dev.c:206).
      
        Fix
      
        In mdc_device_alloc() (and symmetrically osc_device_alloc()), set d->ld_obd = obd before calling mdc_setup()/
        osc_setup(), following the pattern of server-side drivers (lod, ofd, osp all set ld_obd in their own alloc
        functions):
      
         obd->obd_lu_dev = d;
         d->ld_obd = obd;          // add this
         rc = mdc_setup(obd, cfg);
      

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: