Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14283

sanity-sec test 52 crashes due to ‘divide error: 0000 [#1] SMP PTI’

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0
    • Lustre 2.14.0
    • 3
    • 9223372036854775807

    Description

      sanity-sec test_52 crashes due to ‘divide error: 0000 1 SMP PTI’. We’ve seen this test crash five times for review-dne-ssk or review-dne-selinux-ssk patch testing. Two recent crashes happened while testing RHEL8.3 clients and servers:
      https://testing.whamcloud.com/test_sets/13b5e7cf-b908-4799-9d40-9ba4b21dbf00
      https://testing.whamcloud.com/test_sets/e4c617c1-30e1-4c44-8454-3a2a36ff1189

      Looking at the client console log, we see

      [37329.207145] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock,test_dummy_encryption,skpath=/tmp/test-framework-keys trevis-8vm4@tcp:/lustre /mnt/lustre
      [37331.900380] Lustre: 1600717:0:(keysetup.c:472:llcrypt_get_encryption_info()) inode 144115188193296385: setting policy filenames_encryption_mode to null
      [37331.901861] Lustre: 1600717:0:(keysetup.c:472:llcrypt_get_encryption_info()) Skipped 5 previous similar messages
      [37334.491809] divide error: 0000 [#1] SMP PTI
      [37334.492358] CPU: 1 PID: 1550060 Comm: ptlrpcd_00_00 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-240.1.1.el8_3.x86_64 #1
      [37334.493661] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      [37334.494315] RIP: 0010:osc_announce_cached.isra.24+0xd4/0x3b0 [osc]
      [37334.494970] Code: af d0 48 8b 43 50 48 39 ca 48 0f 42 d1 48 89 d1 48 c1 e1 0c f6 80 c5 02 00 00 10 74 1a 8b b3 f0 00 00 00 48 8d 44 32 ff 31 d2 <48> f7 f6 0f af 83 ec 00 00 00 48 01 c1 48 81 f9 ff ff ff 6f b8 ff
      [37334.496814] RSP: 0018:ffffa4c700833a78 EFLAGS: 00010246
      [37334.497361] RAX: 0000000000023fff RBX: ffff958f013085f0 RCX: 0000000024000000
      [37334.498104] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff958f013086e4
      [37334.498838] RBP: ffff958f1ba8c150 R08: 0000000000033c62 R09: ffff958f1ba8c248
      [37334.499564] R10: ffff958efd282880 R11: 0000000000002000 R12: ffff958f013086e4
      [37334.500297] R13: 0000000000000010 R14: 0000000000000002 R15: ffff958f013085f0
      [37334.501029] FS:  0000000000000000(0000) GS:ffff958f3fd00000(0000) knlGS:0000000000000000
      [37334.501848] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [37334.502442] CR2: 00007fc4a2b46000 CR3: 000000005a20a002 CR4: 00000000000606e0
      [37334.503184] Call Trace:
      [37334.503498]  osc_brw_prep_request+0xf2e/0x15b0 [osc]
      [37334.504040]  osc_build_rpc+0xb82/0x10b0 [osc]
      [37334.504514]  osc_check_rpcs+0x1703/0x1880 [osc]
      [37334.505029]  ? __switch_to_asm+0x41/0x70
      [37334.505454]  ? __switch_to_asm+0x35/0x70
      [37334.505879]  ? __switch_to_asm+0x35/0x70
      [37334.506299]  ? __switch_to_asm+0x41/0x70
      [37334.506728]  osc_io_unplug0+0xc0/0x110 [osc]
      [37334.507194]  brw_queue_work+0x2e/0xc0 [osc]
      [37334.507801]  work_interpreter+0x32/0x110 [ptlrpc]
      [37334.508337]  ptlrpc_check_set+0x51a/0x21b0 [ptlrpc]
      [37334.508883]  ? finish_task_switch+0x77/0x2a0
      [37334.509368]  ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
      [37334.509889]  ptlrpcd+0x458/0x4c0 [ptlrpc]
      [37334.510332]  ? finish_wait+0x80/0x80
      [37334.510751]  ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
      [37334.511262]  kthread+0x112/0x130
      [37334.511622]  ? kthread_flush_work_fn+0x10/0x10
      [37334.512106]  ret_from_fork+0x35/0x40
      

      Attachments

        Activity

          [LU-14283] sanity-sec test 52 crashes due to ‘divide error: 0000 [#1] SMP PTI’
          pjones Peter Jones added a comment -

          Landed for 2.14

          pjones Peter Jones added a comment - Landed for 2.14

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41225/
          Subject: LU-14283 osc: avoid crash if ocd reset
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 1ec58a2e6ef98bacbb806f1b141ef38cdefe2226

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41225/ Subject: LU-14283 osc: avoid crash if ocd reset Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1ec58a2e6ef98bacbb806f1b141ef38cdefe2226

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41256/
          Subject: LU-14283 obdclass: connect vs disconnect race
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: 4f689a30fb5d8707270dfc9df9329b32fec8440f

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/41256/ Subject: LU-14283 obdclass: connect vs disconnect race Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4f689a30fb5d8707270dfc9df9329b32fec8440f

          Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41256
          Subject: LU-14283 obdclass: connect vs disconnect race
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 43d0981915600fa6030ccc4d496d2824745633c7

          gerrit Gerrit Updater added a comment - Wang Shilong (wshilong@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/41256 Subject: LU-14283 obdclass: connect vs disconnect race Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 43d0981915600fa6030ccc4d496d2824745633c7
          jhammond John Hammond added a comment -

          It seems very unlikely, but if if osd_connect_flags includes GRANT_PARAM but not GRANT then we could get here.

          jhammond John Hammond added a comment - It seems very unlikely, but if if osd_connect_flags includes GRANT_PARAM but not GRANT then we could get here.

          I did not find why we could hit this issue.

          Notice once @cl_max_extent_pages is inited ever, even reconnect won't clear it to be zero unless i miss something here, and there is a window
          it could be zero but that was protected by spinlock.

          so big possibility is it was zero initially...

          wshilong Wang Shilong (Inactive) added a comment - I did not find why we could hit this issue. Notice once @cl_max_extent_pages is inited ever, even reconnect won't clear it to be zero unless i miss something here, and there is a window it could be zero but that was protected by spinlock. so big possibility is it was zero initially...
          jhammond John Hammond added a comment -

          > The other question to ask is how can we end up in osc_announce_cached() if the OSC obd_device is not setup.

          I do not think this is possible.

          jhammond John Hammond added a comment - > The other question to ask is how can we end up in osc_announce_cached() if the OSC obd_device is not setup. I do not think this is possible.

          Thanks for this analysis, this is helpful.

          The messages like:

          [13483.414663] LustreError: 809674:0:(gss_cli_upcall.c:270:gss_do_ctx_init_rpc()) obd lustre-OST0007-osc-ffff99b9f9160800 not setup
          

          occur because the client tries to initialize its GSS context before the obd device is actually set up. This GSS initialization operation is triggered from userspace by lgss_keyring upon key request, so it can happen completely asynchronously. In any case, I would not tend to consider this is harmful.

          The other question to ask is how can we end up in osc_announce_cached() if the OSC obd_device is not setup.

          sebastien Sebastien Buisson added a comment - Thanks for this analysis, this is helpful. The messages like: [13483.414663] LustreError: 809674:0:(gss_cli_upcall.c:270:gss_do_ctx_init_rpc()) obd lustre-OST0007-osc-ffff99b9f9160800 not setup occur because the client tries to initialize its GSS context before the obd device is actually set up. This GSS initialization operation is triggered from userspace by lgss_keyring upon key request, so it can happen completely asynchronously. In any case, I would not tend to consider this is harmful. The other question to ask is how can we end up in osc_announce_cached() if the OSC obd_device is not setup.
          jhammond John Hammond added a comment -

          sebastien could you look again?

          See https://testing.whamcloud.com/test_sets/13b5e7cf-b908-4799-9d40-9ba4b21dbf00 and the client logs at https://testing.whamcloud.com/test_logs/d8e65e1b-ad1c-4b27-9318-07ebeab40ea2/show_text

          I see several messages of the form

          [13483.414663] LustreError: 809674:0:(gss_cli_upcall.c:270:gss_do_ctx_init_rpc()) obd lustre-OST0007-osc-ffff99b9f9160800 not setup
          

          The obd device found does not have the obd_set_up bit set. In obd_import_event() if that bit is not set then the event is ignored:

          static inline void obd_import_event(struct obd_device *obd,
                                              struct obd_import *imp,
                                              enum obd_import_event event)
          {
                  ENTRY;
                  if (!obd) {
                          CERROR("NULL device\n");
                          EXIT;
                          return;
                  }
          
                  if (obd->obd_set_up && OBP(obd, import_event))
                          OBP(obd, import_event)(obd, imp, event);
          
                  EXIT;
          }
          

          In osc_import_event() we have:

                  case IMP_EVENT_OCD: {
                          struct obd_connect_data *ocd = &imp->imp_connect_data;
          
          		if (ocd->ocd_connect_flags & OBD_CONNECT_GRANT)
                                  osc_init_grant(&obd->u.cli, ocd);
          
                          /* See bug 7198 */
                          if (ocd->ocd_connect_flags & OBD_CONNECT_REQPORTAL)
                                  imp->imp_client->cli_request_portal =OST_REQUEST_PORTAL;
          
          		rc = obd_notify_observer(obd, obd, OBD_NOTIFY_OCD);
                          break;
                  }
          

          Note:

          osc_init_grant() is responsible for initializing the cl_max_extent_pages field to a non zero value (according to the OCD data):

                          /* determine maximum extent size, in #pages */
                          size = (u64)ocd->ocd_grant_max_blks << ocd->ocd_grant_blkbits;
          		cli->cl_max_extent_pages = size >> PAGE_SHIFT;
          

          cl_max_extent_pages being zero is giving us the divide error in osc_announce_cached():

                                  /* take extent tax into account when asking for more
                                   * grant space */
                                  nrextents = (nrpages + cli->cl_max_extent_pages - 1)  /
                                               cli->cl_max_extent_pages;
          

          Could you determine why the OSC obd_device is not getting setup here?

          jhammond John Hammond added a comment - sebastien could you look again? See https://testing.whamcloud.com/test_sets/13b5e7cf-b908-4799-9d40-9ba4b21dbf00 and the client logs at https://testing.whamcloud.com/test_logs/d8e65e1b-ad1c-4b27-9318-07ebeab40ea2/show_text I see several messages of the form [13483.414663] LustreError: 809674:0:(gss_cli_upcall.c:270:gss_do_ctx_init_rpc()) obd lustre-OST0007-osc-ffff99b9f9160800 not setup The obd device found does not have the obd_set_up bit set. In obd_import_event() if that bit is not set then the event is ignored: static inline void obd_import_event(struct obd_device *obd, struct obd_import *imp, enum obd_import_event event) { ENTRY; if (!obd) { CERROR( "NULL device\n" ); EXIT; return ; } if (obd->obd_set_up && OBP(obd, import_event)) OBP(obd, import_event)(obd, imp, event); EXIT; } In osc_import_event() we have: case IMP_EVENT_OCD: { struct obd_connect_data *ocd = &imp->imp_connect_data; if (ocd->ocd_connect_flags & OBD_CONNECT_GRANT) osc_init_grant(&obd->u.cli, ocd); /* See bug 7198 */ if (ocd->ocd_connect_flags & OBD_CONNECT_REQPORTAL) imp->imp_client->cli_request_portal =OST_REQUEST_PORTAL; rc = obd_notify_observer(obd, obd, OBD_NOTIFY_OCD); break ; } Note: osc_init_grant() is responsible for initializing the cl_max_extent_pages field to a non zero value (according to the OCD data): /* determine maximum extent size, in #pages */ size = (u64)ocd->ocd_grant_max_blks << ocd->ocd_grant_blkbits; cli->cl_max_extent_pages = size >> PAGE_SHIFT; cl_max_extent_pages being zero is giving us the divide error in osc_announce_cached() : /* take extent tax into account when asking for more * grant space */ nrextents = (nrpages + cli->cl_max_extent_pages - 1) / cli->cl_max_extent_pages; Could you determine why the OSC obd_device is not getting setup here?

          The above patch is not really a fix for the root problem of (what appears to be) a case of the obd_connect_data being zeroed out during reconnect (or similar). It is only working around the crash to avoid taking out the client. Probably this patch needs to be reverted in order to reproduce the root cause.

          adilger Andreas Dilger added a comment - The above patch is not really a fix for the root problem of (what appears to be) a case of the obd_connect_data being zeroed out during reconnect (or similar). It is only working around the crash to avoid taking out the client. Probably this patch needs to be reverted in order to reproduce the root cause.

          People

            wshilong Wang Shilong (Inactive)
            jamesanunez James Nunez (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: