Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17269

el9.3 crash conf-sanity test_41c Oops in class_setup()

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • Lustre 2.16.0
    • None
    • 3
    • 9223372036854775807

    Description

      [ 3093.416284] Lustre: DEBUG MARKER: == conf-sanity test 41c: concurrent mounts of MDT/OST should all fail but one ========================================================== 19:54:14 (1699300454)
      ...
      [ 3149.141357] LustreError: 187855:0:(libcfs_fail.h:190:cfs_race()) cfs_race id 716 sleeping
      [ 3149.143276] LustreError: 187854:0:(libcfs_fail.h:201:cfs_race()) cfs_fail_race id 716 waking
      [ 3149.143494] LustreError: 187855:0:(libcfs_fail.h:199:cfs_race()) cfs_fail_race id 716 awake: rc=500
      [ 3149.143591] LustreError: 187855:0:(obd_config.c:696:class_setup()) Device 0 setup in progress (type osd-zfs)
      [ 3149.143660] LustreError: 187855:0:(obd_mount.c:213:lustre_start_simple()) lustre-MDT0000-osd setup error -17
      [ 3149.143731] LustreError: 187855:0:(tgt_mount.c:2183:server_fill_super()) Unable to start osd on lustre-mdt1/mdt1: -17
      [ 3149.143804] LustreError: 187855:0:(super25.c:188:lustre_fill_super()) llite: Unable to mount <unknown>: rc = -17
      [ 3149.143896] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
      [ 3149.144137] CPU: 0 PID: 187854 Comm: mount.lustre Tainted: G        W  O     --------- -  - 4.18.0 #2
      [ 3149.144266] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc36 04/01/2014
      [ 3149.144445] RIP: 0010:class_setup+0x610/0xad0 [obdclass]
      [ 3149.144519] Code: 05 61 f0 09 00 00 00 00 00 e8 2c 3a ea ff 31 d2 be 2f 02 00 00 48 c7 c7 10 3b 98 c0 e8 49 65 7a e2 e8 b4 ed c6 e2 48 8b 04 24 <48> 8b 40 28 48 83 f8 01 0f 84 8e 03 00 00 48 8b 04 24 48 8b 48 28
      [ 3149.144747] RSP: 0018:ffff9206ab77bae8 EFLAGS: 00010246
      [ 3149.144814] RAX: 6b6b6b6b6b6b6b6b RBX: ffff9206a6cf4600 RCX: 000000000002d000
      [ 3149.144912] RDX: 0000000000000000 RSI: 000000000000022f RDI: ffffffffc0983b10
      [ 3149.145018] RBP: ffff9206b42b0530 R08: ffffffffc07d5000 R09: ffffffffa3e0bbc0
      [ 3149.145117] R10: ffff9206ab77ba20 R11: ffff9206ad3457a3 R12: ffff9206b42b0110
      [ 3149.145233] R13: ffff9206b42b02b8 R14: ffff9206b42b0048 R15: 0000000000000000
      [ 3149.145334] FS:  00007f1f838808c0(0000) GS:ffff9206cfe00000(0000) knlGS:0000000000000000
      [ 3149.145434] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 3149.145528] CR2: 0000000000667000 CR3: 00000001908f7003 CR4: 0000000000370eb0
      [ 3149.145634] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 3149.145736] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 3149.145835] Call Trace:
      [ 3149.146062]  ? libcfs_debug_msg+0x9be/0xb00 [libcfs]
      [ 3149.146380]  ? xas_load+0x8/0x80
      [ 3149.146452]  ? xas_find+0x173/0x1b0
      [ 3149.146854]  ? xa_find+0xae/0xe0
      [ 3149.146911]  ? do_raw_spin_unlock+0x44/0xc0
      [ 3149.146973]  ? _raw_spin_unlock+0x1a/0x30
      [ 3149.147061]  class_process_config+0x14fa/0x2e60 [obdclass]
      [ 3149.147154]  ? do_lcfg+0x15a/0x4b0 [obdclass]
      [ 3149.147247]  do_lcfg+0x223/0x4b0 [obdclass]
      [ 3149.147322]  lustre_start_simple+0x72/0x1c0 [obdclass]
      [ 3149.147471]  osd_start+0x565/0x7b0 [ptlrpc]
      [ 3149.147536]  ? kstrtou16+0x1b/0x40
      [ 3149.147607]  ? target_name2index+0x106/0x140 [obdclass]
      [ 3149.147721]  server_fill_super+0x327/0x1100 [ptlrpc]
      [ 3149.147814]  ? obd_zombie_barrier+0x36/0x90 [obdclass]
      [ 3149.147889]  ? debug_mutex_init+0x31/0x40
      [ 3149.147978]  lustre_fill_super+0x390/0x480 [lustre]
      [ 3149.148066]  ? lustre_mount+0x10/0x10 [lustre]
      [ 3149.148141]  mount_nodev+0x41/0x90
      

      this problem was introduced in c5e5060d950 ("LU-8802 obd: remove MAX_OBD_DEVICES") IMO:

      	if (class_name2dev(new_obd->obd_name) == -1) {
      		class_incref(new_obd, "obd_device_list", new_obd);
      		rc = xa_alloc(&obd_devs, &dev_no, new_obd,
      			      xa_limit_31b, GFP_ATOMIC);
      

      two threads can try and create OBDs with a same name:

      00000020:00000080:0.0:1699293418.519360:0:185838:0:(genops.c:417:class_newdev()) Allocate new device lustre-OST0000-osd (00000000b8694366)
      00000020:00000080:1.0:1699293418.519360:0:185839:0:(genops.c:417:class_newdev()) Allocate new device lustre-OST0000-osd (00000000e7494c1a)
      

      Attachments

        Issue Links

          Activity

            [LU-17269] el9.3 crash conf-sanity test_41c Oops in class_setup()
            pjones Peter Jones added a comment -

            Merged for 2.16

            pjones Peter Jones added a comment - Merged for 2.16

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54747/
            Subject: LU-17269 obdclass: fix locking for class_register/deregister
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: e8d318054063c860f1e039890792ab25950eb8de

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54747/ Subject: LU-17269 obdclass: fix locking for class_register/deregister Project: fs/lustre-release Branch: master Current Patch Set: Commit: e8d318054063c860f1e039890792ab25950eb8de

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54744/
            Subject: LU-17269 tests: exclude conf-sanity/41c
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: a24f55f500d8ff320225dc20e18278a58c37285b

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/54744/ Subject: LU-17269 tests: exclude conf-sanity/41c Project: fs/lustre-release Branch: master Current Patch Set: Commit: a24f55f500d8ff320225dc20e18278a58c37285b

            "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54747
            Subject: LU-17269 obdclass: fix locking for class_register/deregister
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a85d87f892883c0258a978212e8518548384f7ec

            gerrit Gerrit Updater added a comment - "Timothy Day <timday@amazon.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54747 Subject: LU-17269 obdclass: fix locking for class_register/deregister Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a85d87f892883c0258a978212e8518548384f7ec
            gerrit Gerrit Updater added a comment - - edited

            "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54744
            Subject: LU-17269 tests: exclude conf-sanity/41c
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 209237cac446778bf2af2e0a53e02a8d6dca5305

            gerrit Gerrit Updater added a comment - - edited "Andreas Dilger <adilger@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54744 Subject: LU-17269 tests: exclude conf-sanity/41c Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 209237cac446778bf2af2e0a53e02a8d6dca5305
            adilger Andreas Dilger added a comment - - edited

            timday, it looks like the master testing is crashing about 1/2 of runs (18/30 runs).

            I think this might relate to changing the review test sessions to run with two MDS nodes and 4 MDTs, or possibly some difference in alignment of the planets (or electrons in RAM) after the eclipse passing over the colo hosting the test cluster...

            Please add extra testing to confirm this intermittent failure is actually fixed:

            Test-Parameters: testlist=conf-sanity env=ONLY=41c,ONLY_REPEAT=20
            
            adilger Andreas Dilger added a comment - - edited timday , it looks like the master testing is crashing about 1/2 of runs (18/30 runs) . I think this might relate to changing the review test sessions to run with two MDS nodes and 4 MDTs, or possibly some difference in alignment of the planets (or electrons in RAM) after the eclipse passing over the colo hosting the test cluster... Please add extra testing to confirm this intermittent failure is actually fixed: Test-Parameters: testlist=conf-sanity env=ONLY=41c,ONLY_REPEAT=20
            timday Tim Day added a comment -

            There's some serialization in https://review.whamcloud.com/c/fs/lustre-release/+/53606 that should fix this. I can split up that patch so that part can land faster. The other hash table stuff can wait.

            timday Tim Day added a comment - There's some serialization in https://review.whamcloud.com/c/fs/lustre-release/+/53606 that should fix this. I can split up that patch so that part can land faster. The other hash table stuff can wait.

            This is failing daily on master lustre-review testing since 2024-04-10 when it was previously only failing once a week for "full" testing after branch landings:
            https://testing.whamcloud.com/search?horizon=15552000&status%5B%5D=CRASH&test_set_script_id=7f66aa20-3db2-11e0-80c0-52540025f9af&sub_test_script_id=553e5ade-1fe2-11e4-8610-5254006e85c2&source=sub_tests#redirect

            It looks like it is only failing with el9.3 servers.

            adilger Andreas Dilger added a comment - This is failing daily on master lustre-review testing since 2024-04-10 when it was previously only failing once a week for "full" testing after branch landings: https://testing.whamcloud.com/search?horizon=15552000&status%5B%5D=CRASH&test_set_script_id=7f66aa20-3db2-11e0-80c0-52540025f9af&sub_test_script_id=553e5ade-1fe2-11e4-8610-5254006e85c2&source=sub_tests#redirect It looks like it is only failing with el9.3 servers.

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53011
            Subject: LU-17269 obdclass: serialize obddev creation
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: ccbe5c2c6c442f021a3fda9e6f418a5052956897

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/53011 Subject: LU-17269 obdclass: serialize obddev creation Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: ccbe5c2c6c442f021a3fda9e6f418a5052956897

            People

              timday Tim Day
              bzzz Alex Zhuravlev
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: