[LU-11232] replay-ost-single test_0b: BUG: unable to handle kernel NULL pointer dereference Created: 09/Aug/18  Updated: 14/Aug/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.5
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-10573 mdt_destroy_export()) ASSERTION( list... Open
duplicates LU-9806 tgt_client_free()) ASSERTION( lut && ... Resolved
duplicates LU-10806 Hard crash when mounting DNE MDT Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/7c77283e-9882-11e8-b0aa-52540065bddc

test_0b failed with the following error:

trevis-9vm10 crashed during replay-ost-single test_0b

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
replay-ost-single test_0b - trevis-9vm10 crashed during replay-ost-single test_0b



 Comments   
Comment by Sarah Liu [ 09/Aug/18 ]

server/client: SLES12sp3 2.10.5 rc1

OST dmesg

[  270.466705] Lustre: DEBUG MARKER: test -b /dev/lvm-Role_OSS/P4
[  270.504171] Lustre: DEBUG MARKER: e2label /dev/lvm-Role_OSS/P4
[  270.539602] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre-ost4; mount -t lustre   		                   /dev/lvm-Role_OSS/P4 /mnt/lustre-ost4
[  270.580165] LDISKFS-fs warning (device dm-3): ldiskfs_multi_mount_protect:328: MMP interval 42 higher than expected, please wait.
               
[  273.124433] Lustre: lustre-OST0001: Denying connection for new client de942e5a-e67a-c940-aec3-3a8bef12e248(at 10.9.4.98@tcp), waiting for 4 known clients (1 recovered, 0 in progress, and 0 evicted) to recover in 0:20
[  273.124438] Lustre: Skipped 21 previous similar messages
[  275.002105] Lustre: lustre-OST0002: Will be in recovery for at least 1:00, or until 4 clients reconnect
[  275.002167] Lustre: lustre-OST0002: Connection restored to 10.9.4.104@tcp (at 10.9.4.104@tcp)
[  279.319191] Lustre: lustre-OST0002: Connection restored to 5252abe8-f295-e986-07d7-b18acab9614a (at 10.9.4.101@tcp)
[  279.319198] Lustre: Skipped 1 previous similar message
[  288.125028] LustreError: 137-5: lustre-OST0003_UUID: not available for connect from 10.9.4.98@tcp (no target). If you are running an HA pair check that the target is mounted on the other server.
[  288.125034] LustreError: Skipped 75 previous similar messages
[  293.887723] Lustre: lustre-OST0001: recovery is timed out, evict stale exports
[  293.887738] Lustre: lustre-OST0001: disconnecting 3 stale clients
[  293.939466] Lustre: lustre-OST0001: Recovery over after 1:06, of 4 clients 1 recovered and 3 were evicted.
[  294.285458] Lustre: lustre-OST0001: Connection restored to cdaa6359-d98e-8831-5ce1-1b42a13d2b24 (at 10.9.4.100@tcp)
[  313.518413] LDISKFS-fs (dm-3): recovery complete
[  313.530904] LDISKFS-fs (dm-3): mounted filesystem with ordered data mode. Opts: ,errors=remount-ro,no_mbcache,nodelalloc
[  313.545257] LustreError: 3564:0:(osd_scrub.c:2784:osd_scrub_setup()) dm-3: UUID has been changed from 664b5c1b-7333-88c0-43f0-1645592e9480 to 470cd461-31b1-a8d5-48b2-c736bed6d0db
[  313.551387] LDISKFS-fs error (device dm-3): ldiskfs_lookup:1968: inode #40972: comm OI_scrub: deleted inode referenced: 65549
[  313.567933] Aborting journal on device dm-3-8.
[  313.933723] LDISKFS-fs (dm-3): Remounting filesystem read-only
[  313.946278] LustreError: 3564:0:(llog.c:1313:llog_backup()) MGC10.9.4.104@tcp: failed to open backup logfile lustre-OST0003T: rc = -30
[  313.946284] LustreError: 3564:0:(mgc_request.c:1897:mgc_llog_local_copy()) MGC10.9.4.104@tcp: failed to copy remote log lustre-OST0003: rc = -30
[  313.952302] LustreError: 3579:0:(tgt_lastrcvd.c:1122:tgt_client_del()) lustre-OST0003: failed to update server data, skip client 6155c338-c467-e54e-4cd8-161d004528a0 zeroing, rc -30
[  313.954793] BUG: unable to handle kernel NULL pointer dereference at 0000000000000378
[  313.956197] IP: [<ffffffff81618ffc>] _raw_spin_lock+0xc/0x20
[  313.957212] PGD 800000007a65c067 PUD 373f7067 PMD 0 
[  313.958270] Oops: 0002 [#1] SMP 
[  313.959012] Modules linked in:
[  313.959459] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) header@ffff88007be93040[0x0, 1, [0x1:0x0:0x0] hash exist]{
               
[  313.959463] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) ....local_storage@ffff88007be93090
               
[  313.959468] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) ....osd-ldiskfs@ffff88007c44d500osd-ldiskfs-object@ffff88007c44d500(i:ffff88006867e750:81/3077371105)[plain]
               
[  313.959471] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) } header@ffff88007be93040
               
[  313.959554] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) header@ffff8800657b17c0[0x0, 1, [0x200000003:0x0:0x0] hash exist]{
               
[  313.959557] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) ....local_storage@ffff8800657b1810
               
[  313.959561] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) ....osd-ldiskfs@ffff8800374b9d00osd-ldiskfs-object@ffff8800374b9d00(i:ffff880068578bd8:80/615556883)[plain]
               
[  313.959564] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) } header@ffff8800657b17c0
               
[  313.959845] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) header@ffff8800655af040[0x0, 1, [0xa:0x0:0x0] hash exist]{
               
[  313.959848] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) ....local_storage@ffff8800655af090
               
[  313.959852] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) ....osd-ldiskfs@ffff8800374b9800osd-ldiskfs-object@ffff8800374b9800(i:ffff88006868b2c8:82/3077371139)[plain]
               
[  313.959854] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) } header@ffff8800655af040
               
[  313.959920] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) header@ffff8800655af580[0x0, 1, [0xa:0x2:0x0] hash exist]{
               
[  313.959923] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) ....local_storage@ffff8800655af5d0
               
[  313.959929] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) ....osd-ldiskfs@ffff8800370e3000osd-ldiskfs-object@ffff8800370e3000(i:ffff88006865ab58:83/3077371140)[plain]
               
[  313.959931] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) } header@ffff8800655af580
               
[  313.960365] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) header@ffff880065b4edc0[0x0, 1, [0x200000001:0x1017:0x0] hash exist]{
               
[  313.960368] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) ....local_storage@ffff880065b4ee10
               
[  313.960373] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) ....osd-ldiskfs@ffff88007b667800osd-ldiskfs-object@ffff88007b667800(i:ffff880068656208:12/3002225808)[plain]
               
[  313.960375] LustreError: 3579:0:(ofd_dev.c:251:ofd_stack_fini()) } header@ffff880065b4edc0
               
[  313.967885] LustreError: 3579:0:(obd_config.c:558:class_setup()) setup lustre-OST0003 failed (-30)
[  313.967889] LustreError: 3579:0:(obd_config.c:1682:class_config_llog_handler()) MGC10.9.4.104@tcp: cfg command failed: rc = -30
[  313.967893] Lustre:    cmd=cf003 0:lustre-OST0003  1:dev  2:0  3:f  
               
[  313.967952] LustreError: 15c-8: MGC10.9.4.104@tcp: The configuration from log 'lustre-OST0003' failed (-30). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
[  313.967954] LustreError: 3564:0:(obd_mount_server.c:1386:server_start_targets()) failed to start server lustre-OST0003: -30
[  313.967990] LustreError: 3564:0:(obd_mount_server.c:1879:server_fill_super()) Unable to start targets: -30
[  313.968004] LustreError: 3564:0:(obd_config.c:609:class_cleanup()) Device 12 not setup
[  313.994053]  osp(OEN)
[  313.994392]  ofd(OEN) lfsck(OEN) ost(OEN) mgc(OEN) osd_ldiskfs(OEN) ldiskfs(OEN) lquota(OEN) fid(OEN) fld(OEN) ksocklnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) libcfs(OEN) sd_mod(E) iscsi_tcp(E) libiscsi_tcp(E) rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) fscache(E) af_packet(E) iscsi_ibft(E) iscsi_boot_sysfs(E) rpcrdma(E) sunrpc(E) ib_isert(E) iscsi_target_mod(E) ib_iser(E) libiscsi(E) scsi_transport_iscsi(E) ib_srpt(E) target_core_mod(E) ib_srp(E) scsi_transport_srp(E) ib_ipoib(E) rdma_ucm(E) ib_ucm(E) ib_uverbs(E) ib_umad(E) rdma_cm(E) configfs(E) ib_cm(E) iw_cm(E) ib_core(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) jitterentropy_rng(E) drbg(E) ansi_cprng(E) aesni_intel(E) aes_x86_64(E) lrw(E) gf128mul(E) glue_helper(E)
[  314.004777]  ablk_helper(E) cryptd(E) 8139too(E) 8139cp(E) pcspkr(E) joydev(E) virtio_balloon(E) mii(E) i2c_piix4(E) processor(E) button(E) ext4(E) crc16(E) jbd2(E) mbcache(E) ata_generic(E) virtio_blk(E) ata_piix(E) floppy(E) serio_raw(E) ahci(E) libahci(E) libata(E) virtio_pci(E) virtio_ring(E) virtio(E) uhci_hcd(E) ehci_hcd(E) usbcore(E) usb_common(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) autofs4(E)
[  314.010930] Supported: No, Unsupported modules are loaded
[  314.011495] CPU: 1 PID: 2363 Comm: obd_zombid Tainted: G           OE   N  4.4.132-94.33_lustre-default #1
[  314.012447] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  314.013041] task: ffff88007a724680 ti: ffff88007be74000 task.ti: ffff88007be74000
[  314.013797] RIP: 0010:[<ffffffff81618ffc>]  [<ffffffff81618ffc>] _raw_spin_lock+0xc/0x20
[  314.014688] RSP: 0018:ffff88007be77dd0  EFLAGS: 00010246
[  314.015243] RAX: 0000000000000000 RBX: ffff880066b09c00 RCX: ffffea00018c57e0
[  314.015964] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 0000000000000378
[  314.016692] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  314.017466] R10: 0000000000000000 R11: ffff88006317dbaf R12: 0000000000000378
[  314.018196] R13: ffff8800650f3008 R14: ffffffffa08b93a4 R15: 0000000000000000
[  314.018916] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[  314.019721] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  314.020323] CR2: 0000000000000378 CR3: 000000007c61c000 CR4: 0000000000060670
[  314.021062] Stack:
[  314.021319]  ffffffffa0b59f5f ffffffffa0cdfab6 ffff880066b09c00 ffff880066b09c00
[  314.022326]  ffff880066b09c54 ffff880065028000 ffff880066b09c00 ffffffffa0ccb56a
[  314.023331]  ffff880066b09c00 ffff8800650f3008 ffff88007a724680 ffffffffa0817309
[  314.024338] Call Trace:
[  314.024777]  [<ffffffffa0b59f5f>] tgt_grant_discard+0x2f/0x180 [ptlrpc]
[  314.025483]  [<ffffffffa0ccb56a>] ofd_destroy_export+0x6a/0x160 [ofd]
[  314.026185]  [<ffffffffa0817309>] class_export_destroy+0xe9/0x470 [obdclass]
[  314.026928]  [<ffffffffa081fe9b>] obd_zombie_impexp_cull+0x37b/0x530 [obdclass]
[  314.027688]  [<ffffffffa08200a0>] obd_zombie_impexp_thread+0x50/0x190 [obdclass]
[  314.028463]  [<ffffffff8109eb89>] kthread+0xc9/0xe0
[  314.029000]  [<ffffffff81619805>] ret_from_fork+0x55/0x80
[  314.032089] DWARF2 unwinder stuck at ret_from_fork+0x55/0x80
[  314.032690] 
[  314.032902] Leftover inexact backtrace:
               
[  314.033465]  [<ffffffff8109eac0>] ? kthread_park+0x50/0x50
[  314.034036] Code: 66 66 66 90 8b 07 85 c0 74 03 31 c0 c3 ba 01 00 00 00 f0 0f b1 17 85 c0 75 f0 b0 01 c3 66 90 66 66 66 66 90 31 c0 ba 01 00 00 00 <f0> 0f b1 17 85 c0 75 02 f3 c3 89 c6 e9 4e 85 b7 ff 0f 1f 00 66 
[  314.040103] RIP  [<ffffffff81618ffc>] _raw_spin_lock+0xc/0x20
[  314.040775]  RSP <ffff88007be77dd0>
[  314.041161] CR2: 0000000000000378
Comment by Andreas Dilger [ 14/Aug/18 ]

It seems that this obd_zombie_impexp_cull() issue may be a duplicate of LU-9806, LU-10806, and LU-10573, which all appear to be related to problems with the zombie export list pointing to an export that is no longer present. This issue could be tracked under LU-10806 so that this ticket can address the OI Scrub problem.

Comment by Andreas Dilger [ 14/Aug/18 ]

There is a second issue happening here. The LFSCK check at MDT mount is detecting that the MDT was restored from backup and then running a full scrub on the filesystem.

osd_scrub_setup()) dm-3: UUID has been changed from 664b5c1b-7333-88c0-43f0-1645592e9480 to 470cd461-31b1-a8d5-48b2-c736bed6d0db

This causes the MDT is being mounted read-only because an unused inode was accessed:

[  313.551387] LDISKFS-fs error (device dm-3): ldiskfs_lookup:1968: inode #40972: comm OI_scrub: deleted inode referenced: 65549
[  313.567933] Aborting journal on device dm-3-8.
[  313.933723] LDISKFS-fs (dm-3): Remounting filesystem read-only

It isn't clear if there is a race condition between OI Scrub running and accessing a file that was deleted, or something is accessing a stale inode via the OI and unlinking it while OI Scrub is processing it? This might happen because LFSCK is running without full locking on the files to avoid blocking other MDS threads that are accessing the filesystem.

If this is happening on a regular basis, it would potentially be useful to change this error message to include the parent name/inode so that we can see what kind of file it is (Lustre internal or part of the namespace). Failing that, running "debugfs -c -R 'ncheck <inode_number>' /dev/mdtdev" on the MDT filesystem after the failure would report the pathname, so long as we have access to it before it is reformatted (maybe as part of the test script).

Generated at Sat Feb 10 02:42:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.