[LU-11809] conf-sanity test 28A hangs on file system mount Created: 18/Dec/18  Updated: 13/Jan/20  Resolved: 05/Feb/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: ubuntu
Environment:

Ubuntu 18.04 clients/RHEL 7.6 servers


Issue Links:
Related
is related to LU-11803 sanity test 255c fails with 'Ladvise ... Resolved
is related to LU-13118 change client instance to respect ASLR Open
is related to LU-11834 config instance truncation due to buf... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

conf-sanity test_28A hangs due to client problem mounting the file system. We are only seeing this issue when testing with Ubuntu 18.04 clients.

Looking at the test suite hang at https://testing.whamcloud.com/test_sets/d8edfc48-fdd4-11e8-a97c-52540065bddc , the client test_log is empty. Looking at the Client 1 (vm1) console log, we see the issue

[ 2626.155389] Lustre: DEBUG MARKER: == conf-sanity test 28A: permanent parameter setting ================================================= 02:34:15 (1544495655)
[ 2626.772885] Lustre: Lustre: Build Version: 2.12.0_RC2
[ 2626.836890] LNet: Added LNI 10.9.4.224@tcp [8/256/0/180]
[ 2626.842963] LNet: Accept all, port 7988
[ 2628.417559] Lustre: 3169:0:(gss_svc_upcall.c:1199:gss_init_svc_upcall()) Init channel is not opened by lsvcgssd, following request might be dropped until lsvcgssd is active
[ 2628.419207] Key type lgssc registered
[ 2628.519146] Lustre: Echo OBD driver; http://www.lustre.org/
[ 2639.110774] Lustre: DEBUG MARKER: mkdir -p /mnt/lustre
[ 2639.121362] Lustre: DEBUG MARKER: mount -t lustre -o user_xattr,flock trevis-19vm4@tcp:/lustre /mnt/lustre
[ 2650.553084] Lustre: Mounted lustre-client
[ 2651.652270] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre-*.max_read_ahead_whole_mb
[ 2651.665867] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre-*.max_read_ahead_whole_mb
[ 2652.000321] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre-*.max_read_ahead_whole_mb
[ 2652.011374] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre-*.max_read_ahead_whole_mb
[ 2653.024019] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n llite.lustre-*.max_read_ahead_whole_mb
[ 2653.764392] BUG: unable to handle kernel paging request at 0000000080d48269
[ 2653.765275] IP: class_process_config+0x1cf8/0x27b0 [obdclass]
[ 2653.765911] PGD 0 P4D 0 
[ 2653.766218] Oops: 0000 [#1] SMP PTI
[ 2653.766619] Modules linked in: lustre(OE) obdecho(OE) mgc(OE) lov(OE) mdc(OE) osc(OE) lmv(OE) fid(OE) fld(OE) ptlrpc_gss(OE) ptlrpc(OE) obdclass(OE) ksocklnd(OE) lnet(OE) libcfs(OE) rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd input_leds joydev mac_hid serio_raw sch_fq_codel sunrpc ip_tables x_tables autofs4 psmouse virtio_blk floppy 8139too 8139cp mii pata_acpi i2c_piix4 [last unloaded: libcfs]
[ 2653.771057] CPU: 1 PID: 3729 Comm: llog_process_th Tainted: G        W  OE    4.15.0-32-generic #35-Ubuntu
[ 2653.772039] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[ 2653.772668] RIP: 0010:class_process_config+0x1cf8/0x27b0 [obdclass]
[ 2653.773327] RSP: 0018:ffffb4fd4247bc58 EFLAGS: 00010246
[ 2653.773892] RAX: 0000000080d47e61 RBX: ffff9d7f744c0880 RCX: 0000000000000000
[ 2653.774624] RDX: 0000000000000018 RSI: ffffffffc0768d04 RDI: ffff9d7f744c08c6
[ 2653.775363] RBP: ffffb4fd4247bd08 R08: 00000000ffffffff R09: 0000000000000024
[ 2653.776103] R10: ffffffffc07594b0 R11: f000000000000000 R12: ffffffffc0759680
[ 2653.776838] R13: ffffffffc0761b50 R14: 0000000000000000 R15: ffffb4fd4247bd40
[ 2653.777571] FS:  0000000000000000(0000) GS:ffff9d7fbfd00000(0000) knlGS:0000000000000000
[ 2653.778413] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2653.779026] CR2: 0000000080d48269 CR3: 0000000010a0a002 CR4: 00000000000606e0
[ 2653.779769] Call Trace:
[ 2653.780100]  ? libcfs_debug_msg+0x50/0x70 [libcfs]
[ 2653.780624]  ? libcfs_debug_msg+0x50/0x70 [libcfs]
[ 2653.781174]  class_config_llog_handler+0x7cb/0x14c0 [obdclass]
[ 2653.781828]  llog_process_thread+0x651/0x1580 [obdclass]
[ 2653.782413]  llog_process_thread_daemonize+0x9f/0xe0 [obdclass]
[ 2653.783071]  kthread+0x121/0x140
[ 2653.783461]  ? llog_backup+0x4d0/0x4d0 [obdclass]
[ 2653.783981]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 2653.784534]  ret_from_fork+0x35/0x40
[ 2653.784951] Code: 8b 40 38 48 85 c0 0f 84 e5 08 00 00 48 89 da be 20 00 00 00 4c 89 ef e8 b7 41 ed c9 41 89 c4 e9 d9 f4 ff ff 48 8b 85 58 ff ff ff <48> 8b 80 08 04 00 00 48 8b 50 10 81 3a 03 bd ac bd 0f 85 87 06 
[ 2653.786833] RIP: class_process_config+0x1cf8/0x27b0 [obdclass] RSP: ffffb4fd4247bc58
[ 2653.787628] CR2: 0000000080d48269
[    0.439151] Kernel panic - not syncing: Out of memory and no killable processes...

Logs for other hangs are at
https://testing.whamcloud.com/test_sets/c185cb70-f713-11e8-b67f-52540065bddc
https://testing.whamcloud.com/test_sets/b5f76e48-f778-11e8-b67f-52540065bddc
https://testing.whamcloud.com/test_sets/05386eea-fa11-11e8-8a18-52540065bddc



 Comments   
Comment by Peter Jones [ 18/Dec/18 ]

James

Not sure if this is related to the other tickets you've just picked up...

Peter

Comment by Andreas Dilger [ 18/Dec/18 ]

This looks to be related to LU-11803, but shows a different symptom - there is a BUG() in class_process_config(), which might be the root cause of why the /sys tunable registration is failing in LU-11803 and related tickets.

Comment by Gerrit Updater [ 20/Dec/18 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33900
Subject: LU-11809 llite: don't use %p to generate cfg_instance
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b211ce5864730223ef5efadd8706707a84a3909c

Comment by Li Xi [ 20/Dec/18 ]

The instance name of Lustre client like "$fsname-ffff88002738bc00" doesn't feel like a meaningful name. The pointer (ffff88002738bc00) looks like a random number and it changes when remounting. There is no consistent name for a Lustre client. This will bring inconvienence when writing a script.

I am wondering whether the instance name can be changed to something like "$fsname-$mount_point". Maybe in order to avoid problems, we need to replace "/" in the mount point to "-" or so. But a least, for a given mount point, we can generate the client name easily.

Comment by Andreas Dilger [ 20/Dec/18 ]

The "ffff888..." part of the name is not meant to be meaningful, it is meant to be unique. If we use something like the mounpoint, what happens if e.g. there are two containers that are both mounting the same filesystem on the same mountpoint internally? What happens if the filesystem is moved to a new location with "mount --bind /mnt/lustre /mnt/testfs; umount /mnt/testfs"?  I think it would be possible to use something other than the superblock pointer to be unique per client (e.g. PRNG value stored in superblock, client mountpoint UUID, etc.) and something like ll_sb_uuid might make it easier in userspace to match the /sys/fs/lustre/llite entries to a specific mountpoint.

Another option is to include the client UUID in the mount options shown in /proc/mounts, like:

 10.0.2.15@tcp:/testfs /mnt/testfs2 lustre rw,seclabel,uuid=b893c52a-995a-d164-5dc6-46e03330e73e 0

though this might mean that we accept "uuid=" as a client mount option (which is maybe silently discarded, because we don't want users to be able to specify non-unique UUID values for the mounts, jut print it to identify the mountpoint.

Comment by James A Simmons [ 20/Dec/18 ]

I have been playing with a patch that uses the UUID. In my patch I removed the dashes so it prints it as one number. Should we keep the dashes?

Comment by Andreas Dilger [ 20/Dec/18 ]

I was thinking about the same. Unfortunately, the dashes make it hard to parse. In my current patch I change llapi_getname() from assuming a fixed 16-char instance number at the end to allowing everything after the last dash. As an alternative, I could explicitly scan for "clilov" and drop that from the middle of the device name? That would allow a "proper" UUID with multiple dashes to be appended, but it may also be problematic for other code that uses "strrchr('-')" to cut off the identifier.

In any case, that kind of change isn't something to be done at the last minute. For 2.12 we need to stick with a single hex digit, and we can look at changing it in 2.13 when there is a much longer testing window. The only thing that might be changed in my current patch is to make llapi_getname() more tolerant of what we might introduce in the future.

Comment by Gerrit Updater [ 21/Dec/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33900/
Subject: LU-11809 llite: don't use %p to generate cfg_instance
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: cd294a12553a0f24096c98c2dc59f4b0ec4a5c14

Generated at Sat Feb 10 02:47:07 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.