[LU-5159] Lustre MGS/MDT fails to start using initscripts using 2.4.2 based packages Created: 07/Jun/14  Updated: 29/Aug/14  Resolved: 29/Aug/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Prakash Surya (Inactive) Assignee: Hongchao Zhang
Resolution: Duplicate Votes: 0
Labels: llnl

Issue Links:
Related
is related to LU-1279 failure trying to mount two targets a... Resolved
Severity: 3
Rank (Obsolete): 14229

 Description   

I set up a small Lustre filesystem inside of a few VMs running our TOSS 2.2 packages, and the initscript is failing to mount the MGS and MDT when run after a reboot of the MGS. I think this might be a duplicate of LU-1279, so feel free to mark it a duplicate if that's the case.

-bash-4.1# dmesg -c > /dev/null
-bash-4.1# time /etc/init.d/lustre start
Mounting stotch-mds1/mgs0 on /mnt/lustre/local/stotch-MGS0000
Mounting stotch-mds1/mdt0 on /mnt/lustre/local/stotch-MDT0000
mount.lustre: mount stotch-mds1/mgs0 at /mnt/lustre/local/stotch-MGS0000 failed: No such device
Are the lustre modules loaded?
Check /etc/modprobe.conf and /proc/filesystems
mount.lustre: mount stotch-mds1/mdt0 at /mnt/lustre/local/stotch-MDT0000 failed: Input/output error
Is the MGS running?

real    7m34.545s
user    0m0.427s
sys     0m0.173s

-bash-4.1# mount
/dev/mapper/VolGroup-lv_root on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/vda1 on /boot type ext4 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)

-bash-4.1# dmesg
LNet: HW CPU cores: 4, npartitions: 1
alg: No test for crc32 (crc32-table)
alg: No test for adler32 (adler32-zlib)
padlock: VIA PadLock Hash Engine not detected.
Lustre: Lustre: Build Version: 2.4.2-11chaos-11chaos--PRISTINE-2.6.32-431.17.2.1chaos.ch5.2.x86_64
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol RQF_FLD_QUERY
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol req_capsule_server_pack
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol req_capsule_client_get
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol ptlrpc_queue_wait
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol req_capsule_fini
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol req_capsule_init
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol req_capsule_set
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol req_capsule_server_get
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol ptlrpc_at_set_req_timeout
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol ptlrpc_request_alloc_pack
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol RMF_FLD_OPC
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol ptlrpc_request_set_replen
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol RMF_FLD_MDFLD
fld: gave up waiting for init of module ptlrpc.
fld: Unknown symbol ptlrpc_req_finished
LNet: Added LNI 192.168.2.90@tcp [8/256/0/180]
LNet: Accept secure, port 988
LustreError: 2927:0:(client.c:1053:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880068beb000 x1470206796890120/t0(0) o253->MGC192.168.2.90@tcp@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 2927:0:(obd_mount_server.c:1140:server_register_target()) stotch-MDT0000: error registering with the MGS: rc = -5 (not fatal)
LustreError: 2927:0:(client.c:1053:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880068beb000 x1470206796890124/t0(0) o101->MGC192.168.2.90@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 2927:0:(client.c:1053:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880068beb000 x1470206796890128/t0(0) o101->MGC192.168.2.90@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 15c-8: MGC192.168.2.90@tcp: The configuration from log 'stotch-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 2927:0:(obd_mount_server.c:1273:server_start_targets()) failed to start server stotch-MDT0000: -5
Lustre: stotch-MDT0000: Unable to start target: -5
LustreError: 2927:0:(obd_mount_server.c:865:lustre_disconnect_lwp()) stotch-MDT0000-lwp-MDT0000: Can't end config log stotch-client.
LustreError: 2927:0:(obd_mount_server.c:1442:server_put_super()) stotch-MDT0000: failed to disconnect lwp. (rc=-2)
LustreError: 2927:0:(obd_mount_server.c:1472:server_put_super()) no obd stotch-MDT0000
Lustre: server umount stotch-MDT0000 complete
LustreError: 2927:0:(obd_mount.c:1290:lustre_fill_super()) Unable to mount  (-5)

-bash-4.1# rpm -qa | grep lustre
lustre-tools-llnl-1.6-1.ch5.2.x86_64
lustre-osd-ldiskfs-2.4.2-11chaos_2.6.32_431.17.2.1chaos.ch5.2.ch5.2.x86_64
lustre-modules-2.4.2-11chaos_2.6.32_431.17.2.1chaos.ch5.2.ch5.2.x86_64
lustre-osd-zfs-2.4.2-11chaos_2.6.32_431.17.2.1chaos.ch5.2.ch5.2.x86_64
lustre-debuginfo-2.4.2-11chaos_2.6.32_431.17.2.1chaos.ch5.2.ch5.2.x86_64
lustre-2.4.2-11chaos_2.6.32_431.17.2.1chaos.ch5.2.ch5.2.x86_64

-bash-4.1# cat /etc/ldev.conf 
stotch-mds1 - stotch-MGS0000 zfs:stotch-mds1/mgs0
stotch-mds1 - stotch-MDT0000 zfs:stotch-mds1/mdt0
stotch-oss1 - stotch-OST0000 zfs:stotch-oss1/ost0
stotch-oss2 - stotch-OST0001 zfs:stotch-oss2/ost0

Is this expected behavior? I assume not.

If I run the script a second time, everything mounts just fine (and much faster):

-bash-4.1# time /etc/init.d/lustre start
Mounting stotch-mds1/mgs0 on /mnt/lustre/local/stotch-MGS0000
Mounting stotch-mds1/mdt0 on /mnt/lustre/local/stotch-MDT0000

real    0m4.484s
user    0m0.439s
sys     0m0.228s

-bash-4.1# mount
/dev/mapper/VolGroup-lv_root on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/vda1 on /boot type ext4 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
stotch-mds1/mgs0 on /mnt/lustre/local/stotch-MGS0000 type lustre (rw)
stotch-mds1/mdt0 on /mnt/lustre/local/stotch-MDT0000 type lustre (rw)


 Comments   
Comment by Peter Jones [ 08/Jun/14 ]

Hongchao

Could you please comment on this one?

Thanks

Peter

Comment by Hongchao Zhang [ 13/Jun/14 ]

I have reproduced this issue, and it is a duplicate of LU-1279, which "modprobe" regards the module with its state == "MODULE_STATE_COMING"
(in this case, is "lnet") as a valid one, then it continues to load the following modules, which will take the "module_mutex" and wait the previous module (lnet)
to be complete (the state changes to "MODULE_STATE_LIVE") in "resolve_symbol", but the mutex lock prevents its submodules (in this case, it's various
klnd modules) to be loaded, then it will fail with "XXX1: gave up waiting for init of module XXX2. XXX1: Unknown symbol XXXXXX".

Comment by Peter Jones [ 29/Aug/14 ]

Closing as a duplicate. It looks like there is a recently-landed patch for LU-1279 to try out.

Generated at Sat Feb 10 01:49:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.