Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.12.3
-
Lustre OSS server running ZFS.
-
3
-
9223372036854775807
Description
When restarting our production Lustre file system we encountered this bug:
[407608.498637] LNetError: 72335:0:(o2iblnd_cb.c:3335:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[407608.509681] LNetError: 72335:0:(o2iblnd_cb.c:3410:kiblnd_check_conns()) Timed out RDMA with 10.10.32.102@o2ib2 (5): c: 3, oc: 0, rc: 7
[407608.526089] LustreError: 72335:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff8ca33db8a800
[407608.537667] LustreError: 72335:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff8ca33db8a800
[407608.549244] LustreError: 167072:0:(ldlm_lib.c:3259:target_bulk_io()) @@@ network error on bulk WRITE req@ffff8cabcfcba850 x1648066684855104/t0(0) o4->8d9c48a5-020d-9844-4aa4-57225c35d4e2@10.10.32.102@o2ib2:135/0 lens 608/448 e 0 to 0 dl 1573227610 ref 1 fl Interpret:/0/0 rc 0/0
[407608.576219] Lustre: f2-OST001d: Bulk IO write error with 8d9c48a5-020d-9844-4aa4-57225c35d4e2 (at 10.10.32.102@o2ib2), client will retry: rc = -110
Eventually we ended up seeing:
[423015.676012] [<ffffffff98d5d28b>] queued_spin_lock_slowpath+0xb/0xf
[423015.676017] [<ffffffff98d6b760>] _raw_spin_lock+0x20/0x30
[423015.676026] [<ffffffffc19ddf39>] ofd_intent_policy+0x1d9/0x920 [ofd]
[423015.676070] [<ffffffffc161dd26>] ldlm_lock_enqueue+0x366/0xa60 [ptlrpc]
[423015.676080] [<ffffffffc12f4033>] ? cfs_hash_bd_add_locked+0x63/0x80 [libcfs]
[423015.676085] [<ffffffffc12f77be>] ? cfs_hash_add+0xbe/0x1a0 [libcfs]
[423015.676107] [<ffffffffc1646587>] ldlm_handle_enqueue0+0xa47/0x15a0 [ptlrpc]
[423015.676130] [<ffffffffc166e6d0>] ? lustre_swab_ldlm_lock_desc+0x30/0x30 [ptlrpc]
This looks similar to the issues reported by NASA but just to make sure.