[LU-5188] nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0 Created: 12/Jun/14  Updated: 03/Mar/15  Resolved: 27/Jun/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.3
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Blocker
Reporter: Mahmoud Hanafi Assignee: Niu Yawei (Inactive)
Resolution: Fixed Votes: 0
Labels: HB

Issue Links:
Related
is related to LU-5249 conf-sanity test_32a: NULL pointer in... Resolved
is related to LU-6269 Unable to mount /nobackupp8 Resolved
is related to LU-4345 failed to update accounting ZAP for user Resolved
is related to LU-5244 conf-sanity test_32b: osp_sync_thread... Closed
Severity: 3
Rank (Obsolete): 14392

 Description   

mounting mdt we start to see these errors then the system becomes unreponsive

nbp6-mds login: LDISKFS-fs (dm-1): recovery complete
LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts: 
LDISKFS-fs (dm-2): recovery complete
LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. quota=on. Opts: 
Lustre: nbp6-MDT0000: Not available for connect from 10.153.1.57@o2ib233 (not set up)
Lustre: nbp6-MDT0000: used disk, loading
Lustre: nbp6-MDT0000: Not available for connect from 10.153.0.76@o2ib233 (not set up)
Lustre: 2967:0:(mdt_handler.c:4969:mdt_process_config()) For interoperability, skip this mdt.group_upcall. It is obsolete.
Lustre: 2967:0:(mdt_handler.c:4969:mdt_process_config()) For interoperability, skip this mdt.quota_type. It is obsolete.
LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
LustreError: 11-0: nbp6-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
Lustre: nbp6-MDT0000: Will be in recovery for at least 5:00, or until 1083 clients reconnect
LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
LustreError: 3081:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST0032-osc-MDT0000: invalid setattr record, lsr_valid:0
LustreError: 3081:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 212864 previous similar messages
LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 229931 previous similar messages
LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 450953 previous similar messages
LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 445853 previous similar messages
LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 930108 previous similar messages
LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 925668 previous similar messages
LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 1897185 previous similar messages
LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 1898934 previous similar messages
LustreError: 3081:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST0032-osc-MDT0000: invalid setattr record, lsr_valid:0
LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 3829999 previous similar messages
LustreError: 3081:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 3851279 previous similar messages
LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 7683067 previous similar messages
LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 7758692 previous similar messages
LustreError: 3081:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST0032-osc-MDT0000: invalid setattr record, lsr_valid:0
LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
LustreError: 3069:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 15494392 previous similar messages
LustreError: 3081:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 15461799 previous similar messages
LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 31047376 previous similar messages
LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 30898154 previous similar messages
INFO: task crond:3224 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
crond         D 0000000000000003     0  3224   2649 0x00000080
 ffff8805c5579d38 0000000000000086 ffff8805c5579d00 ffff8805c5579cfc
 0000000000000000 ffff88063fc24700 ffff880028296780 0000000000000500
 ffff880625ed85f8 ffff8805c5579fd8 000000000000fc40 ffff880625ed85f8
Call Trace:
 [<ffffffff81540545>] schedule_timeout+0x215/0x2e0
 [<ffffffff81346555>] ? extract_entropy+0xe5/0x140
 [<ffffffff815401c3>] wait_for_common+0x123/0x180
 [<ffffffff81063be0>] ? default_wake_function+0x0/0x20
 [<ffffffff815402dd>] wait_for_completion+0x1d/0x20
 [<ffffffff810921c8>] synchronize_sched+0x58/0x60
 [<ffffffff81092150>] ? wakeme_after_rcu+0x0/0x20
 [<ffffffff8122205c>] install_session_keyring_to_cred+0x6c/0xd0
 [<ffffffff812221f3>] join_session_keyring+0x133/0x160
 [<ffffffff810dbff7>] ? audit_syscall_entry+0x1d7/0x200
 [<ffffffff81220df8>] keyctl_join_session_keyring+0x38/0x70
 [<ffffffff81221a20>] sys_keyctl+0x170/0x190
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) can't send: -22
LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) nbp6-OST002f-osc-MDT0000: invalid setattr record, lsr_valid:0
LustreError: 3069:0:(osp_sync.c:487:osp_sync_new_setattr_job()) Skipped 61763336 previous similar messages
LustreError: 3081:0:(osp_sync.c:797:osp_sync_process_queues()) Skipped 62092386 previous similar messages


 Comments   
Comment by Peter Jones [ 12/Jun/14 ]

Oleg is looking into this

Comment by Oleg Drokin [ 12/Jun/14 ]

Hm, it appears that this is a bug in LU-4345 patch that you have applied.

Patch is at http://review.whamcloud.com/10706 - this should help your immediate problem.

Niu: regarding the LU-4345 - it appears that while there is now handling for this case instead of LASSERT, it's pointless because we enter eternal retrying loop so possibly we want to have some more handling there.

Comment by Mahmoud Hanafi [ 12/Jun/14 ]

ok thanks will build and install the patch.

Comment by Peter Jones [ 12/Jun/14 ]

NIu

Could you please complete any follow up work on this ticket?

Thanks

Peter

Comment by Niu Yawei (Inactive) [ 13/Jun/14 ]

Niu: regarding the LU-4345 - it appears that while there is now handling for this case instead of LASSERT, it's pointless because we enter eternal retrying loop so possibly we want to have some more handling there.

Right, I think we'd return 0 instead of -EINVAL, so that we just print an error message, and osp sync process can continue processing on other records.

Comment by Andreas Dilger [ 16/Jun/14 ]

I think skipping the invalid record is OK, since LFSCK 2 should fix up the ownership of the OST objects on its next scan.

Comment by Mahmoud Hanafi [ 20/Jun/14 ]

The patched fixed our issue.

Comment by Andreas Dilger [ 20/Jun/14 ]

Patch landed to master for 2.6.0.

Comment by Di Wang [ 25/Jun/14 ]

This patch cause LU-5244.

It seems osp_sync_new_setattr_job return 0 but did not issue RPC, so we should decrease opd_syn_rpc_in_flight and opd_sync_rpc_in_progress in this case.

Comment by Di Wang [ 25/Jun/14 ]

http://review.whamcloud.com/10828

Comment by Jodi Levi (Inactive) [ 27/Jun/14 ]

Patch landed to Master.

Generated at Sat Feb 10 01:49:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.