[LU-10066] A potential bug on OSP setattr handling Created: 03/Oct/17  Updated: 25/Jul/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jinshan Xiong (Inactive) Assignee: Alex Zhuravlev
Resolution: Unresolved Votes: 0
Labels: patch

Attachments: File log.gz     HTML File p    
Issue Links:
Related
is related to LU-7251 reduce commit callbacks in OSP Resolved
is related to LU-10170 OSP device could miss wakeup from com... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807
Epic Link: FLR tech debt review

 Description   

I discovered a potential bug about OSP when I was working on FLR. In order to make FLR work properly, the MDT sends layout version to OST objects by setattr RPC. Also the same layout version will be sent to client also and client will carry the layout version to make the BRW write RPC legal. The symptom is that after OSP writes the setattr record into llog, which should be picked by the corresponding osp-sync thread later but that was never happened.

Attached please find the log and debug patch for this problem, but let me explain a little bit more for the log.

00000004:80000000:9.0:1506717693.938247:0:11105:0:(lod_object.c:1152:lod_obj_stripe_attr_set_cb()) [0x100020000:0x1045:0x0]: set layout version: 630, comp_idx: 10
00000004:80000000:9.0:1506717693.938249:0:11105:0:(osp_sync.c:425:osp_sync_add_rec()) [0xd4:0x1:0x0]: set layout version to: 630, rc = 0
..
00000004:80000000:9.0:1506717693.938247:0:11105:0:(lod_object.c:1152:lod_obj_stripe_attr_set_cb()) [0x100020000:0x1045:0x0]: set layout version: 630, comp_idx: 10
00000004:80000000:9.0:1506717693.938249:0:11105:0:(osp_sync.c:425:osp_sync_add_rec()) [0xd4:0x1:0x0]: set layout version to: 630, rc = 0
..
00000004:80000000:9.0:1506717693.938310:0:11105:0:(lod_object.c:1152:lod_obj_stripe_attr_set_cb()) [0x100030000:0x1049:0x0]: set layout version: 630, comp_idx: 11
00000040:00100000:40.0:1506717693.938311:0:9452:0:(llog.c:208:llog_cancel_rec()) Canceling 6230 in log [0xd4:0x1:0x0]
00000004:80000000:9.0:1506717693.938312:0:11105:0:(osp_sync.c:425:osp_sync_add_rec()) [0xd5:0x1:0x0]: set layout version to: 630, rc = 0
..
00000004:80000000:9.0:1506717693.938390:0:11105:0:(lod_object.c:1152:lod_obj_stripe_attr_set_cb()) [0x100000000:0x104f:0x0]: set layout version: 630, comp_idx: 11
00000004:80000000:9.0:1506717693.938391:0:11105:0:(osp_sync.c:425:osp_sync_add_rec()) [0xd3:0x1:0x0]: set layout version to: 630, rc = 0

Thread 11105 wrote 4 records intended to set layout version on 4 OSTs. Writes were successful.

00000004:00100000:33.0:1506717694.004531:0:9454:0:(osp_sync.c:772:osp_sync_new_setattr_job()) @@@ lustre-OST0002-osc-MDT0000: [0x1045:0x0:0x0]: set layout version: 630
  req@ffff880797eec800 x1579668955411600/t0(0) o2->lustre-OST0002-osc-MDT0000@10.8.1.68@tcp:28/4 lens 560/432 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1
00000004:00100000:69.0:1506717694.004532:0:9721:0:(osp_sync.c:772:osp_sync_new_setattr_job()) @@@ lustre-OST0003-osc-MDT0000: [0x1049:0x0:0x0]: set layout version: 630
  req@ffff880fdb70a400 x1579668955411616/t0(0) o2->lustre-OST0003-osc-MDT0000@10.8.1.68@tcp:28/4 lens 560/432 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1
00000004:00100000:40.0:1506717694.004534:0:9452:0:(osp_sync.c:772:osp_sync_new_setattr_job()) @@@ lustre-OST0001-osc-MDT0000: [0x1048:0x0:0x0]: set layout version: 630
  req@ffff880fbd6ef200 x1579668955411632/t0(0) o2->lustre-OST0001-osc-MDT0000@10.8.1.68@tcp:28/4 lens 560/432 e 0 to 0 dl 0 ref 1 fl New:/0/ffffffff rc 0/-1

Later, those log were picked by the corresponding sync thread: 9454, 9721, and 9452. I have verified the OSTs had received the setattr RPC.

The problem is the last one that is supposed to send to OST0000 and that was never happened. That request should be picked by thread 9450:

[jinxiong@wolf-68 ~]$ ps ax | grep osp-syn
  9450 ?        S      0:53 [osp-syn-0-0]
  9452 ?        S      0:53 [osp-syn-1-0]
  9454 ?        S      0:54 [osp-syn-2-0]
  9721 ?        S      0:54 [osp-syn-3-0]

And the thread stack trace of 9450:

[562512.630879] osp-syn-0-0     S ffff8808594dc2d8     0  9450      2 0x00000080
[562512.639451]  ffff88084461f9b8 0000000000000046 ffff88082279de20 ffff88084461ffd8
[562512.648442]  ffff88084461ffd8 ffff88084461ffd8 ffff88082279de20 0000000000002000
[562512.657438]  ffff8808594dc000 ffff8808594dc2e8 ffff881051a8c058 ffff8808594dc2d8
[562512.666447] Call Trace:                                                     
[562512.669845]  [<ffffffff8168c969>] schedule+0x29/0x70                        
[562512.676115]  [<ffffffffa13edd11>] osp_sync_process_queues+0x1641/0x20e0 [osp]
[562512.684729]  [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20                 
[562512.691574]  [<ffffffffa0a03565>] llog_process_thread+0x5a5/0x1180 [obdclass]
[562512.700289]  [<ffffffffa13ec6d0>] ? osp_sync_thread+0x9d0/0x9d0 [osp]       
[562512.708235]  [<ffffffffa0a041fc>] llog_process_or_fork+0xbc/0x450 [obdclass]
[562512.716770]  [<ffffffffa0a0972a>] llog_cat_process_cb+0x20a/0x220 [obdclass]
[562512.725343]  [<ffffffffa0a03565>] llog_process_thread+0x5a5/0x1180 [obdclass]
[562512.734025]  [<ffffffff810cabe4>] ? select_task_rq_fair+0x584/0x720         
[562512.741645]  [<ffffffffa0a09520>] ? llog_cat_process_common+0x440/0x440 [obdclass]
[562512.750778]  [<ffffffffa0a041fc>] llog_process_or_fork+0xbc/0x450 [obdclass]
[562512.759354]  [<ffffffffa0a09520>] ? llog_cat_process_common+0x440/0x440 [obdclass]
[562512.768492]  [<ffffffffa0a086a9>] llog_cat_process_or_fork+0x199/0x2a0 [obdclass]
[562512.777520]  [<ffffffff810c54f2>] ? default_wake_function+0x12/0x20         
[562512.785286]  [<ffffffff810ba628>] ? __wake_up_common+0x58/0x90              
[562512.792479]  [<ffffffffa13ec6d0>] ? osp_sync_thread+0x9d0/0x9d0 [osp]       
[562512.800382]  [<ffffffffa0a087de>] llog_cat_process+0x2e/0x30 [obdclass]     
[562512.808440]  [<ffffffffa13ebefa>] osp_sync_thread+0x1fa/0x9d0 [osp]         
[562512.816175]  [<ffffffff81029569>] ? __switch_to+0xd9/0x4c0                  
[562512.823041]  [<ffffffffa13ebd00>] ? osp_sync_process_committed+0x6c0/0x6c0 [osp]
[562512.832031]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0 
(gdb) l *(osp_sync_process_queues+0x1641)
0x17d41 is in osp_sync_process_queues (/home/jinxiong/work/flr/lustre/osp/osp_sync.c:1151).
1146			}
1147	
1148			if (d->opd_sync_last_processed_id == d->opd_sync_last_used_id)
1149				osp_sync_remove_from_tracker(d);
1150	
1151			l_wait_event(d->opd_sync_waitq,
1152				     !osp_sync_running(d) ||
1153				     osp_sync_can_process_new(d, rec) ||
1154				     !list_empty(&d->opd_sync_committed_there),
1155				     &lwi);

It was sitting there ignoring the new request.

I suspect one of the condition was false that stopped that thread from working:

static inline int osp_sync_can_process_new(struct osp_device *d,                
                                           struct llog_rec_hdr *rec)            
{                                                                               
        LASSERT(d);                                                             
                                                                                
        if (unlikely(atomic_read(&d->opd_sync_barrier) > 0))                    
                return 0;                                                       
        if (unlikely(osp_sync_in_flight_conflict(d, rec)))                      
                return 0;                                                       
        if (!osp_sync_rpcs_in_progress_low(d))                                  
                return 0;                                                       
        if (!osp_sync_rpcs_in_flight_low(d))                                    
                return 0;                                                       
        if (!d->opd_imp_connected)                                              
                return 0;                                                       
        if (d->opd_sync_prev_done == 0)                                         
                return 1;                                                       
        if (atomic_read(&d->opd_sync_changes) == 0)                             
                return 0;                                                       
        if (rec == NULL ||                                                      
            osp_sync_correct_id(d, rec) <= d->opd_sync_last_committed_id)       
                return 1;                                                       
        return 0;                                                               
}  

This problem is pretty consistent. I can see it after running sanity-pfl:200() for 2 days on average. Please help.



 Comments   
Comment by Jinshan Xiong (Inactive) [ 03/Oct/17 ]

I won't touch the code that can reproduce this problem. Please let me know if you want to look at it.

Comment by Gerrit Updater [ 10/Oct/17 ]

Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: https://review.whamcloud.com/29550
Subject: LU-10066 osp: error overflow handling for llog id
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: b2d621a48fd70abd4d119091a7d2bbfeb4dad865

Comment by Colin Faber [ 25/Jul/22 ]

Hi bzzz 

Can you take a look? Thank you!

Generated at Sat Feb 10 02:31:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.