Today
- 白天在写report(点击访问)
- 晚上计划写TFPose中aux_loss部分的代码,但实在太困了,没有顶住。
Tomorrow
- 写TFPose中aux_loss部分的代码
- 思路,先学习反卷积函数nn.ConvTranspose2d,完成feature map的上采样,之后想办法生成groundTruth的heatmap,用mse计算loss。
- 学习RLE的代码部分,并写一些总结。
不需要写tommorrow的计划,只需要把当天和research相关的task的时间列清楚。除此外,周日晚上默认可以休息,但这种情况不可以出现在工作日,请知悉。
不需要每日的汇报新建个topic,直接采用“上一天的report后进行回复”方式进行。
2022/12/19
Hours on research:6 h
From 10 am to 11 am, 2 pm to 4 pm, 7 pm to 10pm : Aux_loss coding task (part of TFPose implementation); 70% of the work is done.
2022/12/20
Hours on research:5 h
From 3 pm to 5 pm: have completed Aux_loss coding task (part of TFPose implementation)
From 7 pm to 10 pm: 实现decoder部分对预测的y值做refine的操作®ression loss (part of TFPose implementation); 50% of the work is done.
please submit the report in English, without Chinese.
怎么理解:50% work done,是TFPose implementation完成了50%? 还是当天既定的任务完成了50%,如果是当天任务,请不要把“实际完成的”和“计划完成的”两者混淆,需要report你做完了什么。
Hours on research:5 h
From 3 pm to 5 pm, 8:30 pm to 11:30pm: The module in the red box is completed (part of TFPose implementation)
After each decoder layer, an MLP layer is added. The features output by the first layer of decoder are used to roughly predict the coordinates of key points, and the output of each layer of decoder after the first layer is used to predict the offset.
2022/12/22
Hours on research:5 h
From 3 pm to 5 pm: Adjust the code according to the TFPose paper.
From 8:30 pm to 11:30 pm: Read the RLE source code.
2022/12/23
Hours on research:3 h
From 8:00 pm to 11:00 pm:
Today, I continued to make small repairs to the TFPose code, and found that there are problems with the heatmap and pred output images. Continue to debug tomorrow, and then run experiments on the server, increase the batchsize, and run the results of COCO first.
2022/12/24
Hours on research:4 h
From 10:00 am to 11:00 am: Submit the TFPose training task to the server.
From 2:00 pm to 4:00 am: Read the source code of RLE and some related blog posts.
From 7:00 pm to 8:00 am: Read the paper of Swin Transformer.
Do not mix ‘todo’ and ‘done’. Have you found out the reasons?
Test TFPose asap just in case there are potential errors.
2022/12/25
Hours on research:3 h
From 2:00 pm to 5:00 pm:
Continue to test TFPose, the current problem:
①Nan values often appear during network training.
② After submitting to the server, it is found that there will be bugs in the val part.
2022/12/26
Hours on research:4 h
From 7:00 pm to 11:00 pm:
Continue to test TFPose, solved the problem that the program could not run multiple gpus, and solved a newly discovered bug that would cause the model to crash in the last batch of training.
Please show me the test results!
2022/12/27
Hours on research:6 h
From 3:00 pm to 5:00 pm, 7:00 pm to 11:00 pm: debug TFPose
Under multi-gpus training, various bugs and various shape mismatch problems will occur in the last batch of training in each epoch. The location of the bug is after each epoch of training, so it takes a long time to debug the code to that location, and it takes two hours. For the current problem, I thought of another solution. If it still doesn’t work, I will redesign a small data set for debugging.
About the experimental results, I will debug as soon as possible.
You should consider reducing the training steps to save some debugging time. Also, do not use the whole training data for debugging. Small training data can do the same thing!
2022/12/28
Hours on research:6 h
From 3:00 pm to 5:00 pm, 7:00 pm to 11:00 pm: debug TFPose
After reducing the number of pictures in the dataset, I located the problem after debugging. The problem is that the number of the last batch of pictures in each epoch is less than batch_size, so the results of the last batch of training on multiple gpus cannot be merged. Under the current code logic, I have thought of many solutions but have not solved it. My current approach is to discard the last batch of pictures that do not meet the batch_size (less than 16 pictures). I will continue to find a solution to this problem, but for now I will put it on hold and debug other parts of the code to speed up the progress.
Then, I will debug other code, the code in the validate phase, and the code that calculates performance.
2022/12/29
Hours on research:3h
From 8:00 pm to 11:00 pm: debug TFPose
I’m sorry, I’m exhausted so I haven’t made much progress.
What still needs to be done is to debug the code. To be honest, I didn’t expect it to be so difficult. I thought that if the processed data and process were embedded in the original code, there would be many errors that I could not predict.
Debugging still leaves the validate part and the part of calculating the evaluation index. The validate part is very similar to the train part, but there are still problems that I’m working on. As for the deadline, I hope it will be next Tuesday. During this period I also need to submit a report on a project in the course.
Tomorrow’s weekly meeting, I plan to introduce the code I wrote.
2022/12/30
Hours on research:4h
From 7:00 pm to 11:00 pm: debug TFPose
Completed the debugging work on the validation part.
2022/12/31
Hours on research:5h
From 3:00 pm to 5:00 pm, 7:00 pm to 10:00 pm: debug TFPose
Locate the place where the value suddenly increases, and then limit the size of the value, and solve the problem that the loss will become nan later.