Building a general usable framework for computer vision task competitions

Winter 2020

Machine Learning Engineer

As a Z by HP Data Science Global Ambassador, Qishen Ha's content is sponsored and he was provided with HP products.


Hi, my name is Qishen Ha. I am currently working as a Machine Learning Engineer for LINE Corp., the 5th ranked Kaggle Grandmaster in the world, focusing mainly on computer vision problems like image classification, semantic segmentation or object detection.


I am very honored to be a Z by HP Data Science Global Ambassador, and I am very grateful to Z by HP for giving me this opportunity and providing me with the HP Z8 G4 workstation and ZBook Studio. This has increased my competitiveness in the Kaggle competitions.


Today I want to talk about my code framework used for the competition. In particular, the computer vision competition.

My Public Notebooks

Over the years I've made a number of CNN training notebooks publicly available, either as a baseline model or as a minimal version of the top solution (because the full version is too large in terms of code and training).

If any of you have read these notebooks of mine, you will see that although they use different data and train for different tasks, they all have the same basic framework. This is called a generic framework. Using such a framework, when we encounter a new competition, we can train a new baseline model in the shortest possible time and it is also very easy to improve or maintain it afterwards.


I will now summarize the framework I used in these notebooks and introduce each module. This is of course the framework I am used to and I think it is very easy to work with. If you already have a framework that you are comfortable with, there is no need to copy mine exactly, but just use it for ideas.

Introducing my framework

These are the basic generic modules in my framework:


  • Dataset
  • Augmentation
  • Model
  • Loss Function


I will now introduce them one by one.

Dataset defines how we read the data, how we pre-process the data, how we read the labels and how we deal with them. This picture shows one of the most basic code structures of a dataset.

This is a simple image classification task. We use cv2 to read the image into memory, then pass through augmentation and pre-process it, and then finally return the processed image and label.


This is a very generic code style and requires only very minor modifications when we need to adapt it to the image segmentation task. The following picture shows the most important modifications.

Lastly, we adjust the data type and dimension of the mask, replace row.label and return it.


This way we can easily modify the dataset, read the data we want, pre-process it as we wish and so on.

You may have noticed that in the dataset there is a parameter called “transforms” which contains the augmentation methods that we will use, and these methods are defined in the section on augmentation.


This picture shows a simple definition of augmentations. In training we use horizontal flip and resize, while in validation we only use resize.


If we want to add more augmentation methods to this, we can simply add to it, as shown below.


Like this, we have added random rotation and blur to the training process.

In this subsection we need to define the structure of the model. Let's still take the simplest example - the model structure for the image classification task - as a reference.


Typically, in an image classification task, we create an imagenet pre-trained model, such as efficientnet, as a backbone, delete its own linear layer of 1000 classes (the imagenet dataset is a 1000-class dataset), and add our own linear layer of n classes. As shown in this figure.

If we wanted to add a dropout before the last layer of FC, we could simply write it like the following.

Another common scenario is that the input image may not be RGB 3 channels, but 4 channels or more. In this case we can change the input of the first convolution layer of the backbone to what we want, as in the image below.

Here n_ch is the number of channels we have as input. By writing like this we not only change the number of input channels to what we want, but we also keep using the imagenet's pretrained weights for the first conv layer.

The easiest way to define a Loss Function is as follows. This is also the most common way.

However, it is also very easy to change to a complex look, such as the following.

In this loss function, we use cross entropy loss for the first four outputs and BCE loss for the others, and add loss weight to balance the two losses, which makes the logic more complex but does not require much code change. We used this loss to win first place in the RANZCR competition.


These are the four basic modules of my framework, all of which are designed to be very easy to extend. Combined they form the framework that I use. When using this framework for experiments, I keep a notebook for each experiment, which is useful for analyzing the results and reproducing them.


For more information, you can go to my Kaggle homepage and find the notebooks I've shared, and I'm sure you'll find more useful information in these notebooks.

Have a Question?
Contact Sales Support. 

Follow Z by HP on Social Media






Monday - Friday

7:00am - 7:30pm (CST) 

Enterprise Sales Support


Small Business Sales Support


Monday - Friday

7:00am - 7:00pm (CST) 

Government Sales Support 



State and local 


Go to Site 

Monday - Friday

7:00am - 7:00pm (CST) 

Education Sales Support 

K-12 Education


Higher Education


Go to Site  

Monday - Sunday

9:00am - 11:00pm (CST) 

Chat with a Z by HP Live Expert

Click on the Chat to Start

 Need Support for Your Z Workstation? 

Go to Support Page


  1. Product may differ from images depicted.


    The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein.