Abstrak/Abstract |
Directorate General of Customs and Excise (DGCE), an Indonesian Government agency under the Ministry of Finance, is responsible for ensuring importer or exporter classify their declared goods based on the Harmonized System Code (HS Code). This study aims to find an optimal machine learning framework to classify goods into their HS Code based on the challenges DGCE faced, such as mixed language with an inconsistent pattern of goods descriptions, imbalance multiclass HS Code, and some additional categorical variables. Refer to some previous studies that propose some machine learning models to predict the HS Code based on goods descriptions. This study tries to make some improvements and adjustments in line with the previously mentioned challenges faced by DGCE. Some preprocessing tasks were performed, such as dealing with abbreviations, misspellings, the varying pattern of goods description, and translating Indonesian words into English. One Hot Coding (OHC) is applied to encode nominal and categorical variables. To make features from goods descriptions, we choose Term Frequency-Inverse Document Frequency (TF-IDF) combined with bigrams. As a result, our models show that Random Forest got an F1-score of 79.60% when classifying the HS Code's first four digits, and Multinomial NB got an F1-score of 72.74% when classifying the HS Code's entire digits. Compared to the baseline paper, those scores are 11.26% and 11.36% higher, respectively. |