Abstract
Unlike explicit unethical expressions, implicit unethical expressions are not only difficult to select as training data but also difficult to predict future production patterns. Therefore, to improve the detection ability of language models for implicit unethical expressions, research into the weaknesses of the models is essential. In this paper, we changed the notation of implicit unethical expressions (YaminJeongeum, alien words) and inserted positive factors (vocabulary, emojis) to induce changes in the model’s predictions. We also designed additional experiments using YaminJeongeum, alien words, and emojis. As a result, we found that (1) the influence of emojis is stronger than the text itself in the language model detection process, and (2) the language model is vulnerable to certain input variations. Thus, we then constructed a fine tuning dataset using the input variants that the language model was weak on, and fine tuned the model, which led to a noticeable performance improvement. We concluded that training with more diverse types of data is critical to improve the ability of language models to detect unethical expressions. We hope that this study will stimulate further research on implicit unethical expressions detection using language models. |