Paper Details
Abstract
The cosmetics industry lacks personalized, science-backed advisory systems, particularly for low-resource languages like Vietnamese. This paper addresses this gap by introducing and evaluating a novel two-stage Retrieval-Augmented Generation (RAG) system. Our approach leverages a state-of-the-art architecture combining a high-recall Bi-Encoder Retriever with a high-precision Cross-Encoder Re-ranker, forming a robust pipeline for specialized information retrieval. To enable this research, we created the Vietnamese Cosmetics E-commerce Dataset (VCED), a new, publicly available corpus of 9,173 canonical products derived from 11,609 raw e-commerce listings via a rigorous ``Funnel Strategy'' for data cleaning and entity resolution. The system's core components are language-specific models fine-tuned for the Vietnamese context. Experimental results demonstrate the decisive advantage of this specialization, with the Retriever achieving 99.92\% triplet accuracy and the Re-ranker reaching 99.74\% Average Precision. Most critically, end-to-end evaluation confirms that the re-ranking stage is indispensable; its inclusion more than doubled the Mean Reciprocal Rank (MRR) to 0.585 and improved the Hits@1 score from zero to 0.473. Successfully deployed and validated as a Facebook Messenger chatbot, this work not only establishes a new performance benchmark for domain-specific conversational AI in Vietnamese but also provides a production-ready blueprint for applying advanced RAG architectures in non-English, low-resource environments.