Developing a Synthetic Data Generation Framework with Structured Question Patterns for Fine-Tuning Domain-Specific LLMs: A Case Study on POLYCC LLM League 2025
Keywords:
Synthetic Data Generation, Large Language Models, LoRA, Knowledge Distillation, POLYCCAbstract
The rapid evolution of Large Language Models (LLMs) has highlighted a critical deficiency: the inability of general-purpose models to handle localized, niche, and institutional knowledge bases without significant risk of hallucination. This research addresses this gap by proposing a novel Synthetic Data Generation Framework specifically designed for the Malaysian POLYCC (Polytechnics and Community Colleges) ecosystem. Focusing on the Education in Polytechnics and Community Colleges, TVET Policy, Student Activities, TVET Madani, TeCC 4.0 and Maker Market initiatives, we developed a taxonomy of 60 structured question patterns to drive domain-specific specialization. Utilizing Gemini as a high-fidelity 'Teacher' model, we generated a balanced dataset of 3,600 question-response pairs. We then distilled expertise into a Meta-Llama-3.2-3B "Student" model via AWS SageMaker JumpStart. Through an eight-stage incremental scaling experiment, we demonstrate that intelligence growth in fine-tuned models is non-linear and pattern-dependent. The results indicate that the model reached its Strategic Peak at Stage 5, achieving a 58% Win Rate over the baseline. This study confirms that for effective domain specialization, the quality and logical structure of synthetic data, specifically rationale-based patterns are more vital than raw data volume, providing a scalable blueprint for institutional AI deployment in the POLYCC LLM League 2025

