Developing a Synthetic Data Generation Framework with Structured Question Patterns for Fine-Tuning Domain-Specific LLMs: A Case Study on POLYCC LLM League 2025

Norazuwa Salehudin; Norwahida Saamri; Beny Yusmar

Authors

Norazuwa Salehudin Kolej Komuniti Temerloh Author
Norwahida Saamri Kolej Komuniti Temerloh Author
Beny Yusmar Kolej Komuniti Temerloh Author

Keywords:

Synthetic Data Generation, Large Language Models, LoRA, Knowledge Distillation, POLYCC

Abstract

The rapid evolution of Large Language Models (LLMs) has highlighted a critical deficiency: the inability of general-purpose models to handle localized, niche, and institutional knowledge bases without significant risk of hallucination. This research addresses this gap by proposing a novel Synthetic Data Generation Framework specifically designed for the Malaysian POLYCC (Polytechnics and Community Colleges) ecosystem. Focusing on the Education in Polytechnics and Community Colleges, TVET Policy, Student Activities, TVET Madani, TeCC 4.0 and Maker Market initiatives, we developed a taxonomy of 60 structured question patterns to drive domain-specific specialization. Utilizing Gemini as a high-fidelity 'Teacher' model, we generated a balanced dataset of 3,600 question-response pairs. We then distilled expertise into a Meta-Llama-3.2-3B "Student" model via AWS SageMaker JumpStart. Through an eight-stage incremental scaling experiment, we demonstrate that intelligence growth in fine-tuned models is non-linear and pattern-dependent. The results indicate that the model reached its Strategic Peak at Stage 5, achieving a 58% Win Rate over the baseline. This study confirms that for effective domain specialization, the quality and logical structure of synthetic data, specifically rationale-based patterns are more vital than raw data volume, providing a scalable blueprint for institutional AI deployment in the POLYCC LLM League 2025

Downloads

Download data is not yet available.

Developing a Synthetic Data Generation Framework with Structured Question Patterns for Fine-Tuning Domain-Specific LLMs: A Case Study on POLYCC LLM League 2025

Authors

Keywords:

Abstract

Downloads

Downloads

Published

Issue

Section

Categories

How to Cite

ISSN

Indexing

Visitor