Automated Lemma & Morph-Tagging Pipeline
📌 Overview
Manually tagging a corpus for lemmatization and grammatical features is a massive bottleneck in linguistic research. This project automates that process by targeting the most “impactful” words in a corpus and using Large Language Models (LLMs) to perform morphological analysis.
⚙️ The Pipeline Architecture
The system functions as a continuous ETL (Extract, Transform, Load) pipeline:
- Frequency Analysis: Using Python, I generate a frequency list of all word forms in the corpus.
- Pareto Filtering: To optimize API costs and processing time, the script automatically filters out “long-tail” infrequent forms, focusing only on the top 80% of the corpus volume.
- Automated Trigger: A Google Apps Script runs on a time-based trigger to fetch untagged forms.
- LLM Inference: The script calls an LLM API with a strict system prompt to return a structured JSON response.
- Relational Storage: Data is parsed and saved into two synchronized tables:
- Lemmas Table: Stores the dictionary headword.
- Forms Table: Stores the specific inflected form linked to its lemma with associated grammatical tags.
🛠 Tech Stack
- Automation: Google Apps Script (JavaScript-based).
- Intelligence: OpenAI/Gemini/Grok API (JSON Mode).
- Data Prep: Python (Pandas) for frequency distribution.
- Storage: Google Sheets / Relational Tables.
🧩 Key Challenges
- Challenge: Ensuring the LLM doesn’t “hallucinate” linguistic tags or return malformed strings.
- Solution: Implemented strict JSON schema validation and a retry logic within the Apps Script to ensure every database entry maintains relational integrity.
📈 Results
- Efficiency: Automated what would typically take a human linguist months of manual entry.
- Coverage: Successfully captured the “core” of the language by prioritizing high-frequency tokens.
- Data Integrity: Created a clean, relational dataset ready for use in dictionary building or downstream NLP tasks.