Spoken Georgian Frequency Analysis
📌 Project Overview
Most frequency lists for the Georgian language are derived from formal sources like news articles or academic texts. This project aims to bridge the gap between “classroom Georgian” and “street Georgian” by creating a frequency corpus based on 601 YouTube videos across various genres.
⚙️ The Data Pipeline
The project utilizes a multi-stage extraction and cleaning process:
- Discovery & Filtering: Used Filmot to query the YouTube database specifically for Georgian-language content, filtering for metadata that indicates authentic, high-quality speech.
- Transcript Extraction: Leveraged the
youtube-transcript-apiin Python to programmatically fetch the captions/transcripts for the identified 601-video dataset. - Frequency Processing: A custom Python script cleans the raw text (removing timestamps, formatting, and non-alphabetic characters) and calculates the frequency distribution of tokens.
📊 Impact & Insights
The resulting dataset provides a more accurate representation of the spoken lexicon—including common filler words, conversational particles, and informal verbal forms—which are often missing from traditional dictionaries.
- Dataset Size: 601 videos.
- Primary Output: A prioritized “spoken” frequency list.
- Linguistic Value: Identifies the “High-Yield” vocabulary truly necessary for listening comprehension in non-formal environments.
🛠 Technical Stack
- Python: Core processing and API interaction.
- Filmot: Advanced metadata filtering for video discovery.
- YouTube Transcript API: Automated data retrieval.