Spoken Georgian Frequency Analysis

📌 Project Overview

Most frequency lists for the Georgian language are derived from formal sources like news articles or academic texts. This project aims to bridge the gap between “classroom Georgian” and “street Georgian” by creating a frequency corpus based on 601 YouTube videos across various genres.

⚙️ The Data Pipeline

The project utilizes a multi-stage extraction and cleaning process:

Discovery & Filtering: Used Filmot to query the YouTube database specifically for Georgian-language content, filtering for metadata that indicates authentic, high-quality speech.
Transcript Extraction: Leveraged the youtube-transcript-api in Python to programmatically fetch the captions/transcripts for the identified 601-video dataset.
Frequency Processing: A custom Python script cleans the raw text (removing timestamps, formatting, and non-alphabetic characters) and calculates the frequency distribution of tokens.

📊 Impact & Insights

The resulting dataset provides a more accurate representation of the spoken lexicon—including common filler words, conversational particles, and informal verbal forms—which are often missing from traditional dictionaries.

Dataset Size: 601 videos.
Primary Output: A prioritized “spoken” frequency list.
Linguistic Value: Identifies the “High-Yield” vocabulary truly necessary for listening comprehension in non-formal environments.

🛠 Technical Stack

Python: Core processing and API interaction.
Filmot: Advanced metadata filtering for video discovery.
YouTube Transcript API: Automated data retrieval.

← Back to Projects