Projects — Koyilbek Valiev

Uzbek-English Pretrained Language Model (140M Parameters)

Collected a diverse Uzbek-English dataset and trained a BPE tokenizer from scratch (~62,016 tokens). Built a decoder-only modified Transformer with 140.7M parameters and 1024-token context window. Pre-trained on 2x NVIDIA A100 GPUs for 16.5 hours, reaching ~3.5% cross-entropy loss. Planning instruction fine-tuning and downstream alignment.

PyTorchTransformersBPEA100 GPUPretraining

LLMs from Scratch — GPT-2 Implementation & Fine-Tuning

Full GPT-2 implementation from scratch: text preprocessing, positional encodings, multi-head self-attention, pretraining with causal language modeling. Fine-tuned for spam classification and instruction tuning for prompt-following.

PyTorchGPT-2CLMInstruction Tuning

Multimodal Video Search & Interaction with RAG

Retrieval-augmented video Q&A integrating state-of-the-art retrieval techniques for natural language video interaction.

WhisperLLaVABridgeTowerLanceDBLangChainGradio

Neural Conversational Chatbot

Seq2seq chatbot with bidirectional GRU encoder and Luong global attention, trained on 220K+ Cornell Movie-Dialogs pairs.

PyTorchSeq2SeqGRUAttention

Featured Projects

Uzbek-English Pretrained Language Model (140M Parameters)

LLMs from Scratch — GPT-2 Implementation & Fine-Tuning

Multimodal Video Search & Interaction with RAG

Neural Conversational Chatbot