LLM Pipeline

November 2024

ScalaApache SparkAWS EMRHadoop MapReduceDeepLearning4JJTokkit

An end-to-end distributed training pipeline for Large Language Models (LLMs) using Scala, Apache Spark, and AWS EMR.

Motivation

Developed to explore distributed machine learning and large-scale text processing for LLM training.

Implemented MapReduce and Spark-based frameworks on AWS EMR
Applied BPE tokenization with custom mappers/reducers
Generated distributed word2vec-style embeddings with DeepLearning4J
Trained transformer architectures with parameter averaging
Deployed pipeline with runtime monitoring of training loss and memory utilization