LLM Pipeline
November 2024
ScalaApache SparkAWS EMRHadoop MapReduceDeepLearning4JJTokkit
An end-to-end distributed training pipeline for Large Language Models (LLMs) using Scala, Apache Spark, and AWS EMR.
Motivation
Developed to explore distributed machine learning and large-scale text processing for LLM training.
Key Learnings
- Implemented MapReduce and Spark-based frameworks on AWS EMR
- Applied BPE tokenization with custom mappers/reducers
- Generated distributed word2vec-style embeddings with DeepLearning4J
- Trained transformer architectures with parameter averaging
- Deployed pipeline with runtime monitoring of training loss and memory utilization