Back to Projects

LLM Pipeline

November 2024

ScalaApache SparkAWS EMRHadoop MapReduceDeepLearning4JJTokkit

An end-to-end distributed training pipeline for Large Language Models (LLMs) using Scala, Apache Spark, and AWS EMR.

Motivation

Developed to explore distributed machine learning and large-scale text processing for LLM training.

Key Learnings

  • Implemented MapReduce and Spark-based frameworks on AWS EMR
  • Applied BPE tokenization with custom mappers/reducers
  • Generated distributed word2vec-style embeddings with DeepLearning4J
  • Trained transformer architectures with parameter averaging
  • Deployed pipeline with runtime monitoring of training loss and memory utilization