Bazel Spark

Developing an automated setup tool for local pyspark projects

DATA SCIENCE

7/28/20241 min read

Pyspark and Delta-Lake are very powerful tools, but the requirement of working with them via DataBricks can be restrictive for some projects. Although the two building blocks (PySpark & Delta-Lake are open source) their setup & implementation can be tricky to get right. This project aims to fix that using Bazel an automated application setup & testing framework.

Goals & Requirements

  • Streamlined, One-Click Setup

  • Environments setup using anaconda

  • Mutliple distinct environments can be supported on a single machine

  • Support for S3 data storage

  • Platform-Agnostic installation (Mac/Windows/Linux) -- [stretch goal]

  • UI for better UX -- [stretch goal]

As of Summer 2024, the workflow is still in development, but initial proofs of concept have been successful. Eventually i'd like to migrate this into a more user friendly setup with a yaml file interface, or maybe a graphical UI generated with panel.

Theres not all that much to "show" as a result here, but give it a try!

Github Repo with Instructions