Setting up delta-spark on Mac using Multipass

Yuwei Sung
5 min readNov 30, 2024

--

I spend a bit of time searching how to setup a local delta environment on mac and only find some container solutions (starting with spark images). Spark image is a fast pass to write your first spark job, but it hides the detail of environment setup. I need a bit of “ground up” approach. Recently I find Multipass from Canonical is a fast way to bring up a Linux virtual machine in Apple Mac M2. This article is my beginning of delta lake journey.

First of all, it is pretty easy to setup Multipass in Mac, brew away…

# install multipass 
brew install --cask multipass

Once Multipass installed, you can use “launch” to get a virtual machine setup. In my case, I specify Ubuntu 24.04 image with 4 cpu, 2GB memory, 10GB disk, and local volume mount. You can find those launch options in Multipass command line help. After the instance is ready, you can shell into the instance like a docker container.

# launch 24.04
multipass launch 24.04 -c 4 -d 10G -m 2G --mount ~/git/delta:/data -n delta
# shell into ubuntu
multipass shell delta
ubuntu@delta:~$

After shelling into the instance, like other fresh installation, you need to do some package upgrades. Also, I follow Delta doc to 1) install jdk and python-venv, and 2) download/untar spark binary.

# upgrade packages
ubuntu@delta:~$sudo apt upgrade -y
...
# install openjdk and python-venv
ubuntu@delta:~$sudo apt install -y default-jdk python3.12-venv
...
# download spark binary
ubuntu@delta:~$wget https://dlcdn.apache.org/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3.tgz
...
# untar spark binary
ubuntu@delta:~$tar zxvf spark-3.5.3-bin-hadoop3.tgz
...
# move binary to opt
ubuntu@delta:~$sudo mv spark-3.5.3-bin-hadoop3 /opt
# delete the zip
ubuntu@delta:~$rm spark-3.5.3-bin-hadoop3.tgz
# symlink spark
ubuntu@delta:~$sudo ln -s /opt/spark-3.5.3-bin-hadoop3 /opt/spark

In most Linux distribution, python should be installed and many system services are using the system python libs. Direct install pyspark or delta-spark to system python is not recommended. So I use python venv in the following setup steps.

# create venv
ubuntu@delta:~$python3 -m venv pyspark
# setup spark home, pyspark python path
ubuntu@delta:~$echo 'export SPARK_HOME=/opt/spark' >> .bashrc
ubuntu@delta:~$echo 'export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin' >> .bashrc
ubuntu@delta:~$echo 'source ~/pyspark/bin/activate' >> .bashrc
ubuntu@delta:~$echo 'export PYSPARK_DRIVER_PYTHON=python' >> .bashrc
ubuntu@delta:~$echo 'export PYSPARK_PYTHON=~/pyspark/bin/python' >> .bashrc
ubuntu@delta:~$source .bashrc
# in pyspark venv
(pyspark) ubuntu@delta:~$

Now it is safe to install delta-park using pip. Note that you should always checkout the delta vs spark compatibility.

pyspark) ubuntu@delta:~$ pip install delta-spark==3.2.1
Collecting delta-spark==3.2.1
Downloading delta_spark-3.2.1-py3-none-any.whl.metadata (1.9 kB)
Collecting pyspark<3.6.0,>=3.5.3 (from delta-spark==3.2.1)
Downloading pyspark-3.5.3.tar.gz (317.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 317.3/317.3 MB 11.4 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Collecting importlib-metadata>=1.0.0 (from delta-spark==3.2.1)
Downloading importlib_metadata-8.5.0-py3-none-any.whl.metadata (4.8 kB)
Collecting zipp>=3.20 (from importlib-metadata>=1.0.0->delta-spark==3.2.1)
Downloading zipp-3.21.0-py3-none-any.whl.metadata (3.7 kB)
Collecting py4j==0.10.9.7 (from pyspark<3.6.0,>=3.5.3->delta-spark==3.2.1)
Downloading py4j-0.10.9.7-py2.py3-none-any.whl.metadata (1.5 kB)
Downloading delta_spark-3.2.1-py3-none-any.whl (21 kB)
Downloading importlib_metadata-8.5.0-py3-none-any.whl (26 kB)
Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 200.5/200.5 kB 16.2 MB/s eta 0:00:00
Downloading zipp-3.21.0-py3-none-any.whl (9.6 kB)
Building wheels for collected packages: pyspark
Building wheel for pyspark (pyproject.toml) ... done
Created wheel for pyspark: filename=pyspark-3.5.3-py2.py3-none-any.whl size=317840629 sha256=6ebe46664d2f3dad5c865a916608d71f7846fee6ab67b80278688aea8b173ade
Stored in directory: /home/ubuntu/.cache/pip/wheels/07/a0/a3/d24c94bf043ab5c7e38c30491199a2a11fef8d2584e6df7fb7
Successfully built pyspark
Installing collected packages: py4j, zipp, pyspark, importlib-metadata, delta-spark
Successfully installed delta-spark-3.2.1 importlib-metadata-8.5.0 py4j-0.10.9.7 pyspark-3.5.3 zipp-3.21.0
(pyspark) ubuntu@delta:~$

Following the delta doc, we can run a shell test, writing a series number to data volume we mount on the instance.

# run pyspark with delta
(pyspark) ubuntu@delta:~$pyspark --packages io.delta:delta-spark_2.12:3.2.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
Python 3.12.3 (main, Nov 6 2024, 18:32:19) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
24/11/29 23:15:32 WARN Utils: Your hostname, delta resolves to a loopback address: 127.0.1.1; using 192.168.205.5 instead (on interface enp0s1)
24/11/29 23:15:32 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/opt/spark-3.5.3-bin-hadoop3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/ubuntu/.ivy2/cache
The jars for the packages stored in: /home/ubuntu/.ivy2/jars
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-34282855-89d1-4291-88db-1a52c20bdfe7;1.0
confs: [default]
found io.delta#delta-spark_2.12;3.2.0 in central
found io.delta#delta-storage;3.2.0 in central
found org.antlr#antlr4-runtime;4.9.3 in central
downloading https://repo1.maven.org/maven2/io/delta/delta-spark_2.12/3.2.0/delta-spark_2.12-3.2.0.jar ...
[SUCCESSFUL ] io.delta#delta-spark_2.12;3.2.0!delta-spark_2.12.jar (308ms)
downloading https://repo1.maven.org/maven2/io/delta/delta-storage/3.2.0/delta-storage-3.2.0.jar ...
[SUCCESSFUL ] io.delta#delta-storage;3.2.0!delta-storage.jar (38ms)
downloading https://repo1.maven.org/maven2/org/antlr/antlr4-runtime/4.9.3/antlr4-runtime-4.9.3.jar ...
[SUCCESSFUL ] org.antlr#antlr4-runtime;4.9.3!antlr4-runtime.jar (43ms)
:: resolution report :: resolve 2005ms :: artifacts dl 399ms
:: modules in use:
io.delta#delta-spark_2.12;3.2.0 from central in [default]
io.delta#delta-storage;3.2.0 from central in [default]
org.antlr#antlr4-runtime;4.9.3 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 3 | 3 | 0 || 3 | 3 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-34282855-89d1-4291-88db-1a52c20bdfe7
confs: [default]
3 artifacts copied, 0 already retrieved (6321kB/14ms)
24/11/29 23:15:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.3
/_/

Using Python version 3.12.3 (main, Nov 6 2024 18:32:19)
Spark context Web UI available at http://192.168.205.5:4040
Spark context available as 'sc' (master = local[*], app id = local-1732943736264).
SparkSession available as 'spark'.
>>> data = spark.range(0,10)
>>> data.write.format("delta").mode("overwrite").save("/data/testShell")
>>> exit()

After exiting the pyspark prompt, we can find the data saved in the instance volume mount (/data). You can find _delta_log and 4 parquet files in the folder. The number of files may vary and depend on how many cpu you specify in the instance.

#check the output
(pyspark) ubuntu@delta:~$ls /data/testShell/
_delta_log part-00002-5df0e7e7-40ef-4f2b-ad03-9ca7246cc5d3-c000.snappy.parquet
part-00000-d824cff5-439c-4915-bde0-ee23e41264e6-c000.snappy.parquet part-00003-23f80fa3-c2f2-4d1d-a30f-9f7f88212a55-c000.snappy.parquet
part-00001-6fd3e51f-a381-45c0-a91c-fcd42cf26d95-c000.snappy.parquet

After the task is done, exit the instance shell and stop the instance. Next time, you can just run multipass start delta and shell into the instance. Bonus, you can take snapshots in case something screw up.

(pyspark) ubuntu@delta:~$ exit
logout
# back to my iTerm2
[~/git/delta]$ ls testShell
_delta_log part-00002-5df0e7e7-40ef-4f2b-ad03-9ca7246cc5d3-c000.snappy.parquet
part-00000-d824cff5-439c-4915-bde0-ee23e41264e6-c000.snappy.parquet part-00003-23f80fa3-c2f2-4d1d-a30f-9f7f88212a55-c000.snappy.parquet
part-00001-6fd3e51f-a381-45c0-a91c-fcd42cf26d95-c000.snappy.parquet
[~/git/delta]$ multipass stop delta
# take a snapshot
[~/git/delta]$ multipass snapshot delta -n spark-tutorial1
Snapshot taken: delta.spark-tutorial1
[~/git/delta]$

This article summarizes my local delta-spark environment setup. Keep learning.

--

--

Yuwei Sung
Yuwei Sung

Written by Yuwei Sung

A data nerd started from data center field engineer to cloud database reliability engineer.

No responses yet