Scientific computing on Apple M1, part I: ASE and GPAW

(Update 13.2.2021: changed installation instructions to ASE and GPAW development versions from Git.
Update 17.2.2021: added benchmark for real DFT runs with GPAW.
Update 12.4.2021: added missing PAW setup installation instructions.)

I’ve long been an Apple enthusiast, and greatly enjoy the combination of a sleek, modern GUI and extensive software support, coupled with a true terminal environment for my scientific computing work, that MacOS offers. Recently, Apple released their first Mac computers using a novel in-house Apple silicon ARM architecture, with the first processor in the series dubbed the M1.

I got my hands on a first-gen M1 Macbook Pro, and though it is very early days for native Apple silicon support, much of the Python code that I am using in my everyday work is already fully functional and running natively thanks to this Conda-forge project.

In this post, which I expect will be the first in a series, I’ll share the code that got me running with a basic Python 3.9, scipy, and matplotlib environment. However, I immediately took it further, getting a working – and quite well-performing – installations of the Atomic Simulation Environment (ASE), used for building, manipulating and visualizing atomistic structure files, as well as a parallel installation of the density functional theory code GPAW.

Please chime in the comments if you run into any problems, as this was a bit of a trial and error process and the software environment is quickly evolving. I did some initial benchmarks, and the M1 chip was running circles around the Intel Xeon W chip in my iMac Pro.

In future posts, I’ll try to get the Nion Swift microscope control software working, as well as our in-house transmission electron microscopy simulation code abTEM – hopefully, eventually with GPU-acceleration via Apple’s Metal APIs!

Installing ASE and GPAW


# Install Homebrew
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install conda
curl -O https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh

chmod u+x Miniforge3-MacOSX-arm64.sh
./Miniforge3-MacOSX-arm64.sh

conda config --set auto_activate_base false # Optional

# Create and activate conda container
conda create --name scicomp python=3.8
source ~/.zshrc
conda activate scicomp

# Install required packages for ASE; pip does not work
conda install matplotlib
conda install scipy=1.5.3
conda install pillow

# Install ASE development version
git clone https://gitlab.com/ase/ase.git
cd ase
pip install --editable .

# Run ASE tests
conda install pytest
pytest --pyargs ase

# Start installing required packages for GPAW
brew install libxc

# Check that the paths match your system!
export C_INCLUDE_PATH=/opt/homebrew/Cellar/libxc/4.3.4_1/include
export LIBRARY_PATH=/opt/homebrew/Cellar/libxc/4.3.4_1/lib
export LD_LIBRARY_PATH=/opt/homebrew/Cellar/libxc/4.3.4_1/lib

brew install fftw

brew install open-mpi

brew install scalapack

export LDFLAGS="-L/opt/homebrew/opt/openblas/lib"
export CPPFLAGS="-I/opt/homebrew/opt/openblas/include"

# Install GPAW development version from Git
git clone https://gitlab.com/gpaw/gpaw.git
cd gpaw
pip install --editable .

# Install PAW setups
cd ~
curl -O https://wiki.fysik.dtu.dk/gpaw-files/gpaw-setups-0.9.20000.tar.gz
tar -zxf gpaw-setups-0.9.20000.tar.gz
echo 'export GPAW_SETUP_PATH=~/gpaw-setups-0.9.20000' >> ~/.zprofile
source ~/.zprofile

# Test GPAW
conda install pytest-xdist
pytest -n 4 --pyargs gpaw # Running tests in parallel on 4 cores.

Benchmarking: ASE and GPAW tests

Here is some benchmarking for my iMac Pro desktop and the M1 Macbook Pro laptop.

The 2017 iMac Pro has a 10-Core 3 GHz Intel Xeon W processor, 64 GB of 2666 MHz DDR4 memory, 10 MB of L2 cache, and 13.8 MB of L3 cache. The late-2020 MacBook Pro’s M1 processor has 4 high-performance 3.2 GHz ARM cores and 4 high-efficiency cores, 16 GB of LPDDR4X-4266 memory, and 12 + 4 MB of L2 cache.

As widely reported when the M1 came out, its raw Geekbench performance is pretty wild: the M1 is 56.5% faster than the iMac Pro in single-core performance, and only 23.6% slower in multicore against the 10-core iMac Pro model. The SSD in the M1 MacBook also appears to be faster than the state-of-the-art-for-2017 in the iMac Pro, but the difference isn’t quite as drastic. The difference in price, however, is: the iMac Pro costs about six times as much as the MacBook Pro (of course, with an excellent 27” screen, but still).

But how about real-life Python scientific computing performance?

Let’s start with running the tests for both ASE and GPAW (using either pytest or pytest-xdist, as shown above), the latter either on a single core, or in parallel on 4 cores (where the M1 has the maximum advantage), and on all cores (where the iMac’s better full multicore performance might come into play).

The times below are averages over three test runs, though I didn’t pay too much mind to what else was open on the systems. The MacBook was running on battery power, and the 8-core test did manage to get its fans running. The results are as follows:

Testsuite (cores)ASE (1)GPAW (1)GPAW (4)GPAW (all)
iMac Pro (Xeon W)176.9 s3079.7 s867.2 s452.1 s
MacBook Pro (M1)105.0 s1707.3 s469.7 s389.1 s
M1 advantage40.7%44.6%45.8%13.9%

So nope: not even having 10 high-performance Xeon cores in the iMac Pro instead of only 4 high-performance M1 cores in the MacBook Pro brought the two systems to parity: the M1 MacBook Pro handily wins this comparison.

Next up coming soon: more demanding “real” DFT tests!

Benchmarking: DFT runs with GPAW

I then turned to some more demanding DFT runs, taken from the “big” testsuit of GPAW, which is normally run weekly on a computing cluster. This allowed me to test more realistic workloads, and also to really test the limits of the M1 MacBook thermals.

The “PBE0” test is a hybrid exchange-correlation functional test running multiple small molecules in succession. The “h2_osc” and “na2_osc” tests run time-dependent DFT molecular dynamics simulations for a hydrogen molecule and a sodium dimer, respectively. And finally, the “CH4Au532” relaxes a 24-atom gold slab surface with an adsorbed methane molecule.

The results are shown in the table below; note that the times are now given in hours.

Testsuite (cores)PBE0 (4)h2_osc (4)na2_osc (4)na2_osc (all)CH4Au532 (4)CH4Au532 (all)
iMac Pro (Xeon W)1.5 h1.0 h19.0 h10.2 h21.2 h16.7 h
MacBook Pro (M1)0.7 h0.4 h7.0 h9.9 h26.0 h30.0 h
M1 advantage52.8%63.1%62.9%2.6%-22.8%-79.5%

The results were initially quite impressive: the M1 advantage was even greater in this test, with up to 63% advantage over the iMac and no sign of thermal performance throttling when running on 4 cores. Running on all cores for the na2_osc benchmarks evened things out, though the M1 still managed to slightly beat the 10-core Xeon W.

However, the last test was an exception, where now the iMac clearly came on top. More remarkably, the performance of the M1 system decreased when increasing the number of cores from 4 to 8. This was initially quite puzzling before I realised to look at the memory consumption: the matrices that were calculated in this example did not actually fit into the free memory on the M1, and thus the system was heaving swapping to disk.

For the 8-core run, the total swap volume was a cool 3.5 TB per core (no, that is not a typo)! Since each core had to store the same data in this particular calculation, adding more cores meant more swap, and thus worse performance. It’s actually rather stunning that the 4-core run was only 22.8% slower with this much of a handicap!

Based on this testing, it seems clear that more memory than is currently offered (16 GB) will be beneficial, but more importantly: rumoured upcoming Mac desktops with many more Apple silicon cores will be beasts for scientific computing!

31 Comments