Jupyter at ACCRE
Jupyter notebooks have become a widely used data analysis platform in the last several years, primarily in academic research and data science. Notebooks provide programmers with the ability to combine code, documentation, analysis, and visualization inside a single document that is accessible from a web interface and therefore easy to share with colleagues.
ACCRE has recently deployed an instance of JupyterHub for hosting notebooks for researchers at VU and VUMC at jupyter.accre.vanderbilt.edu. It is available for all students, faculty and staff at Vanderbilt University, as well as VUMC employees and members of the Complex Muon Solenoid experiment (CMS). You do not need a traditional cluster account to use the Jupyter cluster.
The Jupyter cluster is supported by a Vanderbilt TIPs grant. For information on how it is being used and how it came to be, see our announcement post.
Which cluster is right for me?
If you don’t have an ACCRE account, if you are working with students without an ACCRE account (but have a VUnetID), or if you need to use Spark or HDFS, use our dedicated Jupyter cluster.
Keep in mind that the Jupyter cluster cannot access files you have on the traditional cluster (including your /home directory), nor can it access Lmod software packages or submit jobs. If you have an ACCRE account and need access to data or software on the cluster, use the traditional cluster instead. For ACCRE users who wish to use a Jupyter notebook and must access files or programs on the traditional cluster, we provide an option through the ACCRE Visualization Portal.
Our original big data cluster (available by SSH connection to
bigdata.accre.vanderbilt.edu) will remain operational and is available upon request.
To get started, simply point your web browser at https://jupyter.accre.vanderbilt.edu and login with your VUNetID and ePassword.
Once you have logged in, you can easily create a notebook in Python 2, Python 3, or R (support for other languages may be added later) by clicking the “New” button on the top right of the page. You can also upload files or data, both of which will persist between login sessions (the data will not disappear after logging out and logging back in).
The process for creating and running new notebooks is well-documented elsewhere (e.g. http://jupyter.org/). If you are not familiar with notebooks, we would recommend finding a few demos online to walk through before getting started. One of the big advantages of notebooks is their ease of use, so new users should be able to get up and running with basic functionality within an hour or two.
Installing Python Packages
When you create a new Python 2 or Python 3 notebook, you are actually invoking something called a kernel in order to do so. Each programming language comes with its own kernel. The vanilla Python 2 and 3 kernels do not come with many packages pre-installed (the major exception is PySpark, which you can read more about below). If you would like to use other standard packages (e.g. NumPy, SciPy, Matplotlib, etc), we recommend installing them into a virtual environment and then creating your own custom kernel that can be selected from the “New” button on top-right of your homepage. Note that Python-based kernels in Jupyter also come with support for shell magic commands.
Python packages can be easily installed from within a notebook. For Python 3, you can place the following lines in a single cell (it’s actually important to use a single cell to ensure the Python interpreter in your virtual environment is selected) and execute the cell to build your custom Python 3 kernel with the packages listed below installed via pip. Note that the name of the kernel for this example will appear as py3-data-science (you will probably select a different name):
%%bash python3 -m virtualenv py3-data-science source py3-data-science/bin/activate pip3 install ipykernel numpy scipy matplotlib pandas scikit-learn hdfs ipython kernel install --user --name=py3-data-science
After successfully running this notebook (it may take several minutes to complete, and make sure you confirm there were no error messages listed in the output from the cell), you can refresh your JupyterHub homepage and select your custom kernel from the “New” drop-down menu on your homepage.
The process is very similar for Python 2:
%%bash python2 -m virtualenv py2-data-science source py2-data-science/bin/activate pip install ipykernel numpy scipy matplotlib pandas scikit-learn hdfs ipython kernel install --user --name=py2-data-science
Installing R Packages
Unlike Python, for R we simply recommend installing the packages you need directly into your JupyterHub home directory, without building a new kernel or using a virtual environment. New packages should be installed from a R kernel, and can be accessed by selecting the R kernel in future login sessions. We recommend adding the following lines to the first cell in your R notebooks:
rversion=system('R --version | head -1 | cut -f3 -d " "',intern=TRUE) mklocaldir=paste('mkdir -v -p ~/R/rlib-',rversion, sep="") system(mklocaldir,intern=TRUE) libdir=paste('~/R/rlib-',rversion, sep="") .libPaths(libdir)
These lines will ensure that packages are installed to the correct location, even when ACCRE admins make newer versions of R available in the future. These lines will also ensure that your R search path is setup correctly so that the correct packages are loaded within your notebook. Once you have executed this cell, you can load packages that you have already installed, or install new packages with a line like the following:
Big Data Analysis with HDFS and Spark: Transferring Data to HDFS
HDFS (i.e. Hadoop File System) is a file store designed for data resiliency and high parallel throughput. The ACCRE JupyterHub instance is connected to several hundred terabytes of HDFS-based storage. Large datasets (larger than a few hundred MB) that you would like to analyze on JupyterHub should be stored in HDFS.
Unlike a traditional filesystem, HDFS is a user space filesystem that is generally not accessed directly but rather through the Linux hdfs command or programmatically through an API (e.g. HDFS has many Python APIs). It is possible to mount HDFS as a directory that is more directly accessible as a more traditional Linux filesystem, but at the moment it is not setup in this way. Your HDFS home directory is located at:
From a Python notebook, you can run the following command to see what is inside your HDFS home directory (it should be empty initially):
%%bash hdfs dfs -ls hdfs:///store/user/<VUNETID>
<VUNETID> should be replaced by your unique VUNetID. Many other Linux-like commands are supported by the hdfs Linux command. Documentation can be found here.
To transfer data to or from HDFS, you can use the “hdfs dfs -put” and “hdfs dfs -get” subcommands, respectively. The standard ACCRE cluster has Hadoop command line tools available via LMod. While logged into the ACCRE cluster, simply type:
module load Hadoop
to load the hdfs command into your path. Files can then be transferred from ACCRE’s traditional Linux GPFS filesystem to HDFS like so:
hdfs dfs -put big-data.csv hdfs:///store/user/<VUNETID>
Note this must be run from a Linux shell on the ACCRE cluster, not from the Jupyter cluster or from within a notebook. Similarly, data can be downloaded to ACCRE’s GPFS from HDFS like so:
hdfs dfs -get hdfs:///store/user/<VUNETID>/big-data.csv .
To delete data, you can use:
hdfs dfs -rm -skipTrash hdfs:///store/user/<VUNETID>/big-data.csv
If you need assistance transferring your data to HDFS, please open a helpdesk ticket with us.