# Curriculum

#### Our courses are organized into three core sequences:

**Computation**which will focus on programming, data structures, computer systems, and methods.**Data Analysis**which will focus on data exploration, analysis, prediction, inference and algorithms.**Practice**aimed to impart workplace skills, ethical standards, and awareness of data science to date.

The Vanderbilt Master of Science in Data Science is a 4-semester, 16-course (48 credits) program, which includes the completion and presentation of a capstone project. Students will be trained in the three core sequences, gain practical experience, and sharpen workplace skills (teamwork, communication, leadership). Below is an example of course sequencing.

**Course Descriptions**

Students learn the foundations of effective software design and programming practice, how to program and evaluate a simulation, and how to apply modern resampling techniques in simulations in both R and Python. Students learn workflow solutions, e.g., Jupyter Notebooks, Latex, Knitr, Markdown reports, and collaboration platforms (e.g., GitHub and version control). Reproducible methods for programming and data processing are emphasized.

This course explores the ethos, ethics, and obligations of the modern data scientist. Modern data security and privacy vulnerabilities, as well as solutions, for individual-level data and the institutions from which the data are derived will be discussed. The history, ethics, and standards for human experimentation are reviewed. The legal landscape concerning data ownership and privacy will be surveyed.

This course will focus on an in-depth exploration of multiple (e.g., 6-8) case studies of modern data science applications. For half of the case studies, students in small groups will attempt their own solutions by re-analyzing data and comparing their findings to each other and the original solution. The other case studies will focus on high-profile data science solutions, in academia and business, in which two yielded generally positive outcomes (reproducible, impactful) and two did not (not-reproducible, unethical). The processes of obtaining and processing the data, preforming the analysis, and communicating the results will be discussed.

Students will work in small groups to engage in real world problems, and apply their skills in a supervised environment where active learning is reinforced, and learn to make practical decisions. Students will learn how to process data, generate analysis reports and data summaries. Data will come from many sources, e.g. Kaggle competitions, VU/VUMC labs, and online documented data sets. Students will experience a goal-oriented teamwork environment and learn how to participate and support teams as the primary data curator and data analyst.

Second-year students will lead the small groups formed in the *Data Science in Teams I* class. Here second-year students get to take on the responsibility of team leader and presenter; they will learn supervisory skills by managing team members and taking responsibility for the final project. This course also allows student cohorts to interact in a formal setting.

This course covers database management systems, e.g., relation databases, data architecture, and security. Topics include entity-relationship models and relational theory; storage and access of data; complex SQL queries; and non-relational databases including NoSQL databases. Connections to Hadoop and MapReduce will be highlighted. Students are exposed to database architectures as time allows.

An applied and practical combination of discrete structures and computational algorithms that are relevant for data science applications and infrastructure. Topics include natural language processing, graph and network models, (stochastic) gradient descent, block coordinate descent, and (quasi-) newton methods along with an overview of more traditional topics such as sorting and searching, hashing, queues, trees, string processing, advanced data structures, recurrence relations, shortest paths, matching, and dynamic programming. The course will also cover streaming algorithms for computational statistics, e.g., Monte-Carlo Markov Chain, simulated annealing, and stability of numerical algorithms.

This course will address key challenges that arise when working with big data and parallel processing. Practical techniques for storing, retrieving, and scaling are discussed. Topics include high-performance computing, parallel processing, commercial cloud architectures, and mapping of data science algorithms onto scalable computing platforms.

This course will teach students how to explore, summarize, and graph data (big and small). Topics include principles of perception, how to display data, scatterplots, histograms, boxplots, bar charts, dynamite plots, proper data summaries, dimensionality reduction, multidimensional scaling, and unsupervised clustering algorithms, such as principal component analysis, k-means clustering, and nearest neighbor algorithms.

This course covers the fundamentals of probability theory and statistical inference. Topics in probability include random variables, distributions, expectations, moments, Jensen’s inequality, law of large numbers, central limit theorem. Topics in inference include maximum likelihood, point estimation (Bayesian, frequentist, and likelihood versions); hypothesis and significance testing; re-sampling techniques. Complex mathematical proofs will be illustrated with computational solutions.

This is the first course in a sequence exploring statistical modeling and machine learning techniques. Both courses emphasize unifying and advanced concepts, such as prediction and calibration, classification and discrimination, optimism and cross-validation, re-sampling methods for model assessment, the evaluation of modeling assumptions and bias-variance trade-off. This first course focused on regression, generalized linear models, regularized regression, support-vector machines and kernel methods, and simple neural networks.

This is the second course in a sequence exploring statistical modeling and machine learning techniques. Both courses emphasize unifying and high-level concepts such as prediction and calibration, classification and discrimination, optimism and cross-validation, re-sampling methods for model assessment, the evaluation of modeling assumptions and bias-variance trade-off. This second course covers nonparametric regression, neural networks (convolution and recurrent), deep learning, reinforcement learning, long-short term memory models, hidden-markov models and Bayesian networks.

This course explores recent research on the analysis of social networks and on models and algorithms that are used to abstract their properties and make predictions. Key topics covered in this course are: Graph models; Network centrality measurements; Computational methods of link prediction, clustering and classification on graphs, and network diffusion; Deep learning on graphs including network embedding and graph neural network models and their applications. Prerequisites: DS 5440 and 5660. [3]

A structured environment in which students develop their capstone projects; get feedback from students, faculty, and industry mentors; learn how to construct a poster presentation; and practice oral presentations. Students will also learn how to set a timeline and work toward completion in a supervised environment.