CX4240: Introduction to Computational Data Analysis (2019 Summer)

Course Information

Instructor:
Mahdi Roozbahani
Head TA:
Wendi Ren (wren44@gatech.edu)
TA:
Aradhya Biswas (abiswas39@gatech.edu)
TA:
David Kartchner (david.kartchner@gatech.edu)

Course Overview

This course introduces techniques for computational data analysis, with an emphasis on machine learning algorithms and their applications to real-world data. We will investigate the following question: how to extract useful knowledge from data computationally for decision making and task support? We will focus on on machine learning methods for computational data analysis, which are organized into three parts:

  1. Basic math for data science and machine learning

    • Linear algebra
    • Probability and statistics
    • Information theory
  2. Unsupervised machine learning for data exploration

    • Clustering analysis
    • Dimension reduction
    • Kernel density estimation
  3. Supervised learning for predictive data analysis

    • Tree-based models
    • Linear classification and regression
    • Neural networks

Prerequisites for this course include 1) basic knowledge of probability, statistics, and linear algebra; 2) Basic programming experience in Python.

Schedule

GitHub Pages; YAML Configuration; NumPy Tutorial; Matplotlib Tutorial
Date Topic Assignment Due Readings
May 14, 2019 Course Overview; Math Basics: Linear Algebra Piazza Signup GT Honor Code
May 16, 2019 Math Basics: Linear Algebra Math Basics: Probability and Statistics Linear Algebra Review by Zico Kolter
May 21, 2019 Math Basics: Probability and Statistics Math Basics: Information Theory AS1 Out Probability Theory Review by Andrew Moore
May 23, 2019 Math Basics: Information Theory; Data Analysis Toolbox; Project Presentations Spring 2019; Project Presentations Spring 2019; Visual Information Theory by Chris Olah
May 28, 2019 Clustering Analysis and K-Means; Hierarchical Clustering Jupyter Notbook (Kmeans and DBSCAN);
May 30, 2019 Density-Based Clustering Start working on Project Proposal and team members AS1 Due The Heilmeier Catechism; Project Examples; seaborn: statistical data visualization; Project hints
June 4, 2019 Gaussian Mixture Model AS2 Out
June 6, 2019 Evaluation of Clustering Algorithms
June 11, 2019 Density Estimation Proposal Due Python Noteebook Example
June 13, 2019 Dimension Reduction Feature extraction using PCA; PCA for images; PCA as linear combination of features; PCA and Linear Discriminant Analysis;
June 18, 2019 Linear Regression AS2 Due Simple Linear Regression in Matrix Format
June 20, 2019 Regularization and Linear Regression AS3 Out
June 25, 2019 Naïve Bayes and Logistic Regression Evaluating Machine Learning Methods
June 27, 2019 Decision Tree and Random Forest Ensemble Learning and Random Forest
July 2, 2019 Support Vector Machine AS3 Due KKT and SVM
July 4, 2019 No Class (Independence Day)
July 9, 2019 Kernel Method \ SVM AS4 Out
July 11, 2019 Neural Networks and Deep learning (Forward pass and Back propagation); Class notes Back propagation numerical example More detailed introduction
July 16, 2019 Project Presentation All projects (GitHub links) should be submitted by Tuesday before 12 pm, July 16 Project Scoring Guidance
July 18, 2019 Project Presentation
July 23, 2019 Course Review AS4 Due
July 30, 2019 Final Exam: 2:40 PM - 5:30 PM

Office Hours and Questions

  • Office Hours:

    • Instructor: Thursdays 1:30-2:20pm (Klaus 1323)
    • David Office Hour: Mondays 02:00-03:00pm
    • Wendi Office Hour: Tuesdays 10:30-11:20am
    • Aradhya Office Hour: Wednesdays 10:30-11:20am
  • Piazza will be the main place for course discussions and announcements. If you have questions, please ask it on Piazza first because 1) other students may have the same question; 2) you will get help faster compared to sending emails.

  • If it’s something you do not like to discuss publicly on Piazza, you can use private messaging in Piazza.

Grading

  • Assignments (50%)

    • There will be four assignments. Each one is designed for testing your understanding of the taught algorithms. Assignments will have programming and written analysis.
    • You will need to submit all your assignments using ipynb. In ipynb, you can use markdown text editor. Here is a quick guidline how to use markdown in ipynb.
    • All assignments follow the “no-late” policy. Assignments received after the due time will receive zero credit.
    • All students are expected to follow the Georgia Tech Academic Honor Code.
  • Project Proposal (5%)

    • A project proposal should be just one page
    • A project proposal should include:
      • Introduction/Background
      • Methods
      • Potential results
      • Discussion
      • At lease three referenced papers
    • A checkpoint to make sure you are working on a proper machine learning related project.
  • Project (20%)

    • You are expected to complete a project on computational data analysis with real-life data. Your project needs to be clear about 1) the data you are using; 2) the problem you are attempting to solve; 3) the method you are using; 4) the results and conclusion you attain.
    • You will need to turn in a GitHub page for your project. The project report and the presentation. The project presentation and report can be combined into one deliverable using a GitHub page. For the project presentation, you just need to scroll down on your GitHub page.
    • .
    • Each project needs to be completed in a team of 2-4 people. Team members need to clearly claim their contributions in the project report.
    • Each presentation cannot exceed beyond N/A minutes. If your presentation takes more than N/A minutes, you will be asked to stop the presentation at N/A minute mark. There will be N/A minute for Q/A.
    • There will be three or more guest professors and PhD students in addition to TAs who will grade your presentations
    • Refer to Project hints for your project's template, creating GitHub page, and also some general hints to improve the accuracy of your predictive model.
  • Class participation (5%)

    • Your class participation score will be graded based on attendance and possibly in-class quizzes.
    • Participation in class discussions (including asking relevant questions in class, volunteering to answer questions on Piazza) will be considered when determining your final grade. It will be especially useful when you are right on the edge of two letter grades.
  • Final Exam (20%)

    • The final exam will be at assigened date/time for this class.
    • The final exam will be a written and open-book exam. No electronic material can be used except calculator. Only paper material can be used in the exam (books, printed notes, etc). It would be better if you prepare a one or two page cheatsheet for yourself (let's save some trees).
    • Again, there will be no make-up exams. You will get zero credit for your missed final exam.

Resources

Recommended books:

Other resources, such as machine learning toolboxes and datasets, will be provided throughout the course.