Mahdi Roozbahani

CS 4641/7641: Machine Learning (Fall 2019)

Course Information

Lecture time: Tuesdays and Thursdays, 9:30am-10:45am
Location: College of Business 100
Piazza: https://piazza.com/class/jzboy4ebm2c6my

**Head TA**:
Wendi Ren (wren44@gatech.edu)

TA:
Sashank Gondala (sgondala@gatech.edu)

TA:
Albert Liu (fangzhou.liu@gatech.edu)

TA:
Nitish R. Sontakke (nitishsontakke@gatech.edu)

Course Overview

This course introduces techniques in machine learning with an emphasis on algorithms and their applications to real-world data. We will investigate the following question: how to computationally extract useful knowledge from data for decision making and task support? We will focus on machine learning methods, which are organized into three parts:

Basic math for data science and machine learning
- Linear algebra
- Probability and statistics
- Information theory
Unsupervised machine learning for data exploration
- Clustering analysis
- Dimension reduction
- Kernel density estimation
Supervised learning for predictive data analysis
- Tree-based models
- Linear classification and regression
- Neural networks

Prerequisites for this course include 1) basic knowledge of probability, statistics, and linear algebra; 2) Basic programming experience in Python, especially Jupyter Notebook.

Schedule

Date	Topic	Assignment	Due	Readings
August 20, 2019	Course Overview;	Piazza Signup		GT Honor Code
August 22, 2019	Math Basics: Linear Algebra			Linear Algebra Review by Zico Kolter
August 27, 2019	Math Basics: Probability and Statistics			Probability Theory Review by Andrew Moore
August 29, 2019	Math Basics: Information Theory;	AS1 Out		Visual Information Theory by Chris Olah
September 03, 2019	Data Analysis Toolbox - Part 1;			KKT for inequality constrained optimization; Project Presentations Summer 2019 - Part 1; Project Presentations Summer 2019 - Part 2;
September 05, 2019	Data Analysis Toolbox - Part 2; Project Information;			GitHub Pages; YAML Configuration; NumPy Tutorial; Matplotlib Tutorial The Heilmeier Catechism; Project Examples; seaborn: statistical data visualization;
September 10, 2019	Clustering Analysis and K-Means;			Curse of dimensionality (Euclidean space example) Jupyter Notbook (Kmeans and DBSCAN);
September 12, 2019	Hierarchical Clustering		AS1 Due	Understanding the concept of Hierarchical clustering Technique;
September 17, 2019	Density-Based Clustering	Start working on Project Proposal		GitHub Student Application; Jupyter Notbook (Kmeans and DBSCAN); Overleaf for GT students, no kidding.
September 19, 2019	Gaussian Mixture Model	AS2 Out
September 24, 2019	Gaussian Mixture Model; Evaluation of Clustering Algorithms
September 26, 2019	Evaluation of Clustering Algorithms
October 01, 2019	Density Estimation		Proposal Due	KDE interactive visualization ; KDE sampling ; KDE SKLearn and sampling ; Jupyter Notebook Kernel Density Example;
October 03, 2019	Dimension Reduction			Image reconstruction using PCA ; Feature extraction using PCA ; PCA for images ; PCA as linear combination of features ; PCA and Linear Discriminant Analysis ;
October 08, 2019	Midterm Review		AS2 Due	BLUE EXAM BOOK just $0.7 in BARNES\& NOBLE GT;
October 10, 2019	Midterm Exam
October 15, 2019	No Class Fall Recess
October 17, 2019	Linear Regression	AS3 Out		Simple Linear Regression in Matrix Format; Adding Noise to Regression Predictors
October 22, 2019	Regularization and Linear Regression
October 24, 2019	Regularization and Linear Regression; Naïve Bayes and Logistic Regression
October 29, 2019	Naïve Bayes and Logistic Regression;
October 31, 2019	Decision Tree and Random Forest; Ensemble Learning and Random Forest;			Evaluating Machine Learning Methods
November 05, 2019	Support Vector Machine;			KKT and SVM
November 07, 2019	Kernel Method \ SVM		AS3 Due
November 12, 2019	Neural Networks and Deep learning (Forward pass and Back propagation); Class notes	AS4 Out		NN Playground ; The role of a hidden layer Back propagation numerical example More detailed introduction
November 14, 2019	Neural Networks and Deep learning;			CNN Live Demo A guide to an efficient way to build CNN and optimize its hyper-parameters Back Propagation in CNN Transfer learning in CNN
November 19, 2019	Project Presentation		Updated: All projects (GitHub links) should be submitted November 17 (Sunday), before 23:59 pm.	Project Scoring Guidance
November 21, 2019	Project Presentation
November 26, 2019	Project Presentation
November 28, 2019	No Class Thanksgiving Break
December 03, 2019	Course Review		AS4 Due
December 04, 2019	No Class Reading Period
December 09, 2019	Final Exam: 11:20 AM - 2:10 PM

Office Hours and Questions

Office Hours:
- Instructor: Thursdays 10:45-11:45am in Business School lobby(access to my office is extremely hard)
- Wendi and Ruijia: Tuesdays 01:30-02:30pm
- Sashank and Nitish : Wednesdays 11:30-12:30am
- David an Roy: Wednesdays 3:00-4:00pm
- Yue and Albert: Thursdays 11:00-12:00am
- Jiahao and Xianda: Fridays 10:30-11:30am
- Tongtong and Nan: Fridays 1:30-2:30pm
- TA Office Hours location: in Klaus building lobby at the first floor (next to room 1325)
Piazza will be the main place for course discussions and announcements. If you have questions, please ask it on Piazza first because 1) other students may have the same question; 2) you will get help much faster.
If it’s something you do not like to discuss publicly on Piazza, you can use private messaging in Piazza.

Grading

Assignments (50%)
- There will be four assignments. Each one is designed for testing your understanding of the taught algorithms. Assignments will have programming and written analysis.
- You will need to submit all your assignments using ipynb. In ipynb, you can use markdown text editor. Here is a quick guidline how to use markdown in ipynb.
- All assignments follow the “no-late” policy. Assignments received after the due time will receive zero credit.
- There are some bonus questions in assignments for Undergrad students. The bonus questions are required to be answered for all Grad students and they are not considered as bonus points.
- All students are expected to follow the Georgia Tech Academic Honor Code.
Project Proposal (5%)
- A project proposal should be just one page pdf (less than 500 words single spaced)
- A project proposal should include:
  - Introduction/Background
  - Methods
  - Potential results
  - Discussion
  - At least three references (preferably peer reviewed)
- A checkpoint to make sure you are working on a proper machine learning related project.
Project (20%)
- You are expected to complete a project on machine learning with real-life data. Your project needs to be clear about 1) the data you are using; 2) the problem you are attempting to solve; 3) the method you are using; 4) the results and conclusion you attain.
- You will need to turn in a GitHub page for your project. The project presentation and report must be combined into one deliverable using a GitHub page. For the project presentation, you just need to scroll down on your GitHub page when you present your project (make sure you have visible images and graph).
- Each project needs to be completed in a team of 5 people (you will be forming your team on your own. In case you can't find any team, we will randomly assign you a team). Team members need to clearly claim their contributions in the project report.
- Each presentation cannot exceed beyond 6 minutes. If your presentation takes more than 6 minutes, you will be asked to stop the presentation at 6 minute mark. There will be 2 minute for Q/A.
- There will be 6 judges who will grade your presentations
- Refer to Project hints for your project's template on the GitHub page, and also some general hints to improve the accuracy of your predictive model.
- Refer to Project Scoring Guidance for what we expect from your presentation.
- If you are in a Grad students team, you are required to have both unsupervised and supervised learning in your project.
Class participation (5%)
- Your class participation score will be graded based on attendance and possibly in-class quizzes. For some lectures, I will bring my roll call sheet. I will inform students beforehand in the class not on Piazza that I am going to bring my roll call sheet for the next upcoming lecture.
- Participation in class discussions (including asking relevant questions in class, volunteering to answer questions on Piazza) will be considered when determining your final grade. It will be especially useful when you are right on the edge of two letter grades.
Midterm Exam (10%)
- The midterm exam will only cover the math and probability and un-supervised learning parts.
- The midterm exam will take place the regular class time slot.
- The midterm exam will be a written and open-book exam. No electronic material can be used except calculator. Only paper material can be used in the exam (books, printed notes, etc). It would be better if you prepare a one or two page cheatsheet for yourself.
- There will be no make-up exams. You will get zero credit for your missed midterm exam.
Final Exam (10%)
- The final exam will only cover the supervised learning part.
- The final exam will be at assigened date/time for this class.
- The final exam will be a written and open-book exam. No electronic material can be used except calculator. Only paper material can be used in the exam (books, printed notes, etc). It would be better if you prepare a one or two page cheatsheet for yourself (let's save some trees).
- Again, there will be no make-up exams. You will get zero credit for your missed final exam.
Bonus points
- About Bonus points: Bonus points will be counted to always be beneficial for your final grade. What do I mean by that? it means that if for some reasons I may need to curve the grades, bonus points will be applied to your grade after curving not before curving.
- Undergrads and grads: Piazza has statistics which give us many measurements regarding how much a student has been involved on Piazza's activities such as viewing posts, answering questions, asking questions and so on. Not only we use this to account for a minor part of the Class Participation score (in total it is %5), we will use the statistics to give students bonus points. Bonus points will be applied to students who answer the other students' questions correctly. At the end of the semester, we will define a minimum and maximum number of involvement considering all the students, and based on those, some students will receive at most %3 bonus points. It is possible to receive less than 3% bonus based on your activities on Piazza.
- Undergrads: As you all know, we have bonus points for hws. Bonus points will be different for different hws.For example, hw 1 may have 30 bonus points, hw 2 may have 20 bonus points and so on. If you receive all the bonus points for all your hws, we will add %5 to your final grade.

Dataset Ideas (may need API, or scraping) Thanks to Polo

Google Dataset Search
Google public datasets. Thanks Revant!
Kaggle public datasets
Awesome Public Datasets. Thanks Marcel Gwerder!
NYC Taxi data for 2013 (suggested by Chris Wong). 2013 Trip Data (11.0GB). 2013 Fare Data (7.7GB). Visualization for a days trip. Thanks Jitesh.
Large datasets publicly available. Thanks Gopi!
Georgia Tech's campus data (has APIs): bus info, directory, building, T-square, room reservation, building facilities usage (e.g., electricity, lights, A/C, etc.), Oscar/course info/registration, etc.
Yahoo WebScope
Data.gov: U.S. Government's open data
IPEDS data: Postsecondary education data from National Centre for Education Statistics
Bureau of Labor Statistics data
Uber data: Anonymized data from over 2 billion trips
Freebase
Yelp
Microsoft Academic Graph
Numerous APIs from Google (e.g., Maps, Freebase, YouTube, etc.)
Zillow: real estate listing site
Numerous graph datasets (large and small): SNAP, Konect
Movies data: IMDB
List of lists of datasets for recommendations.
Thanks Jon!
Million song dataset by Echo Nest.
It contains not only the basic information of songs (artist, genre, year, length etc), but also some musical features(like tempo, pitch, key, brightness).
Thanks Minwei!
Dataset about soccer games, players, clubs.
No API, but easy to scrape.
For a soccer player: transfer history, performance, nationality, birth date, etc.
For a soccer club: performance, squad, etc.
Thanks Ding!
The Free 'Big Data' Sources Everyone Should Know
Quandl - a dataset search engine for time-series data.
Thanks Henry!
UCI also has a collection of links to various datasets sorted for various tasks (Classification, Regression, etc)
Thanks Vinodh!
Amazon AWS Public Data Sets (Thanks Jonathan!)
KDD Cup: annual competition in data mining, like Kaggle
Academic domain: Microsoft Academic Search, DBLP
Retrosheet: MLB statistics (Game/Play logs)
Classification datasets
Thanks Amish!
Various geophysical datasets for the oceans (magnetism, gravity, seismology, etc).
Thanks Ryan!
Social trends (Thanks Jonathan!)
Beer data (Thanks Jonathan!). Website offline :( . Older version at web.archive.org
Academic torrents (terabytes) (Thanks Vaibhav!)
Article Search API from the New York Times (all the way back to 1851!) (Thanks Guido!)
Civil Engineering Dataset (Thanks Dr. Frost)
(Kayak: flight, hotel, car, etc.)
Data Science Initiative - Microsoft Research has various datasets and access to tools that can aid in data science research

Resources

Recommended books:

Learning from data, by Yaser S. Abu-Mostafa
Pattern recognition and machine learning, by Christopher Bishop
Machine learning, by Tom Mitchell
Data Mining: Concepts and Techniques, by Jiawei Han, Micheline Kamber, and Jian Pei
The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
Deep Learning, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville

Other resources, such as machine learning toolboxes and datasets, will be provided throughout the course.