The primary objective of this course is to introduce you to broad classes of techniques and tools for analyzing text data using Natural Language Processing (NLP) algorithms and techniques. It emphasizes on how to apply pre-processing, processing, and post processing NLP techniques to analyze and develop NLP models.

Course Goals

  • Analyze NLP techniques and apply them to text data. The course is divided into three main categories: 
    • Pre-processing: Demonstrate how to clean and integrate text data
    • Processing: Apply NLP algorithms on your pre-processed data to perform different tasks
    • Post-processing: Evaluate your developed NLP models. 
  • Solve problems with real datasets 
  • Apply practical know-how (useful for jobs, research) through significant hands-on programming assignments 

Course Pre- and/or Co-Requisites

Review our "warnings" before taking this course.
You are expected to have some knowledge in Machine Learning, Linear Algebra, Optimization, Probability and Statistics. We will go over some of the Machine Learning algorithms, but we may not be able to go through them in detail. The programming language for this class is Python (Python 3.^). It is important to know at minimum how to use Numpy and its matrix operations, linear algebra, probability and statistics.
Additional Prerequisites:
  • CSE 6040
  • CS 1301

Class Text

  • Required Readings: No required readings
  • Recommended Reading: Introduction to Natural Language Processing by Jacob Einstein

Announcements and Discussion

The fastest way to get help with homework assignments and quizzes is to post your questions on Ed. That way, not only our TAs and instructor can help, your peers can too.

If you prefer that your question addresses to only our TAs and the instructor, you can use the private post feature (i.e., check the "Individual Students(s) / Instructors(s)" radio box).

While we welcome everyone to share their experiences in tackling issues and helping each other out, but please do not post your answers, as that may affect the learning experience of your fellow classmates.

For special cases such as failed submissions due to system errors, missing grades, failed file uploads, emergencies that prevent you from submitting, personal issues, you can contact the staff using a private Ed post, but please ensure to contact us before the deliverable deadline.

Course Staff & Office Hours

TAs plan to hold office hours starting week 2, except on Georgia Tech holidays (e.g., thanksgiving, MLK day, spring break). Each office hour session will be run by at least one TA, and is 1 hour long. See GT’s academic calendar for the full list of holidays (https://registrar.gatech.edu/calendar). We will spread the office hours across weekdays, and across time of the day. We will announce the office hour times.

We will hold office hours via Ed Chat Channel, where the TA running the office hour will be responsive. We will share information about how to join the appropriate Ed Chat Channel.

Please note that you are always welcome to ask questions on Ed. Office hours supplement Ed, and do not replace it.

Course Schedule

For all dates used in this course, their times are 23:59 Anywhere on Earth (11:59 pm AoE). For example, a due date of "June 02" is the same as "June 02, 23:59pm AoE". Convert the times to your local times using a Time Zone Converter.


Wk Dates Topics Homework (HW) Quizzes
1 May 15-19 * Course introduction and Text Reresesentation
* Text data preprocessing: Normalization, lemmatization, stemming, stop words removal...
* One hot encoding
* BoW (frequency counting)
* TF-IDF
HW1 out
Fri, May 19

Quiz 0 [ Knowledge-base]
out: Mon, May 15
due: Fri, May 19

2   22-26 * Classification Introduction
* Naive Bayes
* Classification Model Evaluation: accuracy, precision, recall, confusion matrix
* Logistic Regression
* SVM
* Perceptron
 

Quiz 1 [ week 1]
out: Fri, May 19
due: Tue, May 23

3 29-02

* Memorial Day (Official School Holiday)
* SVD (Dimensionality Reduction) + Co-occurrence embeddings
* GLoVe

HW1 due
Fri, Jun 02
(Sat, 07:59 ET)

HW2 out
Fri, Jun 02

Quiz 2 [week 2]

out: Fri, May 26
due: Tue, May 30

4 June 05-09 * Neural Network (fully connected)
* Word2vec: CBoW, Skip-Gram
 

Quiz 3 [ week 3]
out: Fri, Jun 02
due: Tue, Jun 06

5 Jun 12-16 * Focus on HW2 HW2 due
Fri, Jun 16

HW3 out
Fri, Jun 16

Quiz 4 [ week 4]
out: Fri, Jun 09
due: Tue, Jun 13

6   19-23 * Juneteenth (Official School Holiday)
* CNN (use the chart, provide some explainability)
* RNN (quick overview as an intro to LSTM)
Quiz 5 [ week 2]
out: Fri, Jun 16
due: Tue, Jun 20
7 26-30 * LSTM and GRU
* LSTM + Attention (Focus on Attention mechanism)

Quiz 6 [ week 6]
out: Fri, Jun 23
due: Tue, Jun 27

8 Jul 03-07

* Independence Day (Official School Holiday)
* Transformer models
* Examples: BERT(Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer)

HW3 due
Fri, Jul 07

HW4 out
Fri, Jul 07

Quiz 7 [ week 7]
out: Fri, Jun 30
due: Wed, Jul 05

9   10-14 * Sequence Labelling: POS Tagging
* Sequence Labelling: NER

Quiz 8 [ week 8]
out: Fri, Jul 07
due: Tue, Jul 11

10 17-21 * Unsupervised Models
*Topic Modeling (Latent Semantic Indexing, LDA (Latent Dirichlet Allocation)

Quiz 9 [ week 9]
out: Fri, Jul 14
due: Tue, Jul 18

11 24-25 * Focus on HW4 HW4 due
Tue, Jul 25

Quiz 10 [ week 10]
out: Fri, Jul 21
due: Tue, Jul 25

This course can be tough: why?

WARNING! You are expected to have some knowledge in Machine Learning, Linear Algebra, Optimization, Probability and Statistics. We will go over some of the Machine Learning algorithms, but we may not be able to go through them in detail. The programming language for this class is Python (Python 3.^). It is important to know very well at minimum how to use Numpy and its matrix operations, linear algebra, probability and statistics

Minimum Computer Requirements

  • 8GB RAM (16GB recommended)
  • 512GB disk (SSD recommended). Some assignments use data files that are more than a few GBs, and some uses virtual machines that can easily take up more than tens of GBs.
  • Dual-core Core i5 (8th generation or better recommended)

Accessing Course Materials Outside of US

You may need to use Georgia Tech's VPN. We also recommend checking out some solutions that seem to be working well for OMS students in different countries.

Homework (85%)

We have 4 big assignments in total (subject to change). Visit this course's Gradescope site for the assignment documents. See the schedule table above for deliverable due dates.
  • [20%] HW1: Week 1 to Week 3 topics
  • [20%] HW2: Week 4 to Week 7 topics
  • [20%] HW3: Week 8 to Week 10 topics
  • [25%] HW4: Week 11 to end.
We do not release solutions for homework.
Can you release homework early? We understand that some students may prefer that homework assignments be released as soon as possible. Behind the scenes, our course staff work diligently to develop new questions, which means testing new datasets, new instructions, new auto graders, solution code, and more! Unfortunately, this means we likely cannot release assignments well in advance. We will release them as early as possible, hopefully some days before the scheduled release dates on our course schedule. When we release an assignment, we always announce it on Ed discussion.

Quiz (15%)

  • There will be 10 quizzes throughout the semester on Canvas, and all the quizzes are mandatory to take even the knowledge-base quiz (Quiz 0).
  • Each quiz will have 1.5% of your final score.
  • If you fail to take a quiz, you will be penalized by losing 3% of your course score instead of 1.5%. As taking all quizzes is mandatory.
  • The topic of each quiz will coincide with the contents covered in lectures on the specified weeks.
  • Quizzes will have a duration of seven-minutes for students. Each quiz will have five multiple choice questions. Check the course schedule table for when quizzes will be out and due. Any possible changes on quizzes dates will be reflected on our course schdule page. Please make sure to check our class website before taking the quiz.
  • Quizzes measure your understanding of the topics and they will be mostly conceptual questions.
  • Quizzes' answers will be released as soon as all our students took them including our ODS students. Please do not ask any questions about a quiz that you just take on Edstem before we release the answers.
  • Quizzes questions are selected randomly from our question bank, which means that students will not receive the same questions for their quiz.
  • We use Honorlock for all quizzes in this course to enhance assessment integrity. Quizzes will be open book, open notes with browser activity restricted to Canvas.

General Policy

  1. Class deliverables: All class deliverables will be handled via Gradescope except quizzes which will be on Canvas. The time span offered to complete the course objectives is plentiful and deadlines will not be extended under any circumstances. To ensure the class is fair for all students, you will receive zero credit for work submitted after the deadline. Regrade requests should be submitted directly on Gradescope within a defined period after grade publication (we will inform you on that; we usually give one week for the regrade request; but it may change depending on our class schedule). Should you find yourself in an impasse with the TA responsible for your grading, feel free to contact the head TA or course instructor on Edstem.
  2. Edstem:
    • Edstem will be the main and only place for the course discussions and announcements. If you have questions, please ask it on Edstem first because 1) other students may have the same question; 2) you will get help much faster.
    • If it’s something you do not like to discuss publicly on Edstem, you can use private messaging on Edstem.
    • Anytime you want to send a private message to just me on Edstem, please make sure to add our HEAD TAs too in case I may miss your message.
    • Edstem GOOD questions
      • I don't understand this part of the lecture, can you explain it to me?
      • This certain part of the hw is not clear to me, would it be possible to explain that more?
      • I have a question about the an algorithm...
      • I found an issue on the website, hw or the lectures, can you clarify ...
      • Any feedback, suggestions, ... would be greatly appreciated.
      • Most of the questions ware good in general.
    • Edstem BAD questions
      • Can you debug my code? [our team will not do that. You need to be specific about your question]
      • Can you find where the problem is in my code?
  3. You must achieve an overall weighted average of 60% to pass the course.
  4. All deliverables will be graded by our TAs\GradeScope.
  5. When assigning course grades, We will start with the standard grade thresholds (90, 80, etc.). I may lower (and never raise) the thresholds (i.e., to your benefits). For example, I may use 88 instead of 90.

Plagiarism, Collaboration Policy, and Student Honor Code

  • All course participants (myself, teaching assistants, and learners) are expected to know and abide by the Georgia Tech Academic Honor Code.
  • Ethical behavior is extremely important in all facets of life.
  1. Plagiarism is a serious offense. You are responsible for completing your own work. You are not allowed to copy and paste, or paraphrase, or submit materials created or published by others, as if you created the materials. All materials submitted must be your own.
  2. You may discuss high-level ideas with other students at the "whiteboard" level (e.g., how cross validation works, use hashmap instead of array) and review any relevant materials online. However, each student must write up and submit his or her own answers.
  3. You must not put your code on public domain (e.g., public GitHub), because a (future) student could copy your code. That student obviously violates the honor code, and you may also be implicated.
  4. All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures (e.g., reported to and directly handled by the Office of Student Integrity (OSI)). Consequences can be severe, e.g., academic probation or dismissal, grade penalties, a 0 grade for assignments concerned, and prohibition from withdrawing from the class.

Late Policy and Due Dates

  • All homework and quizzes deliverables are due at the times shown in the Course Schedule. The course offers NO LATE POLICY. Convert the times to your local times using a Time Zone Converter.
  • Every homework assignment deliverable and every quizzes deliverable comes with NO LATE POLICY.
  • Any deliverable submitted after the deadline defined on class schdule will get 0 credit. We recommend that you submit your work at least a day before the deliverable deadline.
  • We will not consider late submission of any missing parts of a deliverable. To make sure you have submitted everything, download your submitted files to double check. If your submitting large files, you are responsible for making sure they get uploaded to the system in time.
  • No penalties for medical reasons or emergencies. And should they arise, you MUST contact the Dean of Students office BEFORE contacting us. Doctor's notes, medical documentation, explanation of emergencies, etc. should be submitted to the Dean’s office. After their office receives the information, they will notify us on your behalf.

Timing Policy

  • The course videos follow a logical sequence that includes knowledge-building (quizzes) and experience-building (assignments).
  • Assignments should be completed by their due dates, in order for timely peer assessment. Peer assessments should also be completed by their due dates, to give timely feedback.
  • You will have access to the course content for the scheduled duration of the course.

Attendance Policy

  • This is a fully online course.
  • Login on a regular basis to complete your work, so that you do not have to spend a lot of time reviewing and refreshing yourself regarding the content.

Netiquette

  • Netiquette refers to etiquette that is used when communicating on the Internet. Review the Ground Rules for Online Discussions. When you are communicating via email, discussion forums or synchronously (real-time), please use correct spelling, punctuation and grammar consistent with the academic environment and scholarship.
  • We expect all participants (learners, faculty, teaching assistants, staff) to interact respectfully. Learners who do not adhere to this guideline may be removed from the course.

Resources

Diversity and inclusion

Just as machine learning algorithms cannot accomplish complex tasks if trained on datasets of limited variability, our course cannot be successful without appreciating the diversity of our students. In this class we aim to create an environment where all voices are valued, respecting the diversity of gender, sexuality, age, socioeconomic status, ability, ethnicity, race, and culture. We always welcome suggestions that can help us achieve this goal. Additionally, if any of our class scheduled activities conflicts with religious events, please inform the instruction team so that we can make appropriate arrangements for you.

Students with disabilities: your access to this course is extremely important to us. The institute has policies regarding disability accommodation, which are administered through the Office of Disability Services: http://disabilityservices.gatech.edu. Please request your accommodation letter as early in the semester as possible, so that we have adequate time to arrange your approved academic accommodations. . If you need a classroom accommodation, please make an appointment with the ADAPTS office (see http://www.adapts.gatech.edu).

Support Services

Academic support, and personal support: Office of the Dean of Students, Counseling Center, Health Serivces, Women's Resource Center, LGBTQIA Resource Center, Veteran's Resource Center, Georgia Tech Police.

Recommended Reading

All content and course materials can be accessed online. There is no textbook for this course.

All Georgia Tech students have FREE access to https://www.oreilly.com, where you can find a huge number of highly rated and classic books (e.g., the "animal" books) from O'Reilly and Pearson covering a wide variety of computer science topics, including some of those listed below. Just log in with your official GT email address, e.g., jdoe3@gatech.edu.

Software engineering; become a better programmer and developer

Python

Data science, machine learning, data mining

Probability

Human Computation

How to manage multiple versions of Python packages?

To get started, we recommend the excellent article on Which Python package manager should you use?

If you've decided to go with pyenv, I recommend Managing Multiple Python Versions With pyenv.

If you use Mac, we recommend to also check out The right and wrong way to set Python 3 as default on a Mac.

Students in my reserach group said that Poetry seems to be fast replacing conda envs, and may even replace setuptools for pypi packages in the future.

Course offerings and Registration

Auditing & Pass/Fail

Due to the large class size, we are not offering auditing and pass/fail option.

Cannot Register for This Course?

This course is currently offered to students in OMS Analytics, and it is taught as one class. Due to the large combined class sizes in OMSA program, we are not able to increase the enrollment capacity. During the first week of class, there is usually a lot of movement in the course registration waitlist. Unfortunately, we do not have an estimate about the extent of the movement. We hope you will be able to enroll, and we will see you in class!

Acknowledgment & Related Classes

Many thanks to my colleagues for sharing some of their course materials.