Skip to content

Commit

Permalink
working on sp2025 cs598.md
Browse files Browse the repository at this point in the history
  • Loading branch information
pyongjoo committed Jan 14, 2025
1 parent 13901fe commit 24a416d
Showing 1 changed file with 353 additions and 0 deletions.
353 changes: 353 additions & 0 deletions teaching/sp2025/cs598.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,353 @@
---
layout: default
---

## CS 598: Hot Topics in Data Management

**Time:** 2:00 PM - 3:15 PM (Tue/Thu)
**Location:** 1310 Digital Computer Laboratory ([Google Map](https://maps.app.goo.gl/9G3fGhgtiVWJ5wWEA))
See this course in   [[Canvas]](https://canvas.illinois.edu/courses/54802)
  [[Course Explorer]](https://courses.illinois.edu/schedule/2025/spring/CS/598)
  [[Canvas SignUp Request]]()

**Instructor:** Yongjoo Park (https://yongjoopark.com):
<ul style="margin-top: -15px;">
<li>Email: [email protected]</li>
<li>Office Hours: After class in person at Rm 2114 Siebel for CS</li>
</ul>

<p style="margin-top: -15px;"> <b>TA 1:</b> Hanxi Fang ([email protected])</p>
<ul style="margin-top: -15px;">
<li>Office Hours: TBD</li>
</ul>

<p style="margin-top: -15px;"> <b>TA 2:</b> Talika Gupta ([email protected]) </p>
<ul style="margin-top: -15px;">
<li>Office Hours: TBD</li>
</ul>

<p style="margin-top: -15px;"> <b> Prerequisites </b> </p>
<ul style="margin-top: -15px;">
<li><b>Background:</b> CS411 (Database) or equivalent.</li>
<li><b>Programming:</b> Proficiency in one of the following programming languages: C, Java, etc. We don't provide programming practices; you should be able to write code independently.</li>
</ul>

<p style="margin-top: -15px;"> <b> Class Structure </b> </p>
<ul style="margin-top: -15px;">
<li>Overview by Instructor: 5 mins</li>
<li>Student Presentation (if any): 25 mins (including Q&A)</li>
<li>Instructor Presentation: 40 mins</li>
<li>Quiz: 5 mins</li>
</ul>

<p style="margin-top: -15px;"> <b>Your main workload as a student will be </b> </p>
<ul style="margin-top: -15px;">
<li>Understanding papers (take a quiz every class and two midterms)</li>
<li>Participating in group projects (3 MPs and 1 research)</li>
<li>(Optional) presenting a paper for 5% extra credits</li>
</ul>

{% tabs log %}


### Second Tab: Schedule

{% tab log schedule %}

[[Slides in Google Drive]](https://drive.google.com/drive/folders/1CvzTFRY0rHRASpMldEfSUBFKJ7Gu-thp?usp=sharing) &nbsp; &nbsp;
*Accessible to all Illinois people; Permission denied? See [this page](https://help.uillinois.edu/TDClient/42/UIUC/Requests/ServiceDet?ID=135).*


### Topic I: Analytical Database

#### Week 1: Distributed Processing I

8/27: Course Overview / [ComesAround by Stonebraker and Hellerstein](https://www-cs.stanford.edu/people/chrismre/cs345/rl/datamodel.pdf)
- Main question: *Why relational model?*

8/29: [MapReduce by Dean and Ghemawat](https://people.eecs.berkeley.edu/~bh/61a-pages/Volume2/mapreduce-osdi04.pdf)
- Main question: *How can we scale analytic operations?*
- *(read more: [ParallelDB by DeWitt and Gray](https://lipyeow.github.io/home/teaching/ics421/2010spr/dewittgray92.pdf))*
- ***Presentation sign-up (1/2): 8/30 8 AM, Canvas announcement***


#### Week 2: Distributed Processing II

9/3: [Distributed Join by Blanas et al.](https://web.archive.org/web/20110401200257id_/http://pages.cs.wisc.edu/~jignesh/publ/hadoopjoin.pdf)
- Main question: *What join algorithms for distributed DBMSs?*
- ***Project group sign-up starts (for both MPs and Research project)***

9/5: [Google File System by Ghemawat et al.](https://courses.cs.vt.edu/cs5204/fall05-gback/papers/p29-ghemawat.pdf)
- Main question: *What file sysetems for distributed DBMSs?*


#### Week 3: Columnar Layout

9/10: [C-Store by Stonebraker et al.](https://www.vldb.org/archives/website/2005/program/paper/thu/p553-stonebraker.pdf)
- Main question: *Different data layouts for analytical workloads?*
- Student presentation 1 (Early-bird bonus: 20%)

9/12: [Dremel](https://15799.courses.cs.cmu.edu/fall2013/static/papers/p330-melnik.pdf)
- Main question: *How can we represent nested data?*
- Student presentation 2 (Early-bird bonus: 20%)
- ***Project 1 (MP: Spark/HDFS) Out (due in three weeks: 10/2)***

#### Week 4: Sketching

9/17: [Monkey by Dayan and Idreos](https://stratos.seas.harvard.edu/files/stratos/files/monkeykeyvaluestore.pdf)
- Main question: *Basics and optimization of Bloom filters*
- Student presentation 3 (Early-bird bonus: 15%)

9/19: [HyperLogLog](https://dmtcs.episciences.org/3545/pdf) *(Read more: [HLL++](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40671.pdf) by Google)*
- Main question: *Memory-efficient counting of unique items*
- Student presentation 4 (Early-bird bonus: 15%)
- ***Project 4 (research) Proposal Out (Due 11/7)***


#### Week 5: Sampling

9/24: [BlinkDB](https://dsf.berkeley.edu/cs286/papers/blinkdb-eurosys2007.pdf)
- Student presentation 5 (Early-bird bonus: 10%)

9/26: [Online Aggregation by Hellerstein](https://dsf.berkeley.edu/papers/sigmod97-online.pdf) *(Read more: [DeepOLA by Park](https://arxiv.org/pdf/2303.04103))*
- Main question: *New query processing paradigm*
- Student presentation 6 (Early-bird bonus: 10%)
- ***Presentation sign-up (2/2): 9/28 8 AM, Canvas announcement***


#### Week 6: Cardinality Estimation

10/1: [SystemR at IBM](https://cs.brown.edu/courses/cs295-11/2006/AccessPath.pdf) *(Read more: [QuickSel by Park](https://yongjoopark.com/resources/quicksel_preprint.pdf))*
- Main question: *Basics of cardinality/selectivity estimation*
- Student presentation 7 (Early-bird bonus: 5%)

10/3: [Naru by Yang and Stoica](https://www.vldb.org/pvldb/vol13/p279-yang.pdf)
- Main question: *Deep learning for cardinality estimation*
- Student presentation 8 (Early-bird bonus: 5%)
- ***Project 1 Due by 10/2 11:59 PM Central***
- ***Project 2 (MP: DeepOLA) Out (due in three weeks: 10/23)***
- project topic: Deep Online Aggregation


#### Week 7: Compression

10/8: [Columnar Compression by Abadi and Madden](https://cs.brown.edu/courses/cs227/archives/2008/mitchpapers/required10.pdf)
- Main question: *Data compression techniques for DBMSs*
- Student presentation 9

10/10: <span style="color: red">**Midterm 1**</span> (in-person; Scope: Week 1-7)

- Online option available on 10/12 (Saturday) at 2 PM Central; 10% point reduction.
- To take the online version, you must request it by the end of 10/3 via email to the instructor Yongjoo Park OR you must have a very good reason (e.g., unforeseeable tragic events).

### Topic II: Data Science

#### Week 8: Dataframe

10/15: [Modin by Petersohn and Parameswaran](https://arxiv.org/pdf/2001.00888)
- Main question: *Scalable library for dataframe operations*
- Student presentation 10

10/17: [Ray by Moritz and Stoica](https://www.usenix.org/system/files/osdi18-moritz.pdf)
- Main question: *Distributed framework behind Modin and ChatGPT*
- Student presentation 11


#### Week 9: Scalable Data Science

10/22: [ElasticNotebook by Li and Park](https://billyzhaohengli.github.io/files/Elastic_Notebook_Class_Presentation.pdf)
- Main question: *Can we save Jupyter notebooks?*
- Student presentation 12

10/24: [Differential Privacy](https://people.eecs.berkeley.edu/~raluca/cs261-f15/readings/sigmod115-mcsherry.pdf)
- Main question: *How can we share sensitive information without breaching privacy?*
- Student presentation 13
- ***Project 2 Due by 10/23 11:59 PM Central***
- ***Project 3 (MP: Ray/Pandas) Out (due in three weeks: 11/13)***


### Topic III: Transaction and System Tuning

#### Week 10: Database Transaction

10/29: [Snapshot Isolation](https://www.vldb.org/conf/2006/p715-daudjee.pdf)
- Main question: *Snapshot isolation vs Serializability*
- Student presentation 14

10/31: [Aria by Lu and Madden](https://dspace.mit.edu/bitstream/handle/1721.1/133983/3407790.3407808.pdf?sequence=2&isAllowed=y)
- Main question: *How does deterministic transactions work?*
- Student presentation 15

#### Week 11: Learning for DBMS

11/5: [Learned Index by Kraska](https://courses.cs.washington.edu/courses/csep544/21sp/papers/kraska-case-for-learned-index-sigmod-2018.pdf)
- Main question: *Machine learning for indexing*
- *(Read more: [AirIndex by Chockchowwat and Park](https://arxiv.org/pdf/2306.14395))*
- Student presentation 16

11/7: [OtterTune by Aken and Pavlo](https://15799.courses.cs.cmu.edu/spring2022/papers/06-knobs1/p1009-van-aken.pdf)
- Main question: *ML for System Optimization*
- Student presentation 17
- ***Project 4 (research) Proposal Due***


### Topic IV: No SQL

#### Week 12: Key-Value Store

11/12: [LSM-Tree Slides by Dayan](https://cs-people.bu.edu/mathan/classes/Comp115-Spring2017/slides/Comp115-Guest-Lecture-LSM-Trees.pdf)
- Main question: *Basics of LSM tree: the data structure behind LevelDB/RocksDB*
- Student presentation 18

11/14: [PebblesDB](https://www.cs.utexas.edu/~vijay/papers/sosp17-pebblesdb.pdf)
- Main question: *Understand write amplification in LSM Tree*
- Student presentation 19
- ***Project 3 Due by 11/13 11:59 PM Central***


#### Week 13: Stream Processing

11/19: [Kafka -> Confluent](https://pages.cs.wisc.edu/~akella/CS744/F17/838-CloudPapers/Kafka.pdf)
- Main question: *Most well-known stream procsesing system*
- Student presentation 20

11/21: [Naiad: Timely Dataflow -> Materialize](https://www.cs.princeton.edu/courses/archive/fall22/cos418/papers/naiad.pdf)
- Main question: *A new alternative to stream processing*
- Student presentation 21


#### Fall Break

11/26: *Fall break*

11/28: *Fall break*

***MP Extra Credit Due by Nov 30, 2024 at 11:59:59 PM Central Time***

#### Week 14: Final Presentation and Midterm 2

12/3: Project Presentation I
- ***Project 4 (report; 4 pages) Due by 12/2 11:59 PM Central***
- Up to 12 groups will present this day; 5 mins per group

12/5: Project Presentation II
- The remaining groups will present this day; 5 mins per group

12/10: <span style="color: red">**Midterm 2**</span> (In-person; Scope: Week 8-13) *(Last day of instruction)*

- Online option available on 12/12 (Thursday) at 2 PM Central; 10% point reduction.
- To take the online version, you must request it by the end of 12/3 via email to the instructor Yongjoo Park OR you must have a very good reason (e.g., unforeseeable tragic events).


{% endtab %}


### Third Tab: Evaluation

{% tab log evaluation %}

### Overview

**Regular Components**

- Evaluating Other Student Presentations: 5%
- Quizzes: 11%
- Midterms: 40% (20% + 20%)
- Group Projects: 40%
- Participation: 4%

**Extra Credits**

- Student Presentation: 2% (min) - 5% (max); evaluation based on peers


### Evaluating Other Student Presentations (5%)

There are 21 student presentation slots. For each presentation, you should provide peer evaluation using a Likert scale between 1 and 5. To receive full credit (5%), you must submit evaluations for at least 50% of the presentations (e.g., 50 if there are 100 presentations in total). If you submit less, you will receive partial credit (e.g., if you submit 25 evaluations among 100 total presentations, you will receive 25/50*5% = 2.5%).

Warning: If you submit constant scores (e.g., 5/5 for every presentation), it will be considered as "no submission" and your Participation score may get penalized.

[[Presentation Evaluation Form]](https://docs.google.com/forms/d/e/1FAIpQLSf411lC0kxC10edflAjNwqnTjIdvYtsk_qaCsoOOniyFuPlKw/viewform?usp=sf_link)


### Quizzes (11%)

In the middle of each class, there will be a short quiz. You will receive a full 11% by scoring 70/100 (or 70%) in the quiz portion. Specifically, to calculate the final score, your individual quiz scores will be summed and normalized to have a max 100 points (which is called a raw score); then, the final score will be calculated as:

(final quiz score; max 100) = min(raw score / 70 * 100, 100)

For regular absences, no make-up quizzes will be provided (so come to class!). For Verified Absences, we will provide make-up quizzes. Sample quizzes will be provided in the first and second classes.

"70/100" is to accommodate unexpected absences (e.g., mild cold, too busy for other courses, family travel). No additional absences will be allowed except for extreme circumstances (e.g., seriously ill, contagious disease, tragic events); in these cases, please contact the main instructor directly to discuss and/or adjust your course workload.

Quiz Evaluation: You can receive up to two points in each quiz. If you submit an answer (in class), you will get one point always (even though your answer is wrong). If your answer is correct, you will receive one additional point, thereby two points in total for that quiz.

*Quiz scores will be made available immediately.*


### Midterms (40%: 20% + 20%)

We have two midterms: the first on 10/10 and the second on 12/10. Exam questions are based on the quizzes. Online make-up exams are offered for important events (e.g., family travel) two days after each exam day with a 10% relative point reduction (thus, perfect answers will receive 90 out of 100, if the total is 100). The point reduction may be removed if the reason is something serious and unavoidable (e.g., tragic events, serious illness).

*Your midterm scores will be released within one week. This delay is to accomodate the students who may be taking online exams.*


### Projects (40%): three MPs and one research project

Four students will form a group for the project. The goal of the projects is to learn well-known (modern) systems in the area of large-scale data management. Please read the following document to learn more about what tasks will be accomplished.

[Project Document Placeholder]()

Late submissions will receive a 20% point reduction every day. For example, if you submit within the first day after the deadline, you will receive up to 80/100. After five days, you will receive 0 points. We will check both report submission times (if required) and the last timestamps of code changes (e.g., on GitHub).


### Participation (4%)

Your in-class participation will be evaluated based on your participation in questions/discussions. If you ask questions or participate in discussions either in class or **online**, claim your contribution using [this Google Form](https://docs.google.com/forms/d/e/1FAIpQLSfcqUumeq4O6cu9icC0n2D8Tsut_UnIGuSAvw3uPp6Cw_Df0w/viewform?usp=sf_link) within two days of the class. You should not falsely claim your contribution (e.g., by submitting forms without actually asking questions), which will be in violation of the code.


### Extra Credit: Student Presentation (5%)

**Presentation Scores**

You will receive extra credit based on the peer evaluations by other students. To ensure most students receive high-enough scores, the following (rather complex) scheme will be used.

(your presentation score; max 100) = \sum_i^N (normalized score from student-i) / N

where (normalized score from student-i) is obtained by normalizing all the scores submitted by student-i to have mean 93 and standard deviation 5. That is, even if student-i submits relatively bad scores (say 2/5 on average), his/her negative scores do not affect your overall score since those scores will be anyway normalized to have the same mean and standard deviation. The same logic applies to another student (say student-j) who tends to submit higher scores (say 4/5 on average) since his/her scores will also be normalized to have the same mean and standard deviation.

*Your presentation scores will only be released almost at end of the semester
due to normalization process.*

**Early-bird Bonus**

To encourage presenting papers we cover earlier weeks of this course, we will give an additional bonus by scaling up your "presentation score" (up to 20%). That is, if you present in the second week (Lecture No 4), scaling factor 1.2 will be applied to your "presentation score", making your final score "presentation score" * 1.2. The scaling ratio decreases over time. For more details, please see this Presentation Schedule (to view this document, you must be logged in using an Illinois account).


{% endtab %}


### Third Tab: Others

{% tab log others %}

[Academic Integrity Policy and Procedure](https://studentcode.illinois.edu/article1/part4/1-401/)

[Anti Racism and Inclusivity Statement](https://ws.engr.illinois.edu/sitemanager/getfile.asp?id=1595)

[Accessibility and Accommodations](https://oae.illinois.edu/our-services/accessibility-and-accommodations/)

[Religious Observations](https://odos.illinois.edu/resources/students/religious-observances)

[Sexual Misconduct Reporting Obligations](https://wecare.illinois.edu/faq/employees/#:~:text=Policy%20and%20Reporting-,The%20University%20of%20Illinois%20is%20committed%20to%20combating%20sexual%20misconduct,the%20University's%20Title%20IX%20Office.)


**Statement on CS CARES and CS Values and Code of Conduct**

All members of the Illinois Computer Science department - faculty, staff, and students - are expected to adhere to the CS Values and Code of Conduct. The CS CARES Committee is available to serve as a resource to help people who are concerned about or experience a potential violation of the Code. If you experience such issues, please contact the CS CARES Committee. The instructors of this course are also available for issues related to this class.

{% endtab %}


{% endtabs %}

0 comments on commit 24a416d

Please sign in to comment.