MA 384 Data Mining
Course Prerequisites
- MA 384 Prerequisite: CSSE 120 Intro to Software Development or equivalent course.
- CSSE 386 Prerequisite: CSSE 220 Object-Oriented Software Development
- MA 223 Engineering Statistics I or MA 381 Intro to Probability.
Course Topics
- Data Exploration, Preparation and Visualization
- Classification Analysis
- Cluster Analysis
- Elementary Natural Language Processing
- Student Project on an Advanced Topic
- Note: CSSE 386 has more advanced programming assignments.
Course Description
An introduction to data mining for large data sets, include data preparation, exploration, aggregation/reduction, and visualization. Elementary methods for classification analysis, cluster analysis, and natural language processing will be covered. Significant attention will be given to presenting and reporting data mining results.
Course Textbook: none
Three V’s of Big Data
Example (Volume)
Visualizing Friendships (Facebook)
Visualizing data is like photography, Instead of starting with a blank canvas, you manipulate the lens used to present the data from a certain angle. Paul Butler, Facebook Intern
Facebook’s social network graph is large enough to reproduce a rough map of the world.
Example (Velocity):
Clicks in the United States in June 2011
YouTube animation below created by Helmut Hissen. Green represents clicks from mobile devices and red represents clicks from non-mobile devices.
Example (Variety)
City of Chicago Data Portal
Cities are releasing a wide variety of data to the public.
Past Student Projects
- A Technical Analysis of Stock Trades Made by Members of Congress, Ben Adler, Thomas Bioren, Matthew Hart, Jackson Heil
- Predicting S&P 500 Movement Using Financial Asset Prices, Austin Vesich, Ian irschenbaum, Jake Grellman
- Chess Analytics, Avichal Jadeja, Chris Lardner
- Assessing Improvement in Collegiate Swimmers: Key Factors from High School to College, Tommaso Calviello, Emre Gunay, Vineet Ranade, Preksha Sarda, Blaise Swartwood
- Detecting Illicit Bitcoin Transactions, Graph Data Mining for Anti-Money Laundering, Ian Lemons, Aiden O’Neil, Ethan Pabbathi, Rhys Phelps
- Predicting Movie Success: A Data-Driven Approach, Simarjit Dhillon, Ben Joens, Justin O’Donnell, Kyle Wang
- Football Sports Betting Analysis, Manav Ahuja, Matthew Briscoe, Colin Decker, Ethan Huey
- Possible Factors Explaining the California-Texas Migration, Dylan McCain, Ryan Seidel, Larsen Morehouse, and Mark Worden
- What Factors Determine Movie Profitability? Aidan Frantz, Ethan Hutton, Devin Mehringer, Wesley Schuh
- Credit Card Approval and Loyalty Prediction Analysis, James Koh, Kevin Lin, Joshua Lowe, and Aditya Senthilvel (pdf)
- Cardiovascular Disease Assessment and Prediction, Mitch Boucher, Ariadna Duvall, Aiko Sherman, Manuella Shomba
- Historical Composition of the United States Senate and House of Representatives, Eric Bender, Evan Chung, Anthony Mui, Rahul Siripuram
- Behind the Goals: A Data-Driven Approach to the FIFA World Cup, Brian Beasley, Matteo Calviello, Caleb Mosteller, David Utsis (pdf)
- Analyzing NBA Game Data: Unraveling the Key Metrics for Victory, Wil Bell, Michael Trinh, Gurinder Vasanta
- Relationship Between San Francisco Crime and Housing, Swade Cirata, Marcus Henderson, Aidan Matthews, Chaitanya Singh
- What Makes a Hit Song? Chuwei Du, Josh Norris, Drew Kilner
- Predicting the Financial Stability of Banks, Daniel Gaull, Hanshuo Geng, William Hawkins, Muyao Zhong
- Academic Test Scores in NCAA Sports, Ethan Brown, Lucas Czarnecki, Grant Ripperda, Liam Waterbury, Harrison Wight
- Racial Demographics of Superfund Sites, Riya Bharamaraddi, Mike Bryant, Luke Ferderer, Jayden Foshee, Collin Morris
- Statistical Characteristics of Big Five Personality Traits, Olivia Davis, Nat Hurtig, Dalton Julian, Andrew Kosikowski, Andrew Orians
- Soccer Player Analysis, Qijun Jiang, Xianshun Jiang, Yuanyu Wang, Yunzhe Wei, Yujie Zhang
- Analysis of Trends in Steam Reviews, Ian Liu, Hunter Masur, Raf Qian, Simon Tian, Thomas Yang
- Analysis of Discord Conversations, Nathan Chen, Spencer Chubb, Emily Hart, Matthew Ragland
- Factors of the Median Income of Graduates by College, Jordan Ansari, Brock Buczkowski, Cade Parkhurst, Andrew Pascente, Adithya Ramji
- LA Rams: Analyzing the NFL’s best, Sangheon Choi, Kush Bhuwalka, Ken Zheng, Samvit Ram
- On Implementation of a Flat Tax Rate on Individual Income Tax in the United States, Luke McMahon, Josh Mestemacher, Evan Sellers, Michael Yager
- Housing Prices of Housing Types Across Regions in the United States, Jadon Brutcher, Adam Korinek, Avery Wagner, Grant Wyness
- Pandemic Crime Changes, Andre Battle, Nick Bohner, Aidan Mazany, Jake Wallis
- Youtube Dislikes, Rob Budak, Luke Cesario, Jonathan Moyers, Azzam Turkistani
- Covid19 Vaccination Adverse Reactions, Bowen Ding, Ao Liu, Nigel Nie
- Lichess, Tom Ahmed, Griffin Annis, Landon Bundy, Jackson Hajer, Christian Meinzen, Nick Von Bulow
- Toxic Comment Classification, Shannon Jin, Dylan Luttrell, Vidhu Naik, Connie Zhu
- Kickstarter Project, Joey Hatfield, Zach Kelly, Zackery Painter, Nick Pisciotta, Ried Tate
- Olympics Through the Ages, Max Chaplin, Abi Clayton, Bowen Lie, Ainsley Liu, Jake Milanowski, William Thesken
- Characteristics of Successful Movies, Shengjun Guan, Alex Ketcham, Aaryan Khatri, Andrea Wynn, Sean Xia, Will Yelton
- Premier League Analysis 1920, Yutong Chen, Mashengjun Li, Max Li, Jiadi Want, Travis Zheng
- Solar Panels, David Alba-Lopez, Jeremiah Wooten, Rachel Harness, Mory Chen
- DSL Modem Data Analysis, T.J. Ballard, Sybil Chen, Piotr Galas, Wendy Ju, Kristen McKellar
- U.S. Stock Exploration, Sam Dunaway, Aaron Glave
- Movie Data, Mohammed Ali, Derek Grayless, David Gruninger, Jiafan Lin, Caleb Schlundt
- Covid-19 Data Analysis, Tiantian Zhang, Ben Feaster, Howard Hu, Yu Xin Evian Wen
- Repeat Buyer Prediction, Augustine Cui, Doris Chen, Scott Sun, Wenxing Li, Xiangnan Chen
- Panopto Video Statistics, Jessica Myers, Katana Colledge, Brionna Slaughter, Jake Meister
- Netflix Digital Contents, Robin Li, Zijian Huang, Valerie Liu, Aurora Ouyang, Susie Seo, Siwei Xu
- Energy Production and Usage, Steven Feng, Lawrence Ko, Wenze Ma, Shiloh Musser, Darren Zhu
- Sports Data, Samuel Flickinger, Eric Kirby, Arjun Mahajan, Jared Petrisko, Anthony Schmidt
- Factors of Graduate School Admissions, Runzhe Gao, Frank Hu, Weite Li, Song Luo, Max Wang
- UFO Sightings, Alexander Boffo, Aditya Burle, Benjamin Goldstein, Michael Lake, Cehong Wang
- Google Books Ngrams, Ben Hall, William Mason, Aaron Michael, Stella Park, Wyatt Shafer
- Trending YouTube Video Statistics, Hussein Alawami, Tyler Bath, Tyson Clark, Sylvia Nees, Indresh Srivastava
- Opiate Overdosing, Brevin Lacy, Matthew Lyons, Kathi Munoz-Hofman, Neelie Shah
- Kickstarters, Stephen Crowell, Timmy D’Avello, Michelle Reese, Nate Schwindt
- Analyzing Trends in the Stock Market, Alexander Bradshaw, JaeJung Hyun, Jacob Petrisko, Abilash Raghuram
- Measuring Economic Distress and Disparity, Khalad Alfayez, Omar Fayoumi, Megan Hawksworth, Addi Reynolds, Seiji Takagi
- New York City Restaurant Inspections, Xiaomei Bi, Jocelynn Cheesebourough, Cambron Johnson, Jing Lin, Olivia Penry
- Aviation Accidents, Sonia Lai, Yiyu Ma, Dylan Scheumann, Xuechen Xie, Willis Yang
- Trending YouTube Video Statistics, Eric Chen, Lory Wang, Yiyuan Wang, Valentine Wu, Huirou Zou
- Stock Pump and Dump, Manoj Kurapati, Joshua Palamuttam, Isaac Austin, Rithvik Subramanya
- RoseCareer, Luke Wukusick, Kaiyu Xie, Chelsey Yin, Fred Lin, Christopher Nurrenberg, Jonah Reel
- College Swimming Data, Mary Petersen, Adam Baker, David Saadatnezhadi, Johann Ryan
- Global Terrorism Data Mining, Wesley Turner, Michael Crowell, Zachary Taylor, Alexander Granowski, Tucker Osman
- Advanced Computer Simulated Conflict Data, Kevin Lewis, Charlie Hersherger, Logan Smith, Anthony Grueninger
- ROSECRET, Qiuyun Li, Curtis Wang, Jerry Zheng, Mory Chen, Lansi Wang, Yicong Xie
- MOBA Game Analysis, Xiangbei Chen, Yifei Chen, Jiahao Chi, Jizhon Hang, Peicheng Tang, Yilun Wu
- Analyzing Song Lyrics and Chord Progressions, Wyatt Smith, Anne Boxeth, Jarret Alexander, Kennedy Schnieders, Zachary Thelen
- Plant Geneology, Charlie Gettys, Jenna Wohlpart, Nick Harrelson, Ishan Saraf, Anirudh Singh
- Flight Status, Ramsey Tomasi-Carr, Akanksha Chattopadyay, Nihaal George, Wesley Siebenthaler, Shinjun Yu
- Analyze and Visualize Traffic Throughput, Benjamin Brubaker, Songuy Wang, Alexander Wong
- Analyze Flights Status and Predict Delay and Cancellation, Tiancong Zhao
- Hearthstone Card Graph, Tyler Rarick
- League of Legend Ranked Game Analysis, Fangyuan Wang
- Magic the Gathering Card Analysis, Dalton Bush, John Fenoglio, John Hamilton
- Measuring the Effect of Weather on Electricity Generation for Renewables, Marc Schmitt, Daniel Verlaque
- Professional Sports Signing Predictions Based Off NCAA Stats, Lucas Weier, Eric Haug, Alexander Meyers
- Stock Correlator, Ryan Crafts, Christopher Knight, Ethan Peterson
- Stock Market and Social Platforms, Dustin George
- UFO Report Analysis in Past 20 Years, Wenkang Dang, Donglai Guo
- Video Game Sales, Yunuan Ding, Yuanqi Li
- Amazon Stock Price Analysis, Alexa Pieragowski, Emelye Wu
- Chess Game Analysis, Kieran Groble, Lewis Kelley, David Lam
- Graphical Data Mining using Diiagramr Library Backend by Pandas, Christian Nunnally
- Modeling Cryptocurrencies to a Behavioral Economic Model, Joseph Porter, William York
- Movie Recommender, Joseph Brown, Ding Nie, Avery Pratt
- Optimization of Electricity Grid – Analysis of Resources, Distribution and Consumption of Electricity, Bryan Gish, Caleb Hille, Joseph Novosel
- Pokemon Project, Fengyi Huang, Ming Lyu, Junyi Xiao
- Professions Across the Country and The Cost of Working There, Leo Betts, Jaron Goodman
- Protein Visualization, Krystal Yang, Sam Zhang, Fred Zhang
- Topic Interest Correlation to Asset Prices, Adit Survarna, Jack Wassom
- TV Show Quotes Analysis, Lance Dinh, Maya Holeman
- Twin Cities Metro Area Data, Kiana Caston, Joshua Richey
- World Input-Output Database, Ty Adams, Mariana Lane
- Yelp Mining, Justin Willoughby
- League of Data, Dax Earl, Mason Schneider, Aaron Golliver, Tayler Burns and Mark Hein
- Classification of Protein Folds and Families, Jonathan Taylor, Alexis Fink, Devon Timaeus, Giuliana Watson
- Geographic Analysis of Movie Preferences, L.E. Davey, David Galvez, Matthew Mercer, Henrik Sohlberg
- Classification of Galaxies. Man Chi Huen, Si Fi Faye Li
- Handwritten Digits Classification, Brent Austgen, Matt Spurr, Jake Schuenke, Tyler Shelton
- Prediction of Flight Delay Times at SFO, Andy Chen, Zhengyu Qin, Ted Samore
- Organic Foods Impacts and Trends, Brandon Cox, Davis Robinson, Fang Huang
- March Mining Madness, Dan Schepers, Matt Skorina
- Where Rose Goes, Elias White
- Mining Twitter for Meaningful Sentiment, Alex Crowley
- Transforming How We Diagnose Heart Disease, Alvin Ye, Kyle Daruwalla
- Handwritten Digit Recognition, Adam Finer, James Gibson, Johnathon Hein
- Steam Marketplace Analysis, Jacob Knispel, Nithin Perumal, Alec Tiefenthal, Matt Buckner
- An Analysis by Income of the 2013 Community Data Set from Kaggle, Jake Laird, Abby Mann, Zachary Haloski
- Lyric Generator, Christopher Lambert, Graham Fuller
- Chilean Government Income Analysis, Deven Dong, Yuzong Gao, Fang-Yen Lee
- Comparing Countries over Time, Anne Leonhard
- Tornado Trends in the United States, Megan Liebman
- Where and When Teleport is a Better Summoner Spell than Ignite in League of Legends, Andrew Ma
- How Data Mining Can Help You Do Better in Hearthstone, Jerry Qiu, An Hu
- Are Stock Clusters Meaningful, Dylan Vener, Daniel Mikhail
- Unrevealed Relationships Between Resources, Ruinan Zhang, Wenjun Kong, Zhihao Xue, Jiaren Wu
- Correlation between Stock Price and Trading Volume, Xiao Xin

