TJ Machine Learning Club
Making AI more accessible
Join Us TodayCompetition Instructions
Beginner Series Competitions
Neural Networks Competition Instructions
12/11/19 - Your job is to write the code to create an neural network, train it on the training data, and use it to predict the classes of the testing data. This time, we'll be using data from the famous MNIST dataset. MNIST has 28x28 pixel images of handwritten numerical digit, which means 784 different features, each representing one pixel of the image.
Your neural network should predict the handwritten digit, outputting a number from 0-9
, based on these 784 input pixels.
Each pixel is described by a single 0-255
number representing its intensity (0 being a completely white pixel, 125 being a gray-ish pixel, and 255 being a fully black pixel).
Note: the numbers represented will probably contain relatively intense pixels near the center of the stroke, but have pixels fading in intensity on the borders of the stroke of the number.
The Data
With that in mind, the training data now looks like this:
class (digit), n1, n2, n3, n4 ... n784
0, 0, 0, 130, 255 ... 120
9, 0, 10, 101, 87 ... 9
3, 230, 0, 5, 35 ... 1
1, 90, 0, 0, 90 ... 23
1, 80, 70, 100, 5 ... 167
etc. (27629 lines)
and the testing data looks like this:
id, n1, n2, n3, n4 ... n784
1, 0, 0, 130, 255 ... 120
2, 0, 10, 101, 87 ... 9
3, 230, 0, 5, 35 ... 1
4, 90, 0, 0, 90 ... 23
5, 80, 70, 100, 5 ... 167
etc. (10000 lines)
Your end file should be a .csv
like this:
id, number
(For every set of features in line N in the training data, you should have a line
1, predicted_class
2, predicted_class
3, predicted_class
4, predicted_class
etc. (10000 lines)
N, predicted_class
in your submission file.)
Guidelines
We highly recommend you open both the training and testing csv files in a program like Excel, which will help you modify columns of data and perform calculations quickly.
The Data I/O code from the decision tree competition may still be useful.
As always, use Python for this competition. The only external libraries allowed are Numpy and Pandas. (No Scikit-Learn.) This competition will end at 11:59 PM on Jan 15th, 2020. If it's your first time competing, check out how to participate under the Decision Trees Competition link on this page.
We've written a small shell. The shell has a network class, and each network is made up of a list of layers (which are a separate class). Each layer is designed to have its own vectors and matrices (biases and weights, etc.). You don't have to structure your network this way by any means, or have a layer class at all. Most of the time when people write neural networks from scratch, they only have a single network class.
Support Vector Machines Competition Instructions
10/23/19 - This competition is a harder version of the Random Forests competition.
Your job is to write the code to create an SVM, train it on the training data, and use it to predict the classes of the testing data.
The goal is to classify survival of passengers on the Titanic, and the data we are using are from actual passengers on the ship.
Your random forest is supposed to classify whether a passenger survived RIP (0) or Survived (1)
, based on 11 different metrics (features).
The purpose of this contest is not to test your ability to write an SVM. Instead, this contests mostly tests:
- Your ability to use Scikit-Learn
- Your ability to work with real-world data
The second is far more important (and difficult) than the first.
The Data
With that in mind, the training data now looks like this:
feature 1, survival, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11
feature 1, survival, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11
feature 1, survival, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11
etc. (636 lines)
and the testing data looks like this:
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11
etc. (255 lines)
Your end file should be a .csv
like this:
id, solution
(For every set of features in line N in the training data, you should have a line
1, predicted_class
2, predicted_class
3, predicted_class
4, predicted_class
etc. (255 lines)
N, predicted_class
in your submission file.)
Guidelines
There are missing data points. Some features are not useful. The difficult part of this contest is formatting the data given, determining which features are useful, which ones should be trained on, and how to deal with the missing data.
You are allowed to (and should) use Scikit-Learn to create your SVM. Scikit has detailed instructions on how to write an SVM using the library here.
We highly recommend you open both the training and testing csv files in a program like Excel, which will help you modify columns of data and perform calculations quickly.
The Data I/O code from the decision tree competition may still be useful.
This jupyter notebook, which applies an SVM to the trivial problem of differentiating greek yogurt from regular yogurt, might also be helpful.
As always, only use Python for this competition. The only external libraries allowed are Scikit-Learn, Numpy, and Pandas. This competition will end at 11:59 PM on Tuesday, November 5th. If it's your first time competing, check out how to participate under the Decision Trees Competition link on this page.
Random Forests Competition Instructions
10/20/19 - Welcome to the (surprise) second competition of the year!
Your job is to write the code to create a random forest, train it on the training data, and use it to predict the classes of the testing data.
The goal is to classify survival of passengers on the Titanic, and the data we are using are from actual passengers on the ship.
Your random forest is supposed to classify whether a passenger survived RIP (0) or Survived (1)
, based on 7 different metrics (features).
The Data
The training data looks like this:
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, survival
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, survival
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, survival
etc. (500 lines)
and the testing data looks like this:
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7
etc. (214 lines)
Your end file should be a .csv
like this:
id, solution
(For every set of features in line N in the training data, you should have a line
1, predicted_class
2, predicted_class
3, predicted_class
4, predicted_class
etc. (214 lines)
N, predicted_class
in your submission file.)
Guidelines
No external libraries allowed other than numpy. The competition (here) closes at 11:59 PM on 10/29/19. The leaderboard calculations, Python resources, and participation steps apply to this competition too.
Shell code is available here, should you want it. The Data I/O code from the decision tree lecture will still be useful, but will need to be adapted for this data. Good luck!
Footnote
The features correspond to:
pclass - Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd
sex - Sex: 0 = Male, 1 = Female
Age - Age in years
sibsp - # of siblings / spouses aboard the Titanic
parch - # of parents / children aboard the Titanic
fare - passenger fare
embarked - Port of Embarkation: 0 = Southampton, 1 = Cherbourg, 2 = Queenstown
survival - Survival: 0 = No, 1 = Yes
though this isn't necessary to know for the competition.
Decision Trees Competition
10/2/19 - Welcome to the first contest of the year! Your job is to write the code to create a decision tree, train it on the training data, and use it to predict the classes of the testing data. We are trying to classify breast cancer data. The data we are using is actual data from breast cancer patients. Your decision tree is supposed to classify the type of breast cancer they have (benign (0) or malignant (1)), based on 9 different metrics (features).
The Data
The training data looks like this:
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, class
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, class
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, class
etc. (533 lines)
and the testing data looks like this:
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9
feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9
etc. (174 lines)
Your end goal is to create a file which looks like:
id, class
For every set of features in line N in the training data, you should have a line
1, predicted_class
2, predicted_class
3, predicted_class
4, predicted_class
etc. (174 lines)
N, predicted_class
in your submission file.
Basic shell code is available here. The Data I/O code from the decision tree lecture will still be useful, but will need to be adapted for this data.
How to participate
Our contests will be held on Kaggle, using Kaggle InClass. This allows us to upload data and competition instructions, as well as impose submission deadlines. It also ranks submissions automatically! To participate:
- Create a Kaggle account by clicking "sign up" in the top right.
- Click on this link (the competition link).
- Download the training and testing data.
- Download the I/O Code.
- Write your algorithm and train it on the training data.
- Then, test it on the testing data, creating a submission file with the predicted ground truth in the format shown in the sample submission file.
- Upload your submission file and see your results!
- Tweak your code, repeating steps 5-8 to improve your accuracy and move up the leaderboard.
- Send your code to tjmachinelearning@gmail.com whenever you're done.
Some More Guidelines:
- We recommend that you use Python. Everything we do is in Python this year, and it's the main language for machine learning in general. Don't use any external libraries, other than numpy and pandas. Both of these packages are very useful for data processing and manipulation, so it's a good idea to learn them if you can. If you're unfamiliar with Python, for this competition, you can use whichever language you want to, but in the future it will be much more difficult. If it's your first time using Python, check out these links for a jumpstart: Google's Python course, LearnPython.org, and The Python Guru.
- The Competition ends at 11:59:00 PM, October 15th. For this competition, we're giving an extra week than we normally do, so be sure to try it out!
- The leaderboard on Kaggle that you can see is the Public Leaderboard, which is your accuracy for 50% of the testing data. Your final rankings will be based on the Private leaderboard, which is based on the other 50% of the testing data and will become public as soon as the competition ends. This is to prevent you from just writing a decision tree that overfits the testing data, which defeats the purpose.
Advanced Series Competitions
Cloud Identification Competition Instructions
10/2/19 - Our advanced members can compete in the Cloud Identification competition. Detailed rules and instructions can be found on the Kaggle competition page. Basically, the goal is to submit a model in your team that is able to classify cloud patterns in the sky.
As a starting point, check out this public notebook which will help you manipulate the images in the dataset.
Guidelines
- At least one member from each team should be present at the advanced group every meeting.
- Have a working submission (score > 0) on Kaggle by 10/16. (This can be the .csv from the sample Kaggle notebook.)
- Improve on previous submission to a higher score by 11/13.
- Final deadline for competition is 11/18.
Points Breakdown
- Having at least one member of the team present at every club meeting will score one point.
- Meeting the 10/16 and 11/13 submission deadlines will score one point each.
-
Overall competition scores will be by competition rank, relative to other teams in the club:
1st place - 5 pts
2nd place - 4 pts
etc.