2019-2020 Competitions | TJ Machine Learning Club

Competition Instructions

Beginner Series Competitions

Neural Networks Competition Instructions

12/11/19 - Your job is to write the code to create an neural network, train it on the training data, and use it to predict the classes of the testing data. This time, we'll be using data from the famous MNIST dataset. MNIST has 28x28 pixel images of handwritten numerical digit, which means 784 different features, each representing one pixel of the image.

Your neural network should predict the handwritten digit, outputting a number from 0-9, based on these 784 input pixels. Each pixel is described by a single 0-255 number representing its intensity (0 being a completely white pixel, 125 being a gray-ish pixel, and 255 being a fully black pixel). Note: the numbers represented will probably contain relatively intense pixels near the center of the stroke, but have pixels fading in intensity on the borders of the stroke of the number.

The Data

With that in mind, the training data now looks like this:

class (digit), n1, n2, n3, n4 ... n784 0, 0, 0, 130, 255 ... 120 9, 0, 10, 101, 87 ... 9 3, 230, 0, 5, 35 ... 1 1, 90, 0, 0, 90 ... 23 1, 80, 70, 100, 5 ... 167 etc. (27629 lines)

and the testing data looks like this:

id, n1, n2, n3, n4 ... n784 1, 0, 0, 130, 255 ... 120 2, 0, 10, 101, 87 ... 9 3, 230, 0, 5, 35 ... 1 4, 90, 0, 0, 90 ... 23 5, 80, 70, 100, 5 ... 167 etc. (10000 lines)

Your end file should be a .csv like this:
id, number 1, predicted_class 2, predicted_class 3, predicted_class 4, predicted_class etc. (10000 lines) (For every set of features in line N in the training data, you should have a line N, predicted_class in your submission file.)

Guidelines

We highly recommend you open both the training and testing csv files in a program like Excel, which will help you modify columns of data and perform calculations quickly.

The Data I/O code from the decision tree competition may still be useful.

Competition Link

As always, use Python for this competition. The only external libraries allowed are Numpy and Pandas. (No Scikit-Learn.) This competition will end at 11:59 PM on Jan 15th, 2020. If it's your first time competing, check out how to participate under the Decision Trees Competition link on this page.

We've written a small shell. The shell has a network class, and each network is made up of a list of layers (which are a separate class). Each layer is designed to have its own vectors and matrices (biases and weights, etc.). You don't have to structure your network this way by any means, or have a layer class at all. Most of the time when people write neural networks from scratch, they only have a single network class.

Support Vector Machines Competition Instructions

10/23/19 - This competition is a harder version of the Random Forests competition. Your job is to write the code to create an SVM, train it on the training data, and use it to predict the classes of the testing data. The goal is to classify survival of passengers on the Titanic, and the data we are using are from actual passengers on the ship. Your random forest is supposed to classify whether a passenger survived RIP (0) or Survived (1), based on 11 different metrics (features).

The purpose of this contest is not to test your ability to write an SVM. Instead, this contests mostly tests:

Your ability to use Scikit-Learn
Your ability to work with real-world data

The second is far more important (and difficult) than the first.

The Data

With that in mind, the training data now looks like this:

feature 1, survival, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11 feature 1, survival, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11 feature 1, survival, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11 etc. (636 lines)

and the testing data looks like this:

feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11 feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11 feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, feature 10, feature 11 etc. (255 lines)

Your end file should be a .csv like this:
id, solution 1, predicted_class 2, predicted_class 3, predicted_class 4, predicted_class etc. (255 lines) (For every set of features in line N in the training data, you should have a line N, predicted_class in your submission file.)

Guidelines

There are missing data points. Some features are not useful. The difficult part of this contest is formatting the data given, determining which features are useful, which ones should be trained on, and how to deal with the missing data.

You are allowed to (and should) use Scikit-Learn to create your SVM. Scikit has detailed instructions on how to write an SVM using the library here.

We highly recommend you open both the training and testing csv files in a program like Excel, which will help you modify columns of data and perform calculations quickly.

The Data I/O code from the decision tree competition may still be useful.

This jupyter notebook, which applies an SVM to the trivial problem of differentiating greek yogurt from regular yogurt, might also be helpful.

Competition Link

As always, only use Python for this competition. The only external libraries allowed are Scikit-Learn, Numpy, and Pandas. This competition will end at 11:59 PM on Tuesday, November 5th. If it's your first time competing, check out how to participate under the Decision Trees Competition link on this page.

Random Forests Competition Instructions

10/20/19 - Welcome to the (surprise) second competition of the year! Your job is to write the code to create a random forest, train it on the training data, and use it to predict the classes of the testing data. The goal is to classify survival of passengers on the Titanic, and the data we are using are from actual passengers on the ship. Your random forest is supposed to classify whether a passenger survived RIP (0) or Survived (1), based on 7 different metrics (features).

The Data

The training data looks like this:

feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, survival feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, survival feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, survival etc. (500 lines)

and the testing data looks like this:

feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7 feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7 feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7 etc. (214 lines)

Your end file should be a .csv like this:
id, solution 1, predicted_class 2, predicted_class 3, predicted_class 4, predicted_class etc. (214 lines) (For every set of features in line N in the training data, you should have a line N, predicted_class in your submission file.)

Guidelines

No external libraries allowed other than numpy. The competition (here) closes at 11:59 PM on 10/29/19. The leaderboard calculations, Python resources, and participation steps apply to this competition too.

Shell code is available here, should you want it. The Data I/O code from the decision tree lecture will still be useful, but will need to be adapted for this data. Good luck!

Footnote

The features correspond to:


                            pclass - Ticket class: 1 = 1st, 2 = 2nd, 3 = 3rd 

                            sex - Sex: 0 = Male, 1 = Female 

                            Age - Age in years  

                            sibsp - # of siblings / spouses aboard the Titanic  

                            parch - # of parents / children aboard the Titanic  

                            fare - passenger fare 

                            embarked - Port of Embarkation: 0 = Southampton, 1 = Cherbourg, 2 = Queenstown

                            survival - Survival: 0 = No, 1 = Yes

though this isn't necessary to know for the competition.

Decision Trees Competition

10/2/19 - Welcome to the first contest of the year! Your job is to write the code to create a decision tree, train it on the training data, and use it to predict the classes of the testing data. We are trying to classify breast cancer data. The data we are using is actual data from breast cancer patients. Your decision tree is supposed to classify the type of breast cancer they have (benign (0) or malignant (1)), based on 9 different metrics (features).

The Data

The training data looks like this:

feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, class feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, class feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9, class etc. (533 lines)

and the testing data looks like this:

feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9 feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9 feature 1, feature 2, feature 3, feature 4, feature 5, feature 6, feature 7, feature 8, feature 9 etc. (174 lines)

Your end goal is to create a file which looks like:
id, class 1, predicted_class 2, predicted_class 3, predicted_class 4, predicted_class etc. (174 lines) For every set of features in line N in the training data, you should have a line N, predicted_class in your submission file.

Basic shell code is available here. The Data I/O code from the decision tree lecture will still be useful, but will need to be adapted for this data.

How to participate

Our contests will be held on Kaggle, using Kaggle InClass. This allows us to upload data and competition instructions, as well as impose submission deadlines. It also ranks submissions automatically! To participate:

Create a Kaggle account by clicking "sign up" in the top right.
Click on this link (the competition link).
Download the training and testing data.
Download the I/O Code.
Write your algorithm and train it on the training data.
Then, test it on the testing data, creating a submission file with the predicted ground truth in the format shown in the sample submission file.
Upload your submission file and see your results!
Tweak your code, repeating steps 5-8 to improve your accuracy and move up the leaderboard.
Send your code to tjmachinelearning@gmail.com whenever you're done.

If you're still confused, we explain more in-depth about Kaggle and Kaggle Competitions here.

Some More Guidelines:

We recommend that you use Python. Everything we do is in Python this year, and it's the main language for machine learning in general. Don't use any external libraries, other than numpy and pandas. Both of these packages are very useful for data processing and manipulation, so it's a good idea to learn them if you can. If you're unfamiliar with Python, for this competition, you can use whichever language you want to, but in the future it will be much more difficult. If it's your first time using Python, check out these links for a jumpstart: Google's Python course, LearnPython.org, and The Python Guru.
The Competition ends at 11:59:00 PM, October 15th. For this competition, we're giving an extra week than we normally do, so be sure to try it out!
The leaderboard on Kaggle that you can see is the Public Leaderboard, which is your accuracy for 50% of the testing data. Your final rankings will be based on the Private leaderboard, which is based on the other 50% of the testing data and will become public as soon as the competition ends. This is to prevent you from just writing a decision tree that overfits the testing data, which defeats the purpose.

Advanced Series Competitions

Cloud Identification Competition Instructions

10/2/19 - Our advanced members can compete in the Cloud Identification competition. Detailed rules and instructions can be found on the Kaggle competition page. Basically, the goal is to submit a model in your team that is able to classify cloud patterns in the sky.

As a starting point, check out this public notebook which will help you manipulate the images in the dataset.

Guidelines

At least one member from each team should be present at the advanced group every meeting.
Have a working submission (score > 0) on Kaggle by 10/16. (This can be the .csv from the sample Kaggle notebook.)
Improve on previous submission to a higher score by 11/13.
Final deadline for competition is 11/18.

Points Breakdown

Having at least one member of the team present at every club meeting will score one point.
Meeting the 10/16 and 11/13 submission deadlines will score one point each.
Overall competition scores will be by competition rank, relative to other teams in the club:
1st place - 5 pts
2nd place - 4 pts
etc.

TJ Machine Learning Club

Making AI more accessible

Competition Instructions

Beginner Series Competitions

Neural Networks Competition Instructions

The Data

Guidelines

Support Vector Machines Competition Instructions

The Data

Guidelines

Random Forests Competition Instructions

The Data

Guidelines

Footnote

Decision Trees Competition

The Data

How to participate

Some More Guidelines:

Advanced Series Competitions

Cloud Identification Competition Instructions

Guidelines

Points Breakdown