Outline:

Regression and Classification

  • Regression returns a numeric value:
    • new value in X axis, then approximate to find value Y
  • Classification returns a state:
    • positive or negative, yes or no, dog and cats

regress-class

Testing

  • How well is my model doing?
  • How to find a model that generalizes well?
  • Never use your testing data for training.

Which model is better?: which-better

Split data to train and test: train-test.png

Test results: test.png

Code:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

Confusion Matrix

  • How well is my model doing?

Example:

  • Blues are True, Red are Negative
  • Line above are Positive, below are False

confusion

Then:

  • True Positive: 6, True Negative: 5, False Positive: 2, False Negative: 1

Accuracy

  • One of the ways to measure how good model is
  • accuracy = (True Positive + True Negative) / total

Example:

accuracy

Code:

from sklearn.metrics import accuracy_score

accuracy_score(y_true, y_pred)

Regression Metrics

Mean Absolute Error

  • Distances to the points to the line.

mean-absolute-error

Code:

from sklearn.metrics import mean_abosolute_error
from sklearn.linear_mode import LinearRegression

classifier = LinearRegression()
classifier.fit(X, y)

guesses = classifier.predict(X)

error = mean_abosolute_error(y, guesses)

Mean Squared Error

  • Squares of the distances to the points to the line.

mean-squared-error

Code:

from sklearn.metrics import mean_squared_error
from sklearn.linear_mode import LinearRegression

classifier = LinearRegression()
classifier.fit(X, y)

guesses = classifier.predict(X)

error = mean_squared_error(y, guesses)

R2 score

  • R2 Score is based on comparing our model to the simplest possible model.
  • What is the simplest possible model that fits a bunch of points?
    • the average of all the values and draw a horizontal line through them.

r2-score

Code:

from sklearn.metrics import r2_score

y_true = [1, 2, 4]
y_pred = [1.3, 2.5, 3.7]

r2_score(y_true, y_pred)

Types of Errors

  • Oversimplify the problem: underfitting
    • Error due to in bias
  • Overcomplicate the problem: overfitting
    • Error due to in variance

types.png

Underfitting: underfitting.png

Overfitting: overfitting.png

Tradeoff: tradeoff.png

Model Complexity Graph

  • Detect training model error.

Training models example: models.png

Cross-Validation data:

  • Used for making decisions about the model.

cross-validation.png

Errors of each training model: graph-example.png

Real-life model complexity graph:

  • on left side underfit
  • on right side overfit graph-general.png

K-Fold Cross Validation

  • Method to recycle our data
  • Is there anything to not throw away useful data to test data?
    • Break our data in to K buckets
    • Then we just train our model K times.
    • Each time using a different bucket as our testing set and the remaining points as our training set.
    • Then we average the results to get a final model.

Code:

from sklearn.model_selection import KFold

kf = KFold(12, 3, shuffle=True)

for train_indices, test_indices in kf:
    print(train_indices, test_indices)