Skip to content

Latest commit

 

History

History
80 lines (56 loc) · 3.83 KB

README.md

File metadata and controls

80 lines (56 loc) · 3.83 KB

Introduction to data science using fisher's iris data and scikit-learn

Steps to get started

  1. Open on Google Colab and open the notebook using the repository link
  2. If you do not see a dialog then, select File > Open Notebook
  3. Select "Github" > Enter Repository link > press enter https://github.com/Data-and-Design-Lab/DS-Iris-Flower-dataset
  4. select the notebook to get started
  5. Explore the notebook

Or

  1. Git clone the repository to your machine
  2. Install required packages (see tools section)
  3. Explore the notebook

Introduction

Iris is a beautiful flower but there are 260-300 species (wikipedia). We want to create an application that will observe features of the flower and tell us what species it is.

Fig. 3 species in the Iris dataset (source)

We will be working with the Iris Flower dataset also known as Fisher's Iris dataset. Ronald Fisher (1890 – 1962), a British statistician and biologist, introduced this multivariate data in his 1936 paper (Fisher, 1936). He used this data to show how linear discriminant analysis can be applied for taxonomy problems (see more in Wikipedia). In machine learning textbooks, this dataset is used to introduce different concepts. Similarly, we will use this dataset to train a classification model. This dataset contains 5 columns and 150 rows of 3 species of Iris flower (preview in UCI ML repository). The columns are:

  • Sepal length: Number / cm
  • Sepal width: Number / cm
  • Petal length: Number / cm
  • Petal width: Number / cm
  • Species: Text / "Versicolor", "Setosa" or "Virginica"
row sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3 5.2 2.3 virginica
146 6.3 2.5 5 1.9 virginica
147 6.5 3 5.2 2 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3 5.1 1.8 virginica

To be able to differentiate and classify, we will be doing the following analyses using Data Science and Machine Learning tools.

Analysis

Descriptive

  • Simple statistics: mean, standard deviation, median, etc.
  • Visualization
  • Clustering with k-means

Predictive

  • Classification models: Artificial Neural network (Multi-Layer Perceptron), Support Vector Machines (SVM), Decision tree and Random Forest
  • Model evaluation using confusion matrix

Tools for analysis

We will be using the following Python (v3.7) packages:

  1. pandas for data analysis
  2. sklearn (scikit-learn) for training models
  3. Seaborn for visualization
  4. Yellowbrick for ML visualization

Exercise

Create a presentation containing slides on (total 4 slides) in 45 minutes

  • What is the problem
  • Challenges
  • Key findings
  • Model accuracy