Introduction to data science using fisher's iris data and scikit-learn
- Open on Google Colab and open the notebook using the repository link
- If you do not see a dialog then, select File > Open Notebook
- Select "Github" > Enter Repository link > press enter https://github.com/Data-and-Design-Lab/DS-Iris-Flower-dataset
- select the notebook to get started
- Explore the notebook
Or
- Git clone the repository to your machine
- Install required packages (see tools section)
- Explore the notebook
Iris is a beautiful flower but there are 260-300 species (wikipedia). We want to create an application that will observe features of the flower and tell us what species it is.
Fig. 3 species in the Iris dataset (source)
We will be working with the Iris Flower dataset also known as Fisher's Iris dataset. Ronald Fisher (1890 – 1962), a British statistician and biologist, introduced this multivariate data in his 1936 paper (Fisher, 1936). He used this data to show how linear discriminant analysis can be applied for taxonomy problems (see more in Wikipedia). In machine learning textbooks, this dataset is used to introduce different concepts. Similarly, we will use this dataset to train a classification model. This dataset contains 5 columns and 150 rows of 3 species of Iris flower (preview in UCI ML repository). The columns are:
- Sepal length: Number / cm
- Sepal width: Number / cm
- Petal length: Number / cm
- Petal width: Number / cm
- Species: Text / "Versicolor", "Setosa" or "Virginica"
row | sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | species |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5 | 3.6 | 1.4 | 0.2 | setosa |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5 | 1.9 | virginica |
147 | 6.5 | 3 | 5.2 | 2 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3 | 5.1 | 1.8 | virginica |
To be able to differentiate and classify, we will be doing the following analyses using Data Science and Machine Learning tools.
Descriptive
- Simple statistics: mean, standard deviation, median, etc.
- Visualization
- Clustering with k-means
Predictive
- Classification models: Artificial Neural network (Multi-Layer Perceptron), Support Vector Machines (SVM), Decision tree and Random Forest
- Model evaluation using confusion matrix
We will be using the following Python (v3.7) packages:
- pandas for data analysis
- sklearn (scikit-learn) for training models
- Seaborn for visualization
- Yellowbrick for ML visualization
Create a presentation containing slides on (total 4 slides) in 45 minutes
- What is the problem
- Challenges
- Key findings
- Model accuracy