Steps to get started

Introduction to data science using fisher's iris data and scikit-learn

Steps to get started

Open on Google Colab and open the notebook using the repository link
If you do not see a dialog then, select File > Open Notebook
Select "Github" > Enter Repository link > press enter https://github.com/Data-and-Design-Lab/DS-Iris-Flower-dataset
select the notebook to get started
Explore the notebook

Or

Git clone the repository to your machine
Install required packages (see tools section)
Explore the notebook

Introduction

Iris is a beautiful flower but there are 260-300 species (wikipedia). We want to create an application that will observe features of the flower and tell us what species it is.

Fig. 3 species in the Iris dataset (source)

We will be working with the Iris Flower dataset also known as Fisher's Iris dataset. Ronald Fisher (1890 – 1962), a British statistician and biologist, introduced this multivariate data in his 1936 paper (Fisher, 1936). He used this data to show how linear discriminant analysis can be applied for taxonomy problems (see more in Wikipedia). In machine learning textbooks, this dataset is used to introduce different concepts. Similarly, we will use this dataset to train a classification model. This dataset contains 5 columns and 150 rows of 3 species of Iris flower (preview in UCI ML repository). The columns are:

Sepal length: Number / cm
Sepal width: Number / cm
Petal length: Number / cm
Petal width: Number / cm
Species: Text / "Versicolor", "Setosa" or "Virginica"

row	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5	3.6	1.4	0.2	setosa
...	...	...	...	...	...
145	6.7	3	5.2	2.3	virginica
146	6.3	2.5	5	1.9	virginica
147	6.5	3	5.2	2	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3	5.1	1.8	virginica

To be able to differentiate and classify, we will be doing the following analyses using Data Science and Machine Learning tools.

Analysis

Descriptive

Simple statistics: mean, standard deviation, median, etc.
Visualization
Clustering with k-means

Predictive

Classification models: Artificial Neural network (Multi-Layer Perceptron), Support Vector Machines (SVM), Decision tree and Random Forest
Model evaluation using confusion matrix

Tools for analysis

We will be using the following Python (v3.7) packages:

pandas for data analysis
sklearn (scikit-learn) for training models
Seaborn for visualization
Yellowbrick for ML visualization

Exercise

Create a presentation containing slides on (total 4 slides) in 45 minutes

What is the problem
Challenges
Key findings
Model accuracy

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
analysis_iris_dataset.ipynb		analysis_iris_dataset.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Steps to get started

Introduction

Analysis

Tools for analysis

Exercise

About

Releases

Packages

Languages

Data-and-Design-Lab/DS-Iris-Flower-dataset

Folders and files

Latest commit

History

Repository files navigation

Steps to get started

Introduction

Analysis

Tools for analysis

Exercise

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages