This repository contains the lab assignments and projects for the DS & BDA (Data Science and Big Data Analytics) Lab course, which is part of the third-year Information Technology curriculum for the 2019 batch.
- Group A: Assignments based on the Hadoop
- Group B: Assignments based on Data Analytics using Python
- Group C: Model Implementation
- Usage
- Requirements
- License
- Contact
Assignment 1: Single node/Multiple node Hadoop Installation
This assignment involves the installation of Hadoop on either a single node or multiple nodes. The instructions provided here will guide you through the installation process.
Assignment 2: MapReduce Log Processing
This assignment involves designing a distributed application using MapReduce (Java) to process a log file from a system. The goal is to identify the users who have logged in for the maximum period on the system.
Assignment 3: Flight Information System using HiveQL
This assignment involves using HiveQL to build a flight information system. It covers various tasks such as creating, dropping, and altering database tables, loading data into tables, inserting new values and fields, joining tables, and creating an index on the flight information table.
- Creating,Dropping, and altering Database tables.
- Creating an external Hive table.
- Load table with data, insert new values and field in the table, Join tables with Hive
- Create index on Flight Information Table
- Find the average departure delay per day in 2008.
Assignment 1: Facebook Metrics Data And Adult Data Analysis
In this assignment, we work with the Facebook metrics dataset and perform the following operations:
-
Create data subsets: We create subsets of the dataset based on specific criteria or filters.
-
Merge Data: We merge multiple datasets together based on common columns or keys.
-
Sort Data: We sort the data based on one or more columns in ascending or descending order.
-
Transposing Data: We transpose the data to interchange rows and columns.
-
Shape and Reshape Data: We reshape the data to convert it into a different structure or format.
-
Visualize the data: We use Python libraries such as Matplotlib and Seaborn to plot graphs and visualize the data.
Assignment 2: Air Quality and Heart Diseases Data Analysis
In this assignment, we work with the Air Quality and Heart Diseases datasets and perform the following operations:
-
Data Cleaning: We clean the datasets by handling missing values, outliers, and inconsistent data.
-
Data Integration: We integrate multiple datasets into a single dataset based on common attributes or keys.
-
Data Transformation: We transform the data by applying mathematical or statistical operations, feature scaling, or encoding categorical variables.
-
Error Correcting: We correct errors in the data, such as fixing inconsistent values or resolving data quality issues.
-
Data Model Building: We build predictive models or analyze patterns in the data using machine learning algorithms or statistical techniques.
-
Visualize the data: We use Python libraries such as Matplotlib and Seaborn to plot graphs and visualize the data.
Assignment 4: Data Visualization On Dataset (Heart, Tips, AirQuality)
Visualize the data using Python libraries matplotlib, seaborn by plotting the graphs.
Assignment 5: Data Visualization using Tableau
In this assignment, we work with the Adult and Iris datasets and perform the following data visualization operations using Tableau:
-
1D (Linear) Data Visualization: We visualize data along a single dimension using techniques such as bar charts, histograms, or box plots.
-
2D (Planar) Data Visualization: We visualize data in two dimensions using techniques such as scatter plots, bubble charts, or heatmaps.
-
3D (Volumetric) Data Visualization: We visualize data in three dimensions using techniques such as 3D scatter plots, surface plots, or volume rendering.
-
Temporal Data Visualization: We visualize data over time using techniques such as line graphs, area charts, or time series plots.
-
Multidimensional Data Visualization: We visualize data with more than three dimensions using techniques such as parallel coordinates, radar charts, or trellis plots.
-
Tree/Hierarchical Data Visualization: We visualize hierarchical or tree-structured data using techniques such as tree maps, sunburst charts, or dendrograms.
-
Network Data Visualization: We visualize network or graph data using techniques such as node-link diagrams, force-directed layouts, or chord diagrams.
Assignment 1: Web Scraping
Create a review scrapper for any ecommerce website to fetch real time comments, reviews, ratings, comment tags, customer name using Python.
Each assignment is organized into separate folders, and within each folder, you will find the necessary files and code for that assignment. Feel free to explore the code and datasets provided.
To run the code in these assignments, you need to have Python installed on your system along with the required libraries and dependencies. Make sure to install the necessary packages mentioned in the assignment files. For Tableau, you will need to have Tableau software installed on your machine.
This project is licensed under the MIT License. Feel free to use the code and materials for educational purposes or personal projects.
If you have any questions or suggestions, please feel free to contact: Email:
- Ranjeet - contact [dot] ranjeetkumbhar [at] gmail [dot] com