Here is the full content of your assignments, formatted strictly as a README document with code blocks for readability.
ADBMS & R Programming - Practical Assignments
This repository contains the solution code for Assignments 1 through 5, covering R basics, Data Pre-processing, Data Partitioning (SQL), Association Rule Mining, and Regression.
Assignment No. 1 – Basics of R
# Create folders
dir.create("ADBMS Practicals")
dir.create("ADBMS Practicals/R Introduction")
# Get current working directory
getwd()
# Set working directory
setwd("ADBMS Practicals/R Introduction")
# Verify working directory
getwd()
# List files
list.files()
dir()
# Help and examples
?mean
help(mean)
example(mean)
help.start()
Q2. Variable Types and Comparisons
# Create variable Z
Z <- 25
# Check class and type
class(Z)
typeof(Z)
# Print value
print(Z)
# Trigonometric operations
x <- 1:10
y <- atan(1/x)
z <- 1/tan(y)
# Print values
x
y
z
# Comparisons
x == z
identical(x, z)
all.equal(x, z)
all.equal(x, z, tolerance = 0)
# Class and type of special values
class(Inf); typeof(Inf)
class(NaN); typeof(NaN)
class(NA); typeof(NA)
class(""); typeof("")
Q3. Vector Operations
# Create vectors
Vect_Y <- c(2, 4, 6, 8)
Vect_Z <- c(1, 3, 5, 7)
# Length of vector
length(Vect_Z)
# Multiply vector
Vect_Z * 3
# Add vectors
Vect_Y + Vect_Z
# cbind and rbind
cbind(Vect_Y, Vect_Z)
rbind(Vect_Y, Vect_Z)
# Named vector
animals <- c(apple = 10, banana = 5, mango = 7)
names(animals)
animals["banana"]
Q4. Data Manipulation (Temperature)
# Weekly temperature data
temp_week <- c(29, 31, 28, 35, 32, 30, 33)
# Average temperature
mean(temp_week)
# Highest and lowest temperature
max(temp_week)
min(temp_week)
# Temperatures above average
temp_week[temp_week > mean(temp_week)]
# Replace values below 30
temp_week[temp_week < 30] <- 30
temp_week
Q5. Matrix Operations
# Create matrices
A <- matrix(1:16, nrow = 4, ncol = 4)
B <- matrix(1:8, nrow = 4, ncol = 2)
# Row and column sums
rowSums(A)
colSums(A)
# Matrix multiplication
C <- A %*% B
C
# Replace diagonal with zero
diag(A) <- 0
A
# Transpose of matrix B
t(B)
# Determinant
det(A)
# det(B) # (Determinant not defined for non-square matrix)
Q6. Dataframe Basics
# Create dataframe
employee <- data.frame(
empid = c(101, 102, 103, 104),
empname = c("Rahul", "Sneha", "Amit", "Priya"),
salary = c(45000, 52000, 48000, 50000),
start_date = as.Date(c("2022-05-01", "2023-06-15", "2021-07-20", "2024-01-10"))
)
# Display data
employee
# Display empname and salary
employee[, c("empname", "salary")]
# Extract first two rows
employee[1:2, ]
# Extract specific rows and columns
employee[c(1, 3), c(2, 3)]
Q7. Factors
# Create rating vector
rating <- c("satisfied", "not satisfied", "highly satisfied", "partly satisfied", "OK")
# Factor without order
Rating_factor1 <- factor(rating)
Rating_factor1
# Factor with order
Rating_factor2 <- factor(rating,
levels = c("not satisfied", "partly satisfied", "OK", "satisfied", "highly satisfied"),
ordered = TRUE
)
Rating_factor2
Q8. Lists
# Create list
lst <- list(
seq_1_to_10 = 1:10,
months = month.abb,
identity_matrix = diag(3)
)
# Print list
lst
# Structure of list
str(lst)
# Access month "Mar"
lst$months[3]
# Write to CSV
write.csv(employee, "employee.csv", row.names = FALSE)
# Read from CSV
employee_csv <- read.csv("employee.csv")
employee_csv
# Load libraries
library(readxl)
library(writexl)
# Write Excel file
write_xlsx(employee, "employee.xlsx")
# Read Excel file
employee_excel <- read_excel("employee.xlsx")
employee_excel
# Read input from console
a1 <- readline(prompt = "Enter a value: ")
print(a1)
# Read 10 inputs
y <- numeric(10)
for (i in 1:10) {
y[i] <- as.numeric(readline(prompt = paste("Enter number", i, ": ")))
}
print(y)
Q13. Attach/Detach Dataframes
# Attach dataframe
attach(employee)
# Access columns
empname
salary
# Detach dataframe
detach(employee)
# Access after detach (will give error)
# empname
Q14. Workspace Management
# Objects before removal
ls()
# Remove objects
rm(rating, a1, y)
# Objects after removal
ls()
Assignment No. 2 – Data Pre-Processing
Q1. Load mtcars dataset
# Load dataset into dataframe
cars <- mtcars
# Display first 10 rows
head(cars, 10)
# Display last 5 rows
tail(cars, 5)
# Display column names
colnames(cars)
# Display summary
summary(cars)
# Display total rows and columns
dim(cars)
nrow(cars)
ncol(cars)
# Display 20th and 25th rows
cars[c(20, 25), ]
# Display hp value of 15th row
cars$hp[15]
Q2. Convert pH levels into an ordered factor
# Create pH vector
pH <- c("acidic", "neutral", "alkaline", "neutral")
# Convert to ordered factor
pH_factor <- factor(
pH,
levels = c("acidic", "neutral", "alkaline"),
ordered = TRUE
)
pH_factor
Q3. Dataframe operations using dplyr
# Install and load dplyr (install only once)
# install.packages("dplyr")
library(dplyr)
# Create dataframe
students1 <- data.frame(
ID = 1:10,
Name = c("Asha", "Ravi", "Meera", "Kiran", "Priya", "Vikram", "Latha", "Arjun", "Sneha", "Ramesh"),
Age = c(20, 21, 22, 20, 23, 24, 21, 22, 23, 20),
Gender = c("F", "M", "F", "M", "F", "M", "F", "M", "F", "M"),
Marks = c(85, 78, 92, 65, 88, 70, 95, 80, 90, 60),
Department = c("CS", "Math", "CS", "Physics", "Math", "CS", "Physics", "Math", "CS", "Physics")
)
# Select Name and Marks
students1 %>% select(Name, Marks)
# Filter marks > 80
students1 %>% filter(Marks > 80)
# Arrange by marks (descending)
students1 %>% arrange(desc(Marks))
# Rename Marks to Score
students1 %>% rename(Score = Marks)
# Distinct departments
students1 %>% distinct(Department)
# Average marks
students1 %>% summarise(Average_Marks = mean(Marks))
# Average marks by department
students1 %>%
group_by(Department) %>%
summarise(Average_Marks = mean(Marks))
# Count students per department
students1 %>%
group_by(Department) %>%
summarise(Student_Count = n())
# Highest marks in each department
students1 %>%
group_by(Department) %>%
slice_max(Marks, n = 1)
# Average marks by gender
students1 %>%
group_by(Gender) %>%
summarise(Average_Marks = mean(Marks))
# Arrange by department then marks
students1 %>% arrange(Department, desc(Marks))
# Combined operations
students1 %>%
filter(Marks > 70) %>%
group_by(Gender) %>%
summarise(Average_Marks = mean(Marks))
Q4. Handle missing values using Hmisc library
# Install and load Hmisc (install only once)
# install.packages("Hmisc")
library(Hmisc)
# Create dataframe
students2 <- data.frame(
ID = 1:8,
Name = c("Asha", "Ravi", "Meera", "Kiran", "Priya", "Vikram", "Sneha", "Ramesh"),
Age = c(20, 21, NA, 20, 23, 24, NA, 22),
Marks = c(85, 78, 92, NA, 88, 70, 90, NA),
Gender = c("F", "M", "F", "M", NA, "M", "F", "M")
)
# Count missing values
colSums(is.na(students2))
# Replace missing Age with mean
students2$Age[is.na(students2$Age)] <- mean(students2$Age, na.rm = TRUE)
# Replace missing Marks with median
students2$Marks[is.na(students2$Marks)] <- median(students2$Marks, na.rm = TRUE)
# Replace missing Marks with constant value
students2$Marks[is.na(students2$Marks)] <- 60
# Replace missing Gender with "Unknown"
students2$Gender[is.na(students2$Gender)] <- "Unknown"
# Replace missing Marks with random observed values
students2$Marks[is.na(students2$Marks)] <- sample(
students2$Marks[!is.na(students2$Marks)],
sum(is.na(students2$Marks)),
replace = TRUE
)
# Replace missing Age with minimum value
students2$Age[is.na(students2$Age)] <- min(students2$Age, na.rm = TRUE)
# Create new dataset students3
students3 <- students2
# Replace numeric NA with mean
num_cols <- sapply(students3, is.numeric)
students3[num_cols] <- lapply(students3[num_cols], function(x) {
x[is.na(x)] <- mean(x, na.rm = TRUE)
x
})
# Replace character NA with "Missing"
char_cols <- sapply(students3, is.character)
students3[char_cols] <- lapply(students3[char_cols], function(x) {
x[is.na(x)] <- "Missing"
x
})
Assignment No. 3 – Data Partitioning
Q1. Range Partitioning (emp2527_range)
-- Create range partitioned table
CREATE TABLE emp2527_range (
eid NUMBER(5),
ename VARCHAR2(30),
salary NUMBER(10)
)
PARTITION BY RANGE (salary) (
PARTITION p1 VALUES LESS THAN (5000),
PARTITION p2 VALUES LESS THAN (15000),
PARTITION p3 VALUES LESS THAN (20000)
);
-- Insert records
INSERT INTO emp2527_range VALUES (101, 'Amit', 3000);
INSERT INTO emp2527_range VALUES (102, 'Neha', 4500);
INSERT INTO emp2527_range VALUES (103, 'Ravi', 8000);
INSERT INTO emp2527_range VALUES (104, 'Priya', 12000);
INSERT INTO emp2527_range VALUES (105, 'Suman', 16000);
INSERT INTO emp2527_range VALUES (106, 'Raj', 18000);
COMMIT;
-- Display data from table
SELECT * FROM emp2527_range;
-- Display data from each partition
SELECT * FROM emp2527_range PARTITION (p1);
SELECT * FROM emp2527_range PARTITION (p2);
SELECT * FROM emp2527_range PARTITION (p3);
-- Split partition p2
ALTER TABLE emp2527_range SPLIT PARTITION p2 AT (10000) INTO (PARTITION p21, PARTITION p22);
-- Add new partition p4
ALTER TABLE emp2527_range ADD PARTITION p4 VALUES LESS THAN (MAXVALUE);
-- Merge partitions p21 and p22 back into p2
ALTER TABLE emp2527_range MERGE PARTITIONS p21, p22 INTO PARTITION p2;
-- Drop partition p4
ALTER TABLE emp2527_range DROP PARTITION p4;
Q2. List Partitioning (customer_list2527)
-- Create list partitioned table
CREATE TABLE customer_list2527 (
cust_id NUMBER(5),
cust_name VARCHAR2(30),
cust_city VARCHAR2(20),
balance NUMBER(10)
)
PARTITION BY LIST (cust_city) (
PARTITION North VALUES ('Delhi', 'Lucknow', 'Chandigarh'),
PARTITION South VALUES ('Chennai', 'Bangalore', 'Hyderabad'),
PARTITION East VALUES ('Bhimapur', 'Kolkata', 'Dispur', 'Gangtok'),
PARTITION West VALUES ('Mumbai', 'Goa', 'Nasik', 'Nagpur'),
PARTITION other_values VALUES (DEFAULT)
);
-- Insert records
INSERT INTO customer_list2527 VALUES (201, 'Arjun', 'Delhi', 12000);
INSERT INTO customer_list2527 VALUES (202, 'Meera', 'Lucknow', 7000);
INSERT INTO customer_list2527 VALUES (203, 'Rahul', 'Chennai', 9500);
INSERT INTO customer_list2527 VALUES (204, 'Kiran', 'Bangalore', 18000);
INSERT INTO customer_list2527 VALUES (205, 'Sanjay', 'Mumbai', 5000);
INSERT INTO customer_list2527 VALUES (206, 'Neel', 'Goa', 15000);
INSERT INTO customer_list2527 VALUES (207, 'Rekha', 'Kolkata', 9000);
INSERT INTO customer_list2527 VALUES (208, 'Vinay', 'Dispur', 25000);
INSERT INTO customer_list2527 VALUES (209, 'Tarun', 'Jaipur', 8000); -- (Goes to default)
COMMIT;
-- Display data from table
SELECT * FROM customer_list2527;
-- Display data from each partition
SELECT * FROM customer_list2527 PARTITION (North);
SELECT * FROM customer_list2527 PARTITION (South);
SELECT * FROM customer_list2527 PARTITION (East);
SELECT * FROM customer_list2527 PARTITION (West);
SELECT * FROM customer_list2527 PARTITION (other_values);
-- Display customers from West region with balance < 10000
SELECT * FROM customer_list2527 PARTITION (West) WHERE balance < 10000;
-- Add a new city to East partition
ALTER TABLE customer_list2527 MODIFY PARTITION East ADD VALUES ('Patna');
Q3. Hash Partitioning (student_hash2527)
-- Create hash partitioned table
CREATE TABLE student_hash2527 (
stud_id NUMBER(5),
stud_name VARCHAR2(30),
stud_city VARCHAR2(20),
balance NUMBER(10),
course_id NUMBER(3)
)
PARTITION BY HASH (course_id) PARTITIONS 4;
-- Insert records
INSERT INTO student_hash2527 VALUES (301, 'Rita', 'Delhi', 8000, 101);
INSERT INTO student_hash2527 VALUES (302, 'Manoj', 'Mumbai', 12000, 102);
INSERT INTO student_hash2527 VALUES (303, 'Anita', 'Chennai', 7000, 103);
INSERT INTO student_hash2527 VALUES (304, 'Suresh', 'Kolkata', 9000, 104);
INSERT INTO student_hash2527 VALUES (305, 'Rohan', 'Pune', 6000, 105);
INSERT INTO student_hash2527 VALUES (306, 'Preeti', 'Jaipur', 15000, 106);
COMMIT;
-- Display all records
SELECT * FROM student_hash2527;
-- Display table name and partition names
SELECT table_name, partition_name FROM user_tab_partitions WHERE table_name = 'STUDENT_HASH2527';
-- Display data from each partition (Replace SYS_Pxxx with actual names)
-- SELECT * FROM student_hash2527 PARTITION (SYS_Pxxx);
-- SELECT * FROM student_hash2527 PARTITION (SYS_Pyyy);
Assignment No. 4 – Association Rule Mining
Q1. Use weather.csv dataset
# Install arules package (install only once)
# install.packages("arules")
library(arules)
# Read the dataset
weather <- read.csv("weather.csv")
# Convert selected columns to factor
weather[, c(1, 2, 3, 5)] <- lapply(weather[, c(1, 2, 3, 5)], as.factor)
# Convert dataframe to transactions
w_trans <- as(weather, "transactions")
# Apply Apriori Algorithm
# (i) Default parameter settings
rules_default <- apriori(w_trans)
inspect(rules_default)
# (ii) Support = 0.4
rules_supp <- apriori(w_trans, parameter = list(supp = 0.4))
inspect(rules_supp)
# (iii) Confidence = 1
rules_conf <- apriori(w_trans, parameter = list(conf = 1))
inspect(rules_conf)
# (iv) Confidence = 1, minlen = 2, maxlen = 3
rules_conf_len <- apriori(
w_trans,
parameter = list(conf = 1, minlen = 2, maxlen = 3)
)
inspect(rules_conf_len)
# (v) Support = 0.2, minlen = 2, maxlen = 3
rules_supp_len <- apriori(
w_trans,
parameter = list(supp = 0.2, minlen = 2, maxlen = 3)
)
inspect(rules_supp_len)
Q2. Use Adult dataset
# Load library
library(arules)
# Load Adult dataset
data("Adult")
a <- Adult
# Display first 6 transactions
inspect(head(a, 6))
# Generate rules with default settings
rules_adult <- apriori(a)
# Display first 6 rules
inspect(head(rules_adult, 6))
# Display last 6 rules
inspect(tail(rules_adult, 6))
# Display summary of rules
summary(rules_adult)
# Confidence = 0.9, Support = 0.5, minlen = 4
rules_high_conf <- apriori(
a,
parameter = list(conf = 0.9, supp = 0.5, minlen = 4)
)
inspect(rules_high_conf)
# Support = 0.6, minlen = 4
rules_high_supp <- apriori(
a,
parameter = list(supp = 0.6, minlen = 4)
)
inspect(rules_high_supp)
Q3. Use Groceries dataset
# Load Groceries dataset
data("Groceries")
gro <- Groceries
# Display first 3 transactions
inspect(gro[1:3])
# Apply Apriori algorithm
rules_gro <- apriori(
gro,
parameter = list(supp = 0.0015, conf = 0.5, maxlen = 5)
)
inspect(rules_gro)
Assignment No. 5 – Regression
Q1. House Price Prediction using Linear Regression
# Create dataframe
house <- data.frame(
SalesPrice = c(75.0, 60.5, 82.3, 55.0, 90.0, 68.7, 48.3, 72.5, 65.0, 85.4),
SqFt = c(20, 18, 25, 15, 27, 22, 16, 21, 19, 24),
Stories = c(2, 1, 2, 1, 3, 2, 1, 2, 2, 3),
Bathrooms = c(3, 2, 3, 2, 4, 3, 1, 2, 2, 3),
Age = c(15, 10, 8, 20, 5, 12, 30, 18, 25, 7)
)
# Simple linear regression using SqFt
model1 <- lm(SalesPrice ~ SqFt, data = house)
# Summary of simple regression
summary(model1)
# Plot the model (requires ggplot2)
# install.packages("ggplot2")
library(ggplot2)
ggplot(house, aes(x = SqFt, y = SalesPrice)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) +
labs(
title = "Sales Price vs Square Feet",
x = "Square Feet (hundreds)",
y = "Sales Price (Lakhs)"
)
# Multiple linear regression using all predictors
model2 <- lm(SalesPrice ~ SqFt + Stories + Bathrooms + Age, data = house)
# Summary of multiple regression
summary(model2)
Q2. Air Quality Prediction using Linear Regression
# Load dataset
data("airquality")
# Display summary
summary(airquality)
# Remove missing values
air <- na.omit(airquality)
# Simple linear regression (Ozone ~ Solar.R)
model3 <- lm(Ozone ~ Solar.R, data = air)
# Summary of model
summary(model3)
# Plot the model (ggplot2 required)
library(ggplot2)
ggplot(air, aes(x = Solar.R, y = Ozone)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) +
labs(
title = "Ozone vs Solar Radiation",
x = "Solar Radiation",
y = "Ozone Level"
)
# Multiple linear regression using multiple predictors
model4 <- lm(Ozone ~ Wind + Solar.R + Temp, data = air)
# Summary of multiple regression model
summary(model4)
---
Assignment No. 6 – Classification
Q1. Classification using Naïve Bayes (fruits_data.csv)
Objective: classify fruits based on features using the Naïve Bayes algorithm.
# Read dataset
fruits <- read.csv("fruits_data.csv")
# Convert class variable to factor
fruits$Fruit <- as.factor(fruits$Fruit)
# Display structure and dimensions
str(fruits)
dim(fruits)
# Split dataset into training and test set (60:40)
set.seed(123)
index <- sample(nrow(fruits), floor(nrow(fruits) * 0.6))
train_fruits <- fruits[index, ]
test_fruits <- fruits[-index, ]
# Prepare training data
xtrain <- train_fruits[, -which(names(train_fruits) == "Fruit")]
ytrain <- train_fruits$Fruit
# Train Naïve Bayes model
library(caret)
nbmodel <- train(
xtrain,
ytrain,
method = "nb",
trControl = trainControl(method = "cv", number = 10)
)
# Predict on test data
xtest <- test_fruits[, -which(names(test_fruits) == "Fruit")]
ytest <- test_fruits$Fruit
pred_nb <- predict(nbmodel$finalModel, xtest)
# Confusion matrix
tab_nb <- table(pred_nb$class, ytest)
# Accuracy function
acc <- function(x) {
sum(diag(x)) / sum(x)
}
acc(tab_nb)
confusionMatrix(tab_nb)
Q2. Classification using K-Nearest Neighbors (Default dataset)
Objective: Predict default status using KNN with varying values of k.
# Load required libraries
library(ISLR)
library(class)
# Load dataset
d <- Default
# Display structure and summary
str(d)
summary(d)
# Remove student column
d <- d[, !(names(d) == "student")]
# Display dimensions and column names
dim(d)
colnames(d)
# Normalize numeric columns
nor <- function(x) { (x - min(x)) / (max(x) - min(x)) }
d_norm <- d
d_norm[, 2:3] <- as.data.frame(lapply(d[, 2:3], nor))
# Set seed
set.seed(142)
# Split dataset (75:25)
index <- sample(nrow(d_norm), floor(nrow(d_norm) * 0.75))
train_d <- d_norm[index, ]
test_d <- d_norm[-index, ]
train_X <- as.matrix(train_d[, -1])
test_X <- as.matrix(test_d[, -1])
train_Y <- as.factor(d[index, 1])
test_Y <- as.factor(d[-index, 1])
# KNN with k = 1
m1 <- knn(train_X, test_X, cl = train_Y, k = 1)
tab_m1 <- table(m1, test_Y)
# KNN with k = 3
m2 <- knn(train_X, test_X, cl = train_Y, k = 3)
tab_m2 <- table(m2, test_Y)
# KNN with k = 21
m3 <- knn(train_X, test_X, cl = train_Y, k = 21)
tab_m3 <- table(m3, test_Y)
# Accuracy function
acc <- function(x) {
sum(diag(x)) / sum(x)
}
acc(tab_m1)
acc(tab_m2)
acc(tab_m3)
Q3. Classification using ID3 Algorithm (mobile_purchase_data.csv)
Objective: Build a decision tree to predict mobile purchases.
# Load libraries
library(rpart)
library(rpart.plot)
library(caret)
# Read dataset
mobile <- read.csv("mobile_purchase_data.csv")
# Display summary
summary(mobile)
# Convert class variable to factor
mobile$purchases_mobile <- as.factor(mobile$purchases_mobile)
# Set seed
set.seed(123)
# Split dataset (80:20)
index <- sample(nrow(mobile), floor(nrow(mobile) * 0.8))
train_mob <- mobile[index, ]
test_mob <- mobile[-index, ]
# Display dimensions
dim(train_mob)
dim(test_mob)
# Build ID3 decision tree
mob_tree <- rpart(
purchases_mobile ~ .,
data = train_mob,
method = "class"
)
# Plot tree
rpart.plot(
mob_tree,
type = 4,
extra = 101,
fallen.leaves = TRUE,
main = "Mobile Purchase Decision Tree"
)
# Predict test data
pred_mob <- predict(mob_tree, test_mob, type = "class")
# Confusion matrix
tab_mob <- table(pred_mob, test_mob$purchases_mobile)
# Accuracy
acc(tab_mob)
confusionMatrix(tab_mob)
Q4. Classification using C4.5 Algorithm (wdbc.csv)
Objective: Use the J48 (C4.5) algorithm to classify breast cancer diagnosis.
# Load required libraries
library(dplyr)
library(RWeka)
library(partykit)
library(caret)
# Read dataset
wdbc <- read.csv("wdbc.csv")
# Display summary
summary(wdbc)
# Convert Diagnosis to factor
wdbc$Diagnosis <- as.factor(wdbc$Diagnosis)
# Set seed
set.seed(123)
# Split dataset (70:30)
train_wdbc <- wdbc %>% sample_frac(0.7)
test_wdbc <- setdiff(wdbc, train_wdbc)
# Display dimensions
dim(train_wdbc)
dim(test_wdbc)
# Build C4.5 (J48) model
j48_model <- J48(Diagnosis ~ ., data = train_wdbc)
# Display model summary
summary(j48_model)
# Plot decision tree
plot(as.party(j48_model))
# Predict test data
pred_wdbc <- predict(j48_model, test_wdbc, type = "class")
# Confusion matrix
tab_wdbc <- table(pred_wdbc, test_wdbc$Diagnosis)
# Accuracy
acc(tab_wdbc)
confusionMatrix(tab_wdbc)
Assignment No. 7 – Clustering
Q1. Agglomerative Clustering (PlantGrowth)
Objective: Perform hierarchical clustering using complete and average linkage methods on the weight feature.
# a) Store the dataset
data("PlantGrowth")
plant <- PlantGrowth
# b) Structure and dimensions
str(plant)
dim(plant)
# Distance using only weight
d <- dist(plant$weight)
# -------- Complete Linkage --------
# c) Cluster using complete method
hc_c <- hclust(d, method = "complete")
# d) Summary of cluster
summary(hc_c)
# e) Plot dendrogram
plot(hc_c, main = "Complete Linkage Dendrogram")
# f) Dendrogram with 3 clusters (green)
rect.hclust(hc_c, k = 3, border = "green")
# g) Cut tree into 3 clusters
cutree(hc_c, k = 3)
# -------- Average Linkage --------
# h) Cluster using average method
hc_a <- hclust(d, method = "average")
summary(hc_a)
plot(hc_a, main = "Average Linkage Dendrogram")
rect.hclust(hc_a, k = 3, border = "green")
cutree(hc_a, k = 3)
Q2. K-Means Clustering (PlantGrowth)
Objective: Group plants into 3 clusters based on weight using K-Means.
# a) Store the dataset
plant <- PlantGrowth
# b) Structure and dimensions
str(plant)
dim(plant)
# c) K-means clustering (3 clusters using weight)
set.seed(10) # (For reproducibility)
clust <- kmeans(plant$weight, centers = 3)
# d) Summary of cluster
print(clust)
# e) Cluster centers
clust$centers
# f) Cluster sizes
clust$size
# g) Predicted clusters
clust_f <- as.factor(clust$cluster)
# Add cluster column to dataframe
plant$cluster <- clust_f
# h) Plot clusters using ggplot2
library(ggplot2)
ggplot(plant, aes(x = weight, y = cluster, color = cluster)) +
geom_point(size = 3) +
ggtitle("K-Means Clustering on PlantGrowth (Weight Feature)") +
theme_minimal()
Assignment No. 8 – Analytical Queries (SQL)
Objective: Use advanced SQL OLAP functions like ROLLUP, CUBE, and Window Functions (RANK, DENSE_RANK, LEAD, LAG).
Q1. Create Sales table and insert records
-- Create table
CREATE TABLE Sales (
Year NUMBER(4),
Region VARCHAR2(20),
Department VARCHAR2(30),
Profit NUMBER(10)
);
-- Insert records
INSERT INTO Sales VALUES (1996, 'Central', 'Pen_sales', 34000);
INSERT INTO Sales VALUES (1996, 'Central', 'Pen_sales', 41000);
INSERT INTO Sales VALUES (1996, 'Central', 'Book_sales', 74000);
INSERT INTO Sales VALUES (1996, 'East', 'Pen_sales', 89000);
INSERT INTO Sales VALUES (1996, 'East', 'Book_sales', 100000);
INSERT INTO Sales VALUES (1996, 'East', 'Book_sales', 15000);
INSERT INTO Sales VALUES (1996, 'West', 'Pen_sales', 87000);
INSERT INTO Sales VALUES (1996, 'West', 'Book_sales', 86000);
INSERT INTO Sales VALUES (1997, 'Central', 'Pen_sales', 82000);
INSERT INTO Sales VALUES (1997, 'Central', 'Book_sales', 85000);
INSERT INTO Sales VALUES (1997, 'East', 'Pen_sales', 100000);
INSERT INTO Sales VALUES (1997, 'East', 'Book_sales', 137000);
INSERT INTO Sales VALUES (1997, 'West', 'Pen_sales', 96000);
INSERT INTO Sales VALUES (1997, 'West', 'Book_sales', 97000);
COMMIT;
Q2. Total profit for each year in each region (ROLLUP)
SELECT
Year, Region,
SUM(Profit) AS TotalProfit
FROM Sales
GROUP BY ROLLUP (Year, Region)
ORDER BY Year, Region;
Q3. Average profit for Year, Region, Dept (ROLLUP)
SELECT
Year, Region, Department,
AVG(Profit) AS AvgProfit
FROM Sales
GROUP BY ROLLUP (Year, Region, Department)
ORDER BY Year, Region, Department;
Q4. Average profit for all combinations of Year and Region (CUBE)
SELECT
Year, Region,
AVG(Profit) AS AvgProfit
FROM Sales
GROUP BY CUBE (Year, Region)
ORDER BY Year, Region;
Q5. Total profit for all combinations (CUBE)
SELECT
Year, Region, Department,
SUM(Profit) AS TotalProfit
FROM Sales
GROUP BY CUBE (Year, Region, Department)
ORDER BY Year, Region, Department;
Q6. Rank all records based on profit
SELECT
Year, Region, Department, Profit,
RANK() OVER (ORDER BY Profit DESC) AS ProfitRank
FROM Sales
ORDER BY Profit DESC;
Q7. Rank based on profit (Partition by Department)
SELECT
Year, Region, Department, Profit,
RANK() OVER (
PARTITION BY Department
ORDER BY Profit DESC
) AS DeptRank
FROM Sales
ORDER BY Department, Profit DESC;
Q8. Rank based on profit (Partition by Region)
SELECT
Year, Region, Department, Profit,
RANK() OVER (
PARTITION BY Region
ORDER BY Profit DESC
) AS RegionRank
FROM Sales
ORDER BY Region, Profit DESC;
Q9. Dense Rank based on profit for each region
SELECT
Year, Region, Department, Profit,
DENSE_RANK() OVER (
PARTITION BY Region
ORDER BY Profit DESC
) AS RegionDenseRank
FROM Sales
ORDER BY Region, Profit DESC;
Q10. Next 2 profits (LEAD)
SELECT
Year, Region, Department, Profit,
LEAD(Profit, 1, 11111) OVER (
PARTITION BY Department ORDER BY Year, Region
) AS NextProfit1,
LEAD(Profit, 2, 11111) OVER (
PARTITION BY Department ORDER BY Year, Region
) AS NextProfit2
FROM Sales
ORDER BY Department, Year, Region;
Q11. Previous 3 profits (LAG)
SELECT
Year, Region, Department, Profit,
LAG(Profit, 1, 0) OVER (
PARTITION BY Department ORDER BY Year, Region
) AS PrevProfit1,
LAG(Profit, 2, 0) OVER (
PARTITION BY Department ORDER BY Year, Region
) AS PrevProfit2,
LAG(Profit, 3, 0) OVER (
PARTITION BY Department ORDER BY Year, Region
) AS PrevProfit3
FROM Sales
ORDER BY Department, Year, Region;
Q12. Most and least profitable department (Overall)
SELECT
Department,
SUM(Profit) AS TotalProfit,
MAX(SUM(Profit)) OVER () AS MostProfitable,
MIN(SUM(Profit)) OVER () AS LeastProfitable
FROM Sales
GROUP BY Department
ORDER BY TotalProfit DESC;
Q13. Most and least profitable department (Region-wise)
SELECT
Region, Department,
SUM(Profit) AS TotalProfit,
MAX(SUM(Profit)) OVER (PARTITION BY Region) AS MostProfitableInRegion,
MIN(SUM(Profit)) OVER (PARTITION BY Region) AS LeastProfitableInRegion
FROM Sales
GROUP BY Region, Department
ORDER BY Region, TotalProfit DESC;
Q14. Running total of profit per region
SELECT
Year, Region, Department, Profit,
SUM(Profit) OVER (
PARTITION BY Region
ORDER BY Year
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS RunningTotal
FROM Sales
ORDER BY Region, Year;
Q15. Region-wise total profit (Window: 3 Preceding, 1 Following)
SELECT
Year, Region, Department, Profit,
SUM(Profit) OVER (
PARTITION BY Region
ORDER BY Year
ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING
) AS WindowTotal
FROM Sales
ORDER BY Region, Year;
Q16. Region-wise total profit (Window: 2 Preceding only)
SELECT
Year, Region, Department, Profit,
SUM(Profit) OVER (
PARTITION BY Region
ORDER BY Year
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS WindowTotal
FROM Sales
ORDER BY Region, Year;
Q17. Region-wise running total (All Previous)
SELECT
Year, Region, Department, Profit,
SUM(Profit) OVER (
PARTITION BY Region
ORDER BY Year
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS RunningTotal
FROM Sales
ORDER BY Region, Year;
Assignment No. 9 – ORDBMS (Oracle)
Objective: Implement Object-Relational features including Object Types, Nested Objects, Member Functions, and Inheritance.
Q1. Object Types for Customer
-- a. Create object types and table
CREATE OR REPLACE TYPE address_ty AS OBJECT (
street VARCHAR2(100),
city VARCHAR2(50),
state VARCHAR2(50),
zipcode VARCHAR2(10)
);
/
CREATE OR REPLACE TYPE person_ty AS OBJECT (
name VARCHAR2(50),
address address_ty
);
/
CREATE TABLE Customer1 (
custid NUMBER PRIMARY KEY,
custdetails person_ty
);
-- Insert records
INSERT INTO Customer1 VALUES (
101, person_ty('Raj', address_ty('MG Road', 'Pune', 'Maharashtra', '411001'))
);
INSERT INTO Customer1 VALUES (
102, person_ty('Priya', address_ty('Gokhle Road', 'Pune', 'Maharashtra', '411004'))
);
INSERT INTO Customer1 VALUES (
103, person_ty('Amit', address_ty('Curry Road', 'Mumbai', 'Maharashtra', '400706'))
);
INSERT INTO Customer1 VALUES (
104, person_ty('Sneha', address_ty('Bright Road', 'Bangalore', 'Karnataka', '560001'))
);
INSERT INTO Customer1 VALUES (
105, person_ty('Vikram Singh', address_ty('Bandra West', 'Mumbai', 'Maharashtra', '400706'))
);
COMMIT;
-- b. Display customer id and person name
SELECT c.custid, c.custdetails.name AS person_name FROM Customer1 c;
-- c. Display customer details belonging to city 'Pune'
SELECT
c.custid,
c.custdetails.name AS person_name,
c.custdetails.address.street AS street,
c.custdetails.address.city AS city,
c.custdetails.address.state AS state,
c.custdetails.address.zipcode AS zipcode
FROM Customer1 c
WHERE c.custdetails.address.city = 'Pune';
-- d. Display count of customers state-wise
SELECT c.custdetails.address.state AS state, COUNT(*) AS customer_count
FROM Customer1 c
GROUP BY c.custdetails.address.state
ORDER BY customer_count DESC;
-- e. Display city and customer name
SELECT c.custdetails.address.city AS city, c.custdetails.name AS customer_name FROM Customer1 c;
-- f. Find all customers with zipcode '400706'
SELECT
c.custid,
c.custdetails.name AS customer_name,
c.custdetails.address.street AS street,
c.custdetails.address.city AS city,
c.custdetails.address.zipcode AS zipcode
FROM Customer1 c
WHERE c.custdetails.address.zipcode = '400706';
-- g. Display count of customers city-wise
SELECT c.custdetails.address.city AS city, COUNT(*) AS customer_count
FROM Customer1 c
GROUP BY c.custdetails.address.city
ORDER BY customer_count DESC;
Q2. Doctor Type with Member Function
-- a. Create object type and table
CREATE OR REPLACE TYPE Doctor_type AS OBJECT (
DoctorId NUMBER,
Dname VARCHAR2(50),
TelNo VARCHAR2(15),
Specialization VARCHAR2(50),
MEMBER FUNCTION GetCharges(nhours NUMBER) RETURN NUMBER
);
/
CREATE OR REPLACE TYPE BODY Doctor_type AS
MEMBER FUNCTION GetCharges(nhours NUMBER) RETURN NUMBER IS
BEGIN
RETURN nhours * 300;
END GetCharges;
END;
/
CREATE TABLE Doctor OF Doctor_type;
-- b. Insert records
INSERT INTO Doctor VALUES (Doctor_type(1, 'Dr. Smith', '555-0101', 'Cardiology'));
INSERT INTO Doctor VALUES (Doctor_type(2, 'Dr. Johnson', '555-0102', 'Neurology'));
INSERT INTO Doctor VALUES (Doctor_type(3, 'Dr. Williams', '555-0103', 'Orthopedics'));
INSERT INTO Doctor VALUES (Doctor_type(4, 'Dr. Brown', '555-0104', 'Pediatrics'));
INSERT INTO Doctor VALUES (Doctor_type(5, 'Dr. Davis', '555-0105', 'Dermatology'));
COMMIT;
-- c. Display details and charges for 12 hours
SELECT DoctorId, Dname, D.GetCharges(12) AS Charges FROM Doctor D;
Q3. Object Inheritance Hierarchy
-- a. Create base type and subtypes
CREATE OR REPLACE TYPE emp_type AS OBJECT (
ename VARCHAR2(50),
addr VARCHAR2(50),
salary NUMBER,
jobtitle VARCHAR2(50)
) NOT FINAL;
/
CREATE OR REPLACE TYPE developer UNDER emp_type (
prog_lang VARCHAR2(50)
) NOT FINAL;
/
CREATE OR REPLACE TYPE manager UNDER emp_type (
no_of_subordinates NUMBER
);
/
CREATE OR REPLACE TYPE programmer UNDER developer (
no_of_projects NUMBER
);
/
CREATE TABLE employee OF emp_type;
-- b. Insert records
INSERT INTO employee VALUES(manager('Ram', 'Nerul', 80000, 'Manager', 5));
INSERT INTO employee VALUES(emp_type('John', 'Thane', 25000, 'Accountant'));
INSERT INTO employee VALUES(developer('Sita', 'Belapur', 72000, 'Developer', 'Java'));
INSERT INTO employee VALUES(developer('Joseph', 'Airoli', 72000, 'Developer', 'Python'));
INSERT INTO employee VALUES(programmer('Geeta', 'Thane', 100000, 'Developer', 'Solidity', 2));
COMMIT;
-- c. Display all records
SELECT * FROM employee;
-- d. Display only developer records
SELECT * FROM employee e WHERE TREAT(VALUE(e) AS developer) IS NOT NULL;
-- e. Display programming language of developers and programmers
SELECT ename, TREAT(VALUE(e) AS developer).prog_lang AS prog_lang
FROM employee e
WHERE VALUE(e) IS OF (developer, programmer);
-- f. Display names and subordinates for managers
SELECT ename, TREAT(VALUE(e) AS manager).no_of_subordinates AS no_of_subordinates
FROM employee e
WHERE VALUE(e) IS OF (manager);
-- g. Display programmers and projects
SELECT ename, TREAT(VALUE(e) AS programmer).no_of_projects AS no_of_projects
FROM employee e
WHERE VALUE(e) IS OF (programmer);
-- h. Retrieve employees earning > 50000
SELECT * FROM employee WHERE salary > 50000;
-- i. Update salary of all managers by 10%
UPDATE employee e SET e.salary = e.salary * 1.10 WHERE VALUE(e) IS OF (manager);
COMMIT;
Assignment No. 10 – Pentaho (ETL)
Objective: Perform Extract, Transform, and Load (ETL) operations using Pentaho Data Integration (PDI / Spoon).
Q1. Sort CSV and Load
Step 1: Use CSV File Input step to select the source CSV file (fields: empno, FullName, deptno, deptname).
Step 2: Connect to Sort Rows step.
Add Sort field: deptno (Ascending).
Add Sort field: empno (Descending).
Step 3: Connect to Table Output. Configure database connection and set target table to sorted.
Q2. Split Excel Name
Step 1: Use Microsoft Excel Input step (fields: roll_no, Full_name, date_of_birth).
Step 2: Connect to Split Fields step.
Field to split: Full_name.
Delimiter: (Space).
New fields: First_Name, Last_Name.
Step 3: Connect to Table Output to load data.
Q3. Generate Sequence
Step 1: Use Data Grid to create 10 manual rows (fields: pname, qty).
Step 2: Connect to Add Sequence step.
Name of value: pno.
Start at: 101.
Increment by: 2.
Step 3: Use Select Values to verify the stream.
Q4. Calculate Total Cost
Step 1: Use Data Grid (fields: order_no, order_quantity, item_name, unit_price).
Step 2: Connect to Calculator step.
New field: total_cost.
Calculation: Field A * Field B (Select order_quantity and unit_price).
Step 3: Connect to Table Output (target table: order).
Q5. Calculate Average Marks
Step 1: Use Data Grid (fields: roll_no, mark1, mark2, mark3).
Step 2: Connect to Calculator step.
total_marks = mark1 + mark2 + mark3 (Sum of 3 fields).
average = total_marks / 3 (Division with constant).
Step 3: Connect to Table Output (target table: output).
Q6. Split Employee ID from Excel
Step 1: Use Microsoft Excel Input (Example: A_109).
Step 2: Connect to Split Fields.
Delimiter: _.
New fields: emp_code, emp_number.
Step 3: Connect to Table Output.
Q7. Sort Sales Table (from DB)
Step 1: Use Table Input (Write SQL SELECT * FROM Sales).
Step 2: Connect to Sort Rows.
Sort by region (Descending).
Sort by profit (Ascending).
Step 3: Connect to Table Output (target table: sorted_sales).
Q8. Merge Join (Branch & Borrow)
Step 1: Use Table Input for branch table.
Step 2: Use Table Input for borrow table.
Step 3: Connect both to separate Sort Rows steps (Sort both by bname).
Step 4: Connect both Sort steps to Merge Join.
Join Type: INNER.
Key field: bname.
Step 5: Connect to Table Output (target table: branch_load).
Q9. Number Range (Grades)
Step 1: Use Data Grid (field: percentage).
Step 2: Connect to Number Range step.
Input field: percentage.
Output field: result.
Ranges:
Upper Bound: 45 -> Value: Fail
Upper Bound: 60 -> Value: Second Class
Upper Bound: 80 -> Value: First Class
Upper Bound: 90 -> Value: Distinction
Upper Bound: 101 -> Value: Outstanding
Step 3: Connect to Table Output.
Q10. String Padding
Step 1: Use Data Grid (fields: ENAME, CNAME).
Step 2: Connect to String Operations.
Field ENAME: Padding Right, Pad Char *, Length 8.
Field CNAME: Padding Left, Pad Char #, Length 4.
Step 3: Use Select Values to verify.
Q11. Filter Rows (Pass/Fail)
Step 1: Use Data Grid (fields: marks, result_status).
Step 2: Connect to Filter Rows.
Condition: marks <= 100 AND result_status = 'pass'.
Step 3: Connect "True" output to Text File Output (File 1).
Step 4: Connect "False" output to Text File Output (File 2).
Q12. Validate Excel Data
Step 1: Use Microsoft Excel Input (fields: age, country).
Step 2: Connect to Filter Rows.
Condition: age > 18 AND country IN ('India', 'Singapore').
Step 3: Connect "True" output to Text File Output (name: valid, delimiter: %).
Step 4: Connect "False" output to Text File Output (name: invalid, delimiter: :).