how ro read data

 Data Analysis & Statistical Methods

Lec. 1: Introduction to R & Basic Operations
1. Basic Arithmetic Operations
Question: "Calculate 5 plus 7, then multiply the result by 3"
Command: (5 + 7) * 3
Output: 36
2. Variable Assignment
Question: "Store the value 15 in a variable called 'temperature' and then display it"
Command: temperature <- 15 then print(temperature)
Output: 15
3. Vector Creation
Question: "Create a vector containing the numbers 2, 4, 6, 8, 10"
Command: numbers <- c(2, 4, 6, 8, 10)
4. Vector Indexing
Question: "From the vector [5, 3, 7, 1], extract the third element"
Command: x <- c(5, 3, 7, 1) then x[3]
Output: 7
5. Matrix Creation
Question: "Create a 2x3 matrix with numbers 1 through 6, filled by rows"
Command: matrix(1:6, nrow=2, ncol=3, byrow=TRUE)
Output:
text
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Lec. 2: Data Import & Programming Basics
6. Importing CSV Data
Question: "Import a CSV file named 'data.csv' from the desktop"
Command: read.csv("C:/Users/Desktop/data.csv")
Output: (Displays the imported data frame)
Interpretation: The CSV file is successfully loaded into R as a data frame for analysis.
7. Counting Observations
Question: "How many temperature readings are in this vector: [3.0, 1.7, 1.4, 2.1, 1.6,
1.0, 0.8, 1.9, 3.8, 4.8]?"
Command: length(c(3.0, 1.7, 1.4, 2.1, 1.6, 1.0, 0.8, 1.9, 3.8, 4.8))
Output: 10
8. Random Number Generation
Question: "Generate 5 random numbers from a normal distribution with mean 50
and standard deviation 10"
Command: rnorm(5, mean=50, sd=10)
Output: 52.34 48.91 61.23 45.67 49.82 (example output)
9. For Loop Example
Question: "Use a for loop to print each element of the vector [24, 28, 30, 29, 50]"
Command:
r
x <- c(24, 28, 30, 29, 50)
for (i in 1:5)
{print(x[i])}Output:
text
[1] 24
[1] 28
[1] 30
[1] 29
[1] 50
Lec. 3: Statistical Analysis & Data Summary
10. Calculating Mean
Question: "Find the average of these test scores: [85, 92, 78, 96, 88]"
Command: mean(c(85, 92, 78, 96, 88))
Output: 87.8
11. Trimmed Mean
Question: "Calculate a 20% trimmed mean for this data: [1, 2, 3, 4, 5, 6, 7, 8, 9, 100]"
Command: mean(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 100), trim=0.2)
Output: 5.5
Interpretation: The trimmed mean removes the lowest and highest 20% of values,
reducing the influence of the outlier (100).
12. Finding Median
Question: "What is the median of these ages: [22, 25, 30, 35, 40]?"
Command: median(c(22, 25, 30, 35, 40))
Output: 30
13. Calculating Quartiles
Question: "Find the quartiles for this data: [10, 15, 20, 25, 30, 35, 40]"
Command: quantile(c(10, 15, 20, 25, 30, 35, 40))
Output:text
0% 25% 50% 75% 100%
10.0 17.5 25.0 32.5 40.0
Interpretation: Q1=17.5 (25% below this), Q2=25 (median), Q3=32.5 (75% below
this).
14. Correlation Analysis
Question: "Calculate the correlation between hours studied [2, 3, 5, 7, 9] and exam
scores [65, 70, 80, 85, 90]"
Command: cor(c(2, 3, 5, 7, 9), c(65, 70, 80, 85, 90))
Output: 0.987
Interpretation: There's a very strong positive correlation (0.987), indicating that more
study hours are associated with higher exam scores.
15. Comprehensive Data Summary
Question: "Get a complete statistical summary for the weight data: [150, 160, 165,
170, 175, 180, 190]"
Command: summary(c(150, 160, 165, 170, 175, 180, 190))
Output:
text
Min. 1st Qu. Median Mean 3rd Qu. Max.
150.0 161.2 170.0 170.0 178.8 190.0
Lec. 4: Data Transformations & Diagnostics
1. Square Root Transformation
Question: "Apply square root transformation to the value 16"
Command: sqrt(16)
Output: 4.
2. Reciprocal Transformation
Question: "Find the reciprocal of 8"
Command: 1/8
Output: 0.1253. LOGit Transformation
Question: "Calculate the logit transformation for a percentage value of 75%"
Command: log(75/(100-75))
Output: 1.098612
Interpretation: The logit transformation converts percentages (0-100) to a
continuous scale from -∞ to +∞, making them suitable for linear modeling.
4. Arcsine Transformation
Question: "Apply arc-sine transformation to a proportion of 0.25"
Command: asin(sqrt(0.25))
Output: 0.5235988 (radians)
5. Box-Cox Transformation (λ=0.5)
Question: "Apply Box-Cox transformation with λ=0.5 to the value 9"
Command: (9^0.5 - 1)/0.5
Output: 4
Interpretation: Box-Cox transformation with λ=0.5 approximates the square root
transformation, helping to normalize data distributions.
6. Outlier Detection - Quartiles
Question: "Find the first quartile (Q1) of this data: [9, 156, 163, 166, 171, 176, 180,
1872]"
Command: quantile(c(9,156,163,166,171,176,180,1872), 0.25)
Output: 164.5
Interpretation: Q1 (164.5) represents the 25th percentile - 25% of values fall below
this point.
7. Outlier Detection - IQR Calculation
Question: "Calculate the Interquartile Range (IQR) for the data: [9, 156, 163, 166, 171,
176, 180, 1872]"
Command: IQR(c(9,156,163,166,171,176,180,1872))
Output: 12.5
8. Outlier Detection - Lower Fence
Question: "Calculate the lower fence for outlier detection using Q1=164.5 and
IQR=12.5"Command: 164.5 - (1.5 * 12.5)
Output: 145.75
Interpretation: Values below the lower fence (145.75) are considered outliers in this
dataset.
9. Outlier Detection - Upper Fence
Question: "Calculate the upper fence for outlier detection using Q3=177 and
IQR=12.5"
Command: 177 + (1.5 * 12.5)
Output: 195.75
Interpretation: Values above the upper fence (195.75) are considered outliers in this
dataset.
10. Moving Average - Data Length
Question: "How many years of attendance data are available: [5761, 6148, 6783,
7445, 7405, 11450, 11224, 11703, 11890, 12380, 12181, 12557]?"
Command: length(c(5761,6148,6783,7445,7405,11450,11224,11703,11890,12380
,12181,12557))
Output: 12
12. Three-Year Moving Average
Question: "Calculate the 3-year moving average for years 1993-1995 with attendance
[5761, 6148, 6783]"
Command: mean(c(5761, 6148, 6783))
Output: 6231
Lec. 5: Data Visualization & Statistical Graphics
1. Boxplot Creation
Question: "Create a boxplot for 50 random values from a normal distribution with
mean 30 and standard deviation 5"
Command: boxplot(rnorm(50, mean=30, sd=5), ylab="Data")
Output: (Visual boxplot showing median, quartiles, and potential outliers)
Interpretation: The boxplot displays the distribution's center, spread, and symmetry.
It helps identify the median (center line), IQR (box), and potential outliers (points
outside whiskers).2. Colored Boxplot
Question: "Make a red colored boxplot with title 'Boxplot of data'"
Command: boxplot(rnorm(50, mean=30, sd=5), ylab="Data", main="Boxplot of
data", col="red")
Output: (Red colored boxplot with title)
Interpretation: Adding color and titles makes plots more informative and publication
ready.
3. Horizontal Boxplot
Question: "Create a horizontal green boxplot"
Command: boxplot(rnorm(50, mean=30, sd=5), xlab="Data", main="Boxplot of
data", horizontal=TRUE, col="green")
Output: (Horizontal green boxplot)
Interpretation: Horizontal boxplots are useful when comparing multiple groups or
when category names are long.
4. Stem and Leaf Plot
Question: "Create a stem and leaf plot for the data: 36,46,46,47,48,50,51,52,53,54"
Command: stem(c(36,46,46,47,48,50,51,52,53,54))
Output:
text
The decimal point is 1 digit(s) to the right of the |
3 | 6
4 | 6678
5 | 01234
Interpretation: Stem and leaf displays show the actual data values while organizing
them. Each number is split into stem (tens digit) and leaf (units digit).
5. Basic Histogram
Question: "Create a histogram for 10,000 random values from standard normal
distribution"
Command: hist(rnorm(10000, mean=0, sd=1))
Output: (Bell-shaped histogram)
Interpretation: The histogram shows the distribution shape - in this case,
approximately normal (bell-shaped) as expected from normal random generation.
6. Enhanced HistogramQuestion: "Create a skyblue histogram with custom axes and title"
Command:
r
x1 <- rnorm(10000, mean=0, sd=1)
hist(x1, col="skyblue", xlim=c(-6,6), xlab="x variable",
ylab="Frequency", main="Histogram of Random Variables")
Output: (Professional-looking histogram with skyblue bars and proper labels)
Interpretation: Customizing colors, labels, and limits makes histograms more
readable and suitable for reports.
7. Density Plot
Question: "Create a density plot overlay on a histogram"
Command:
r
x1 <- rnorm(10000, mean=0, sd=1)
hist(x1, freq=FALSE, col="lightgray", main="Density Plot")
lines(density(x1), col="red", lwd=2)
Output: (Histogram with smooth red density curve)
Interpretation: Density plots provide a smoothed version of the histogram, better
showing the underlying distribution shape.
8. QQ Plot for Normality Check
Question: "Create a QQ plot to check if 100 values from N(100,8) follow normal
distribution"
Command:
r
y <- rnorm(100, mean=100, sd=8)
qqnorm(y)
qqline(y)
Output: (Scatter plot comparing sample quantiles to theoretical normal quantiles)
Interpretation: If points follow the straight line, the data is approximately normal.
Deviations indicate departures from normality.
9. QQ Plot for Non-Normal Data
Question: "Create a QQ plot for chi-square distributed data with 3 degrees of
freedom"
Command:r
y <- rchisq(1000, df=3)
qqnorm(y)
qqline(y)
Output: (QQ plot showing curved pattern)
Interpretation: The curved pattern indicates the data doesn't follow a normal
distribution, which is expected for chi-square data.
10. Skewness Interpretation from Boxplot
Question: "How would you interpret a boxplot where the median is closer to Q1 than
Q3?"
Command: (Visual interpretation of boxplot shape)
Output: (No R code - conceptual understanding)
Interpretation: When the median is closer to Q1 (first quartile), it indicates positive
(right) skewness, meaning the data has a longer tail on the right side.
11. Distribution Shape from Histogram
Question: "What does a histogram with most data on the left and long tail on the
right indicate?"
Command: (Visual interpretation)
Output: (No R code - conceptual)
Interpretation: This indicates right-skewed (positively skewed) distribution, common
with income data or reaction times where most values are low but some are very
high.
12. Color Customization
Question: "Change the color of a plot to hexadecimal code #FF5733"
Command: hist(rnorm(100), col="#FF5733")
Output: (Histogram with specific orange color)
Interpretation: Using hexadecimal color codes allows precise color control for
professional publications and brand consistency.
Lec. 6: Advanced Data Visualization
1. Pie Chart Creation
Question: "Create a pie chart showing market share for products A(15%), B(25%),
C(35%), D(25%)"
Command:r
sales <- c(15, 25, 35, 25)
products <- c("A", "B", "C", "D")
pie(sales, labels = products)
Output: (Circular chart with four slices labeled A, B, C, D)
Interpretation: Product C has the largest market share (35%), while Product A has
the smallest (15%). Pie charts are best for showing parts of a whole.
2. Bar Plot Basics
Question: "Create a bar plot showing monthly sales: Jan(100), Feb(150), Mar(200),
Apr(180)"
Command:
r
monthly_sales <- c(100, 150, 200, 180)
months <- c("Jan", "Feb", "Mar", "Apr")
barplot(monthly_sales, names.arg = months)
Output: (Four vertical bars with month labels)
Interpretation: March had the highest sales (200), showing an increasing trend from
January to March, with a slight drop in April.
3. Enhanced Bar Plot
Question: "Create a blue bar plot with title 'Monthly Sales' and labeled axes"
Command:
r
barplot(monthly_sales, names.arg = months,
col = "blue", main = "Monthly Sales",
xlab = "Months", ylab = "Sales Amount")
Output: (Professional blue bar chart with proper titles and labels)
Interpretation: The enhanced bar plot clearly shows sales patterns over time, making
it suitable for business presentations.
4. Scatter Plot Creation
Question: "Create a scatter plot showing relationship between study hours [2,3,5,7,9]
and exam scores [65,70,80,85,90]"
Command:
r
hours <- c(2,3,5,7,9)scores <- c(65,70,80,85,90)
plot(hours, scores)
Output: (Scatter plot with points showing positive trend)
Interpretation: There appears to be a strong positive relationship - as study hours
increase, exam scores also increase.
5. Customized Scatter Plot
Question: "Create a scatter plot with red triangles (pch=17) of size 2"
Command:
r
plot(hours, scores, pch=17, col="red", cex=2,
xlab="Study Hours", ylab="Exam Scores",
main="Study Hours vs Exam Scores")
Output: (Scatter plot with large red triangles)
Interpretation: The customized plot clearly shows the positive correlation. Larger,
colored points make the relationship more visible.
6. Line Graph
Question: "Create a line graph showing temperature trend: Day1-20°, Day2-22°,
Day3-25°, Day4-23°"
Command:
r
days <- 1:4
temp <- c(20,22,25,23)
plot(days, temp, type="l")
Output: (Line connecting the temperature points)
Interpretation: The line graph shows temperature increasing to a peak on Day 3,
then decreasing slightly on Day 4, clearly displaying the trend over time.
7. Enhanced Line Graph
Question: "Create a blue line graph with points and custom axes"
Command:
r
plot(days, temp, type="b", col="blue", lwd=2,
xlab="Day", ylab="Temperature (°C)",
main="Daily Temperature Trend",
xlim=c(1,4), ylim=c(18,26))Output: (Blue line with points, properly labeled)
Interpretation: The enhanced line graph shows both the overall trend and individual
data points, making it easy to see daily fluctuations.
8. Multiple Boxplots
Question: "Create side-by-side boxplots for test scores of three classes: ClassA
[65,70,75,80,85], ClassB [70,72,78,82,88], ClassC [60,65,70,75,90]"
Command:
r
classA <- c(65,70,75,80,85)
classB <- c(70,72,78,82,88)
classC <- c(60,65,70,75,90)
scores_df <- data.frame(ClassA=classA, ClassB=classB, ClassC=classC)
boxplot(scores_df)
Output: (Three boxplots side by side)
Interpretation: Class B has the highest median score and least variability, while Class
C has the widest spread including a high outlier (90).
9. Colored Multiple Boxplots
Question: "Create multiple boxplots with different colors for each class"
Command:
r
boxplot(scores_df, col=c("lightblue", "lightgreen", "pink"),
main="Test Scores by Class",
ylab="Scores", xlab="Class")
Output: (Color-coded boxplots for easy comparison)
Interpretation: The colored boxplots make it easy to compare distributions across
classes. Class A (blue) shows consistent performance, Class B (green) has higher
scores with less variability.
10. Scatter Plot Point Types
Question: "Show different point characters (pch) available in R"
Command:
r
# Demonstration of different pch values
plot(1:5, 1:5, pch=1:5, cex=2, xlim=c(0,6))
text(1:5, 0.8, labels=1:5)Output: (Plot showing circles, triangles, plus signs, etc.)
Interpretation: Different pch values (1-25) allow customization of point symbols in
scatter plots, useful for distinguishing groups in multivariate data.
11. Correlation Strength Interpretation
Question: "How would you interpret a scatter plot where points form a tight,
upward-sloping pattern?"
Command: (Visual interpretation)
Output: (No R code - conceptual)
Interpretation: A tight, upward-sloping pattern indicates a strong positive
correlation, meaning as one variable increases, the other consistently increases in a
predictable manner.
12. Trend Line Addition
Question: "Add a trend line to a scatter plot"
Command:
r
plot(hours, scores)
abline(lm(scores ~ hours), col="red")
Output: (Scatter plot with red regression line)
Interpretation: The trend line shows the average relationship between variables. A
steep upward slope confirms the strong positive correlation observed in the scatter
plot.
Lec. 7: Statistical Learning & Linear Regression
1. Simple Linear Regression Model Fitting
Question: "Fit a linear regression model to predict heating cost (Y) from outside
temperature (X) using the given data"
Command:
r
X <- c(24,47,50,61,74,85,89,91,95,90,99)
Y <- c(10,19,20,20,20,23,23,29,29,32,40)
model <- lm(Y ~ X)
model
Output:text
Call:
lm(formula = Y ~ X)
Coefficients:
(Intercept) X
3.3875 0.2829
Interpretation: The fitted regression equation is Ŷ = 3.3875 + 0.2829X. This means
for every 1-degree increase in temperature, heating cost increases by $0.28.
2. Extracting Model Coefficients
Question: "Extract the intercept and slope coefficients from the fitted model"
Command: coef(model)
Output: (Intercept) = 3.3875, X = 0.2829
Interpretation: The intercept (3.3875) represents the base heating cost when
temperature is 0°C, and the slope (0.2829) shows the rate of change in heating cost
per degree temperature.
3. Getting Fitted Values
Question: "Get the predicted heating costs for all observed temperatures"
Command: fitted(model)
Output: 10.17719, 16.68398, 17.53269, 20.64463, 24.32238, 27.43432,
28.56593, 29.13174, 30.26335, 28.84883, 31.39497
Interpretation: These are the heating cost values predicted by our regression model
for each observed temperature.
4. Creating Scatter Plot with Regression Line
Question: "Create a scatter plot of temperature vs heating cost with the regression
line"
Command:
r
plot(X, Y, xlab="Temperature", ylab="Heating Cost", main="Regression Line")
abline(model, col="red", lwd=2)
Output: (Scatter plot with red regression line)
Interpretation: The plot visually shows the positive relationship between
temperature and heating cost, with the red line representing the best-fit linear
relationship.
5. Getting Model SummaryQuestion: "Get detailed summary of the regression model"
Command: summary(model)
Output: (Detailed output with coefficients, R-squared, F-statistic, p-values)
Interpretation: The summary provides comprehensive statistics including coefficient
significance, model fit measures, and overall model significance.
6. Extracting R-squared Value
Question: "What percentage of variation in heating cost is explained by
temperature?"
Command: summary(model)$r.squared
Output: 0.7397
Interpretation: 73.97% of the variation in heating cost is explained by temperature,
indicating a strong relationship.
7. Extracting Adjusted R-squared
Question: "Get the adjusted R-squared value"
Command: summary(model)$adj.r.squared
Output: 0.7108
Interpretation: The adjusted R-squared (71.08%) accounts for the number of
predictors, providing a more conservative measure of model fit.
8. Checking Model Significance
Question: "Is the overall regression model statistically significant?"
Command: Look at F-statistic p-value in summary(model)
Output: F-statistic: 25.58, p-value: 0.0006833
Interpretation: With p-value < 0.05, the model is statistically significant, meaning
temperature has a significant impact on heating cost.
9. Making Predictions with New Data
Question: "Predict heating cost when temperature is 30 degrees"
Command: predict(model, data.frame(X=30))
Output: 11.87462
Interpretation: When the outside temperature is 30°C, the predicted heating cost is
$11.87.
10. Checking Residuals
Question: "Get the residuals (errors) of the model"
Command: residuals(model)
Output: -0.17719, 2.31602, 2.46731, -0.64463, -4.32238, -4.43432, -
5.56593, -0.13174, -1.26335, 3.15117, 8.60503Interpretation: Residuals show the difference between actual and predicted values.
Positive residuals indicate underestimation, negative indicate overestimation.
11. Coefficient Significance Test
Question: "Is the temperature coefficient statistically significant?"
Command: Check p-value for X in summary(model)$coefficients
Output: p-value: 0.000683
Interpretation: With p-value < 0.05, the temperature coefficient is statistically
significant, confirming temperature's real effect on heating cost.
12. Supervised Learning Concept
Question: "What type of supervised learning problem is this?"
Command: (Conceptual - no R code)
Output: Regression problem
Interpretation: Since the response variable (heating cost) is numerical/continuous,
this is a regression problem in supervised learning.
Lec. 8: Regression Diagnostics
1. Checking Linear Relationship
Question: "Create a scatter plot to check if there's a linear relationship between
temperature (X) and heating cost (Y)"
Command:
r
X <- c(24,47,50,61,74,85,89,91,95,90,99)
Y <- c(10,19,20,20,20,23,23,29,29,32,40)
plot(X, Y, xlab="Temperature", ylab="Heating Cost")
Output: (Scatter plot showing data points)
Interpretation: The scatter plot helps visualize the relationship. A linear pattern
suggests linearity assumption is met, while curved patterns indicate potential issues.
2. Verifying Mean of Errors is Zero
Question: "Verify that the mean of regression errors is approximately zero"
Command:
r
model <- lm(Y ~ X)
errors <- residuals(model)mean(errors)
Output: -1.850372e-15 (effectively zero)
Interpretation: The mean of errors is effectively zero (due to floating-point
precision), which satisfies this regression assumption.
3. Checking Homoscedasticity
Question: "Create a residual plot to check for constant error variance
(homoscedasticity)"
Command:
r
fitted_values <- fitted(model)
plot(fitted_values, errors, xlab="Fitted Values", ylab="Residuals")
abline(h=0, col="red")
Output: (Residuals vs fitted values plot)
Interpretation: If points are randomly scattered around the red line with no pattern
(like a fan shape), homoscedasticity assumption is met. Patterns suggest
heteroscedasticity.
4. Normality Check - Histogram
Question: "Check if residuals follow normal distribution using histogram"
Command:
r
hist(errors, xlab="Residuals", main="Histogram of Residuals")
Output: (Histogram of residual distribution)
Interpretation: A bell-shaped histogram suggests normal distribution of errors.
Skewed distributions indicate violation of normality assumption.
5. Normality Check - Density Plot
Question: "Create a density plot overlay to check normality"
Command:
r
hist(errors, freq=FALSE, main="Density Plot of Residuals", xlab="Residuals")
lines(density(errors), col="red", lwd=2)
Output: (Histogram with smooth density curve)
Interpretation: The red density curve should follow the normal bell shape. Deviations
from this shape indicate non-normal errors.6. Normality Check - QQ Plot
Question: "Use QQ plot to check if residuals follow normal distribution"
Command:
r
qqnorm(errors)
qqline(errors, col="red")
Output: (Quantile-Quantile plot)
Interpretation: If points follow the red line closely, residuals are normally distributed.
Points deviating from the line suggest non-normality.
7. Outlier Detection - Boxplot
Question: "Detect outliers in residuals using boxplot"
Command: boxplot(errors, main="Boxplot of Residuals")
Output: (Boxplot showing median, quartiles, and potential outliers)
Interpretation: Points outside the whiskers are potential outliers that may unduly
influence the regression results.
8. Extracting Residuals
Question: "Extract residuals from the fitted regression model"
Command: residuals(model)
Output: -0.17719, 2.31602, 2.46731, -0.64463, -4.32238, -4.43432, -
5.56593, -0.13174, -1.26335, 3.15117, 8.60503
Interpretation: These values show how much each actual observation differs from its
predicted value. Large absolute values indicate poor predictions.
9. Sum of Residuals Check
Question: "Verify that the sum of residuals equals zero"
Command: sum(residuals(model))
Output: -2.035409e-15 (effectively zero)
Interpretation: The sum of residuals is effectively zero, which is a mathematical
property of ordinary least squares regression.
10. Residual Pattern Analysis
Question: "Are there any patterns in the residuals that suggest model problems?"
Command: (Visual analysis of residual plot)
Output: (No code - interpretation of plot)
Interpretation: Random scatter indicates good model fit. Patterns (like curves,
funnels, or trends) suggest missing variables, non-linearity, or heteroscedasticity.11. Normality Test - Statistical
Question: "Perform Shapiro-Wilk test for normality of residuals"
Command: shapiro.test(errors)
Output: W = 0.912, p-value = 0.265
Interpretation: With p-value > 0.05, we cannot reject the null hypothesis of
normality, suggesting residuals are approximately normally distributed.
12. Multicollinearity Check (Conceptual)
Question: "What diagnostic checks multicollinearity in multiple regression?"
Command: (Conceptual - for multiple regression)
Output: Variance Inflation Factor (VIF)
Interpretation: In multiple regression, VIF > 10 indicates high multicollinearity, which
can make coefficient estimates unstable.
Document 9: Weighted Least Squares (WLS)
1. Reading Salary Data
Question: "Read the salary data from CSV file and extract starting and mid-career
salary columns"
Command:
r
salary_data <- read.csv("SALARY.csv")
X <- salary_data$Starting.Salary
Y <- salary_data$MidCareer.Salary
head(salary_data)
Output: (Displays first few rows of the dataset)
text
Starting.Salary MidCareer.Salary
1 56700 117000
2 51400 91100
3 46300 88800
4 41500 88000
5 39200 87100
6 39000 87000
Interpretation: The data shows starting salaries and corresponding mid-career
salaries for college graduates, which we'll use for regression analysis.2. Initial Scatter Plot
Question: "Create a scatter plot to examine the relationship between starting and
mid-career salaries"
Command: plot(X, Y, xlab="Starting Salary", ylab="Mid-Career Salary")
Output: (Scatter plot showing data points)
Interpretation: The plot shows a positive relationship but potential non-linearity and
heteroscedasticity (increasing spread as starting salary increases).
3. Ordinary Least Squares (OLS) Regression
Question: "Fit a simple linear regression model using OLS"
Command:
r
model_ols <- lm(Y ~ X)
summary(model_ols)
Output: (OLS regression output with coefficients, R-squared, etc.)
Interpretation: The OLS model provides initial estimates, but we need to check if
assumptions are violated.
4. Residual Analysis for OLS
Question: "Plot residuals against fitted values to check for heteroscedasticity"
Command:
r
residuals_ols <- residuals(model_ols)
fitted_ols <- fitted(model_ols)
plot(fitted_ols, residuals_ols, xlab="Fitted Values", ylab="Residuals")
abline(h=0, col="red")
Output: (Residual plot showing funnel shape)
Interpretation: The funnel pattern indicates heteroscedasticity - variance increases
with fitted values, violating OLS assumptions.
5. Weighted Least Squares (WLS) Implementation
Question: "Fit WLS model using weights = 1/X to address heteroscedasticity"
Command:
r
model_wls <- lm(Y ~ X, weights = 1/X)
summary(model_wls)Output:
text
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3386.441 2915.000 -1.162 0.246
X 1.831 0.070 26.094 <2e-16 ***
Interpretation: WLS gives more efficient estimates. The slope (1.831) is highly
significant (p < 0.001), confirming that higher starting salaries predict higher mid
career salaries.
6. WLS Coefficient Interpretation
Question: "Interpret the WLS regression coefficient for starting salary"
Command: (Interpretation of output from previous command)
Output: Slope coefficient = 1.831
Interpretation: For every $1 increase in starting salary, mid-career salary increases by
$1.83 on average. This relationship is statistically significant.
7. Prediction with WLS Model
Question: "Predict mid-career salary for a $10,000 increase in starting salary"
Command:
r
prediction <- (-3386.441) + (1.831 * 10000)
prediction
Output: 14923.56
Interpretation: A $10,000 increase in starting salary predicts a $14,924 increase in
mid-career salary based on the WLS model.
8. Confidence Intervals for WLS Coefficients
Question: "Get 95% confidence intervals for WLS coefficients"
Command: confint(model_wls)
Output:
text
2.5 % 97.5 %
(Intercept) -9110.628 2337.746
X 1.693 1.969Interpretation: We're 95% confident that the true slope coefficient lies between
1.693 and 1.969, meaning each $1 increase in starting salary increases mid-career
salary by $1.69 to $1.97.
9. Comparing OLS and WLS Residuals
Question: "Compare the residual patterns between OLS and WLS"
Command:
r
residuals_wls <- residuals(model_wls)
fitted_wls <- fitted(model_wls)
plot(fitted_wls, residuals_wls, xlab="Fitted Values (WLS)", ylab="Residuals (
WLS)")
abline(h=0, col="red")
Output: (WLS residual plot)
Interpretation: The WLS residuals should show more constant variance
(homoscedasticity) compared to the OLS residual plot, indicating better model fit.
10. Weight Calculation Concept
Question: "Why use weights = 1/X for this salary data?"
Command: (Conceptual explanation)
Output: No R code - theoretical understanding
Interpretation: We use weights = 1/X because variance appears to increase with X.
This gives less weight to observations with higher variance (higher starting salaries)
and more weight to more precise observations.
11. Hypothesis Testing in WLS
Question: "Test if starting salary has a significant positive effect on mid-career salary"
Command: Look at t-value and p-value for X in summary(model_wls)
Output: t = 26.094, p < 2e-16
Interpretation: With t-value = 26.094 and p-value < 0.001, we reject the null
hypothesis and conclude that starting salary has a statistically significant positive
effect on mid-career salary.
12. R-squared Comparison
Question: "Compare R-squared values between OLS and WLS models"
Command: Compare summary(model_ols)$r.squared and summary(model_wls)$r.s
quared
Output: OLS R² vs WLS R²
Interpretation: While R-squared may not be directly comparable between OLS and WLS, the primary benefit of WLS is obtaining more efficient (reliable) parameter
estimates when heteroscedasticity is present.
Document 10: Multiple Linear Regression
1. Data Input for Multiple Regression
Question: "Create a data frame with production data containing one response
variable (Y) and five predictors (X1-X5)"
Command:
r
Y <- c(18.6,19.3,19.9,20.7,20.7,20.3,20.8,21.4,22,22.6,23.2,23.8,24.4,25.1,25
.8,26.5,27.2,27.9)
X1 <- c(23.3,24,24.8,25.5,26.3,27.3,28.2,29,29.9,30.8,31.7,23.7,33.7,34.6,35.
6,36.6,37.7,38.8)
X2 <- c(22.4,22.8,23.3,23.8,24.2,29.6,30.7,31.8,33,34.3,35.6,36.9,38.3,39.7,4
1.2,42.8,44.4,46.1)
X3 <- c(49.1,50.9,52.8,54.7,56.7,53.8,55.2,56.7,58.3,59.9,61.5,63.1,64.9,66.6
,68.4,70.3,72.2,74.1)
X4 <- c(24.2,24.4,24.6,24.7,24.9,26.5,26.8,27.1,27.4,27.8,28.1,28.4,28.8,29.1
,29.4,29.8,30.1,30.5)
X5 <- c(292.4,330,346.1,352.6,372,433.8,477,518,562,610,663,721,785,855,932,1
016,1108,1210)
production_data <- data.frame(Y, X1, X2, X3, X4, X5)
head(production_data)
Output:
text
Y X1 X2 X3 X4 X5
1 18.6 23.3 22.4 49.1 24.2 292.4
2 19.3 24.0 22.8 50.9 24.4 330.0
3 19.9 24.8 23.3 52.8 24.6 346.1
4 20.7 25.5 23.8 54.7 24.7 352.6
5 20.7 26.3 24.2 56.7 24.9 372.0
6 20.3 27.3 29.6 53.8 26.5 433.8
Interpretation: The data frame contains production output (Y) and five potential
predictor variables for multiple regression analysis.
2. Checking Linear RelationshipsQuestion: "Create scatter plots to check linear relationships between Y and each
predictor variable"
Command:
r
par(mfrow=c(2,3))
plot(X1, Y, main="Y vs X1")
plot(X2, Y, main="Y vs X2")
plot(X3, Y, main="Y vs X3")
plot(X4, Y, main="Y vs X4")
plot(X5, Y, main="Y vs X5")
par(mfrow=c(1,1))
Output: (Multiple scatter plots in a 2x3 grid)
Interpretation: Visual inspection shows positive linear relationships between Y and
most predictors, supporting the linearity assumption for multiple regression.
3. Fitting Multiple Linear Regression Model
Question: "Fit a multiple linear regression model with Y as response and X1-X5 as
predictors"
Command:
r
mlr_model <- lm(Y ~ X1 + X2 + X3 + X4 + X5, data=production_data)
mlr_model
Output:
text
Call:
lm(formula = Y ~ X1 + X2 + X3 + X4 + X5, data = production_data)
Coefficients:
(Intercept) X1 X2 X3 X4 X5
-1.652630 0.092837 0.045452 0.006836 0.076773 0.000026
Interpretation: The multiple regression equation is: Ŷ = -1.6526 + 0.0928X1 +
0.0455X2 + 0.0068X3 + 0.0768X4 + 0.000026X5
4. Comprehensive Model Summary
Question: "Get detailed summary of the multiple regression model including R
squared and p-values"
Command: summary(mlr_model)
Output: (Detailed output with coefficients, standard errors, t-values, p-values, R-squared, F-statistic)
Interpretation: The summary provides complete diagnostic information including
which predictors are statistically significant and overall model fit measures.
5. Extracting R-squared Value
Question: "What percentage of variation in production output is explained by all
predictors?"
Command: summary(mlr_model)$r.squared
Output: 0.9989
Interpretation: 99.89% of the variation in production output (Y) is explained by the
five predictor variables, indicating an excellent model fit.
6. Checking Multicollinearity - Correlation Matrix
Question: "Check correlation between predictor variables to detect multicollinearity"
Command:
r
predictors <- production_data[,2:6]
cor(predictors)
Output: (5x5 correlation matrix showing relationships between X1-X5)
Interpretation: High correlations (close to 1 or -1) between predictors indicate
potential multicollinearity problems.
7. Checking Multicollinearity - VIF Calculation
Question: "Calculate Variance Inflation Factor (VIF) to quantify multicollinearity"
Command:
r
install.packages("car")
library(car)
vif(mlr_model)
Output: VIF values for each predictor
Interpretation: VIF > 10 indicates serious multicollinearity. Values between 1-5
suggest moderate correlation, while values close to 1 indicate no multicollinearity.
8. Confidence Intervals for Coefficients
Question: "Get 95% confidence intervals for all regression coefficients"
Command: confint(mlr_model)
Output:text
2.5 % 97.5 %
(Intercept) -2.32079333 -0.98446608
X1 0.07567697 0.10999703
X2 0.02408476 0.06681882
X3 -0.00592657 0.01959824
X4 -0.00752606 0.16107231
X5 -0.00047775 0.00052975
Interpretation: We can be 95% confident that the true coefficient for X1 lies between
0.0757 and 0.1100. Intervals containing zero (like X3, X4, X5) suggest those predictors
may not be statistically significant.
9. Making Predictions with New Data
Question: "Predict production output when X1=30, X2=35, X3=60, X4=28, X5=700"
Command:
r
new_data <- data.frame(X1=30, X2=35, X3=60, X4=28, X5=700)
predict(mlr_model, newdata=new_data)
Output: 22.18267
Interpretation: The predicted production output for the given input values is 22.18
units.
10. Model Significance Test
Question: "Is the overall multiple regression model statistically significant?"
Command: Check F-statistic p-value in summary(mlr_model)
Output: F-statistic and p-value
Interpretation: A significant F-statistic (p < 0.05) indicates that the model explains a
significant amount of variation in Y beyond what would be expected by chance.
11. Individual Coefficient Significance
Question: "Which predictors are statistically significant in the model?"
Command: Check p-values for each coefficient
in summary(mlr_model)$coefficients
Output: p-values for each predictor
Interpretation: Predictors with p-values < 0.05 are statistically significant. Non
significant predictors may be candidates for removal from the model.
12. Multiple Regression ConceptQuestion: "What is the key difference between simple and multiple linear
regression?"
Command: (Conceptual - no R code)
Output: Theoretical understanding
Interpretation: Multiple regression considers multiple predictors simultaneously,
allowing us to understand the unique contribution of each variable while controlling
for others, unlike simple regression which only considers one predictor at a time.
Document 11: Polynomial Regression
1. Data Input for Polynomial Regression
Question: "Create vectors for X and Y data: X = [0,1,2,3,4,5,6,7,8,9], Y =
[1,1.5,1.8,2.3,2.9,4,5,6.5,8,10]"
Command:
r
X <- c(0,1,2,3,4,5,6,7,8,9)
Y <- c(1,1.5,1.8,2.3,2.9,4,5,6.5,8,10)
Output: (Vectors stored in memory)
Interpretation: The data shows a clear non-linear relationship where Y increases at
an accelerating rate as X increases, making it suitable for polynomial regression.
2. Checking Relationship with Scatter Plot
Question: "Create a scatter plot to visualize the relationship between X and Y"
Command: plot(X, Y, main="Scatter Plot of X vs Y")
Output: (Scatter plot showing curved pattern)
Interpretation: The scatter plot reveals a clear non-linear (curved) relationship,
indicating that polynomial regression would be more appropriate than simple linear
regression.
3. Quadratic Polynomial Regression (Degree 2)
Question: "Fit a quadratic polynomial model: Ŷ = a + b₁X + b₂X²"
Command:
r
X_sq <- X^2
model_quad <- lm(Y ~ X + X_sq)
model_quad
Output:text
Coefficients:
(Intercept) X X_sq
1.19455 0.03758 0.10303
Interpretation: The quadratic model is Ŷ = 1.195 + 0.038X + 0.103X². The positive X²
coefficient (0.103) confirms the upward-curving relationship.
4. Cubic Polynomial Regression (Degree 3)
Question: "Fit a cubic polynomial model: Ŷ = a + b₁X + b₂X² + b₃X³"
Command:
r
X_cub <- X^3
model_cubic <- lm(Y ~ X + X_sq + X_cub)
model_cubic
Output:
text
Coefficients:
(Intercept) X X_sq X_cub
1.069231 0.266822 0.035897 0.004973
Interpretation: The cubic model is Ŷ = 1.069 + 0.267X + 0.036X² + 0.005X³, adding a
cubic term to capture more complex curvature.
5. Model Comparison using R-squared
Question: "Compare quadratic and cubic models using R-squared values"
Command:
r
quad_r2 <- summary(model_quad)$r.squared
cubic_r2 <- summary(model_cubic)$r.squared
c(quad_r2, cubic_r2)
Output: 0.9984 0.9993
Interpretation: The cubic model has higher R-squared (99.93% vs 99.84%), indicating
it explains slightly more variation in Y, but both models fit very well.
6. Quadratic Model with Deviation Form
Question: "Fit quadratic model using deviation form: Ŷ = b₀ + b₁(X-mean(X)) + b₂(X
mean(X))²"
Command:r
D <- X - mean(X)
D_sq <- D^2
model_dev_quad <- lm(Y ~ D + D_sq)
model_dev_quad
Output:
text
Coefficients:
(Intercept) D D_sq
3.450000 0.964848 0.103030
Interpretation: The deviation form model is Ŷ = 3.45 + 0.965(X-4.5) + 0.103(X-4.5)².
This form centers the data around the mean, which can improve numerical stability.
7. Cubic Model with Deviation Form
Question: "Fit cubic model using deviation form: Ŷ = b₀ + b₁(X-mean(X)) + b₂(X
mean(X))² + b₃(X-mean(X))³"
Command:
r
D_cub <- D^3
model_dev_cubic <- lm(Y ~ D + D_sq + D_cub)
model_dev_cubic
Output:
text
Coefficients:
(Intercept) D D_sq D_cub
3.450000 0.891997 0.103030 0.004973
Interpretation: The cubic deviation model is Ŷ = 3.45 + 0.892(X-4.5) + 0.103(X-4.5)²
+ 0.005(X-4.5)³, providing the same fit as the regular cubic but with centered
predictors.
8. Coefficient Significance Check
Question: "Check which coefficients are statistically significant in the cubic model"
Command: summary(model_cubic)$coefficients
Output: (Coefficient table with p-values)
Interpretation: The cubic term (X³) has p-value = 0.0296 < 0.05, indicating it's
statistically significant and justifies using the more complex cubic model over
quadratic.9. Residual Analysis for Model Selection
Question: "Compare residuals of quadratic vs cubic models to choose the better fit"
Command:
r
quad_resid <- resid(model_quad)
cubic_resid <- resid(model_cubic)
c(mean(quad_resid), mean(cubic_resid))
Output: -2.775558e-17 -1.387779e-17 (both effectively zero)
Interpretation: Both models have residuals centered around zero, but the cubic
model typically shows smaller residual variation, indicating better fit.
10. Making Predictions with Polynomial Model
Question: "Predict Y when X=5.5 using the cubic polynomial model"
Command:
r
new_X <- 5.5
prediction <- predict(model_cubic, data.frame(X=5.5, X_sq=5.5^2, X_cub=5.5^3)
)
prediction
Output: 4.625
Interpretation: When X=5.5, the predicted Y value is 4.625 based on the cubic
polynomial model.
11. Visualizing Polynomial Fit
Question: "Create a plot showing data points with fitted quadratic and cubic curves"
Command:
r
plot(X, Y, main="Polynomial Regression Fit")
curve(1.19455 + 0.03758*x + 0.10303*x^2, add=TRUE, col="red", lwd=2)
curve(1.069231 + 0.266822*x + 0.035897*x^2 + 0.004973*x^3, add=TRUE, col="blu
e", lwd=2)
legend("topleft", legend=c("Quadratic", "Cubic"), col=c("red", "blue"), lwd=2
)
Output: (Scatter plot with red quadratic curve and blue cubic curve)
Interpretation: The plot visually shows how both polynomial curves fit the data, with
the cubic curve (blue) potentially fitting the curvature better, especially at higher X
values.12. Polynomial Regression Concept
Question: "When should you use polynomial regression instead of linear regression?"
Command: (Conceptual - no R code)
Output: Theoretical understanding
Interpretation: Use polynomial regression when the relationship between variables
shows curvature (non-linearity) that can't be captured by a straight line, such as
accelerating growth, diminishing returns, or U-shaped relationships.
Document 12: Regularization Techniques
1. Loading Built-in Dataset
Question: "Load the mtcars dataset and view its structure"
Command:
r
data(mtcars)
head(mtcars)
Output:
text
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Interpretation: The mtcars dataset contains information about 32 cars with 11
variables including mpg (miles per gallon) as the response variable and other car
characteristics as predictors.
2. Variable Extraction
Question: "Extract mpg as response variable and other variables as predictors"
Command:
r
y <- mtcars$mpg
x1 <- mtcars$cyl
x2 <- mtcars$dispx3 <- mtcars$hp
x4 <- mtcars$drat
x5 <- mtcars$wt
x6 <- mtcars$qsec
x7 <- mtcars$vs
x8 <- mtcars$am
x9 <- mtcars$gear
x10 <- mtcars$carb
Output: (Variables stored in memory)
Interpretation: Successfully extracted the response variable (mpg) and 10 predictor
variables for regression analysis.
3. Correlation Analysis for Multicollinearity
Question: "Check correlation between predictor variables to detect multicollinearity"
Command:
r
predictors <- data.frame(x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)
cor_matrix <- cor(predictors)
round(cor_matrix, 3)
Output: (10x10 correlation matrix showing relationships between predictors)
Interpretation: High correlations (close to 1 or -1) between variables like cyl, disp,
and hp indicate potential multicollinearity problems.
4. VIF Analysis for Multicollinearity
Question: "Calculate Variance Inflation Factor (VIF) to quantify multicollinearity"
Command:
r
library(car)
mlr_model <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10)
vif_values <- vif(mlr_model)
vif_values
Output: VIF values for each predictor variable
Interpretation: VIF > 10 indicates serious multicollinearity. Variables with high VIF
values should be considered for removal or regularization.
5. Data Preparation for Ridge RegressionQuestion: "Prepare scaled predictor matrix for ridge regression"
Command:
r
X_matrix <- as.matrix(predictors)
X_scaled <- scale(X_matrix)
y_vector <- as.vector(y)
Output: (Scaled predictor matrix and response vector)
Interpretation: Scaling predictors ensures all variables are on comparable scales,
which is important for ridge regression penalty terms.
6. Finding Optimal Ridge Parameter (Lambda)
Question: "Find optimal lambda value for ridge regression using cross-validation"
Command:
r
library(glmnet)
lambda_seq <- 10^seq(5, -2, length = 100)
ridge_cv <- cv.glmnet(X_scaled, y_vector, alpha = 0, lambda = lambda_seq)
best_lambda <- ridge_cv$lambda.min
best_lambda
Output: 15.84893 (example optimal lambda value)
Interpretation: The optimal lambda (15.85) minimizes cross-validation error,
balancing bias and variance in the ridge regression model.
7. Fitting Ridge Regression Model
Question: "Fit ridge regression model with optimal lambda value"
Command:
r
ridge_model <- glmnet(X_scaled, y_vector, alpha = 0, lambda = best_lambda)
coef(ridge_model)
Output: Ridge regression coefficients for all predictors
Interpretation: Ridge regression shrinks coefficients toward zero but doesn't
eliminate any variables completely, helping reduce multicollinearity effects.
8. Making Predictions with Ridge Model
Question: "Make predictions using the fitted ridge regression model"
Command:
ry_pred <- predict(ridge_model, newx = X_scaled)
head(y_pred)
Output: Predicted mpg values for all cars in the dataset
Interpretation: These are the fitted values from the ridge regression model, which
should be more stable than OLS when multicollinearity is present.
9. Calculating R-squared for Model Accuracy
Question: "Calculate R-squared to measure model accuracy"
Command:
r
library(MLmetrics)
R2_Score(y_pred, y_vector)
Output: 0.8619
Interpretation: The ridge regression model explains 86.19% of the variance in mpg,
indicating good model fit despite regularization.
10. Calculating RMSE for Prediction Error
Question: "Calculate Root Mean Squared Error (RMSE) to measure prediction
accuracy"
Command: RMSE(y_pred, y_vector)
Output: 2.204
Interpretation: The average prediction error is 2.204 mpg, meaning predictions are
typically within ±2.2 mpg of actual values.
11. Comparing with OLS Regression
Question: "Compare ridge regression coefficients with OLS coefficients"
Command:
r
ols_coef <- coef(lm(y ~ X_scaled))
ridge_coef <- as.vector(coef(ridge_model))
comparison <- data.frame(OLS = ols_coef, Ridge = ridge_coef)
comparison
Output: Side-by-side comparison of OLS and ridge coefficients
Interpretation: Ridge coefficients are shrunk toward zero compared to OLS, reducing
their variance and making them more stable.
12. Regularization ConceptQuestion: "What is the main purpose of ridge regression?"
Command: (Conceptual - no R code)
Output: Theoretical understanding
Interpretation: Ridge regression addresses multicollinearity by adding a penalty
(lambda) to the regression coefficients, shrinking them toward zero to reduce
variance and improve model stability, at the cost of introducing some bias.