how ro read data

Data Analysis & Statistical Methods

Lec. 1: Introduction to R & Basic Operations

1. Basic Arithmetic Operations

Question: "Calculate 5 plus 7, then multiply the result by 3"

Command: (5 + 7) * 3

Output: 36

2. Variable Assignment

Question: "Store the value 15 in a variable called 'temperature' and then display it"

Command: temperature <- 15 then print(temperature)

Output: 15

3. Vector Creation

Question: "Create a vector containing the numbers 2, 4, 6, 8, 10"

Command: numbers <- c(2, 4, 6, 8, 10)

4. Vector Indexing

Question: "From the vector [5, 3, 7, 1], extract the third element"

Command: x <- c(5, 3, 7, 1) then x[3]

Output: 7

5. Matrix Creation

Question: "Create a 2x3 matrix with numbers 1 through 6, filled by rows"

Command: matrix(1:6, nrow=2, ncol=3, byrow=TRUE)

Output:

text

[,1] [,2] [,3]

[1,] 1 2 3

[2,] 4 5 6

Lec. 2: Data Import & Programming Basics

6. Importing CSV Data

Question: "Import a CSV file named 'data.csv' from the desktop"

Command: read.csv("C:/Users/Desktop/data.csv")

Output: (Displays the imported data frame)

Interpretation: The CSV file is successfully loaded into R as a data frame for analysis.

7. Counting Observations

Question: "How many temperature readings are in this vector: [3.0, 1.7, 1.4, 2.1, 1.6,

1.0, 0.8, 1.9, 3.8, 4.8]?"

Command: length(c(3.0, 1.7, 1.4, 2.1, 1.6, 1.0, 0.8, 1.9, 3.8, 4.8))

Output: 10

8. Random Number Generation

Question: "Generate 5 random numbers from a normal distribution with mean 50

and standard deviation 10"

Command: rnorm(5, mean=50, sd=10)

Output: 52.34 48.91 61.23 45.67 49.82 (example output)

9. For Loop Example

Question: "Use a for loop to print each element of the vector [24, 28, 30, 29, 50]"

Command:

x <- c(24, 28, 30, 29, 50)

for (i in 1:5)

{print(x[i])}Output:

text

[1] 24

[1] 28

[1] 30

[1] 29

[1] 50

Lec. 3: Statistical Analysis & Data Summary

10. Calculating Mean

Question: "Find the average of these test scores: [85, 92, 78, 96, 88]"

Command: mean(c(85, 92, 78, 96, 88))

Output: 87.8

11. Trimmed Mean

Question: "Calculate a 20% trimmed mean for this data: [1, 2, 3, 4, 5, 6, 7, 8, 9, 100]"

Command: mean(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 100), trim=0.2)

Output: 5.5

Interpretation: The trimmed mean removes the lowest and highest 20% of values,

reducing the influence of the outlier (100).

12. Finding Median

Question: "What is the median of these ages: [22, 25, 30, 35, 40]?"

Command: median(c(22, 25, 30, 35, 40))

Output: 30

13. Calculating Quartiles

Question: "Find the quartiles for this data: [10, 15, 20, 25, 30, 35, 40]"

Command: quantile(c(10, 15, 20, 25, 30, 35, 40))

Output:text

0% 25% 50% 75% 100%

10.0 17.5 25.0 32.5 40.0

Interpretation: Q1=17.5 (25% below this), Q2=25 (median), Q3=32.5 (75% below

this).

14. Correlation Analysis

Question: "Calculate the correlation between hours studied [2, 3, 5, 7, 9] and exam

scores [65, 70, 80, 85, 90]"

Command: cor(c(2, 3, 5, 7, 9), c(65, 70, 80, 85, 90))

Output: 0.987

Interpretation: There's a very strong positive correlation (0.987), indicating that more

study hours are associated with higher exam scores.

15. Comprehensive Data Summary

Question: "Get a complete statistical summary for the weight data: [150, 160, 165,

170, 175, 180, 190]"

Command: summary(c(150, 160, 165, 170, 175, 180, 190))

Output:

text

Min. 1st Qu. Median Mean 3rd Qu. Max.

150.0 161.2 170.0 170.0 178.8 190.0

Lec. 4: Data Transformations & Diagnostics

1. Square Root Transformation

Question: "Apply square root transformation to the value 16"

Command: sqrt(16)

Output: 4.

2. Reciprocal Transformation

Question: "Find the reciprocal of 8"

Command: 1/8

Output: 0.1253. LOGit Transformation

Question: "Calculate the logit transformation for a percentage value of 75%"

Command: log(75/(100-75))

Output: 1.098612

Interpretation: The logit transformation converts percentages (0-100) to a

continuous scale from -∞ to +∞, making them suitable for linear modeling.

4. Arcsine Transformation

Question: "Apply arc-sine transformation to a proportion of 0.25"

Command: asin(sqrt(0.25))

Output: 0.5235988 (radians)

5. Box-Cox Transformation (λ=0.5)

Question: "Apply Box-Cox transformation with λ=0.5 to the value 9"

Command: (9^0.5 - 1)/0.5

Output: 4

Interpretation: Box-Cox transformation with λ=0.5 approximates the square root

transformation, helping to normalize data distributions.

6. Outlier Detection - Quartiles

Question: "Find the first quartile (Q1) of this data: [9, 156, 163, 166, 171, 176, 180,

1872]"

Command: quantile(c(9,156,163,166,171,176,180,1872), 0.25)

Output: 164.5

Interpretation: Q1 (164.5) represents the 25th percentile - 25% of values fall below

this point.

7. Outlier Detection - IQR Calculation

Question: "Calculate the Interquartile Range (IQR) for the data: [9, 156, 163, 166, 171,

176, 180, 1872]"

Command: IQR(c(9,156,163,166,171,176,180,1872))

Output: 12.5

8. Outlier Detection - Lower Fence

Question: "Calculate the lower fence for outlier detection using Q1=164.5 and

IQR=12.5"Command: 164.5 - (1.5 * 12.5)

Output: 145.75

Interpretation: Values below the lower fence (145.75) are considered outliers in this

dataset.

9. Outlier Detection - Upper Fence

Question: "Calculate the upper fence for outlier detection using Q3=177 and

IQR=12.5"

Command: 177 + (1.5 * 12.5)

Output: 195.75

Interpretation: Values above the upper fence (195.75) are considered outliers in this

dataset.

10. Moving Average - Data Length

Question: "How many years of attendance data are available: [5761, 6148, 6783,

7445, 7405, 11450, 11224, 11703, 11890, 12380, 12181, 12557]?"

Command: length(c(5761,6148,6783,7445,7405,11450,11224,11703,11890,12380

,12181,12557))

Output: 12

12. Three-Year Moving Average

Question: "Calculate the 3-year moving average for years 1993-1995 with attendance

[5761, 6148, 6783]"

Command: mean(c(5761, 6148, 6783))

Output: 6231

Lec. 5: Data Visualization & Statistical Graphics

1. Boxplot Creation

Question: "Create a boxplot for 50 random values from a normal distribution with

mean 30 and standard deviation 5"

Command: boxplot(rnorm(50, mean=30, sd=5), ylab="Data")

Output: (Visual boxplot showing median, quartiles, and potential outliers)

Interpretation: The boxplot displays the distribution's center, spread, and symmetry.

It helps identify the median (center line), IQR (box), and potential outliers (points

outside whiskers).2. Colored Boxplot

Question: "Make a red colored boxplot with title 'Boxplot of data'"

Command: boxplot(rnorm(50, mean=30, sd=5), ylab="Data", main="Boxplot of

data", col="red")

Output: (Red colored boxplot with title)

Interpretation: Adding color and titles makes plots more informative and publication

ready.

3. Horizontal Boxplot

Question: "Create a horizontal green boxplot"

Command: boxplot(rnorm(50, mean=30, sd=5), xlab="Data", main="Boxplot of

data", horizontal=TRUE, col="green")

Output: (Horizontal green boxplot)

Interpretation: Horizontal boxplots are useful when comparing multiple groups or

when category names are long.

4. Stem and Leaf Plot

Question: "Create a stem and leaf plot for the data: 36,46,46,47,48,50,51,52,53,54"

Command: stem(c(36,46,46,47,48,50,51,52,53,54))

Output:

text

The decimal point is 1 digit(s) to the right of the |

3 | 6

4 | 6678

5 | 01234

Interpretation: Stem and leaf displays show the actual data values while organizing

them. Each number is split into stem (tens digit) and leaf (units digit).

5. Basic Histogram

Question: "Create a histogram for 10,000 random values from standard normal

distribution"

Command: hist(rnorm(10000, mean=0, sd=1))

Output: (Bell-shaped histogram)

Interpretation: The histogram shows the distribution shape - in this case,

approximately normal (bell-shaped) as expected from normal random generation.

6. Enhanced HistogramQuestion: "Create a skyblue histogram with custom axes and title"

Command:

x1 <- rnorm(10000, mean=0, sd=1)

hist(x1, col="skyblue", xlim=c(-6,6), xlab="x variable",

ylab="Frequency", main="Histogram of Random Variables")

Output: (Professional-looking histogram with skyblue bars and proper labels)

Interpretation: Customizing colors, labels, and limits makes histograms more

readable and suitable for reports.

7. Density Plot

Question: "Create a density plot overlay on a histogram"

Command:

x1 <- rnorm(10000, mean=0, sd=1)

hist(x1, freq=FALSE, col="lightgray", main="Density Plot")

lines(density(x1), col="red", lwd=2)

Output: (Histogram with smooth red density curve)

Interpretation: Density plots provide a smoothed version of the histogram, better

showing the underlying distribution shape.

8. QQ Plot for Normality Check

Question: "Create a QQ plot to check if 100 values from N(100,8) follow normal

distribution"

Command:

y <- rnorm(100, mean=100, sd=8)

qqnorm(y)

qqline(y)

Output: (Scatter plot comparing sample quantiles to theoretical normal quantiles)

Interpretation: If points follow the straight line, the data is approximately normal.

Deviations indicate departures from normality.

9. QQ Plot for Non-Normal Data

Question: "Create a QQ plot for chi-square distributed data with 3 degrees of

freedom"

Command:r

y <- rchisq(1000, df=3)

qqnorm(y)

qqline(y)

Output: (QQ plot showing curved pattern)

Interpretation: The curved pattern indicates the data doesn't follow a normal

distribution, which is expected for chi-square data.

10. Skewness Interpretation from Boxplot

Question: "How would you interpret a boxplot where the median is closer to Q1 than

Q3?"

Command: (Visual interpretation of boxplot shape)

Output: (No R code - conceptual understanding)

Interpretation: When the median is closer to Q1 (first quartile), it indicates positive

(right) skewness, meaning the data has a longer tail on the right side.

11. Distribution Shape from Histogram

Question: "What does a histogram with most data on the left and long tail on the

right indicate?"

Command: (Visual interpretation)

Output: (No R code - conceptual)

Interpretation: This indicates right-skewed (positively skewed) distribution, common

with income data or reaction times where most values are low but some are very

high.

12. Color Customization

Question: "Change the color of a plot to hexadecimal code #FF5733"

Command: hist(rnorm(100), col="#FF5733")

Output: (Histogram with specific orange color)

Interpretation: Using hexadecimal color codes allows precise color control for

professional publications and brand consistency.

Lec. 6: Advanced Data Visualization

1. Pie Chart Creation

Question: "Create a pie chart showing market share for products A(15%), B(25%),

C(35%), D(25%)"

Command:r

sales <- c(15, 25, 35, 25)

products <- c("A", "B", "C", "D")

pie(sales, labels = products)

Output: (Circular chart with four slices labeled A, B, C, D)

Interpretation: Product C has the largest market share (35%), while Product A has

the smallest (15%). Pie charts are best for showing parts of a whole.

2. Bar Plot Basics

Question: "Create a bar plot showing monthly sales: Jan(100), Feb(150), Mar(200),

Apr(180)"

Command:

monthly_sales <- c(100, 150, 200, 180)

months <- c("Jan", "Feb", "Mar", "Apr")

barplot(monthly_sales, names.arg = months)

Output: (Four vertical bars with month labels)

Interpretation: March had the highest sales (200), showing an increasing trend from

January to March, with a slight drop in April.

3. Enhanced Bar Plot

Question: "Create a blue bar plot with title 'Monthly Sales' and labeled axes"

Command:

barplot(monthly_sales, names.arg = months,

col = "blue", main = "Monthly Sales",

xlab = "Months", ylab = "Sales Amount")

Output: (Professional blue bar chart with proper titles and labels)

Interpretation: The enhanced bar plot clearly shows sales patterns over time, making

it suitable for business presentations.

4. Scatter Plot Creation

Question: "Create a scatter plot showing relationship between study hours [2,3,5,7,9]

and exam scores [65,70,80,85,90]"

Command:

hours <- c(2,3,5,7,9)scores <- c(65,70,80,85,90)

plot(hours, scores)

Output: (Scatter plot with points showing positive trend)

Interpretation: There appears to be a strong positive relationship - as study hours

increase, exam scores also increase.

5. Customized Scatter Plot

Question: "Create a scatter plot with red triangles (pch=17) of size 2"

Command:

plot(hours, scores, pch=17, col="red", cex=2,

xlab="Study Hours", ylab="Exam Scores",

main="Study Hours vs Exam Scores")

Output: (Scatter plot with large red triangles)

Interpretation: The customized plot clearly shows the positive correlation. Larger,

colored points make the relationship more visible.

6. Line Graph

Question: "Create a line graph showing temperature trend: Day1-20°, Day2-22°,

Day3-25°, Day4-23°"

Command:

days <- 1:4

temp <- c(20,22,25,23)

plot(days, temp, type="l")

Output: (Line connecting the temperature points)

Interpretation: The line graph shows temperature increasing to a peak on Day 3,

then decreasing slightly on Day 4, clearly displaying the trend over time.

7. Enhanced Line Graph

Question: "Create a blue line graph with points and custom axes"

Command:

plot(days, temp, type="b", col="blue", lwd=2,

xlab="Day", ylab="Temperature (°C)",

main="Daily Temperature Trend",

xlim=c(1,4), ylim=c(18,26))Output: (Blue line with points, properly labeled)

Interpretation: The enhanced line graph shows both the overall trend and individual

data points, making it easy to see daily fluctuations.

8. Multiple Boxplots

Question: "Create side-by-side boxplots for test scores of three classes: ClassA

[65,70,75,80,85], ClassB [70,72,78,82,88], ClassC [60,65,70,75,90]"

Command:

classA <- c(65,70,75,80,85)

classB <- c(70,72,78,82,88)

classC <- c(60,65,70,75,90)

scores_df <- data.frame(ClassA=classA, ClassB=classB, ClassC=classC)

boxplot(scores_df)

Output: (Three boxplots side by side)

Interpretation: Class B has the highest median score and least variability, while Class

C has the widest spread including a high outlier (90).

9. Colored Multiple Boxplots

Question: "Create multiple boxplots with different colors for each class"

Command:

boxplot(scores_df, col=c("lightblue", "lightgreen", "pink"),

main="Test Scores by Class",

ylab="Scores", xlab="Class")

Output: (Color-coded boxplots for easy comparison)

Interpretation: The colored boxplots make it easy to compare distributions across

classes. Class A (blue) shows consistent performance, Class B (green) has higher

scores with less variability.

10. Scatter Plot Point Types

Question: "Show different point characters (pch) available in R"

Command:

# Demonstration of different pch values

plot(1:5, 1:5, pch=1:5, cex=2, xlim=c(0,6))

text(1:5, 0.8, labels=1:5)Output: (Plot showing circles, triangles, plus signs, etc.)

Interpretation: Different pch values (1-25) allow customization of point symbols in

scatter plots, useful for distinguishing groups in multivariate data.

11. Correlation Strength Interpretation

Question: "How would you interpret a scatter plot where points form a tight,

upward-sloping pattern?"

Command: (Visual interpretation)

Output: (No R code - conceptual)

Interpretation: A tight, upward-sloping pattern indicates a strong positive

correlation, meaning as one variable increases, the other consistently increases in a

predictable manner.

12. Trend Line Addition

Question: "Add a trend line to a scatter plot"

Command:

plot(hours, scores)

abline(lm(scores ~ hours), col="red")

Output: (Scatter plot with red regression line)

Interpretation: The trend line shows the average relationship between variables. A

steep upward slope confirms the strong positive correlation observed in the scatter

plot.

Lec. 7: Statistical Learning & Linear Regression

1. Simple Linear Regression Model Fitting

Question: "Fit a linear regression model to predict heating cost (Y) from outside

temperature (X) using the given data"

Command:

X <- c(24,47,50,61,74,85,89,91,95,90,99)

Y <- c(10,19,20,20,20,23,23,29,29,32,40)

model <- lm(Y ~ X)

model

Output:text

Call:

lm(formula = Y ~ X)

Coefficients:

(Intercept) X

3.3875 0.2829

Interpretation: The fitted regression equation is Ŷ = 3.3875 + 0.2829X. This means

for every 1-degree increase in temperature, heating cost increases by $0.28.

2. Extracting Model Coefficients

Question: "Extract the intercept and slope coefficients from the fitted model"

Command: coef(model)

Output: (Intercept) = 3.3875, X = 0.2829

Interpretation: The intercept (3.3875) represents the base heating cost when

temperature is 0°C, and the slope (0.2829) shows the rate of change in heating cost

per degree temperature.

3. Getting Fitted Values

Question: "Get the predicted heating costs for all observed temperatures"

Command: fitted(model)

Output: 10.17719, 16.68398, 17.53269, 20.64463, 24.32238, 27.43432,

28.56593, 29.13174, 30.26335, 28.84883, 31.39497

Interpretation: These are the heating cost values predicted by our regression model

for each observed temperature.

4. Creating Scatter Plot with Regression Line

Question: "Create a scatter plot of temperature vs heating cost with the regression

line"

Command:

plot(X, Y, xlab="Temperature", ylab="Heating Cost", main="Regression Line")

abline(model, col="red", lwd=2)

Output: (Scatter plot with red regression line)

Interpretation: The plot visually shows the positive relationship between

temperature and heating cost, with the red line representing the best-fit linear

relationship.

5. Getting Model SummaryQuestion: "Get detailed summary of the regression model"

Command: summary(model)

Output: (Detailed output with coefficients, R-squared, F-statistic, p-values)

Interpretation: The summary provides comprehensive statistics including coefficient

significance, model fit measures, and overall model significance.

6. Extracting R-squared Value

Question: "What percentage of variation in heating cost is explained by

temperature?"

Command: summary(model)$r.squared

Output: 0.7397

Interpretation: 73.97% of the variation in heating cost is explained by temperature,

indicating a strong relationship.

7. Extracting Adjusted R-squared

Question: "Get the adjusted R-squared value"

Command: summary(model)$adj.r.squared

Output: 0.7108

Interpretation: The adjusted R-squared (71.08%) accounts for the number of

predictors, providing a more conservative measure of model fit.

8. Checking Model Significance

Question: "Is the overall regression model statistically significant?"

Command: Look at F-statistic p-value in summary(model)

Output: F-statistic: 25.58, p-value: 0.0006833

Interpretation: With p-value < 0.05, the model is statistically significant, meaning

temperature has a significant impact on heating cost.

9. Making Predictions with New Data

Question: "Predict heating cost when temperature is 30 degrees"

Command: predict(model, data.frame(X=30))

Output: 11.87462

Interpretation: When the outside temperature is 30°C, the predicted heating cost is

$11.87.

10. Checking Residuals

Question: "Get the residuals (errors) of the model"

Command: residuals(model)

Output: -0.17719, 2.31602, 2.46731, -0.64463, -4.32238, -4.43432, -

5.56593, -0.13174, -1.26335, 3.15117, 8.60503Interpretation: Residuals show the difference between actual and predicted values.

Positive residuals indicate underestimation, negative indicate overestimation.

11. Coefficient Significance Test

Question: "Is the temperature coefficient statistically significant?"

Command: Check p-value for X in summary(model)$coefficients

Output: p-value: 0.000683

Interpretation: With p-value < 0.05, the temperature coefficient is statistically

significant, confirming temperature's real effect on heating cost.

12. Supervised Learning Concept

Question: "What type of supervised learning problem is this?"

Command: (Conceptual - no R code)

Output: Regression problem

Interpretation: Since the response variable (heating cost) is numerical/continuous,

this is a regression problem in supervised learning.

Lec. 8: Regression Diagnostics

1. Checking Linear Relationship

Question: "Create a scatter plot to check if there's a linear relationship between

temperature (X) and heating cost (Y)"

Command:

X <- c(24,47,50,61,74,85,89,91,95,90,99)

Y <- c(10,19,20,20,20,23,23,29,29,32,40)

plot(X, Y, xlab="Temperature", ylab="Heating Cost")

Output: (Scatter plot showing data points)

Interpretation: The scatter plot helps visualize the relationship. A linear pattern

suggests linearity assumption is met, while curved patterns indicate potential issues.

2. Verifying Mean of Errors is Zero

Question: "Verify that the mean of regression errors is approximately zero"

Command:

model <- lm(Y ~ X)

errors <- residuals(model)mean(errors)

Output: -1.850372e-15 (effectively zero)

Interpretation: The mean of errors is effectively zero (due to floating-point

precision), which satisfies this regression assumption.

3. Checking Homoscedasticity

Question: "Create a residual plot to check for constant error variance

(homoscedasticity)"

Command:

fitted_values <- fitted(model)

plot(fitted_values, errors, xlab="Fitted Values", ylab="Residuals")

abline(h=0, col="red")

Output: (Residuals vs fitted values plot)

Interpretation: If points are randomly scattered around the red line with no pattern

(like a fan shape), homoscedasticity assumption is met. Patterns suggest

heteroscedasticity.

4. Normality Check - Histogram

Question: "Check if residuals follow normal distribution using histogram"

Command:

hist(errors, xlab="Residuals", main="Histogram of Residuals")

Output: (Histogram of residual distribution)

Interpretation: A bell-shaped histogram suggests normal distribution of errors.

Skewed distributions indicate violation of normality assumption.

5. Normality Check - Density Plot

Question: "Create a density plot overlay to check normality"

Command:

hist(errors, freq=FALSE, main="Density Plot of Residuals", xlab="Residuals")

lines(density(errors), col="red", lwd=2)

Output: (Histogram with smooth density curve)

Interpretation: The red density curve should follow the normal bell shape. Deviations

from this shape indicate non-normal errors.6. Normality Check - QQ Plot

Question: "Use QQ plot to check if residuals follow normal distribution"

Command:

qqnorm(errors)

qqline(errors, col="red")

Output: (Quantile-Quantile plot)

Interpretation: If points follow the red line closely, residuals are normally distributed.

Points deviating from the line suggest non-normality.

7. Outlier Detection - Boxplot

Question: "Detect outliers in residuals using boxplot"

Command: boxplot(errors, main="Boxplot of Residuals")

Output: (Boxplot showing median, quartiles, and potential outliers)

Interpretation: Points outside the whiskers are potential outliers that may unduly

influence the regression results.

8. Extracting Residuals

Question: "Extract residuals from the fitted regression model"

Command: residuals(model)

Output: -0.17719, 2.31602, 2.46731, -0.64463, -4.32238, -4.43432, -

5.56593, -0.13174, -1.26335, 3.15117, 8.60503

Interpretation: These values show how much each actual observation differs from its

predicted value. Large absolute values indicate poor predictions.

9. Sum of Residuals Check

Question: "Verify that the sum of residuals equals zero"

Command: sum(residuals(model))

Output: -2.035409e-15 (effectively zero)

Interpretation: The sum of residuals is effectively zero, which is a mathematical

property of ordinary least squares regression.

10. Residual Pattern Analysis

Question: "Are there any patterns in the residuals that suggest model problems?"

Command: (Visual analysis of residual plot)

Output: (No code - interpretation of plot)

Interpretation: Random scatter indicates good model fit. Patterns (like curves,

funnels, or trends) suggest missing variables, non-linearity, or heteroscedasticity.11. Normality Test - Statistical

Question: "Perform Shapiro-Wilk test for normality of residuals"

Command: shapiro.test(errors)

Output: W = 0.912, p-value = 0.265

Interpretation: With p-value > 0.05, we cannot reject the null hypothesis of

normality, suggesting residuals are approximately normally distributed.

12. Multicollinearity Check (Conceptual)

Question: "What diagnostic checks multicollinearity in multiple regression?"

Command: (Conceptual - for multiple regression)

Output: Variance Inflation Factor (VIF)

Interpretation: In multiple regression, VIF > 10 indicates high multicollinearity, which

can make coefficient estimates unstable.

Document 9: Weighted Least Squares (WLS)

1. Reading Salary Data

Question: "Read the salary data from CSV file and extract starting and mid-career

salary columns"

Command:

salary_data <- read.csv("SALARY.csv")

X <- salary_data$Starting.Salary

Y <- salary_data$MidCareer.Salary

head(salary_data)

Output: (Displays first few rows of the dataset)

text

Starting.Salary MidCareer.Salary

1 56700 117000

2 51400 91100

3 46300 88800

4 41500 88000

5 39200 87100

6 39000 87000

Interpretation: The data shows starting salaries and corresponding mid-career

salaries for college graduates, which we'll use for regression analysis.2. Initial Scatter Plot

Question: "Create a scatter plot to examine the relationship between starting and

mid-career salaries"

Command: plot(X, Y, xlab="Starting Salary", ylab="Mid-Career Salary")

Output: (Scatter plot showing data points)

Interpretation: The plot shows a positive relationship but potential non-linearity and

heteroscedasticity (increasing spread as starting salary increases).

3. Ordinary Least Squares (OLS) Regression

Question: "Fit a simple linear regression model using OLS"

Command:

model_ols <- lm(Y ~ X)

summary(model_ols)

Output: (OLS regression output with coefficients, R-squared, etc.)

Interpretation: The OLS model provides initial estimates, but we need to check if

assumptions are violated.

4. Residual Analysis for OLS

Question: "Plot residuals against fitted values to check for heteroscedasticity"

Command:

residuals_ols <- residuals(model_ols)

fitted_ols <- fitted(model_ols)

plot(fitted_ols, residuals_ols, xlab="Fitted Values", ylab="Residuals")

abline(h=0, col="red")

Output: (Residual plot showing funnel shape)

Interpretation: The funnel pattern indicates heteroscedasticity - variance increases

with fitted values, violating OLS assumptions.

5. Weighted Least Squares (WLS) Implementation

Question: "Fit WLS model using weights = 1/X to address heteroscedasticity"

Command:

model_wls <- lm(Y ~ X, weights = 1/X)

summary(model_wls)Output:

text

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -3386.441 2915.000 -1.162 0.246

X 1.831 0.070 26.094 <2e-16 ***

Interpretation: WLS gives more efficient estimates. The slope (1.831) is highly

significant (p < 0.001), confirming that higher starting salaries predict higher mid

career salaries.

6. WLS Coefficient Interpretation

Question: "Interpret the WLS regression coefficient for starting salary"

Command: (Interpretation of output from previous command)

Output: Slope coefficient = 1.831

Interpretation: For every $1 increase in starting salary, mid-career salary increases by

$1.83 on average. This relationship is statistically significant.

7. Prediction with WLS Model

Question: "Predict mid-career salary for a $10,000 increase in starting salary"

Command:

prediction <- (-3386.441) + (1.831 * 10000)

prediction

Output: 14923.56

Interpretation: A $10,000 increase in starting salary predicts a $14,924 increase in

mid-career salary based on the WLS model.

8. Confidence Intervals for WLS Coefficients

Question: "Get 95% confidence intervals for WLS coefficients"

Command: confint(model_wls)

Output:

text

2.5 % 97.5 %

(Intercept) -9110.628 2337.746

X 1.693 1.969Interpretation: We're 95% confident that the true slope coefficient lies between

1.693 and 1.969, meaning each $1 increase in starting salary increases mid-career

salary by $1.69 to $1.97.

9. Comparing OLS and WLS Residuals

Question: "Compare the residual patterns between OLS and WLS"

Command:

residuals_wls <- residuals(model_wls)

fitted_wls <- fitted(model_wls)

plot(fitted_wls, residuals_wls, xlab="Fitted Values (WLS)", ylab="Residuals (

WLS)")

abline(h=0, col="red")

Output: (WLS residual plot)

Interpretation: The WLS residuals should show more constant variance

(homoscedasticity) compared to the OLS residual plot, indicating better model fit.

10. Weight Calculation Concept

Question: "Why use weights = 1/X for this salary data?"

Command: (Conceptual explanation)

Output: No R code - theoretical understanding

Interpretation: We use weights = 1/X because variance appears to increase with X.

This gives less weight to observations with higher variance (higher starting salaries)

and more weight to more precise observations.

11. Hypothesis Testing in WLS

Question: "Test if starting salary has a significant positive effect on mid-career salary"

Command: Look at t-value and p-value for X in summary(model_wls)

Output: t = 26.094, p < 2e-16

Interpretation: With t-value = 26.094 and p-value < 0.001, we reject the null

hypothesis and conclude that starting salary has a statistically significant positive

effect on mid-career salary.

12. R-squared Comparison

Question: "Compare R-squared values between OLS and WLS models"

Command: Compare summary(model_ols)$r.squared and summary(model_wls)$r.s

quared

Output: OLS R² vs WLS R²

Interpretation: While R-squared may not be directly comparable between OLS and WLS, the primary benefit of WLS is obtaining more efficient (reliable) parameter

estimates when heteroscedasticity is present.

Document 10: Multiple Linear Regression

1. Data Input for Multiple Regression

Question: "Create a data frame with production data containing one response

variable (Y) and five predictors (X1-X5)"

Command:

Y <- c(18.6,19.3,19.9,20.7,20.7,20.3,20.8,21.4,22,22.6,23.2,23.8,24.4,25.1,25

.8,26.5,27.2,27.9)

X1 <- c(23.3,24,24.8,25.5,26.3,27.3,28.2,29,29.9,30.8,31.7,23.7,33.7,34.6,35.

6,36.6,37.7,38.8)

X2 <- c(22.4,22.8,23.3,23.8,24.2,29.6,30.7,31.8,33,34.3,35.6,36.9,38.3,39.7,4

1.2,42.8,44.4,46.1)

X3 <- c(49.1,50.9,52.8,54.7,56.7,53.8,55.2,56.7,58.3,59.9,61.5,63.1,64.9,66.6

,68.4,70.3,72.2,74.1)

X4 <- c(24.2,24.4,24.6,24.7,24.9,26.5,26.8,27.1,27.4,27.8,28.1,28.4,28.8,29.1

,29.4,29.8,30.1,30.5)

X5 <- c(292.4,330,346.1,352.6,372,433.8,477,518,562,610,663,721,785,855,932,1

016,1108,1210)

production_data <- data.frame(Y, X1, X2, X3, X4, X5)

head(production_data)

Output:

text

Y X1 X2 X3 X4 X5

1 18.6 23.3 22.4 49.1 24.2 292.4

2 19.3 24.0 22.8 50.9 24.4 330.0

3 19.9 24.8 23.3 52.8 24.6 346.1

4 20.7 25.5 23.8 54.7 24.7 352.6

5 20.7 26.3 24.2 56.7 24.9 372.0

6 20.3 27.3 29.6 53.8 26.5 433.8

Interpretation: The data frame contains production output (Y) and five potential

predictor variables for multiple regression analysis.

2. Checking Linear RelationshipsQuestion: "Create scatter plots to check linear relationships between Y and each

predictor variable"

Command:

par(mfrow=c(2,3))

plot(X1, Y, main="Y vs X1")

plot(X2, Y, main="Y vs X2")

plot(X3, Y, main="Y vs X3")

plot(X4, Y, main="Y vs X4")

plot(X5, Y, main="Y vs X5")

par(mfrow=c(1,1))

Output: (Multiple scatter plots in a 2x3 grid)

Interpretation: Visual inspection shows positive linear relationships between Y and

most predictors, supporting the linearity assumption for multiple regression.

3. Fitting Multiple Linear Regression Model

Question: "Fit a multiple linear regression model with Y as response and X1-X5 as

predictors"

Command:

mlr_model <- lm(Y ~ X1 + X2 + X3 + X4 + X5, data=production_data)

mlr_model

Output:

text

Call:

lm(formula = Y ~ X1 + X2 + X3 + X4 + X5, data = production_data)

Coefficients:

(Intercept) X1 X2 X3 X4 X5

-1.652630 0.092837 0.045452 0.006836 0.076773 0.000026

Interpretation: The multiple regression equation is: Ŷ = -1.6526 + 0.0928X1 +

0.0455X2 + 0.0068X3 + 0.0768X4 + 0.000026X5

4. Comprehensive Model Summary

Question: "Get detailed summary of the multiple regression model including R

squared and p-values"

Command: summary(mlr_model)

Output: (Detailed output with coefficients, standard errors, t-values, p-values, R-squared, F-statistic)

Interpretation: The summary provides complete diagnostic information including

which predictors are statistically significant and overall model fit measures.

5. Extracting R-squared Value

Question: "What percentage of variation in production output is explained by all

predictors?"

Command: summary(mlr_model)$r.squared

Output: 0.9989

Interpretation: 99.89% of the variation in production output (Y) is explained by the

five predictor variables, indicating an excellent model fit.

6. Checking Multicollinearity - Correlation Matrix

Question: "Check correlation between predictor variables to detect multicollinearity"

Command:

predictors <- production_data[,2:6]

cor(predictors)

Output: (5x5 correlation matrix showing relationships between X1-X5)

Interpretation: High correlations (close to 1 or -1) between predictors indicate

potential multicollinearity problems.

7. Checking Multicollinearity - VIF Calculation

Question: "Calculate Variance Inflation Factor (VIF) to quantify multicollinearity"

Command:

install.packages("car")

library(car)

vif(mlr_model)

Output: VIF values for each predictor

Interpretation: VIF > 10 indicates serious multicollinearity. Values between 1-5

suggest moderate correlation, while values close to 1 indicate no multicollinearity.

8. Confidence Intervals for Coefficients

Question: "Get 95% confidence intervals for all regression coefficients"

Command: confint(mlr_model)

Output:text

2.5 % 97.5 %

(Intercept) -2.32079333 -0.98446608

X1 0.07567697 0.10999703

X2 0.02408476 0.06681882

X3 -0.00592657 0.01959824

X4 -0.00752606 0.16107231

X5 -0.00047775 0.00052975

Interpretation: We can be 95% confident that the true coefficient for X1 lies between

0.0757 and 0.1100. Intervals containing zero (like X3, X4, X5) suggest those predictors

may not be statistically significant.

9. Making Predictions with New Data

Question: "Predict production output when X1=30, X2=35, X3=60, X4=28, X5=700"

Command:

new_data <- data.frame(X1=30, X2=35, X3=60, X4=28, X5=700)

predict(mlr_model, newdata=new_data)

Output: 22.18267

Interpretation: The predicted production output for the given input values is 22.18

units.

10. Model Significance Test

Question: "Is the overall multiple regression model statistically significant?"

Command: Check F-statistic p-value in summary(mlr_model)

Output: F-statistic and p-value

Interpretation: A significant F-statistic (p < 0.05) indicates that the model explains a

significant amount of variation in Y beyond what would be expected by chance.

11. Individual Coefficient Significance

Question: "Which predictors are statistically significant in the model?"

Command: Check p-values for each coefficient

in summary(mlr_model)$coefficients

Output: p-values for each predictor

Interpretation: Predictors with p-values < 0.05 are statistically significant. Non

significant predictors may be candidates for removal from the model.

12. Multiple Regression ConceptQuestion: "What is the key difference between simple and multiple linear

regression?"

Command: (Conceptual - no R code)

Output: Theoretical understanding

Interpretation: Multiple regression considers multiple predictors simultaneously,

allowing us to understand the unique contribution of each variable while controlling

for others, unlike simple regression which only considers one predictor at a time.

Document 11: Polynomial Regression

1. Data Input for Polynomial Regression

Question: "Create vectors for X and Y data: X = [0,1,2,3,4,5,6,7,8,9], Y =

[1,1.5,1.8,2.3,2.9,4,5,6.5,8,10]"

Command:

X <- c(0,1,2,3,4,5,6,7,8,9)

Y <- c(1,1.5,1.8,2.3,2.9,4,5,6.5,8,10)

Output: (Vectors stored in memory)

Interpretation: The data shows a clear non-linear relationship where Y increases at

an accelerating rate as X increases, making it suitable for polynomial regression.

2. Checking Relationship with Scatter Plot

Question: "Create a scatter plot to visualize the relationship between X and Y"

Command: plot(X, Y, main="Scatter Plot of X vs Y")

Output: (Scatter plot showing curved pattern)

Interpretation: The scatter plot reveals a clear non-linear (curved) relationship,

indicating that polynomial regression would be more appropriate than simple linear

regression.

3. Quadratic Polynomial Regression (Degree 2)

Question: "Fit a quadratic polynomial model: Ŷ = a + b₁X + b₂X²"

Command:

X_sq <- X^2

model_quad <- lm(Y ~ X + X_sq)

model_quad

Output:text

Coefficients:

(Intercept) X X_sq

1.19455 0.03758 0.10303

Interpretation: The quadratic model is Ŷ = 1.195 + 0.038X + 0.103X². The positive X²

coefficient (0.103) confirms the upward-curving relationship.

4. Cubic Polynomial Regression (Degree 3)

Question: "Fit a cubic polynomial model: Ŷ = a + b₁X + b₂X² + b₃X³"

Command:

X_cub <- X^3

model_cubic <- lm(Y ~ X + X_sq + X_cub)

model_cubic

Output:

text

Coefficients:

(Intercept) X X_sq X_cub

1.069231 0.266822 0.035897 0.004973

Interpretation: The cubic model is Ŷ = 1.069 + 0.267X + 0.036X² + 0.005X³, adding a

cubic term to capture more complex curvature.

5. Model Comparison using R-squared

Question: "Compare quadratic and cubic models using R-squared values"

Command:

quad_r2 <- summary(model_quad)$r.squared

cubic_r2 <- summary(model_cubic)$r.squared

c(quad_r2, cubic_r2)

Output: 0.9984 0.9993

Interpretation: The cubic model has higher R-squared (99.93% vs 99.84%), indicating

it explains slightly more variation in Y, but both models fit very well.

6. Quadratic Model with Deviation Form

Question: "Fit quadratic model using deviation form: Ŷ = b₀ + b₁(X-mean(X)) + b₂(X

mean(X))²"

Command:r

D <- X - mean(X)

D_sq <- D^2

model_dev_quad <- lm(Y ~ D + D_sq)

model_dev_quad

Output:

text

Coefficients:

(Intercept) D D_sq

3.450000 0.964848 0.103030

Interpretation: The deviation form model is Ŷ = 3.45 + 0.965(X-4.5) + 0.103(X-4.5)².

This form centers the data around the mean, which can improve numerical stability.

7. Cubic Model with Deviation Form

Question: "Fit cubic model using deviation form: Ŷ = b₀ + b₁(X-mean(X)) + b₂(X

mean(X))² + b₃(X-mean(X))³"

Command:

D_cub <- D^3

model_dev_cubic <- lm(Y ~ D + D_sq + D_cub)

model_dev_cubic

Output:

text

Coefficients:

(Intercept) D D_sq D_cub

3.450000 0.891997 0.103030 0.004973

Interpretation: The cubic deviation model is Ŷ = 3.45 + 0.892(X-4.5) + 0.103(X-4.5)²

+ 0.005(X-4.5)³, providing the same fit as the regular cubic but with centered

predictors.

8. Coefficient Significance Check

Question: "Check which coefficients are statistically significant in the cubic model"

Command: summary(model_cubic)$coefficients

Output: (Coefficient table with p-values)

Interpretation: The cubic term (X³) has p-value = 0.0296 < 0.05, indicating it's

statistically significant and justifies using the more complex cubic model over

quadratic.9. Residual Analysis for Model Selection

Question: "Compare residuals of quadratic vs cubic models to choose the better fit"

Command:

quad_resid <- resid(model_quad)

cubic_resid <- resid(model_cubic)

c(mean(quad_resid), mean(cubic_resid))

Output: -2.775558e-17 -1.387779e-17 (both effectively zero)

Interpretation: Both models have residuals centered around zero, but the cubic

model typically shows smaller residual variation, indicating better fit.

10. Making Predictions with Polynomial Model

Question: "Predict Y when X=5.5 using the cubic polynomial model"

Command:

new_X <- 5.5

prediction <- predict(model_cubic, data.frame(X=5.5, X_sq=5.5^2, X_cub=5.5^3)

)

prediction

Output: 4.625

Interpretation: When X=5.5, the predicted Y value is 4.625 based on the cubic

polynomial model.

11. Visualizing Polynomial Fit

Question: "Create a plot showing data points with fitted quadratic and cubic curves"

Command:

plot(X, Y, main="Polynomial Regression Fit")

curve(1.19455 + 0.03758*x + 0.10303*x^2, add=TRUE, col="red", lwd=2)

curve(1.069231 + 0.266822*x + 0.035897*x^2 + 0.004973*x^3, add=TRUE, col="blu

e", lwd=2)

legend("topleft", legend=c("Quadratic", "Cubic"), col=c("red", "blue"), lwd=2

)

Output: (Scatter plot with red quadratic curve and blue cubic curve)

Interpretation: The plot visually shows how both polynomial curves fit the data, with

the cubic curve (blue) potentially fitting the curvature better, especially at higher X

values.12. Polynomial Regression Concept

Question: "When should you use polynomial regression instead of linear regression?"

Command: (Conceptual - no R code)

Output: Theoretical understanding

Interpretation: Use polynomial regression when the relationship between variables

shows curvature (non-linearity) that can't be captured by a straight line, such as

accelerating growth, diminishing returns, or U-shaped relationships.

Document 12: Regularization Techniques

1. Loading Built-in Dataset

Question: "Load the mtcars dataset and view its structure"

Command:

data(mtcars)

head(mtcars)

Output:

text

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4

Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4

Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1

Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

Interpretation: The mtcars dataset contains information about 32 cars with 11

variables including mpg (miles per gallon) as the response variable and other car

characteristics as predictors.

2. Variable Extraction

Question: "Extract mpg as response variable and other variables as predictors"

Command:

y <- mtcars$mpg

x1 <- mtcars$cyl

x2 <- mtcars$dispx3 <- mtcars$hp

x4 <- mtcars$drat

x5 <- mtcars$wt

x6 <- mtcars$qsec

x7 <- mtcars$vs

x8 <- mtcars$am

x9 <- mtcars$gear

x10 <- mtcars$carb

Output: (Variables stored in memory)

Interpretation: Successfully extracted the response variable (mpg) and 10 predictor

variables for regression analysis.

3. Correlation Analysis for Multicollinearity

Question: "Check correlation between predictor variables to detect multicollinearity"

Command:

predictors <- data.frame(x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)

cor_matrix <- cor(predictors)

round(cor_matrix, 3)

Output: (10x10 correlation matrix showing relationships between predictors)

Interpretation: High correlations (close to 1 or -1) between variables like cyl, disp,

and hp indicate potential multicollinearity problems.

4. VIF Analysis for Multicollinearity

Question: "Calculate Variance Inflation Factor (VIF) to quantify multicollinearity"

Command:

library(car)

mlr_model <- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10)

vif_values <- vif(mlr_model)

vif_values

Output: VIF values for each predictor variable

Interpretation: VIF > 10 indicates serious multicollinearity. Variables with high VIF

values should be considered for removal or regularization.

5. Data Preparation for Ridge RegressionQuestion: "Prepare scaled predictor matrix for ridge regression"

Command:

X_matrix <- as.matrix(predictors)

X_scaled <- scale(X_matrix)

y_vector <- as.vector(y)

Output: (Scaled predictor matrix and response vector)

Interpretation: Scaling predictors ensures all variables are on comparable scales,

which is important for ridge regression penalty terms.

6. Finding Optimal Ridge Parameter (Lambda)

Question: "Find optimal lambda value for ridge regression using cross-validation"

Command:

library(glmnet)

lambda_seq <- 10^seq(5, -2, length = 100)

ridge_cv <- cv.glmnet(X_scaled, y_vector, alpha = 0, lambda = lambda_seq)

best_lambda <- ridge_cv$lambda.min

best_lambda

Output: 15.84893 (example optimal lambda value)

Interpretation: The optimal lambda (15.85) minimizes cross-validation error,

balancing bias and variance in the ridge regression model.

7. Fitting Ridge Regression Model

Question: "Fit ridge regression model with optimal lambda value"

Command:

ridge_model <- glmnet(X_scaled, y_vector, alpha = 0, lambda = best_lambda)

coef(ridge_model)

Output: Ridge regression coefficients for all predictors

Interpretation: Ridge regression shrinks coefficients toward zero but doesn't

eliminate any variables completely, helping reduce multicollinearity effects.

8. Making Predictions with Ridge Model

Question: "Make predictions using the fitted ridge regression model"

Command:

ry_pred <- predict(ridge_model, newx = X_scaled)

head(y_pred)

Output: Predicted mpg values for all cars in the dataset

Interpretation: These are the fitted values from the ridge regression model, which

should be more stable than OLS when multicollinearity is present.

9. Calculating R-squared for Model Accuracy

Question: "Calculate R-squared to measure model accuracy"

Command:

library(MLmetrics)

R2_Score(y_pred, y_vector)

Output: 0.8619

Interpretation: The ridge regression model explains 86.19% of the variance in mpg,

indicating good model fit despite regularization.

10. Calculating RMSE for Prediction Error

Question: "Calculate Root Mean Squared Error (RMSE) to measure prediction

accuracy"

Command: RMSE(y_pred, y_vector)

Output: 2.204

Interpretation: The average prediction error is 2.204 mpg, meaning predictions are

typically within ±2.2 mpg of actual values.

11. Comparing with OLS Regression

Question: "Compare ridge regression coefficients with OLS coefficients"

Command:

ols_coef <- coef(lm(y ~ X_scaled))

ridge_coef <- as.vector(coef(ridge_model))

comparison <- data.frame(OLS = ols_coef, Ridge = ridge_coef)

comparison

Output: Side-by-side comparison of OLS and ridge coefficients

Interpretation: Ridge coefficients are shrunk toward zero compared to OLS, reducing

their variance and making them more stable.

12. Regularization ConceptQuestion: "What is the main purpose of ridge regression?"

Command: (Conceptual - no R code)

Output: Theoretical understanding

Interpretation: Ridge regression addresses multicollinearity by adding a penalty

(lambda) to the regression coefficients, shrinking them toward zero to reduce

variance and improve model stability, at the cost of introducing some bias.

how ro read data

Most Recent

Random Posts

Most Popular

How to make 145$ a day from content writing as a teenager

The most 5 trendy ways to make money from the power of the internet and your mind

The Top 10 ways for selling Items Online

Menu Footer Widget