Here are instructions on how to use R for statistics:

R Instructions

Introductory Instructions:

Download R
- http://www.r-project.org/ or http://sourceforge.net/projects/rportable/
- NOTE: The two folders (R-Portable and work) and R-Start.exe link MUST be in root folder of USB drive.
Start R-portable
- Double-click R-Start.exe
- NOTE: This will automatically load custom functions that will be used during the course.
Find information about any functions
- ?functionname
Load saved work
- load("name.Rdata")
Load custom function
- source("functionname.R")
Load dataset in .csv file located in working directory (loads ALL variables and data and stores into newvarname)
- newvarname = load.data("filename.csv")
Identify all variables in dataset
- names(newvarname)
Load one variable from dataset and store into new variable
- singlevarname = newvarname$var
Load data into new variable
- singlevarname = c(x1, x2, x3, etc) where x1, x2, x3, etc. are quantitative data; or
- singlevarname = c("d1", "d2", "d3", etc) where "d1", "d2", "d3", etc. are qualitative data
Repeat function several times
- do(n)*function(arguments) where n is the number of repetitions
Make comments within code
- # this is a comment

Section 1.3: Simple Random Sampling

sample(singlevarname, n, replace=F) where n is sample size

Section 1.4: Other Effective Sampling Methods

Systematic sample
- systematic(singlevarname, n) where n is sample size
Stratified sample
- stratified(varname, strataVarname, n) where varname is variable containing the population, strataVarname is the variable containing the levels (i.e., strata), and n is sample size

Section 2.1: Organizing Qualitative Data

Construct frequency distribution
- table(singlevarname)
Construct relative frequency distribution
- table(singlevarname)/length(singlevarname)

Section 2.1: Organizing Qualitative Data: The Popular Displays

Construct frequency bar graph – note: the title and labels can be written in any order
- barplot(table(singlevarname), xlab="variable name", ylab="frequency", main="Long Descriptive Title")
Construct relative frequency bar graph – note: the title and labels can be written in any order
- barplot(table(singlevarname)/length(singlevarname), xlab="variable name", ylab="relative frequency", main="Long Descriptive Title")
Construct Pareto chart – note: the title and labels can be written in any order
- pareto.chart(table(singlevarname), xlab="variable name", ylab="relative frequency", main="Long Descriptive Title")
Construct side-by-side frequency bar graph – note: the title and labels can be written in any order
- barplot(xtabs(~ levelvariable + singlevarname ), xlab="variable name", ylab="frequency", main="Long Descriptive Title", beside = TRUE, legend.text = levels(levelvariable), col = c("colorname1", "colorname2", etc.)

Section 2.2: Organizing Quantitative Data: The Popular Displays

Construct histogram – note: the title and labels can be written in any order
- hist(singlevarname, xlab="variable name", main="Long Descriptive Title")
Construct relative frequency histogram – note: the title and labels can be written in any order
- hist(singlevarname, freq=FALSE, xlab="variable name", ylab="relative frequency", main="Long Descriptive Title")
Construct histogram with different starting value and classwidth – note: the title and labels can be written in any order
- hist(singlevarname, xlab="variable name", main="Long Descriptive Title", breaks = c(b1, b2, b3, b4, etc)) where b1, b2, b3, etc. are lower/upper class limits
Construct stemplot
- stem(singlevarname, scale=value) note: use scale=10 to move decimal place one place to left in stemplot; use scale=100 to move decimal place two places to left in stemplot; use scale=0.1 to move decimal place one place to right in stemplot; etc.
Construct back-n-back stemplot
- stem.leaf.backback(var1, var2, back.to.back=T, unit=value, m=1) note: use unit=10 to move decimal place one place to left in stemplot; use unit=100 to move decimal place two places to left in stemplot; use unit=0.1 to move decimal place one place to right in stemplot; etc.

Section 3.1: Measures of Central Tendency

[Arithmetic] Mean
- sum(singlevarname)/length(singlevarname); or
- mean(singlevarname)
Median
- median(singlevarname)

Section 3.2: Measures of Dispersion [a.k.a., Spread]

Range
- max(singlevarname) - min(singlevarname)
Population Variance, σ²
- sum((mean(singlevarname)-singlevarname)^2)/length(singlevarname)
Sample Variance, s²
- var(singlevarname)
Population Standard Deviation, σ
- sqrt(sum((mean(singlevarname)-singlevarname)^2)/length(singlevarname))
Population Standard Deviation, σ
- sigma(singlevarname)
Sample Standard Deviation [a.k.a., sd], s
- sqrt(var(singlevarname)); or
- sd(singlevarname)
Determine if data is normally-distributed
- qqnorm(singlevarname)

Section 3.4: Measures of Positions and Outliers

k^th percentile
- qnorm(k, mean(singlevarname), sd(singlevarname))
z-score – note: the xvalue is a number not a function here
- z = (xvalue - mean(singlevarname))/sd(singlevarname)
Q₁
- quantile(singlevarname, 0.25, type=2)
Q₃
- quantile(singlevarname, 0.75, type=2)
Interquartile range, IQR
- quantile(singlevarname, 0.75, type=2) - quantile(singlevarname, 0.25, type=2); or
- IQR(singlevarname, type=2)
Lower fence for outliers
- lowerfence = quantile(singlevarname, 0.25, type=2) - 1.5*IQR(singlevarname, type=2)
Upper fence for outliers
- upperfence = quantile(singlevarname, 0.75, type=2) + 1.5*IQR(singlevarname, type=2)
Identify outliers below lower fence
- singlevarname[singlevarname < lowerfence]
Identify outliers above upper fence
- singlevarname[singlevarname > upperfence]

Section 3.5: The Five Number Summary and Boxplots

Five-number summary – note: mean is not a measure within the five-number summary
- fivenum(singlevarname)
Boxplot
- boxplot(singlevarname, ylab="variable name", main="Long Descriptive Title") – note: use the argument horizontal=TRUE to change the orientation of the boxplot – make sure that you also switch the labels
Multiple boxplots comparing different levels in variable
- boxplot(singlevarname ~ levels, data=singlevarname, xlab="description of levels", ylab="variable name", main="Long Descriptive Title")
Graph histogram and boxplot concurrently for same variable – note: after graphing close graphics window to reset window
- attach(newvarname)
  par(mfrow=c(2,1))
  hist(singlevarname, xlab="variable name", ylab="frequency", main="Long Descriptive Title")
  boxplot(singlevarname, xlab="variable name", horizontal=TRUE)
  detach(newvarname)

Section 4.1: Scatter Diagram and Correlation

Scatterplot
- plot(xvarname, yvarname, xlab="x variable name", ylab="y variable name", main="Long Descriptive Title")
Pearson linear correlation coefficient, r
- cor(xvarname, yvarname, method="pearson", use="complete.obs") – note: use="complete.obs" removes any 'pair' of observations that has missing value(s)

Section 4.2: Least-Squares Regression

Least-squares regression line (a.k.a., linear model)
- lm(formula = yvarname ~ xvarname) – note: gives Intercept and coefficient of xvarname
Plot scatterplot with least-squares regression line
- plot(xvarname, yvarname, xlab="x variable name", ylab="y variable name", main="Long Descriptive Title")
  abline(lm(formula = yvarname ~ xvarname))

Section 4.3: Coefficient of Determination

Coefficient of determination, r²
- (cor(xvarname, yvarname, method="pearson"))^2

Section 6.1: Discrete Random Variables

Graph discrete probability histogram
- Store x-values into variable: x = c(x1, x2, x3)
- Store probabilities, p, into variable: p = c(p1, p2, p3)
- barplot(x*p), xlab="variable name", ylab="probability", main="Long Descriptive Title")
Compute mean (expected value) of a discrete random variable
- Store x-values into variable: x = c(x1, x2, x3)
- Store probabilities, p, into variable: p = c(p1, p2, p3)
- sum(x*p)
Compute variance of a discrete random variable
- Store x-values into variable: x = c(x1, x2, x3)
- Store probabilities, p, into variable: p = c(p1, p2, p3)
- sum((x-sum(x*p))^2*p)
Compute standard deviation of a discrete random variable
- Store x-values into variable: x = c(x1, x2, x3)
- Store probabilities, p, into variable: p = c(p1, p2, p3)
- sqrt(sum((x-sum(x*p))^2*p))

Section 6.2: The Binomial Probability Distribution

The number of combinations of r objects from n total objects
- choose(n, r)
Binomial probability of k successes from n trials with P(success) = p – note: order is not important
- dbinom(k, prob=p, size=n)
Cumulative binomial probability of a ≤ k ≤ b successes from n trials with P(success) = p
- sum(dbinom(a:b, prob=p, size=n))
Plot cumulative binomial probability of 0 ≤ k ≤ n successes from n trials with P(success) = p
- barplot(dbinom(0:n, prob=p, size=n), names=0:n)

Section 7.1: Properties of the Normal Distribution

Plot the normal curve with mean = m and sd = s
- x = seq(-20,20,by=0.1)
  y = dnorm(x, mean=m, sd=s)
  plot(x, y)
Find area under the normal curve with mean = m and sd = s to the left of x
- pnorm(x, mean=m, sd=s)
Find area under the normal curve with mean = m and sd = s to the right of x
- pnorm(x, mean=m, sd=s, lower.tail=FALSE)
Find area under the normal curve with mean = m and sd = s between a ≤ x ≤ b
- pnorm(a, mean=m, sd=s) - pnorm(b, mean=m, sd=s)
Plot a relative frequency histogram with Normal Curve
- Store variable into x: x = singlevarname
- Construct relative frequency histogram:
  hist(x, freq=FALSE, xlab="variable name", ylab="relative frequency", main="Long Descriptive Title")
- Graph normal curve: curve(dnorm(x, mean=mean(x), sd=sd(x)), add=TRUE)

Section 7.2: The Standard Normal Distribution

Find z_α given area = α to the right of z under a normal curve with mean = m and sd = s
- qnorm(α, mean=m, sd=s, lower.tail=FALSE)
Find the area [a.k.a., proportion, percentage, or probability] under the standard normal curve for z ≤ a
- pnorm(a)
Find the area [a.k.a., proportion, percentage, or probability] under the standard normal curve for z ≥ a
- pnorm(a, lower.tail=FALSE)
Find the area [a.k.a., proportion, percentage, or probability] under the standard normal curve for a ≤ z ≤ b
- pnorm(b) - pnorm(a)

Section 7.3: Applications of the Normal Distribution

Find the area [a.k.a., proportion, percentage, or probability] under the normal curve with mean = m and sd = s for x ≤ a
- pnorm(a, mean=m, sd=s)
Find the area [a.k.a., proportion, percentage, or probability] under the normal curve with mean = m and sd = s for x ≥ a
- pnorm(a, mean=m, sd=s, lower.tail=FALSE)
Find the area [a.k.a., proportion, percentage, or probability] under the normal curve with mean = m and sd = s for a ≤ x ≤ b
- pnorm(b, mean=m, sd=s) - pnorm(a, mean=m, sd=s)
Find the x^th percentile for normally-distributed variable with mean = m and sd = s
- qnorm(x, mean=m, sd=s) where x is a decimal

Section 7.4: Assessing Normality

qqnorm(singlevarname)

Section 9.1: The Logic in Constructing Confidence Intervals for a Population Mean

How to find one-sample t confidence interval (using sample mean and σ)
- zInterval(mean=samplemean, sd=population_sd, n=samplesize, conf.level=C-level) where C-Level is confidence level as a decimal; the arguments may be used in any order

Section 9.2: Confidence Intervals for a Population Mean

How to find critical value, t_α
- qt(1 – α, df) where α is area to the right; df = n – 1
How to find one-sample t confidence interval (using sample mean and s)
- load tInterval.R using source("tInterval.R")
- then tInterval(mean=samplemean, sd=population_sd, n=samplesize, conf.level=C-level) where C-Level is confidence level as a decimal; the arguments may be used in any order
How to find one-sample t confidence interval (using data)
- t.test(data1, mu=mu_0, conf.level=C-Level) where data1 is variable containing data; mu_0 is population mean used in hypotheses; and C-Level is confidence level as a decimal [i.e., 1 – α/2 or 1 – α]

Section 9.3: Confidence Intervals for a Polpulation Proportion

How to find a one-sample confidence interval for population proportion using z-statistic
- prop.test(x=number_of_successes, n=samplesize, p=p_0, conf.level=C-Level, alternative="two.sided") where x is the number of successes; n is sample size; p_0 is population proportion used in hypotheses; and C-Level is confidence level as a decimal [i.e., 1 – α/2 or 1 – α]

Section 10.3: Hypothesis Test for a Population Mean – Population Standard Deviation Unknown

How to perform one-sample hypothesis test for population mean using t-statistic (using data)
- t.test(data1, mu=mu_0, conf.level=C-Level, alternative="two.sided") where data1 is variable containing data; mu_0 is population mean used in hypotheses; C-Level is confidence level as a decimal [i.e., 1 – α/2 or 1 – α]; alternative is symbol used in alternate hypothesis: "less" for <, "greater" for >, "two.sided" for ≠

Section 10.4: Hypothesis Test for a Population Proportion

How to perform one-sample hypothesis test for population proportion using z-statistic
- prop.test(x=number_of_successes, n=samplesize, p=p_0, conf.level=C-Level, alternative="two.sided") where x is the number of successes; n is sample size; C-Level is confidence level as a decimal [i.e., 1 – α/2 or 1 – α]; alternative is symbol used in alternate hypothesis: "less" for <, "greater" for >, "two.sided" for ≠

Section 11.1: Inferences about Two Means: Dependent Samples

How to perform hypothesis test for matched-pairs design using t-statistic (using data)
- t.test(data1, data2, paired=TRUE, conf.level=C-Level, alternative="two.sided") where data1 and data2 are variables containing data; C-Level is confidence level as a decimal [i.e., 1 – α/2 or 1 – α]; alternative is symbol used in alternate hypothesis: "less" for <, "greater" for >, "two.sided" for ≠

Section 11.2: Inferences about Two Means: Independent Samples

How to find two-sample t confidence interval (using data)
- t.test(data1, data2, conf.level=C-Level) where data1 and data2 are variables containing data; C-Level is confidence level as a decimal [i.e., 1 – α/2 or 1 – α]
How to find pooled two-sample t confidence interval (using data)
- t.test(data1, data2, conf.level=C-Level, var.equal) where data1 and data2 are variables containing data; C-Level is confidence level as a decimal [i.e., 1 – α/2 or 1 – α]
How to perform two-sample t-test (using data)
- t.test(data1, data2, conf.level=C-Level, alternative="two.sided") where data1 and data22 are variables containing data; C-Level is confidence level as a decimal [i.e., 1 – α/2 or 1 – α]; alternative is symbol used in alternate hypothesis: "less" for <, "greater" for >, "two.sided" for ≠
How to perform pooled two-sample t-test (using data)
- t.test(data1, data2, conf.level=C-Level, alternative="two.sided", var.equal) where data1 and data2 are variables containing data; C-Level is confidence level as a decimal [i.e., 1 – α/2 or 1 – α]; alternative is symbol used in alternate hypothesis: "less" for <, "greater" for >, "two.sided" for ≠

Section 11.3: Two-Sample Hypothesis Test for Population Proportions, using z-statistic

How to perform two-sample hypothesis test for population proportions using z-statistic (using data)
- prop.test(x = c(x1,x2), n=c(n1,n2), conf.level=C-Level, alternative="two.sided") where x1 and x2 are the number of successes; where n1 and n2 are sample sizes; C-Level is confidence level as a decimal [i.e., 1 – α/2 or 1 – α]; alternative is symbol used in alternate hypothesis: "less" for <, "greater" for >, "two.sided" for ≠
How to find the confidence interval for the difference between two populations proportion using z-statistic (using data)
- prop.test(x = c(x1,x2), n=c(n1,n2), conf.level=C-Level, alternative="two.sided") where x1 and x2 are the number of successes; where n1 and n2 are sample sizes; C-Level is confidence level as a decimal [i.e., 1 – α/2 or 1 – α]

Section 12.1: Goodness-of-Fit Test

How to calculate the Χ²-statistic
- qchisq(alpha, df)
Χ² goodness of fit test
- chisq.test(c(x1,x2,x3),p=c(p1,p2,p3)) where x1, x2, x3, etc. are data; p1, p2, p3, etc., are the respective probabilities that add to 1

Section 12.2: Tests for Independence and the Homogeneity of Proportions

Χ² test of independence
- Set-up matrix (called data frame) containing counts
  - rowvariable1 = c(c_1,1,c_1,2,c_1,3) where c_1,1,c_1,2,c_1,3, etc. are counts in the first row of matrix;
  - rowvariable2 = c(c_2,1,c_2,2,c_2,3) where c_2,1,c_2,2,c_2,3, etc. are counts in the second row of matrix;
  - rowvariable3 = c(c_3,1,c_3,2,c_3,3) where c_3,1,c_3,2,c_3,3, etc. are counts in the third row of matrix;
  - etc.
- chisq.test(data.frame(rowvariable1, rowvariable2, rowvariable3))
Χ² test for homogeneity
- Set-up variables containing counts
  - variable.fair = c(c_1,1,c_1,2,c_1,3) where c_1,1,c_1,2,c_1,3, etc. are counts in the first row of matrix;
  - variable.bias = c(c_2,1,c_2,2,c_2,3) where c_2,1,c_2,2,c_2,3, etc. are counts in the second row of matrix;
- chisq.test(rbind(variable.fair, variable.bias))
- for expected counts only, use: chisq.test(rbind(variable.fair, variable.bias)) [['exp']]

Section x.x: One-Way Analysis of Variance (ANOVA)

How to perform hypothesis test on three or more population means using ANOVA
- Set-up variables containing sample values
  - variable1 = c(c_1,1,c_1,2,c_1,3) where c_1,1,c_1,2,c_1,3, etc. are values in sample 1;
  - variable2 = c(c_2,1,c_2,2,c_2,3) where c_1,1,c_1,2,c_1,3, etc. are values in sample 2;
  - variable3 = c(c_3,1,c_3,2,c_3,3) where c_1,1,c_1,2,c_1,3, etc. are values in sample 3;
  - etc.
- Store all variables in a new variable
  - samplesvariable = data.frame(variable1, variable2, variable3)
- Store samples variable as a variable with factors describing values
  - samplesvariable = stack(samplesvariable)
- oneway.test(values ~ ind, data=samplesvariable, var.equal=TRUE)

Back to John Weber's MATH 1431 Page
Back to john-weber.com