统计学速记 Statistics Crash Course

Levels of measurement

There are four basic levels: nominal, ordinal, interval, and ratio.

Nominal 明目:(定性)只为属性,无论高低贵贱

A variable measured on a “nominal” scale is a variable that does not really have any evaluative distinction. One value is really not any greater than another. A good example of a nominal variable is sex (or gender). Information in a data set on sex is usually coded as 0 or 1, 1 indicating male and 0 indicating female (or the other way around–0 for male, 1 for female). 1 in this case is an arbitrary value and it is not any greater or better than 0. There is only a nominal difference between 0 and 1. With nominal variables, there is a qualitative difference between values, not a quantitative one.

Ordinal 次序:(定性)等级比较,程度区分。评等、教育程度。

Something measured on an “ordinal” scale does have an evaluative connotation. One value is greater or larger or better than the other. Product A is preferred over product B, and therefore A receives a value of 1 and B receives a value of 2. Another example might be rating your job satisfaction on a scale from 1 to 10, with 10 representing complete satisfaction. With ordinal scales, we only know that 2 is better than 1 or 10 is better than 9; we do not know by how much. It may vary. The distance between 1 and 2 maybe shorter than between 9 and 10.

Interval 等距:(定量)每个值之间等距,不可以乘除。一般认为0~10的满意度表是interval的。

A variable measured on an interval scale gives information about more or betterness as ordinal scales do, but interval variables have an equal distance between each value. The distance between 1 and 2 is equal to the distance between 9 and 10. Temperature using Celsius or Fahrenheit is a good example, there is the exact same difference between 100 degrees and 90 as there is between 42 and 32.

Ratio 等比:(定量)最低为0(绝对0点),可以乘除。价格、年龄、高度、分数。

Something measured on a ratio scale has the same properties that an interval scale has except, with a ratio scaling, there is an absolute zero point. Temperature measured in Kelvin is an example. There is no value possible below 0 degrees Kelvin, it is absolute zero. Weight is another example, 0 lbs. is a meaningful absence of weight. Your bank account balance is another. Although you can have a negative or positive account balance, there is a definite and nonarbitrary meaning of an account balance of 0.

Terminologies

measurement of variation in the data

  • average deviation from the mean: the average of these deviations always sums up to 0.
  • average absolute deviation: take the absolute value of each deviation and calculate the average

trimmed mean: A trimmed mean is calculated to minimize the distorting effect of outliers. A 10% trimmed mean is calculated by discarding the highest and lowest 10% of the observations, and then computing the mean of the remaining 80%.

skewness: Negative, or left-skewed refers to a longer or fatter tail on the left side of the distribution, while positive, or right-skewed, refers to a longer or fatter tail on the right.

kurtosis: 峰度,亦称尖度,在统计学中衡量实数随机变量概率分布的峰态。峰度高就意味着方差增大是由低频度的大于或小于平均值的极端差值引起的。

image

Quantile, percentiles

  • The median, for example, is the 50th percentile (middle value).
  • The 75th percentile (upper quartile) is 75% or three quarters of the
  • The 25th percentile (lower quartile) is 25% or one quarter of the way up this rank order.

boundaries:6.7的boundaries是[6.65, 6.75)

variance 方差:离差的平方的加总的平均——适用于数据的全集

image

or 在数据全集的一个样本数据里,用n-1除开,用s^2表示“sample variance样本方差”。

image

standard deviation 标准差:variance的平方根

image

covariance协方差:两个自变量共同作用于因变量。covariance是正的,表示x,y同向变化;负的,反向变化;如为0,表示两者没有关系。不同的covariance之间无法直接相互比较。

针对样本数据,covariance Sxy:

image

correlation相关性:针对不同的covariance之间无法直接相互比较的情况,进而采用correlation,其特点是其数值在[-1,1]区间。R为0不代表x,y没有任何关系——要查看scatter plot

image

coefficient系数:如correlation coefficient “R”是一个系数

sampling error 抽样误差:difference between the sample mean X-bar and the population mean μ.

Std. Dev. of X-bar 样本数据的标准差:The standard deviation of the sample mean (X-bar) is denoted as follows, where σ is the standard deviation of the population X, divided by the square root of the sample size n.

image

there is a 95% probability that the value of the sample mean will be within the interval:

image

Confidence interval 置信区间: 上述公式体现出的区间,如95%可信度的置信区间为

image

Z统计和t统计:

image

Standard error 标准误差:The estimated standard deviation is called the standard error.

image

Data Visualization

  • bar chart
  • histogram
  • time series graphs
  • scatter plot (with ‘jitter’ option)
  • box plot
image

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top arrow