Levels of measurement
There are four basic levels: nominal, ordinal, interval, and ratio.
Nominal 明目:(定性)只为属性,无论高低贵贱
A variable measured on a “nominal” scale is a variable that does not really have any evaluative distinction. One value is really not any greater than another. A good example of a nominal variable is sex (or gender). Information in a data set on sex is usually coded as 0 or 1, 1 indicating male and 0 indicating female (or the other way around–0 for male, 1 for female). 1 in this case is an arbitrary value and it is not any greater or better than 0. There is only a nominal difference between 0 and 1. With nominal variables, there is a qualitative difference between values, not a quantitative one.
Ordinal 次序:(定性)等级比较,程度区分。评等、教育程度。
Something measured on an “ordinal” scale does have an evaluative connotation. One value is greater or larger or better than the other. Product A is preferred over product B, and therefore A receives a value of 1 and B receives a value of 2. Another example might be rating your job satisfaction on a scale from 1 to 10, with 10 representing complete satisfaction. With ordinal scales, we only know that 2 is better than 1 or 10 is better than 9; we do not know by how much. It may vary. The distance between 1 and 2 maybe shorter than between 9 and 10.
Interval 等距:(定量)每个值之间等距,不可以乘除。一般认为0~10的满意度表是interval的。
A variable measured on an interval scale gives information about more or betterness as ordinal scales do, but interval variables have an equal distance between each value. The distance between 1 and 2 is equal to the distance between 9 and 10. Temperature using Celsius or Fahrenheit is a good example, there is the exact same difference between 100 degrees and 90 as there is between 42 and 32.
Ratio 等比:(定量)最低为0(绝对0点),可以乘除。价格、年龄、高度、分数。
Something measured on a ratio scale has the same properties that an interval scale has except, with a ratio scaling, there is an absolute zero point. Temperature measured in Kelvin is an example. There is no value possible below 0 degrees Kelvin, it is absolute zero. Weight is another example, 0 lbs. is a meaningful absence of weight. Your bank account balance is another. Although you can have a negative or positive account balance, there is a definite and nonarbitrary meaning of an account balance of 0.
Terminologies
measurement of variation in the data
- average deviation from the mean: the average of these deviations always sums up to 0.
- average absolute deviation: take the absolute value of each deviation and calculate the average
trimmed mean: A trimmed mean is calculated to minimize the distorting effect of outliers. A 10% trimmed mean is calculated by discarding the highest and lowest 10% of the observations, and then computing the mean of the remaining 80%.
skewness: Negative, or left-skewed refers to a longer or fatter tail on the left side of the distribution, while positive, or right-skewed, refers to a longer or fatter tail on the right.
kurtosis: 峰度,亦称尖度,在统计学中衡量实数随机变量概率分布的峰态。峰度高就意味着方差增大是由低频度的大于或小于平均值的极端差值引起的。
Quantile, percentiles
- The median, for example, is the 50th percentile (middle value).
- The 75th percentile (upper quartile) is 75% or three quarters of the
- The 25th percentile (lower quartile) is 25% or one quarter of the way up this rank order.
boundaries:6.7的boundaries是[6.65, 6.75)
variance 方差:离差的平方的加总的平均——适用于数据的全集
or 在数据全集的一个样本数据里,用n-1除开,用s^2表示“sample variance样本方差”。
standard deviation 标准差:variance的平方根
covariance协方差:两个自变量共同作用于因变量。covariance是正的,表示x,y同向变化;负的,反向变化;如为0,表示两者没有关系。不同的covariance之间无法直接相互比较。
针对样本数据,covariance Sxy:
correlation相关性:针对不同的covariance之间无法直接相互比较的情况,进而采用correlation,其特点是其数值在[-1,1]区间。R为0不代表x,y没有任何关系——要查看scatter plot
coefficient系数:如correlation coefficient “R”是一个系数
sampling error 抽样误差:difference between the sample mean X-bar and the population mean μ.
Std. Dev. of X-bar 样本数据的标准差:The standard deviation of the sample mean (X-bar) is denoted as follows, where σ is the standard deviation of the population X, divided by the square root of the sample size n.
there is a 95% probability that the value of the sample mean will be within the interval:
Confidence interval 置信区间: 上述公式体现出的区间,如95%可信度的置信区间为
Z统计和t统计:
Standard error 标准误差:The estimated standard deviation is called the standard error.
Data Visualization
- bar chart
- histogram
- time series graphs
- scatter plot (with ‘jitter’ option)
- box plot
Leave a Reply