Basic Terminology

Basic Terminology#

There are four key concepts that we need to understand in order to grasp the fundamentals of statistics:

Data: Refers to measurements or, more generally, to documented observations gathered through an experiment or phenomenon.
Experimental unit (or unit of analysis): In statistics, this refers to the individual element or entity that is the subject of analysis in a statistical study. These units serve as the fundamental components from which data is collected and analyzed.
Response variable: A characteristic of interest observed in the experimental unit, which is susceptible to being quantified or recorded in any form, not necessarily in numerical terms.
Statistical population: The totality of values for a response variable across the entire population under study.

1. Population vs. Sample#

In a perfect world environment, we would love to understand all study units. However, in the real world, this is often not possible. With this in mind, we distinguish between population and sample as follows:

Population: The entire group of individuals or observations that you want to study. Studying the entire population is often impractical due to its size and complexity.
Sample: A subset of the population selected for analysis. Samples are used to make inferences about the population, provided they are representative.

Example:

Population: All customers of an e-commerce website.
Sample: 1,000 customers selected randomly from the website’s database.

Why Use a Sample?#

For practical reasons:

Studying the entire population is often impractical or impossible.
Accessibility issues may prevent studying the entire population.
Sampling is faster, more cost-effective, and easier to analyze.

The size of the population relative to the sample is not always the most critical factor; what matters most is that the sample is selected randomly and is representative of the population to ensure valid conclusions.

📝 Exercises for Practice#

2. Bias and Bias Types#

Bias refers to systematic errors in data collection, analysis, interpretation, or presentation that can lead to inaccurate conclusions. It can occur at any stage of a statistical study and can significantly impact the validity of results.

Why is Bias Important?#

Bias can distort findings and lead to incorrect conclusions.
Understanding bias helps ensure the reliability and credibility of research.
Identifying and minimizing bias improves the quality of decision-making based on data.

Common Types of Bias#

Selection Bias:
Occurs when the sample chosen for analysis does not accurately represent the target population. This can result in over- or under-representation of certain groups.
- Example: Conducting a survey only among social media users to assess overall public opinion.
- Self-Selection Bias:
  A special type of selection bias that happens when individuals choose to participate in a study based on their interest, motivation, or other personal characteristics.
  - Example: An online survey about health habits might attract health-conscious individuals, leading to skewed results.
Sampling Bias:
A subset of selection bias where certain members of the population are more likely to be included in the sample than others.
- Example: Conducting a phone survey that excludes people without landlines.
Measurement Bias (or Information Bias):
Arises when data collection methods systematically misrepresent the actual values.
- Example: A faulty scale consistently under-reporting weight measurements.
Response Bias:
Occurs when participants provide inaccurate or false responses due to social desirability, misunderstanding, or survey design.
- Example: People underreporting alcohol consumption in surveys.
Confirmation Bias:
The tendency to interpret data in a way that confirms pre-existing beliefs or hypotheses.
- Example: A researcher focusing only on data that supports their hypothesis and ignoring contradictory evidence.
Observer Bias:
Happens when researchers subconsciously influence the outcome of an experiment based on their expectations.
- Example: A doctor interpreting a patient’s symptoms based on their preconceived notion of the diagnosis.
Publication Bias:
The tendency for journals to publish only positive or significant results, leading to a skewed understanding of research outcomes.
- Example: Studies showing no effect of a treatment are less likely to be published.
Survivorship Bias:
Occurs when only successful cases are considered, ignoring those that failed or were excluded.
- Example: Studying successful companies without analyzing those that went bankrupt.
Recall Bias:
When participants fail to accurately remember past events, leading to skewed data.
- Example: Patients inaccurately recalling their diet habits in a nutritional study.

How to Minimize Bias#

Use random sampling to ensure representativeness.
Apply blinding techniques in experiments to reduce observer bias.
Use standardized data collection methods to minimize measurement errors.
Conduct pilot studies to identify potential biases early.
Be aware of personal biases and use objective analysis methods.

Example Scenario#

Imagine a company conducting a customer satisfaction survey but only selecting participants from their loyalty program.

Potential Bias: Selection bias (loyal customers may have a more positive perception).
Solution: Randomly sample customers from all segments to get a balanced view.

📝 Exercises for Practice#

Python Usage Examples#

Python can help identify and address bias in datasets through statistical analysis and visualization.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Sample biased dataset (survey from social media users)
data = {
    "Age": [18, 22, 25, 35, 45, 50, 60, 65],
    "Income": [20000, 25000, 30000, 50000, 70000, 75000, 80000, 85000],
    "Survey_Source": ["Social Media", "Social Media", "Social Media", "Direct Mail", "Direct Mail", "Direct Mail", "Newspaper", "Newspaper"]
}

df = pd.DataFrame(data)

# Check representation by source
sns.countplot(
    x="Survey_Source",
    data=df
)
plt.title("Survey Respondents by Source")
plt.show()

# Analyze potential bias by income levels
sns.boxplot(
    x="Survey_Source",
    y="Income",
    data=df
)
plt.title("Income Distribution by Survey Source")
plt.show()

# Solution: Resampling techniques or weighting methods
from sklearn.utils import resample

# Oversample underrepresented groups
df_balanced = resample(
    df[df["Survey_Source"] != "Social Media"],
    replace=True,
    n_samples=3,
    random_state=42
)
df_final = pd.concat(
    [df[df["Survey_Source"] == "Social Media"],
     df_balanced]
)

pd.DataFrame(df_final["Survey_Source"].value_counts())

../_images/2f5453933ca8783be5f97f7335cea51ea191abbb97962f57f613a18ed43018ab.png

../_images/5b8fe3bb338153426bc44024555ed2d3d975301db0a18f0a357999e95cc60399.png

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 32
     29 plt.show()
     31 # Solution: Resampling techniques or weighting methods
---> 32 from sklearn.utils import resample
     34 # Oversample underrepresented groups
     35 df_balanced = resample(
     36     df[df["Survey_Source"] != "Social Media"],
     37     replace=True,
     38     n_samples=3,
     39     random_state=42
     40 )

ModuleNotFoundError: No module named 'sklearn'

Tasks to Try:

Analyze the impact of selection bias by introducing another source of respondents.
Use random sampling methods to reduce bias in a given dataset.
Perform hypothesis testing to check for differences across biased and unbiased groups.

# your answers here

3. Descriptive vs. Inferential statistics#

Statistics can be used to either describe a population or sample (descriptive statistics) or to make generalizations and predictions about a population based on a sample (inferential statistics). The differences are as follows:

1. Definition and Purpose#

Aspect	Descriptive statistics	Inferential statistics
Definition	Summarizes and describes the main features of a dataset.	Uses sample data to make inferences or predictions about a larger population.
Purpose	To organize, summarize, and present data in a clear and understandable way.	To make inferences and draw conclusions about a population from a sample using probability theory.
Focus	Presentation and visualization of data.	Analyzing data to make predictions or test hypotheses.

2. Techniques Used#

Aspect	Descriptive statistics	Inferential statistics
Techniques	Measures of central tendency (mean, median, mode), dispersion (range, variance, standard deviation), and visualization (charts, tables).	Estimation (confidence intervals), hypothesis testing (t-tests, chi-square tests), and regression analysis.
Example Methods	Frequency distributions, percentages, histograms, pie charts.	t-tests, ANOVA, regression analysis, and p-values.

3. Data Scope#

Aspect	Descriptive statistics	Inferential statistics
Scope	Describes observed data only (e.g., survey results).	Draws conclusions beyond observed data to a larger population.
Use Case	Summarizing exam scores for a class.	Predicting the performance of all students based on a sampled group.

Example#

Parameter: The population mean (μ) or population standard deviation (σ).
Statistic: The sample mean (x̄) or sample standard deviation (s).

Why do we use it?#

Studying the entire population is often impractical or impossible.
Sampling is faster, more cost-effective, and easier to analyze.

📝 Exercises for Practice#

4. Parameters vs. Statistics#

To better understand inferential statistics, it is important to distinguish between a parameter and a statistic.

1. Definition and Differences#

Aspect	Parameter	Statistic
Definition	A numerical value that describes a characteristic of an entire population.	A numerical value that describes a characteristic of a sample, which is a subset of the population.
Scope	Refers to the whole population (e.g., all customers of a company).	Based on a subset of the population (e.g., a survey of 1,000 customers).
Symbol	Greek letters (e.g., $\mu$ for mean, $\sigma$ for standard deviation, $P$ for proportion).	Latin letters (e.g., $\bar{x}$ for sample mean, $s$ for sample standard deviation, $\hat{p}$ for sample proportion).
Variability	Fixed and constant (but often unknown).	Varies from sample to sample (used to estimate parameters).
Example	The average height of all adults in a country ($\mu$).	The average height of a surveyed group of 1,000 adults ($\bar{x}$).

2. Key Concepts#

Parameters:#

Describe the entire population, which is usually impractical or impossible to measure directly.
They are fixed values but often unknown, and we rely on estimates from sample data.

Statistics:#

Derived from sample data and used to estimate the unknown parameters.
Since samples vary, statistics are subject to sampling variability, which leads to the need for confidence intervals and hypothesis testing.

3. Relationship Between Parameters and Statistics#

Inferential statistics aim to use sample statistics to estimate population parameters.

Example:

The sample mean ($\bar{x}$) can be used as an estimate of the population mean ($\mu$).
The larger and more representative the sample, the closer the statistic will be to the true parameter.

4. Practical Example#

Suppose you want to know the average income of all employees in a company:

Parameter ($\mu$): The true average income of all employees (e.g., $50,000 per year).
Statistic ($\bar{x}$): The average income based on a sample of 100 employees (e.g., $48,500 per year).

If the sample is representative and random, we can infer that the population average income is approximately $50,000.

5. Importance of the Distinction#

Knowing whether you’re dealing with a statistic or a parameter helps in:

Understanding the reliability of conclusions.
Designing proper sampling methods.
Calculating confidence intervals and margins of error.
Avoiding biases in data interpretation.

📝 Exercises for Practice#

5. Types of Data#

Data can be classified into various types based on its nature. Understanding these types helps in selecting the appropriate statistical techniques for analysis.

1. Categorical Data#

Categorical data represents characteristics that describe attributes or qualities rather than numerical values.

Key Features: Categories with no inherent numerical value or order.
Examples: Gender (Male, Female), Eye Color (Brown, Blue, Green), Car Brands (Toyota, Ford, BMW).

2. Numerical Data#

Numerical data represents quantifiable measurements and can be further divided into:

Discrete:
Represents countable values with no intermediate values.
- Examples: Number of students in a class, number of cars in a parking lot.
Continuous:
Represents measurable values that can take any value within a range.
- Examples: Height, Weight, Temperature.

3. Ordinal Data#

Ordinal data represents categories with a meaningful order; however, the differences between values are not necessarily equal or meaningful.

Key Features: Order matters, but exact differences are not quantifiable.
Examples: Education Level (High School, Bachelor’s, Master’s, PhD), Customer Satisfaction (Low, Medium, High), Movie Ratings (1 star, 2 stars, etc.).

4. Interval Data#

Interval data consists of numerical values with equal intervals between them but lacks a true zero point.

Key Features: Allows meaningful comparison of differences, but not ratios.
Examples: Temperature in Celsius or Fahrenheit, IQ scores, Dates in a calendar.

5. Ratio Data#

Ratio data is similar to interval data but includes a true zero point, making ratio comparisons meaningful.

Key Features: Allows comparisons such as “twice as much.”
Examples: Age, Income, Weight, Height, Distance.

6. Time Series Data#

Time series data consists of observations collected at successive time intervals, often used for trend analysis and forecasting.

Key Features: Data is recorded over time, usually at consistent intervals.
Examples: Daily stock prices, monthly sales data, hourly weather records, annual revenue figures.

📝 Exercises for Practice#

6. Python Usage Examples

You can analyze and classify data types using Python. Try the following code snippets:

import pandas as pd
import numpy as np

# Sample Data
data = {
    "Gender": ["Male", "Female", "Female", "Male"],  # Categorical
    "Age": [25, 30, 22, 28],  # Ratio
    "Education_Level": ["High School", "Bachelor's", "Master's", "PhD"],  # Ordinal
    "Temperature_C": [36.5, 37.2, 36.8, 37.0],  # Interval
    "Monthly_Income": [4000, 5500, 3200, 5000],  # Ratio
    "Customer_Rating": [3, 5, 4, 2],  # Ordinal
    "Sales_Date": pd.date_range(start="2023-01-01", periods=4, freq='ME')  # Time Series
}

print("We can also check data types:",df.dtypes)

df = pd.DataFrame(data)

# Convert ordinal data to ordered category
education_levels = ["High School", "Bachelor's", "Master's", "PhD"]
df["Education_Level"] = pd.Categorical(df["Education_Level"], categories=education_levels, ordered=True)

# Display data
print(df)

# Plot time series data
print("Or plot time series data:")
df.plot(
    x="Sales_Date",
    y="Monthly_Income",
    kind="line",
    title="Monthly Income Over Time"
)

   Gender  Age Education_Level  Temperature_C  Monthly_Income  \
  Male   25     High School           36.5            4000   
Female   30      Bachelor's           37.2            5500   
Female   22        Master's           36.8            3200   
  Male   28             PhD           37.0            5000   

   Customer_Rating Sales_Date  
              3 2023-01-31  
              5 2023-02-28  
              4 2023-03-31  
              2 2023-04-30  

<Axes: title={'center': 'Monthly Income Over Time'}, xlabel='Sales_Date'>

../_images/bad287f7247c14462808464d54a9ae4ec1dc0647c92995ef595994cb39d01fb2.png

Tasks to Try:

Modify the dataset by adding discrete and continuous numerical data.
Write code to filter categorical or numerical data from the DataFrame.
Use descriptive statistics functions to summarize the numerical data.
Create visualizations for different data types using Matplotlib or Seaborn.

# Your answers here

6. Levels of measurement#

Understanding the levels of measurement is crucial for selecting the appropriate statistical methods and analyses. Different data types allow for different types of statistical operations and interpretations.

Nominal: Categories with no inherent order; used for classification or labeling purposes.
- Examples: Gender (Male, Female), Eye color (Brown, Blue, Green), Car brands (Toyota, Ford, BMW).
Ordinal: Ordered categories, but the differences between them are not uniform or meaningful.
- Examples: Education levels (High School, Bachelor’s, Master’s, PhD), Customer satisfaction (Low, Medium, High).
Interval: Ordered and measurable intervals, but with no true zero point.
- Examples: Temperature in Celsius or Fahrenheit, IQ scores, Dates on a calendar.
Ratio: Ordered, measurable intervals with a true zero point, allowing for meaningful ratio comparisons.
- Examples: Age, Income, Weight, Height.

Example: Classifying Data#

Suppose you are analyzing data from a fitness app that tracks user information:

Nominal: User’s favorite workout type (Yoga, Cardio, Strength Training)
Ordinal: Fitness level (Beginner, Intermediate, Advanced)
Interval: Temperature during the workout (in Celsius)
Ratio: Number of calories burned during the session

Why is it important?#

Different levels of measurement allow for different types of statistical operations.
- Example: You cannot calculate a mean for nominal data (e.g., colors: “red,” “blue”) but can for interval or ratio data (e.g., temperature, height).
Helps in selecting the appropriate statistical test.
Selecting the appropriate level of measurement is essential for applying the correct statistical techniques.

📝 Exercises for Practice#

6. Python Usage Examples

You can classify and analyze different types of data using Python. Below is an example:

import pandas as pd

# Sample dataset with different levels of measurement
data = {
    "Workout_Type": ["Yoga", "Cardio", "Strength Training", "Yoga"],  # Nominal
    "Satisfaction_Level": ["Low", "Medium", "High", "Medium"],  # Ordinal
    "Calories_Burned": [250, 500, 400, 300],  # Ratio
    "Temperature": [22.5, 24.0, 23.5, 25.0]  # Interval
}

# Create a DataFrame
df = pd.DataFrame(data)

# Convert ordinal data to an ordered category
satisfaction_levels = ["Low", "Medium", "High"]
df["Satisfaction_Level"] = pd.Categorical(df["Satisfaction_Level"], categories=satisfaction_levels, ordered=True)

# Display data types
print(df.dtypes)

# Basic descriptive analysis
print("Average calories burned:", df["Calories_Burned"].mean())

# Visualizing numerical data
df.plot(
    x="Workout_Type",
    y="Calories_Burned",
    kind="bar",
    title="Calories Burned per Workout Type"
)

Workout_Type            object
Satisfaction_Level    category
Calories_Burned          int64
Temperature            float64
dtype: object
Average calories burned: 362.5

<Axes: title={'center': 'Calories Burned per Workout Type'}, xlabel='Workout_Type'>

../_images/5506b654c38df3129013025d00932062d62bdd2acecf30891a3fec07238ee1af.png

Tasks to Try:

Modify the dataset by adding another nominal and ratio variable.
Convert the “Workout_Type” column into categorical data.
Perform statistical calculations on ratio and interval data.
Create a scatter plot comparing calories burned and temperature.

7. Variables#

Variables are fundamental elements in research and data analysis, representing characteristics that can change or vary across different observations.

Independent variable:
The variable that is manipulated or categorized to observe its effect on the dependent variable.
- Examples: Amount of study time (in hours), type of fertilizer used, different marketing strategies.
Dependent variable:
The variable that is measured or observed to assess the effect of the independent variable.
- Examples: Exam scores, plant growth, sales revenue.
Confounding variable: A confounding variable is a third variable that affects both the independent variable (IV) and the dependent variable (DV), potentially distorting or masking the true relationship between them. In other words, a confounder creates a false association or hides a real effect, leading to misinterpretation of causal relationships in statistical analysis.
- Example: Ice Cream & Drowning Deaths
  - Observation: Higher ice cream sales are associated with more drowning deaths.
  - Does this mean eating ice cream causes drowning? ❌ No.
  - Confounding Variable: Temperature (Season)
    - When it’s hot, ice cream sales increase.
    - When it’s hot, more people go swimming, leading to more drownings.
    - The true cause of drowning is not ice cream sales, but the hot weather.

Why Do Confounding Variables Matter?#

They Can Create False Causal Relationships
- If confounders are ignored, you may mistakenly conclude that one variable affects another.
They Can Hide True Relationships
- A confounder might make an actual relationship appear weaker or nonexistent.
They Affect Experiment & Study Validity
- In scientific studies, failing to account for confounders leads to biased results.

How to Handle Confounding Variables?#

Randomization: In experiments, randomly assigning participants helps distribute confounders evenly.
Statistical Control: Use regression analysis to adjust for confounders.
Matching: Ensure groups being compared have similar levels of the confounding variable.
Stratification: Analyze data within subgroups where the confounder is constant.

Example in research:#

If we are studying the effect of study time on exam performance:

Independent Variable: Study time (in hours).
Dependent Variable: Exam scores.

Why is it important?#

Understanding the relationship between variables helps in designing experiments and analyzing results effectively.
Identifying independent and dependent variables allows researchers to determine causality.
Incorrect classification can lead to false assumptions.
- Example: Saying customer satisfaction (DV) influences product price (IV) would be incorrect. Price influences satisfaction, not the other way around.
Helps in designing experiments and A/B testing:
- In clinical trials: New drug (IV) -> Blood pressure reduction (DV).
- In business analytics: Discount offered (IV) -> Customer purchase behavior (DV).
Confounding variables are hidden influencers that can mislead analysis. Recognizing and controlling for them is crucial for accurate research, data science, and business decisions.

📝 Exercises for Practice#

6. Python Usage Examples

You can analyze and visualize relationships between independent and dependent variables using Python. Try the following code snippets:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Sample dataset: Study time (independent) vs. Exam scores (dependent)
data = {
    "Study_Hours": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    "Exam_Scores": [50, 55, 60, 65, 70, 75, 80, 85, 90, 95]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Visualize the relationship using a scatter plot
plt.figure(figsize=(8, 5))
sns.scatterplot(x=df["Study_Hours"], y=df["Exam_Scores"])
plt.title("Relationship Between Study Hours and Exam Scores")
plt.xlabel("Study Hours (Independent Variable)")
plt.ylabel("Exam Scores (Dependent Variable)")
plt.show()

# Calculate correlation between variables
correlation = df["Study_Hours"].corr(df["Exam_Scores"])
print(f"Correlation coefficient: {correlation:.2f}")

../_images/a073a1c9f50f633dea10365dcfa6ea651c69ff8d40cc87865b3904e5fa661afa.png

Correlation coefficient: 1.00

Tasks to Try:

Modify the dataset to include more variables, such as “Sleep Hours” and “Social Media Usage.”
Perform linear regression to analyze the relationship between independent and dependent variables.
Calculate summary statistics (mean, median, standard deviation) for each variable.
Use Python to identify potential outliers in the data.

# Your code here

8. Correlation vs. causation#

Understanding the difference between correlation and causation is crucial in data analysis to avoid incorrect conclusions.

Correlation:
A statistical measure that describes the strength and direction of a relationship between two variables, but it does not imply one causes the other.
- Examples:
  - Ice cream sales and drowning incidents (both increase in summer, but one does not cause the other).
  - Hours of study and exam scores (positive correlation but other factors may be involved).
Causation:
A cause-and-effect relationship where a change in one variable directly results in a change in another.
- Examples:
  - Smoking causes lung cancer (proven through controlled studies).
  - Exercise reduces the risk of heart disease.

Why is it important?#

Correlation helps in identifying relationships between variables but requires further analysis to determine causation.
Understanding causation is essential for decision-making in fields like medicine, economics, and business.
Misinterpreting correlation as causation can lead to faulty conclusions and ineffective policies.

Key Differences:#

Aspect	Correlation	Causation
Definition	Measures the relationship between two variables.	Indicates a direct cause-and-effect relationship.
Directionality	No assumption of causality.	Implies a one-way influence between variables.
Example	More sleep correlates with better productivity.	Proper sleep causes improved cognitive function.

Common Pitfalls:#

Spurious Correlation: When two variables appear related but are actually influenced by a third factor.
Reverse Causation: When causation is mistakenly assumed in the wrong direction.

📝 Exercises for Practice#

6. Python Usage Examples#

Python can be used to analyze and visualize relationships between variables. Below is an example:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Sample dataset
data = {
    "Ice_Cream_Sales": [100, 200, 300, 400, 500, 600, 700],
    "Drowning_Incidents": [5, 10, 15, 20, 25, 30, 35],
    "Temperature": [20, 25, 30, 35, 40, 45, 50]
}

df = pd.DataFrame(data)

# Scatter plot showing correlation
plt.figure(figsize=(8,5))
sns.scatterplot(x=df["Ice_Cream_Sales"], y=df["Drowning_Incidents"])
plt.title("Correlation between Ice Cream Sales and Drowning Incidents")
plt.xlabel("Ice Cream Sales")
plt.ylabel("Drowning Incidents")
plt.show()

# Calculate correlation matrix
correlation_matrix = df.corr()
print("Correlation Matrix:\n", correlation_matrix)

# Checking for spurious correlation by controlling for temperature
plt.figure(figsize=(8,5))
sns.scatterplot(x=df["Temperature"], y=df["Drowning_Incidents"])
plt.title("Temperature vs Drowning Incidents")
plt.xlabel("Temperature")
plt.ylabel("Drowning Incidents")
plt.show()

../_images/0b9fa5ebda9c8ff3774548e571bffea61ee4e74bfe3c392fc19f3efe9c8f09e9.png

Correlation Matrix:
                     Ice_Cream_Sales  Drowning_Incidents  Temperature
Ice_Cream_Sales                 1.0                 1.0          1.0
Drowning_Incidents              1.0                 1.0          1.0
Temperature                     1.0                 1.0          1.0

../_images/c0e721338a7715bc6ea1bc506084434a3dc65307bb018fc8f5b26c08903ac02e.png

Tasks to Try:

Add another variable such as “Sunscreen Sales” and analyze its correlation with ice cream sales and drowning incidents.
Use linear regression to explore relationships between variables.
Identify potential confounding variables using Python statistical tools.
Create a visualization to show how correlation does not imply causation.

# Your code here

9. Summary:#

In this notebook, we covered the fundamental concepts necessary to understand statistics, including:

Data and Experimental Units: Defined how data is collected and analyzed in statistical studies.
Population vs. Sample: Explained why working with samples is often more practical than studying entire populations.
Bias and Its Types: Discussed different types of bias (e.g., selection bias, response bias) and how they can distort statistical analysis.
Descriptive vs. Inferential Statistics: Highlighted their differences and roles in summarizing data versus making predictions.
Parameters vs. Statistics: Defined key statistical measures used in data analysis.
Types of Data: Differentiated between categorical, numerical, ordinal, interval, ratio, and time series data.
Levels of Measurement: Explained why understanding measurement scales is essential for choosing the right statistical methods.
Variables in Statistics: Covered independent, dependent, and confounding variables and their significance in research.

By mastering these foundational concepts, you are now better prepared to explore more advanced statistical methods and applications.

10. Next steps:#

Now that we have a solid understanding of basic statistical terminology, the next section will focus on Descriptive Statistics, where we will:

Learn about measures of central tendency (mean, median, and mode).
Explore variability measurements (range, variance, and standard deviation).
Understand skewness and kurtosis and how they help describe data distributions.
Use data visualization techniques to summarize and interpret datasets effectively.

These concepts will allow us to summarize datasets meaningfully and serve as a foundation for further statistical analysis.

Basic Terminology

Contents

Basic Terminology#

1. Population vs. Sample#

Why Use a Sample?#

📝 Exercises for Practice#

2. Bias and Bias Types#

Why is Bias Important?#

Common Types of Bias#

How to Minimize Bias#

Example Scenario#

📝 Exercises for Practice#

Python Usage Examples#

3. Descriptive vs. Inferential statistics#

1. Definition and Purpose#

2. Techniques Used#

3. Data Scope#

Example#

Why do we use it?#

📝 Exercises for Practice#

4. Parameters vs. Statistics#

1. Definition and Differences#

2. Key Concepts#

Parameters:#

Statistics:#

3. Relationship Between Parameters and Statistics#

4. Practical Example#

5. Importance of the Distinction#

📝 Exercises for Practice#

5. Types of Data#

1. Categorical Data#

2. Numerical Data#

3. Ordinal Data#

4. Interval Data#

5. Ratio Data#

6. Time Series Data#

📝 Exercises for Practice#

6. Levels of measurement#

Example: Classifying Data#

Why is it important?#

📝 Exercises for Practice#

7. Variables#

Why Do Confounding Variables Matter?#

How to Handle Confounding Variables?#

Example in research:#

Why is it important?#

📝 Exercises for Practice#

8. Correlation vs. causation#

Why is it important?#

Key Differences:#

Common Pitfalls:#

📝 Exercises for Practice#

6. Python Usage Examples#

9. Summary:#

10. Next steps:#