Introduction

Pursuing my degree in Computer Science at Hanze (Hanzehogeschool Groningen), after a two-year sabbatical, here one can find my abstract summary of the theories taught in the subject ‘Data Analysis’.

This should cover most of the contents of Chapters 1, 3, 4 and 5 of the book ‘Business Analytics’ by James R. Evans. Credit where credit is due.

Chapter 1

Business Analytics is the use of data, IT, statistics, quantitative methods and models to gain insight about business operations, and make better (fact based) decisions.

BA is at the absolute center of three overlapping fields:

Computer Science
Business & Management
Math & Statistics

Three kinds of analytics

Descriptive analytics: the use of data to understand past and current business performance and make informed decisions

Predictive analytics: predict the future by examining historical data, detecting patterns or relationships in these data, and then extrapolating these relationships forward in time.

Prescriptive analytics: identify the best alternatives to minimize or maximize some objective

Data Reliability and Validity

Data Reliability:

Definition: Data reliability refers to the consistency and repeatability of data measurements. In other words, it assesses whether the same results would be obtained if the same data were collected and analyzed multiple times under similar conditions.

Example: If you weigh an object using a scale multiple times and get the same measurement each time, the data is considered reliable because it consistently produces the same result.

Importance: Reliable data ensures that findings and conclusions drawn from analysis are trustworthy and not merely due to chance or inconsistency in measurement.

Data Validity:

Definition: Data validity refers to the accuracy and truthfulness of data. It assesses whether the data accurately represents the phenomenon it is intended to measure or describe.

Example: If you are measuring the temperature of a liquid using a thermometer, data validity would mean that the thermometer is accurately measuring the temperature of the liquid and not producing incorrect readings due to calibration errors or other factors.

Importance: Valid data ensures that conclusions drawn from analysis are based on accurate representations of the real-world phenomena being studied, thus increasing the credibility and usefulness of the findings.

Three forms of Models

The book prescribes three means to present data in a model:

Verbal model - represented in text (either written or spoken)
Visual model - represented in something visual, such as a graph.
Mathematical model - represented as a function.

A model is used to process input data, and come up with output data that can be used for further interpretation.

Descriptive models explain behavior and allow users to evaluate potential decisions by asking “what-if?” questions.

Predictive models focus on what will happen in the future. Many predictive models are developed by analyzing historical data and assuming that the past is representative of the future.

Prescriptive models help decision makers identify the best solution to a decision problem. Optimization - finding values of decision variables that minimize (or maximize) something such as cost (or profit) Objective function - the equation that minimizes (or maximizes) the quantity of interest Optimal solution - values of the decision variables at the minimum (or maximum) point

Assumptions

Assumptions are made to:

simplify a model and make it more tractable; that is, able to be easily analyzed or solved.
better characterize historical data or past observations. The task of the modeler is to select or build an appropriate model that best represents the behavior of the real situation.

Uncertainty and Risk

Uncertainty is imperfect knowledge of what will happen in the future. Risk is associated with the consequences of what actually happens. “To try to eliminate risk in business enterprise is futile. Risk is inherent in the commitment of present resources to future expectations. Indeed, economic progress can be defined as the ability to take greater risks. The attempt to eliminate risks, even the attempt to minimize them, can only make them irrational and unbearable. It can only result in the greatest risk of all: rigidity.” Peter Drucker

Problem Solving with Analytics

Recognizing a problem

Problems exist when there is a gap between what is happening and what we think should be happening.

Defining the problem

Clearly defining the problem is not a trivial task. Complexity increases when the following occur:

large number of courses of action
the problem belongs to a group and not an individual
competing objectives
external groups are affected
problem owner and problem solver are not the same person
time limitations exist

Structuring the problem

Stating goals and objectives
Characterizing the possible decisions
Identifying any constraints or restrictions

Analyzing the problem

Analytics plays a major role. Analysis involves some sort of experimentation or solution process, such as evaluating different scenarios, analyzing risks associated with various decision alternatives, finding a solution that meets certain goals, or determining an optimal solution.

Interpreting results and making a decision

Models cannot capture every detail of the real problem. Managers must understand the limitations of models and their underlying assumptions and often incorporate judgment into making a decision.

Implementing the solution

Translate the results of the model back to the real world. Requires providing adequate resources, motivating employees, eliminating resistance to change, modifying organizational policies, and developing trust.

Chapter 3

Data visualization

Data visualization - the process of displaying data (often in large quantities) in a meaningful fashion to provide insights that will support better decisions. Data visualization improves decision-making, provides managers with better analysis capabilities that reduce reliance on I T professionals, and improves collaboration and information sharing.

Tabular Data Analysis: looking at a table. Visual Data Analysis: looking at a graph.

Column and Bar Charts

Excel distinguishes between vertical and horizontal bar charts, calling the former column charts and the latter bar charts.

A clustered column chart compares values across categories using vertical rectangles;
A stacked column chart displays the contribution of each value to the total by stacking the rectangles;
A 100% stacked column chart compares the percentage that each value contributes to a total

Column and bar charts are useful for comparing categorical or ordinal data, for illustrating differences between sets of values, and for showing proportions or percentages of a whole.

Line Charts

Line charts provide a useful means for displaying data over time. You may plot multiple data series in line charts; however, they can be difficult to interpret if the magnitude of the data values differs greatly. In that case, it would be advisable to create separate charts for each data series.

Pie Charts

NEVER USE PIE CHARTS. It is difficult to compare relative sizes of areas.

Area Charts

An area chart combines the features of a pie chart with those of line charts. Area charts present more information than pie or line charts alone but may clutter the observer’s mind with too many details if too many data series are used; thus, they should be used with care.

Scatter Charts (Scatter plots)

Scatter charts show the relationship between two variables. To construct a scatter chart, we need observations that consist of pairs of variables.

Orbit Charts

An orbit chart is a scatter chart in which the points are connected in sequence, such as over time. Orbit charts show the “path” that the data take over time, often showing some unusual patterns that can provide unique insights.

Bubble Charts

A bubble chart is a type of scatter chart in which the size of the data marker corresponds to the value of a third variable; consequently, it is a way to plot three variables in two dimensions.

Combination Charts (Combo Charts)

Often, we wish to display multiple data series on the same chart using different chart types. Excel 2016 for Windows provides a Combo Chart option for constructing such a combination chart; in Excel 2016 for Mac, it must be done manually. We can also plot a second data series on a secondary axis; this is particularly useful when the scales differ greatly.

Radar Charts

Radar charts show multiple metrics on a spider web. This is a useful chart to compare survey data from one time period to another or to compare performance of different entities such as factories, companies, and so on using the same criteria.

Stock Charts

A stock chart allows you to plot stock prices, such as daily high, low, and close values.

Geographic Data

Many applications of business analytics involve geographic data. Visualizing geographic data can highlight key data relationships, identify trends, and uncover business opportunities. In addition, it can often help to spot data errors and help end users understand solutions, thus increasing the likelihood of acceptance of decision models. Companies like Nike use geographic data and information systems for visualizing where products are being distributed and how that relates to demographic and sales information. This information is vital to marketing strategies.

Other Excel Data Visualization Tools

Data bars
Color scales
Icon sets
Sparklines

Sparklines

Sparklines are graphics that summarize a row or column of data in a single cell. Excel has three types of sparklines: line, column, and win/loss. Line sparklines are clearly useful for time-series data. Column sparklines are more appropriate for categorical data. Win-loss sparklines are useful for data that move up or down over time.

Chapter 4

Statistics

Statistics, as defined by David Hand, past president of the Royal Statistical Society in the U K, is both the science of uncertainty and the technology of extracting information from data.

Statistics involves collecting, organizing, analyzing, interpreting, and presenting data.
A statistic is a summary measure of data. Descriptive statistics refers to methods of describing and summarizing data using tabular, visual, and quantitative techniques.

Metrics and Data Classification

Metric - a unit of measurement that provides a way to objectively quantify performance. Measurement - the act of obtaining data associated with a metric. Measures - numerical values associated with a metric.

Discrete variable: one that is derived from counting something

Continuous variable: are based on a continuous scale of measurement.

Any metrics involving dollars, length, time, volume, or weight, for example, are continuous

Measurement scales

Categorical (nominal) data - sorted into categories according to specified characteristics. Ordinal data - can be ordered or ranked according to some relationship to one another. Interval data - ordinal but have constant differences between observations and have arbitrary zero points. Ratio data - continuous and have a natural zero.

Measures of Location:

Mode: The mode is the observation that occurs most frequently in a dataset.
Median: The median is the middle value when data is arranged from least to greatest.
Mean: The mean is the average value of a dataset.

Measures of Dispersion:

Range: The range is the difference between the maximum and minimum values in a dataset.
Interquartile Range: The interquartile range is the difference between the first and third quartiles, focusing on the middle 50% of the data.
Variance: Variance measures the average of the squared deviations from the mean.
Standard Deviation: Standard deviation is the square root of the variance, providing a practical measure of dispersion.

Measurement Scales:

Categorical Data: Sorted into categories based on specified characteristics.
Ordinal Data: Data that can be ordered or ranked.
Interval Data: Data with constant differences between observations and arbitrary zero points.
Ratio Data: Continuous data with a natural zero point.

Population and Sample:

Population: All items of interest for a particular decision or investigation.
Sample: A subset of the population used to draw valid inferences.

Measures of Shape:

Skewness: Skewness describes the lack of symmetry in data, with positively skewed and negatively skewed distributions.
Coefficient of Skewness: The coefficient of skewness indicates the degree of skewness in data, with positive values for right-skewed data and negative values for left-skewed data.

Measures of Association:

Correlation: Correlation measures the linear relationship between two variables, independent of units of measurement.
Correlation Coefficient: The correlation coefficient ranges between -1 and 1, indicating the strength and direction of the relationship between variables.

Variability:

Empirical Rules: Rules stating that a certain percentage of observations fall within a specific number of standard deviations from the mean.
Chebyshev’s Theorem: The theorem that provides bounds on the proportion of values within a certain number of standard deviations from the mean.

Statistical Thinking in Business Decisions:

Statistical thinking is a philosophy based on understanding and reducing variation in processes to improve performance.
Work in organizations occurs through interconnected processes, and understanding variation is crucial for decision-making.

Descriptive Statistics in Business:

Descriptive statistics involve analyzing data to understand the characteristics of a dataset.
Measures of location, dispersion, and association help in interpreting data effectively.

Business Analytics:

Business analytics involves using statistical methods to analyze data and make informed decisions.
Understanding measurement scales, population vs. sample, and statistical thinking is essential for effective business analytics.

By focusing on measures of location, dispersion, and statistical thinking, businesses can gain valuable insights from data analysis to improve processes and decision-making.

Chapter 5

Definition

Probability is the likelihood that an outcome occurs. Probabilities are expressed as values between 0 and 1.

Three definitions:

Classical definition: probabilities can be deduced from theoretical arguments
Relative frequency definition: probabilities are based on empirical data
Subjective definition: probabilities are based on judgment and experience

Relationship with frequencies

When data is ordered, and then for each value, its relative frequency (matched / total amount of rows) is accumulated, this will give you the probability that something is done within the whole range, from the minimum to the maximum value within the range.

Probability Distributions

A probability distribution is a characterization of the possible values that a random variable may assume along with the probability of assuming these values. This is based on empirical data, thus earlier scientific - known good - research.

We could simply specify a probability distribution using subjective values and expert judgment. This is often done in creating decision models for phenomena for which we have no historical data.

Cumulative Distribution Function

F(x) = P(min <= x) = P(min) + P(…) + P(x)

Expected Value

The expected value of a random variable corresponds to the notion of the mean, or average, for a sample.

Expected value can be calculated by adding up, for each $item, the product of $item multiplied with its probability.

Notes;

The expected value is a “long-run average” and is appropriate for decisions that occur on a repeated basis. For one-time decisions, however, you need to consider the downside risk and the upside potential of the decision.

Variance

Variance is a weighted average of the squared deviations from the expected value.

Normal distribution

Properties:

Symmetric
Mean = Median = Mode
Range of X is unbounded.
Empirical rules apply

Z-values

The z-value determines to what extent a particular value differs from the normal distribution. Unit being expressed in n * StDev.

Data fitting

Data fitting is the process of finding a mathematical function that best matches a set of data points.

Overfitting occurs when a model captures noise in the data rather than the underlying pattern, resulting in poor performance on new data. It’s like fitting a suit too closely—while it might look great on the training data, it won’t generalize well to unseen data.

Balancing fitting and overfitting is crucial for creating accurate and reliable models.

ANOVA

Assumptions

Independent samples
Well modelled by a normal distribution
Variances of the various groups are equal

If not all assumptions (requirements) are met, don’t use ANOVA. Rather use non-parametric test (Kruskal-Wallis).

F- and P-values

F-value: The F-value is a statistic that is calculated by taking the ratio of two variances. In ANOVA, it represents the ratio of the variance between groups to the variance within groups. In regression analysis, it represents the ratio of the variance explained by the model to the unexplained variance.
P-value: The P-value, or probability value, is a measure used to determine the statistical significance of the F-value. It indicates the probability of obtaining an F-value as extreme as, or more extreme than, the one observed in the data, under the assumption that the null hypothesis is true. In other words, it tells you the likelihood of observing the results if there is no real effect or relationship.

In summary, the F-value tells you whether there is a significant difference or relationship between groups or variables, while the P-value tells you how likely it is that the observed F-value occurred by chance. A low P-value (typically below a predetermined threshold, often 0.05) suggests that the results are statistically significant and that the null hypothesis can be rejected in favor of the alternative hypothesis.