













| |
Chapter
3 – Focus 2002 & GSHE data overview, selection and pre-treatment
3.1 - Introduction
Prior to the
application of the statistical analysis techniques described above to the
Focus 2002 and GSHE data, it is necessary to deal with questions 6 and 7
(Table 1.1). In addition to answering the above research
questions, the purpose of this chapter is to:
 | Justify
and record the decision as to which AstraZeneca nations and sites are to
be included in the scope of this project.
|
 | Explain
the pre-treatment that has been applied to the Focus 2002 and GSHE data
prior to the analysis detailed within Chapters 4 and 5.
|
 | Provide
a better understand of the underlying structure of the Focus 2002 response
data. |
3.2 – The AstraZeneca Focus 2002 survey and its ability to measure
organisational culture
Section
2.3 outlined the work of Flin et al
[60]
that analysed the
themes contained within eighteen publicly available industrial-based safety
climate surveys. They concluded
that the range of culture dimensions surveyed could be distilled into five
themes, namely, ‘Management/Supervision’, ‘Safety system’, ‘Risk’,
‘Work pressure’ and ‘Competence’.
A subjective comparison of the Focus 2002 survey questions against
these ‘thematic factors’ was performed.
Because of the subjective nature of the above comparison, it is
recognised that others performing the comparison may obtain different results.
The results are detailed in Table 3.1.
Table
3.1 indicates that the AstraZeneca Focus 2002 survey contains attitudinal
questions that address all of the Flin et al
[60]
thematic factors.
The comparison therefore suggests that the responses to the attitudinal
questions contained within the Focus 2002 survey may be able to measure
AstraZeneca’s organisational culture.
The
Focus 2002 survey addresses four themes not covered by Flin et al
[60]
namely, My Job, Team,
Our Company, and Pay and Benefits.
These additional attitudinal questions provide an opportunity to
explore if any of these factors influence, or are related to, SHE performance.
Exploring those cultural factors not directly associated with safety
would also address the recommendations of Coyle et al
[31]
.
The subjective comparison addresses research question 6 listed in Table
1.1.
The
information contained within the AstraZeneca Focus 2002 survey will allow
investigations into the psychological ‘safety climate’ element of
Cooper’s
[26]
model of safety
culture. However, the
‘behavioural’ and ‘safety management system’ elements of his model are
not addressed within the Focus 2002 survey.
Sources of behavioural elements can be inferred from a number of
sources currently available within AstraZeneca; these include:
·
The
percentage of Focus survey questionnaires returned;
·
Financial
under- or over-spends;
·
The
findings of behavioural based safety audits;
·
Sickness
rates;
·
Perceived
degree of over- or under-reporting of accidents, incidents etc.
Information
pertaining to ‘safety management system’ elements may also be inferred by
SHE and other management systems audit findings.
Although
it is practicable to obtain ‘behavioural’ and ‘safety management’
information for AstraZeneca, it was decided to exclude them from the scope of
Chapters 4 and 5. The principal
reason for exclusion is a desire to establish a base ‘safety climate’
model with the minimum number of confounding variables and factors.
|
Flin
et al’s
[60]2]
Five Cultural Themes
|
AstraZeneca
Focus 2002 Factors
|
|
SHE
|
Leadership
|
Communication
And Feedback
|
Diversity
|
Innovation
|
Learning
& Development
|
My
Job
|
Team
|
Our
Company
|
Pay
& Benefits
|
Work-Life
Balance
|
|
Management
/Supervision
|
41,48
|
9a,
9b,20a,20b,
20c,20d,22,57,58
|
3,11,16,32,38a,
38b,38c,53a,
53b,53c
|
6,18,29,
37,42
|
10a,10c,10e,
10f
|
|
2,13,21,44
|
1,12
|
|
40,59
|
4,15,27a,
27b,27c,54
|
|
Safety
System
|
17
|
|
|
|
|
|
|
|
|
|
|
|
Risk
|
5,17,28,
36,41,48
|
|
|
|
|
|
|
|
45
|
|
|
|
Work
Pressure
|
17
|
|
32
|
|
10a,
10b
|
|
44
|
|
|
|
|
|
Competence
|
|
|
|
|
|
14,24,26,35,
39,47
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Focus
2002 Questions Not Covered By Flin et al Thematic Factors
|
None
|
None
|
None
|
None
|
10d
|
None
|
7,31,49,52
|
8a,8b,8c,
23,33,
46,50,
|
10g,19,
30a,
30b, 43,51,60
|
25,34,55a,
55b
|
None
|
Table
3.1 – Comparison of the Focus 2002 survey factors with the Flin et al
[60]
safety thematic factors
Note
– A blank cell within Table 3.1 indicates no match between the Focus 2002
and the Flin et al’s factors
3.3 – Definition of project
scope and metrics
3.3.1 - Introduction
The
purpose of this section is to record the rationale behind the selection of the
AstraZeneca sites and the GSHE lagging SHE performance indicators to be
included within the scope of Chapters 4 and 5.
3.3.2 - Selection of nations to be included within the project scope
It was considered
desirable to examine the data from more than one country to allow
investigation into how organisational culture and its relationship to lagging
SHE performance indicators varies between nations. The scope of this
research was restricted to include data only from the United Kingdom (UK),
Sweden (SE) and United States (US) for the following reasons:
·
There is a large data set from each of the three countries and
collectively they represent 57% of the Focus 2002 response data.
·
The SHE statistics reported by these countries is known to be
robust, as demonstrated by local and corporate audit data (There is less
confidence in the accuracy of data from some other territories).
·
There are distinct national cultures in the three countries.
·
Each territory supports the full range of AstraZeneca activities
(for example, manufacturing, research and development, marketing) and is
therefore representative of the Company as a whole.
Appendix 7 details the
number of respondents for each of the 10 UK, 11 SE and 6 US sites selected.
3.3.3 - Selection of lagging SHE
performance indicators
The
desirable features of lagging SHE performance indicators are discussed in
Section 2.5. Appendix 3 details
the reported 2002 SHE performance indicators for all of the sites and
functions in the three countries for which data are available from the 2002
Focus 2002 survey. The following
observations and comments regarding the 2002 GSHE lagging SHE performance
indicators are made.
AstraZeneca
minor injuries: The use of
minor injury rates is not favoured because, as explained in Section 2.5, they
are likely to be under-reported compared with significant injuries.
This view is reinforced by the fact that only 10 of the 27 sites
reported minor injuries. This is in contrast to significant injuries where 25 of the
27 sites reported at least one significant injury.
Additionally, where data are
available, the expected ratio of serious to minor injuries of 1:10
[10]
is not seen. This brings into
question the robustness of the number of reported minor injuries.
There is more confidence in the reporting data for serious injuries
because there is generally also a legal requirement to report.
The minor injury data are therefore not utilised in this project.
Contractor
minor and significant injuries: The
use of contractor injury information was rejected because only AstraZeneca
personnel took part in the Focus 2002 survey.
Although it is foreseeable that the actions and omissions of
contractors may affect AstraZeneca personnel, the relationship to
AstraZeneca’s organisational culture cannot be assessed using the available
data. It is, however, interesting
to note that the ratio between reported minor and significant injuries mirrors
the AstraZeneca data. This again
brings into question the validity of the number of minor injuries reported.
Non-Injury
Information: These data are
rejected because, for any given GSHE reported criteria, the majority of the
sites reported a zero value during 2002.
AstraZeneca
Significant Injuries: Significant
injury reports appear to be the most useful lagging SHE performance indicator
because:
 | There is often an associated legal requirement to
report these injuries. |
 | Significant injuries normally involve medical
treatment and are hence more visible. |
 | All but two of the sites reported at least one
significant injury. |
 | The number of significant injuries reported shows a
large degree of variation between sites. |
The
significant injury data are therefore chosen as the preferred lagging SHE
performance indicator to be included within the scope of Chapters 4 and 5.
3.3.4 – Selection of Focus 2002
metrics
Section
2.8.2 provided an overview of the Focus 2002 data set. A number of different metrics are able to describe the
characteristics and distribution of the Focus 2002 data.
Metrics include the mean, standard deviations, kurtosis and skewness.
A comprehensive list and explanation of distribution metrics can be
found in Price
[133]
and the internet glossary website Risk Glossary.Com
[142]
. The categorical nature and limited range of possible responses
to each Focus 2002 question need to be taken into consideration when choosing
a metric to describe the data. The
Focus 2002 questions have a maximum of five and a minimum of two possible
responses. Distribution metrics
such as standard deviation and kurtosis do not provide meaningful information
for categorical data as the number of categories approaches two.
The usefulness of distribution metrics such as skewness and kurtosis to
describe the Focus 2002 data with a maximum of five categories is
questionable. The choice of
metric to describe the Focus 2002 data should ideally reflect the distribution
of organisational culture and the hypothesised accident causation model within
AstraZeneca. Conceptually, two
extreme cases are possible. The
first case is where the culture is uniform or homogenous and all personnel
contribute, and are equally at risk, from a negative SHE outcome such as an
injury. The second case is where
the organisational culture is heterogeneous and a minority of individuals
cause the negative SHE outcomes. The
use of the arithmetic mean Focus 2002 question response will be most
applicable to the first case. Distribution
metrics that are not dependent upon a normal distribution, such as Kurtosis,
may be more appropriate for the second case. The literature review
in Chapter 2 identified that the mean response appears to be the only metric
used in previous research attempting to correlate metrics of organisational
culture with lagging SHE performance indicators [41, 58, 88, 154].
A subjective insight into the degree of AstraZeneca’s organisational
cultural homogeneity may arguably be obtained by inspection of the
distribution of Focus 2002 question responses.
Question response distributions with a low standard deviation may be
indicative of a homogeneous culture. Question
responses that show more than one peak in the response distribution may be
indicative of a heterogeneous organisational culture.
The separate peaks may be indicative of sub-cultures.
The presence of more than one response peak will change a number of
distribution metrics. Compared
with a normally distributed response, a multi-peak response distribution will
have a larger standard deviation and may have a lower kurtosis.
Information regarding the Focus 2002 response distributions is provided
within Section 2.5.3. The question ‘Are the majority of
accidents caused by a minority of individuals, either being exposed to a
higher level of risk or due to having an ‘undesirable’ culture?’ is
potentially difficult to answer. Information
regarding the working environment and organisational culture would need to be
obtained for those who are involved with undesirable SHE outcomes.
This information would then need to be compared with the same
information obtained from those personnel who are not involved with accidents.
This information could then be used to identify to what extent culture
and the degree of exposure to risk is related to undesirable SHE outcomes.
The identity of Focus 2002 participants was kept confidential.
Due to the confidentiality of the responses it is not possible to
answer the above question as part of this project.
Chapters
4 and 5 are concerned with two principal objectives, namely:
 | Establishment of the relationship between the Focus
2002 data and SHE lagging performance indicators, and; |
 | Identification of those organisational factors that
discriminate nations and sites with differing SHE performance. |
It would
appear that the mean site response to the Focus 2002 questions has the
potential to address these research objectives.
The use of the mean does have limitations in that it is unable to
provide an insight into the relationship between cultural heterogeneity and
lagging SHE performance indicators.
Based upon the above arguments, the mean and standard deviation have
been chosen as the metrics to be used to represent the Focus 2002 data in the
subsequent analysis. The mean is
chosen because it represents the average site attitude of the respondents.
The standard deviation has been chosen because it represents the degree
of spread of responses to a particular question and therefore, may be related
to the degree of organisational heterogeneity.
3.4 - Data pre-treatment
3.4.1 - Introduction
As received
by the author, the Focus 2002 and GSHE lagging SHE performance indicator data
were not in a suitable format to use directly.
This section summarises the pre-treatment that was applied to the data
sets prior to the work detailed in Chapters 4 and 5.
3.4.2 - Focus 2002 data pre-treatment
The
Focus 2002 database was obtained as a data file from AstraZeneca GSHE
Department. The database
contained the 41,779 individual responses from all of the survey participants.
The file consisted of 41,779 rows by 139 columns of data.
Each row of the file contained the responses from individual survey
participants. The columns of the
file recorded the information contained within the Focus 2002 questionnaire.
All of the demographic data associated with each respondent were coded.
For example, columns 1 to 7 inclusive of the data file related to the
‘sequence number’. The
sequence number was a unique respondent identifier.
Columns 8 and 9 related to a Global location code etc.
The data file was imported into Microsoft Excel 2000 (version 9.0.6926
– SP3) and saved as a workbook file. During
the file import the ‘fixed width - text input’ tool within Excel was used
to merge appropriate columns of data together.
For example, columns 1 to 7 were merged to form one column of seven
digit data within the Excel worksheet. The
data within the resultant Excel workbook were write-protected and saved for
use in the analysis detailed in Chapters 4 and 5.
The robustness of the transformation was checked by visually cross
checking several rows of the data file with the corresponding rows in the
resultant Excel worksheet. The
transformation was found to be successful with no observed errors or
omissions.
3.4.3 - GSHE annual report data
pre-treatment
Section
2.10.3 highlighted the importance of using accident rates rather than numbers
of accidents when comparing the SHE performance of several sites.
The number of significant injuries occurring at each UK, SE and US site
during 2002 was obtained directly from the GSHE 2002 annual site reports.
Equation 1 was adapted to reflect the number of hours AstraZeneca
employees work per annum to give the significant-injury rate as per Dodsworth
[46]
. The AstraZeneca injury frequency rate equation is given in
Equation 5.
(5)
This
calculation was used to enable benchmarking with sites other than those in the
UK, SE and US. The 100,000 figure
within Equation 5 represents the approximate number of hours a person will
work during a lifetime of employment. The
1450 figure within Equation 5 represents the number of contractual hours a
person within AstraZeneca works in a year.
The AstraZeneca injury frequency rate therefore represents the
approximate number of injuries an employee will be exposed to during his
working life. As defined within
Equation 5, the injury frequency rate does not have any units.
Since the injury frequency rate does not have any units, it is not a
true rate. Equation 5 was
entered into Excel 2000 (version 9.0.6926 – SP3) and used to calculate the
significant-injury frequency rates for all UK, SE and US sites.
The resultant significant-injury frequency rates are reproduced in
Appendix 7.
3.4.4 - Calculation of the mean Focus
2002 responses
The
arithmetic mean responses to each of the Focus 2002 questions were calculated
for each UK, SE and US site. The
calculations were performed using the ‘AVERAGE’ tool within Excel 2000
(version 9.0.6926 – SP3). The
arithmetic mean responses for the UK, SE and US sites are reproduced in
Appendices 8.1 to 8.3.
3.4.5 – Calculation of the standard
deviation of the Focus 2002 responses
The
standard deviation of the Focus 2002 question responses was calculated for
each UK, SE and US site using the ‘STDEV’ tool within Excel 2000 (version
9.0.6926-SP3). The resultant
standard deviations are reproduced in Appendices 8.4 to 8.6.
3.5 - Understanding the structure of the Focus 2002 survey data
3.5.1
- Introduction
Section
2.9.3.2 explained the PLS modelling strategy.
The first step in the PLS modelling strategy is to carry out a
preliminary statistical analysis, which is performed for two reasons.
Firstly, it confirms that the data are suitable and sufficient for
further analysis. Secondly, it
provides a priori knowledge regarding the underlying structure of the
data that may in turn influence the selection and application of PLS, PLS-DA
and SIMCA modelling parameters.
The purpose of this Section is to summarise the preliminary statistical
analysis that was performed on the Focus 2002 question responses.
The following statistics were calculated:
 | The percentage of missing responses within the Focus
2002 data. |
 | The distribution of responses to the Focus 2002
questions for the UK, US and SE nations. |
 | The (Pearson) correlations between the Focus 2002
mean responses for all UK, US and SE sites. |
 | For each UK, US and SE site, the (Pearson)
correlation coefficients between AstraZeneca significant-injury frequency
rate and the Focus 2002 question response mean and standard deviations. |
3.5.2 – Calculation of the incidence of missing responses within the
Focus 2002 data
It is expected that
survey respondents will not answer all of the questions asked.
Possible reasons for not answering a question include not understanding
the question, not having sufficient time to complete the survey and the
question not being applicable to the respondent.
If unknown to the researcher, or ignored, missing responses have the
potential to significantly influence the results of any statistical analysis
performed. The incidence of
missing question responses is important for two reasons.
Firstly, it can introduce statistical bias and secondly, it may
indicate that a particular group or cluster of respondents have difficulty or
cannot associate with the question being asked.
It is therefore essential to be aware of the incidence of missing
responses prior to statistical analysis.
The incidence of missing responses for the UK, SE and US site data sets
was calculated for each Focus 2002 question using the ‘workset –
statistics’ option within SIMCA P+. The
SIMCA P+ output is reproduced in Appendix 9. The average incidence of
missing responses was found to be less than 0.5% for the UK and US and less
than 0.7% for SE. Questions 8b,
8c and 9a had the maximum percentage missing responses of 2.8, 2.9 and 1.2
respectively. All other questions
had percentage missing response of less than 1%.
Based upon these figures the author considers that all of the Focus
2002 questions are suitable for inclusion in the subsequent analysis.
3.5.3 – Focus 2002 response
distribution
Section 2.9.2.2 and 2.9.3.2 explained that both PCA and
PLS are best able to model data when the data being modelled are approximately
normally distributed. The purpose of this section is to describe the work that was
performed to understand the distribution of Focus 2002 question responses. The
‘DCOUNT’ function within Excel 2000 (version 9.0.6926-SP3) was used to
count the number of times each Focus 2002 question response option was
responded to in the UK, SE and US nations.
The resultant response count information was converted into a
percentage response to allow inter-nation comparisons to be made.
The percentage response information for each question was imported into
Microsoft PowerPoint 2000 (version 9.0.6620-SP3) and grouped histograms were
drawn. The response distribution
histograms for the UK, SE and US sites are reproduced in Appendix 10. Examination of the response distribution figures
exemplifies the difficulties associated with visual identification of trends
and relationships within multivariate data sets; this being difficult, the
following subjective observations are made. The US gave the highest
level of response to 48 out of the 82 Focus 2002 questions. This may indicate the presence of fewer sub-cultures compared
to the other two nations or that US respondents may be more polarised in their
views. Examples of the
Focus 2002 response distributions are provided in Figure 3.1. The response distribution for question 5 is an example where
the responses are not able to discriminate the SE, UK and US sites, whereas
questions 30b is able to discriminate at least one nation’s sites from the
other two. The inclusion of
questions that discriminate national cultures may introduce unwanted variance
into a model correlating responses with lagging SHE performance indicators on
a global scale. The response
distributions for question 42 are approximately normally distributed. The response distributions for question 47 are skewed.
The response distributions for question 56 indicate two distinct peaks.
The presence of two distinct peaks may indicate the presence of
sub-cultures. Inspection of
the distribution histograms indicates that PCA and PLS should be well able to
model a significant proportion of the Focus 2002 question responses as they
are approximately normally distributed. PCA
and PLS analysis of non-normally distributed question responses should be
possible after distribution transformation (Section 2.9.3.2 explained the
process of transformation of non-normally distributed data, prior to
analysis).
United Kingdom
United States
Sweden
Figure
3.1 – Example Focus 2002 survey response distributions
3.5.4- Focus 2002 question response and
significant-injury frequency correlations
The literature
review detailed in Chapter 2 indicated that previous research has correlated
the responses to single climate questions with lagging SHE performance
indicators. Calculation of a
(Pearson) correlation matrix of Focus 2002 question responses and AstraZeneca
SIFR allows:
·
The rapid identification of those Focus 2002 question responses
that correlate with SIFR.
·
Identification of question responses that are highly correlated
with one another.
A priori
knowledge of those Focus 2002 question responses that correlate with SIFR
prior to detailed analysis is useful. A
model based upon these questions will potentially be more able to predict SIFR
performance compared with a model that has been built without a priori
knowledge. Focus 2002 question
responses that are highly correlated with one another may indicate that those
questions are measuring the same latent construct.
If problems are encountered in subsequent PLS, PCA or PLS-DA analysis,
the correlation information can be used to simplify the data set by removal of
one or more of the question responses that correlate with a question remaining
within the PLS, PCA or PLS-DA model. The
following correlations were calculated for the UK, SE and US sites:
·
Pearson correlation coefficients between the arithmetic mean
Focus 2002 question responses and site SIFR.
·
Pearson correlation coefficients between the standard deviation
of each Focus 2002 question response and site SIFR.
As will be discussed
later, the correlation coefficients between the standard deviation of the
Focus 2002 questions and site SIFR were calculated to investigate the
relationship between the spread of question responses and SIFR.
Pearson correlation matrices were calculated using the ‘tools-data
analysis - correlation function’ within Microsoft Excel 2000 (version
9.0.6926-SP3). The resultant
correlation coefficients are reproduced in Appendix 11.
In Section 3.2 it was subjectively asserted that Focus 2002 questions
17, 28, 36, 41, 45 were associated with either Flin et al’s [60] ‘safety
system’ or ‘risk’ cultural factors.
These two factors can arguably be associated with ‘safety’.
Based upon the above arguments the following observations are made:
·
25 UK, 17 SE and 16 US Pearson correlation coefficients are
significant.
·
14 of the significant US Pearson correlation coefficients are
negative.
·
All of the UK Pearson correlation coefficients are positive.
·
Focus 2002 questions 24 and 55a are (Pearson) correlated to SIFR
in all three countries above the level of significance.
It is noted that the Pearson correlation coefficient for the UK and SE
is positive, whereas in the US, the Pearson correlation coefficient is
negative.
·
0 UK, 0 US and 2 SE Focus 2002 responses that significantly
(Pearson) correlate with SIFR are related to ‘safety’.
·
A number of Focus 2002 question responses that positively
correlate with SIFR in the UK or SE are found to be negatively correlate with
SIFR in the US.
The following observations are
made for the correlation matrices based upon the Focus 2002 question response
standard deviations:
·
10 UK, 14 SE and 17 US Pearson correlation coefficients are
significant.
·
Only Focus 2002 question number 22 is (Pearson) correlated to
SIFR above the level of significance in the UK, SE and US.
Visual examination of
the UK, US and SE correlation matrices indicates that a high proportion of the
question responses are strongly correlated with at least one other response.
A high correlation coefficient between two or more question responses
may be indicative of the questions measuring the same underlying theoretical
construct. The existence of high
inter-response correlations can be advantageous in reducing the complexity of
multivariate models. For instance, if several variables correlate with one another
then there may be an opportunity to choose one to represent all of them.
3.5.5 – Calculation of the incidence
of missing Focus 2002 question responses
It is important to
understand, for each site, the percentage of personnel taking part in the
survey prior to subsequent analysis. If,
for any one site, the number of respondents is low, the corresponding
arithmetic mean Focus 2002 responses will not be sufficiently representative
of the entire site. Based upon
the subjective opinion of the author, it was decided that only sites with an
average question response rate of above an arbitrarily set threshold of 30%
would be selected for subsequent analysis.
The percentage of AstraZeneca
personnel responding to the Focus 2002 survey was calculated for each UK, SE
and US site. The results are
given in Appendix 7.
With the exception of site
UK9, all UK sites were found to have a response percentages of between 41% and
73%. According to the information
provided to the author, only 2% of site UK9 responded to the Focus 2002
survey. It is not known if the
low response rate is genuine or is a result of an error in the reporting
database. Regardless of whether
the UK9 response rate is genuine, a value of 2% is insufficient to represent
the site. Site UK9 was therefore
removed from the scope of the analysis detailed in Chapters 4 and 5.
With the exception of sites
US3 and SE8, all US and SE sites were found to have a response rate of between
51% and 96%. The number of
responses for sites US3 and SE8 were found to be greater than the respective
site AstraZeneca survey populations. It
is unlikely that the total site population is incorrect.
The most likely source of error is the Focus 2002 site coding, detailed
in Section 2.8.2. An error in the
site coding information could allow respondents from other sites to be
included within site US3 and SE8 data. For
this reason, sites US3 and SE8 were removed from the scope of the analysis
detailed in Chapters 4 and 5.
3.5.6 - Principal Component Analysis
3.5.6.1 - Introduction
Section
2.9.2.2 provided an overview of how principal component analysis (PCA) can be
used to identify latent constructs and simplify the complex nature of
multivariate and megavariate data sets.
This section describes the application of PCA to the Focus 2002
question response data. PCA
modelling was performed for the following reasons:
·
To better understand the underlying structure of the responses
to the Focus 2002 questions.
·
To obtain principal component information that will be entered
into subsequent SIMCA models.
The data were organised into
two groups. The first group
involved the PCA modelling of only those responses to Focus 2002 questions
that correlated with AstraZeneca SIFRs for each nation.
The second group involved the PCA modelling of all of the Focus 2002
question responses. The principal
reason for splitting the data into two groups was to discover the underlying
structure of those questions known to (Pearson) correlate with SIFR.
3.5.6.2 – Principal Component Analysis – method
Two PCA models were
created for each national data set. The
X block data for the first model consisted of those responses to the Focus
2002 questions identified in Section 3.5.4 as being (Pearson) correlated with
the AstraZeneca SIFR. The X block
data for the second model consisted of the responses to all of the Focus 2002
questions. The mean Focus 2002
responses and AstraZeneca significant-injury frequency rates were entered into
SIMCA P+ (version 10.0.4.0). The
mean responses to the Focus 2002 questions were used as the X block data for
both models. The six resultant
models are summarised in Table 3.2.
|
Model
Name
|
Nation
|
Focus
2002 Question Responses Included
In
The X block Data
|
|
PCA-UK1
|
UK
|
For all UK Sites: Those Focus 2002 question
responses that (Pearson) correlated with SIFR above the level of
statistical significance (at 95% confidence).
|
|
PCA-UK2
|
UK
|
All UK site mean responses to the Focus 2002
survey.
|
|
PCA-SE1
|
SE
|
For all SE Sites:
Those Focus 2002 question responses that (Pearson) correlated
with SIFR above the level of statistical significance (at 95%
confidence).
|
|
PCA-SE2
|
SE
|
All SE site mean responses to the Focus 2002
survey.
|
|
PCA-US1
|
US
|
For all US sites:
Those Focus 2002 question responses that (Pearson) correlated
with SIFR above the level of statistical significance (at 95%
confidence).
|
|
PCA-US2
|
US
|
All US site mean responses to the Focus 2002
survey.
|
Table
3.2 – The Focus 2002 PCA model inputs
The default mean
centering and scaling option was selected within SIMCA P+.
Models PCA-UK2, PCA-SE2 and PCA-US2 were optimised by a process of
inspection of the question response R2 and Q2 values and
successively removing those responses that had a negative Q2 or R2
value, or a Q2 value of less than 0.5, or a R2/Q2
ratio of less than 0.7. Upon each
successive removal of a question response, the model was re-run and the new R2,
Q2 plots inspected. The
process was repeated until all of the Q2 values were above 0.5 and
the R2/Q2 value was above 0.7.
It is important to stress that the goal of model optimisation was to
achieve satisfactorily high R2 and Q2 values and, at the
same time, avoid over-optimisation, i.e. removal of too many question
responses as an aid to drive ever increasing R2 and Q2
values. Models PCA-UK1, PCA-SE1 and PCA-US1 were not optimised as they
are based upon a priori information regarding which question responses
correlate with SIFR.
3.5.6.3 – Principal Component Analysis – results
The graphical outputs
from the PCA modelling are reproduced in Appendix 12.
Table 3.3 summarises the results.
A description of the columns in Table 3.3 follows:
|
Header
column 1
|
:
|
PCA
model name.
|
|
Header
column 2
|
:
|
The
number of principal components in final model.
|
|
Header
column 3
|
:
|
The
cumulative model R2 and Q2 values.
|
|
Header
column 4
|
:
|
The
values of R2 and Q2 for the first principal
component.
|
|
Header
column 5
|
:
|
The
cumulative R2 and Q2 values for the first two
principal components.
|
|
Header
column 6
|
:
|
The
cumulative R2 and Q2 values for the first three
principal components.
|
|
Header
column 7
|
:
|
The
cumulative R2 and Q2 values for the first four
principal components.
|
|
Header
column 8
|
:
|
The
question responses that are retained in the final model.
|
|
PCA
Model Name
|
No
Of Comp
|
Model
Cumulative
|
1st
Principal Comp
|
2nd
Principal Comp
|
3rd
Principal Comp
|
4th
Principal Comp
|
Focus
2002 Question Responses
Retained
In Model
|
|
R2
|
Q2
|
R2
|
Q2
|
R2
|
Q2
|
R2
|
Q2
|
R2
|
Q2
|
|
|
PCA-UK1
|
1
|
0.731
|
0.556
|
0.731
|
0.556
|
na
|
na
|
na
|
na
|
na
|
na
|
7, 8a, 8b, 8c, 10f, 12, 16, 20a, 20c, 20d, 20e, 22, 23, 24, 25, 27b,
37, 38a, 38b, 40, 47, 53a, 53b, 53c, 55a.
|
|
PCA-UK2
|
4
|
0.956
|
0.628
|
0.5587
|
0.1163
|
0.782
|
0.295
|
0.8997
|
0.402
|
0.956
|
0.628
|
1, 3, 7, 8a, 8c, 9b, 13, 19, 20a, 20c, 20e, 20f, 22, 25, 26, 27a, 28,
29, 30a, 33, 34, 35, 37, 38a, 40, 42, 47, 50, 52, 53a, 53b, 55a, 57, 58,
59, 60.
|
|
PCA-SE1
|
1
|
0.837
|
0.764
|
0.837
|
0.764
|
na
|
na
|
na
|
na
|
na
|
na
|
5, 14, 17, 21, 23, 24, 25, 26, 28, 33, 34, 35, 44, 48, 55a, 55b, 58.
|
|
PCA-SE2
|
4
|
0.956
|
0.803
|
0.802
|
0.726
|
0.871
|
0.699
|
0.924
|
0.752
|
0.956
|
0.803
|
1, 3, 4, 5, 6, 7, 8a, 8b, 8c, 9a, 9b, 10a, 10c, 10d, 10e, 10f, 10g,
11,12,13,14, 15, 16, 18, 19, 20a, 20b, 20c, 20d, 20e, 20f, 23, 24, 25,
26, 27a, 27b, 27c, 29, 30a, 30b, 31, 32, 33, 34, 35, 36, 37, 38a, 38b,
38c, 40, 42, 43, 46, 47, 50, 51, 52, 53a, 53b, 53c, 55a, 55b, 56, 57,
58, 59.
|
|
PCA-US1
|
1
|
0.911
|
0.859
|
0.911
|
0.859
|
na
|
na
|
na
|
na
|
na
|
na
|
3, 4, 11, 20a, 20e, 22, 24, 30b, 38a, 38b, 38c, 53a, 53b, 53c, 55a,
57.
|
|
PCA-US2
|
2
|
0.928
|
0.740
|
0.737
|
0.447
|
0.928
|
0.740
|
na
|
na
|
na
|
na
|
1, 2, 4, 5, 6, 7, 8a, 8b, 8c, 9a, 9b, 10c, 10d, 10g, 11, 12, 13, 14,
16, 18, 20a, 20b, 20c, 20d, 20e, 20f, 22, 24, 25, 27a, 27b, 27c, 29,
30a, 30b, 31, 33, 34, 35, 37, 38a, 38b, 38c, 39, 40, 42, 43, 46, 47, 52,
53a, 53b, 53c, 58.
|
Table 3.3 – Summary of the PCA results
3.5.6.4 – Principal Component Analysis - conclusions
All of the
PCA models built had cumulative R2 values in excess of 0.7 and Q2/R2
values in excess of 0.5. Such
results were not anticipated due to the inherently variable nature of human
survey data.
The
number of model principal components within the final models varies between 1
and 4. Each of the principal
components may be envisaged as a dimension of organisational culture. Given the graphical PCA outputs reproduced in Appendix 12, it
was relatively straightforward and practicable to identify and group those
questions that most strongly influence or load highly on each principal
component. This analysis is
detailed in Chapters 4 and 5.
As expected,
the greatest amount of variation is accounted for by the first principal
component in each model. After
optimisation, models PCA-UK2, PCA-SE2 and PCA-US2 were left with 36, 68 and 54
questions respectively in the final model.
The significant number of remaining questions suggests that the PCA
models were not over-optimised. As
one would expect, not all questions in the PCA models starting with those
questions known to correlate with SIFR are retained in the final models that
start with all of the Focus 2002 questions.
For example, responses to Focus 2002 questions 10f, 16, 20d, 23, 24,
27b, 38b and 53c in model PCA-UK1 are not retained in model PCA-UK2. This result is expected for two reasons; firstly, model
PCA-UK1 was not optimised. Those
question responses in model PCA-UK1 that do not appear in model PCA-UK2 were
obviously unable to be modelled as well as those responses retained within
model PCA-UK2. Secondly, the
PCA-UK2 was built to best model the data.
No a priori information regarding how the question responses
related to SIFR was entered into the model.
Models PCA-UK2, PCA-SE2 and PCA-US2 are all suitable for use in the
national discrimination SIMCA analysis detailed in Chapter 5. Inspection
of the PCA score scatter plots in Appendix 12 indicates:
·
Sites UK2 and UK6 differ from the other UK sites.
·
Site SE 1 differs from the other SE sites.
·
Site US4 differs from the other US sites.
Appendix 7 details the
SIFR data for the UK, SE and US sites. Cross-referencing
the above information with the SIFR of each site, it is noted that sites UK2,
SE1 and US4 have significantly higher SIFR values compared with other sites in
the same territory. It can be concluded that the PCA models are
measuring an organisational factor that is related to site SIFR performance.
One would expect this to be the case with models PCA-UK1, PCA-SE1 and
PCA-US1 as these models are based on question responses known to (Pearson)
correlate with SIFR. The ability
of the models PCA-UK2, PCA-SE2 and PCA-US2 to discriminate the better and
poorer SIFR performing sites without this a priori information is
surprising.
3.6 – Summary and
conclusions
Section 3.2 compared the
Focus 2002 questions with organisational culture factors identified by
previous culture surveys. The
comparison concluded that the Focus 2002 survey is capable of measuring
AstraZeneca’s organisational culture. Section 3.3 explained the rationale
behind restricting the scope of the further analysis to the UK, SE and US
nations. The 2002 GSHE annual
accident reports were also examined with a view to identifying lagging SHE
performance indicators that could be used in the analysis detailed within
Chapters 4 and 5. Examination of
the data indicated that the number of reported significant injuries occurring
was the most appropriate metric to use. The
site mean and standard deviations were selected as the metrics to represent
the Focus 2002 responses. Section
3.4 explained how the Focus 2002 and GSHE significant-injury information was
pre-treated and analysed. The
transformation of the raw Focus 2002 data into a data set ready for analysis
and the method of converting the number of significant injuries into
significant-injury frequency rates were described. The mean and the standard deviation for each Focus 2002
question response for all UK, SE and US sites were calculated. Section 3.5 explored the structure of the Focus 2002
data. The incidence of missing
responses was calculated and found to be satisfactorily low.
The percentage of personnel taking part in the Focus 2002 survey was
calculated for each UK, SE and US site. The
percentage responses for sites UK9, SE8 and US3 were found to be anomalous.
Sites UK9, SE8 and US3 were therefore removed from the scope of the
analysis detailed within Chapters 4 and 5.
The response distribution histograms for each of the Focus 2002
questions were plotted. Inspection
of the histograms indicated that the majority of the responses are positively
or negatively skewed. A number of
the histograms showed two distinct peaks that may be indicative of the
presence of sub-cultures. Given
the range of response distributions within the data, one can conclude that it
is unlikely that a single distribution metric may be suitable to best
represent all of the question responses.
Pearson correlation coefficients were calculated between the Focus 2002
survey responses and the SIFRs for the UK, SE and US sites. A significant number of question responses were found to be
strongly correlated with SIFR for UK, SE and US sites. The presence of a
number of question responses that correlate with SIFR responses indicates that
it is highly likely that it will be possible to produce a good predictive PLS
model based upon them. The joint
UK, SE and US correlation matrix indicated weak (Pearson) correlations with
SIFR at the national level. Given
this, the ability to build a single PLS model to predict SIFR performance in
each of the three nations is questionable. Based upon the results of the
(Pearson) correlation matrices, three separate models will be built for each
group of UK, SE and US sites. The
first model will include only those mean question survey responses that are
known to (Pearson) correlate with SIFRs.
The second model will include all of the mean question survey
responses. The third model will
include all of the standard deviation question survey responses.
Given the a priori information, the first model is expected to
be able to create a robust, well predicting, PLS model. Comparison of the predictive ability of the first two PLS
models will give insight into how much SIFR predictive ability is contained in
the survey responses with low (Pearson) correlations that would otherwise be
ignored in typical OVAT analysis. The
third modelling approach will provide an insight into the usefulness of using
question response metrics other than the mean. Section 3.5 summarised the PCA that was performed
on the mean Focus 2002 question responses for the UK, SE and US sites.
The ability to produce PCA models for each group of UK, SE and US sites
suggested that PLS modelling of the Focus 2002 question responses was likely
to be possible.
|