A surrogate machine learning model for advanced gas-cooled reactor graphite core safety analysis

A surrogate machine learning model was developed with the aim of predicting seismic graphite core displacements from crack configurations for the advanced gas-cooled reactor. The model was trained on a dataset generated by a software package which simulates the behaviour of the graphite core during a severe earthquake. Several machine learning techniques, such as the use of convolutional neural networks, were identified as highly applicable to this particular problem. Through the development of the model, several observations and insights were garnered which may be of interest from a graphite core analysis and safety perspective. The best performing model was capable of making 95% of test set predictions within a 20 percentage point margin of the ground truth.


Introduction
Commercial nuclear power in the UK is dominated by the Advanced Gas-cooled Reactor (AGR) (Nonbøl, 1996), a design which combines the use of structural graphite with carbon dioxide coolant.This design is distinct from most in operation internationally, with most countries using reactors which are significantly different from the AGR.As they have been in operation for many decades, the graphite cores of the AGRs are starting to exhibit age related phenomena with the potential to impede commercial operation or safety.Due to combined international novelty and advanced age of the reactors, a significant amount of domestic safety analysis is performed involving all parts of the AGR design and operation.
The generation of data for UK nuclear safety, including AGR graphite, has traditionally been produced using hard computing techniques or physical experiments.These include complex, deterministic calculations, involving thousands or even millions of equations and parameters based on Newtonian and nuclear physics as well as known material properties.
Using techniques from the field of machine learning, a surrogate model can be produced which produces results comparable to traditional techniques (Nyshadham et al., 2019;Asteris et al., 2021), but with a reduced computational cost.The use of machine learning surrogates is present across the engineering field and there has been some progress within the AGR safety sector to exploit machine learning.However, these works are limited in scope to their own niche areas.For example, the two previously mentioned works produce surrogate models that produce accurate results in their particular fields.Despite being in a similar area of study (material properties prediction) each of these works use significantly differing techniques to one another, highlighting the level optimisation and specialisation required.
The research detailed here develops on existing works by the production of a machine learning surrogate model with a large number of inputs and outputs.In contrast to previous works combining machine learning and AGR safety (such as (Dihoru et al., 2018)), the model produced through this research will directly predict outputs from inputs.This development enhances the practical application of the surrogate model, as engineering assessments use the relationship between input and output to make safety determinations.In addition, this research work will build on methods on previous works in the field of surrogate ML modelling of AGR by applying new methods, including convolutional neural networks which exploit physical relationships within the data.
The proceeding sections of this paper are as follows.Section 2 provides background information on nuclear reactor design in the UK, engineering models and machine learning.Section 2.5 highlights relevant existing research works.Section 3 discusses the format of the dataset and the framework around it.Section 4 outlines the methodology of how ML surrogate models were developed.Section 5 presents the results of an experimental model production process.Finally, Section 6 concludes the paper with an overview and discussion of possible further work.

Advanced Gas-cooled Reactors (AGRs)
The UK nuclear power sector is dominated by the advanced gascooled reactor (AGR).There are 14 AGR reactors spread across 7 UK power stations.This design differs from most reactors around the world in that it uses graphite as a neutron moderator and carbon dioxide gas as coolant.This is in contrast to the more common pressurised water reactor (PWR) and boiling water reactor (BWR), both of which use light water as both neutron moderator and coolant.Although there are other reactors worldwide which have used graphite or gas, there has not been the same level of international collaborative research for the AGR as there exists for light water reactors.As a result, most AGR safety analysis has to be performed domestically, leading to a significant requirement  for computation.
The AGR graphite core consists of stacked graphite bricks assembled in an interlocking 3-dimensional assembly (shown in Fig. 1).There are two principal brick types as shown in Fig. 2: (1) large bore bricks for the insertion of fuel assemblies (blue in Fig. 3)); and (2) interstitial bricks which provide structural support (red in Fig. 3), some of which have a small bore to allow the insertion of a control rod.The bricks are linked and held in position by rectangular graphite keys.An individual stack of bricks the height of the core is known as a channel, with types of channel dictated by the type of brick in that channel (fuel or interstitial).
The primary safety concern regarding the AGR graphite core is the cracking of the fuel bricks.Given enough cracks, the ability to insert fuel or control rods could be impeded due to movement of the fuel bricks or adjacent interstitial bricks.The mechanism by which the cracks occur is well understood.The intense radiation within the core catalyses a chemical reaction between the graphite and CO 2 coolant which results in a reduction in the mass and volume of the bricks.This in turn results in a strain differential, which leads to a stress concentration and cracking at the root of the keyway.The growth of these cracks results in the splitting of the brick, first into a "C" shape (singly cracked) and then into two halves (doubly cracked -see Fig. 4).At any given time, it is difficult to ascertain which bricks are exhibiting cracks due to material and irradiation variability, as well as the inaccessible nature of the core internals.Forecasting the probable locations that cracked bricks will occur is an area of ongoing research (McNally et al., 2020).
The UK Office for Nuclear Regulation (ONR) requires that the AGR operator (EDF Energy) must demonstrate that the ability to control the reactor (i.e.insert control rods) will not be threatened under given circumstances e.g. an infrequent earthquake or other serious event.

Traditional engineering safety assessments
The traditional approach used to ensure the safe condition of the AGR graphite core involves production and analysis of computationally expensive deterministic models, as well as expensive experimental tests.
Examples include the computational model Parmec (Koziara, 2019) and the physical Multi-Layer Array (MLA) (Dihoru et al., 2014) at the University of Bristol.Both of these represent a simplified version of the true AGR graphite core, scaling the geometry to one-quarter of the true volume and reducing the number of channels and levels.These can both be configured to simulate a once in 10,000 year earthquake and its effects.
As mentioned at the end of subSection 2.1, it is difficult to ascertain where the cracks are (or where they will occur).In lieu of actual crack positions, the traditional approach, including that with Parmec, is to generate a random distribution of cracks, represented by a 3-dimensional tensor which reflects the structure of the AGR core, with the position of each element corresponding to a spatial position of a brick.
Each example case will exhibit a unique configuration of cracks generated using a procedural process (Van Der Linden et al., 2013) from a random number seed (Avaroglu et al., 2014).A visual representation of one such example case is shown in Fig. 5. Example cases can be generated quickly and with low computational cost using industry standard tools (usually less than 1 min per instance).These cases can be used as input configurations to engineering models such as Parmec.In contrast, to generate outputs of these models is computationally or monetarily expensive, requiring hours or even days to process or configure each instance.
These models are used to forecast the response of the reactor to an earthquake given a particular configuration of cracked fuel bricks.Different configurations may affect the response of adjacent interstitial bricks (recall that these are used to allow the insertion of control rods).During an earthquake, angular or translational movement (Fig. 6) of interstitial bricks may act to effectively reduce the clearance between the fuel or control rod and the surrounding bore wall.Fig. 7 illustrates an example involving a control rod channel: the relative displacement of the control rod insertion point can be calculated as a function of the movement of the bricks.The path of the control rod must not be obstructed, and so, as a simple criterion, the relative displacement must be less than the difference in control rod and bore radius.
A single iteration of the aforementioned models, summarised in Fig. 8, would give very little information on its own.With multiple iterations of a model, each with a different randomised state of the input Fig. 4. Doubly Cracked AGR Brick (Neighbour, 2007).configuration, a stochastic understanding of the problem space can be built.For example, with several hundred iterations, a histogram can be plotted, as shown in Fig. 9.A statistical distribution, such as the Normal distribution, is then fitted to the frequency of results.Using this distribution, it can be determined what percentage of results are above an acceptable threshold (for instance, the bore radius minus the control rod radius -see Fig. 7).Using these statistics, determinations can be made of the probability of certain serious events occurring and are used in safety decision making.

Surrogate machine learning models
For any given natural phenomena, it is possible to develop a physical model that mathematically describes it and can approximate its behaviour.Such phenomena range from a simple decaying wave to highly complex models such as that of turbulent fluid movement.Such models are unlikely to ever be a perfect representation of the natural phenomena, as there may be too many variables and factors to ever account for them all.As a result, there will inevitably be some disparity, though the model is likely to be accurate enough as to provide data that is of practical use.
From data generated by such mathematical models, it is possible to train machine learning models to produce equivalent outputs from the same inputs (Cozad et al., 2014), (Liang et al., 2018).This will effectively be an additional layer of approximation on top of an already approximate model.
What is the motivation for doing this?Given a mathematical model of a phenomena, why develop and use a surrogate model using machine learning which provides inferior results?In a real world case, such as nuclear reactor core safety, such a model may be highly complex, involving thousands of parameters and requiring significant computational expense.A machine learning model on the other hand, once trained, is computationally cheap to use, being just a series of matrix operations.The production of such a machine learning surrogate can also be seen as an exploration of the data space i.e. it is likely through the process of model training and refinement that insights into relationships between variables will be discovered.For example, the inclusion of a  Several recent research works develop surrogate models using machine learning, including (Liang et al., 2018) which uses neural networks and principle component analysis (PCA) (Dunteman, 1989) to predict the behaviour of a finite element model for the medical sector.In this work, a 3D finite element model is encoded as a vector to serve as input to a neural network so as to predict stress distributions under loading.Other research works, including (Nyshadham et al., 2019) and (Asteris et al., 2021) produce machine learning surrogates for the engineering sectors of materials and concrete properties, respectively.Both works utilise neural networks, with the latter using a hybrid approach combining several techniques into an ensemble method.

Neural networks
Differing neural network architectures with varying hyperparameters (Gurney, 1997) can be employed in the design of a machine learning surrogate model.The architecture of a neural network consists of nodes at which a mathematical operation is performed.The nodal output is passed through an activation function which transforms it in a non-linear manner.Commonly employed activation functions include the rectified linear unit (ReLu) (Hara et al., 2015) and the Softplus function (Zheng et al., 2015).
The design of a neural network usually features multiple cascading layers, with the output of the previous layer feeding through as input to the subsequent layer.For each instance vector passed into the neural network, a prediction is calculated and output by the final layer.Repeating this process for an entire dataset yields a matrix known as the model predictions ( Ŷ). Performance is measured by comparing the predictions of the model with the original ground-truth values.This calculation, known as a loss function, is used to optimise model parameters through the back-propagation algorithm (Hecht-Nielsen, 1992) over a number of training cycles (epochs).Various loss functions are available, with common examples being the mean squared error (MSE), mean absolute error (MAE) and the Huber loss (Huber, 1964).At the end of training, another loss value can be calculated using the test set (N t ) to evaluate the overall general performance of a given model and compare different approaches.
The back-propagation algorithm utilises a learning rate, a user defined value that determines how quickly the model weights are adjusted.A learning rate value must be chosen so as not be too high or low: either extreme will prevent an optimal solution from being attained.The learning rate may be adjusted during training.A common problem encountered during the training of neural networks is that of overfitting (Hawkins, 2004).This phenomenon is encountered when a model fits so closely to a training dataset that it does not perform well on new data that it has not encountered during training i.e. the model does not generalise well.A significantly lower training loss relative to the test loss indicates overfitting, whereas a similar value for the two loss values indicates good general performance of the model.To avoid overfitting, several regularisation techniques are employed.These include adding a term to loss function that penalises large weights and randomised dropout of model layers (Srivastava et al., 2014).
Several configurations of neural networks are available.A common choice is the fully connected or dense neural network (DNN): i.e. the output of every node in a layer feeds into every node in the subsequent layer, leading to a 'dense' architecture.This configuration was employed in the aforementioned research papers.An alternative configuration is the convolutional neural network (CNN).Convolutional neural networks (CNNs) attempt to capture local signals in data by use of a sliding window which passes over the input space.The application of CNNs may be effective in our work as there are similarities between the data format for AGR graphite core analysis and image recognition, with the presence and identification of local arrangements important in both areas.CNNs have been utilised in several areas of image analysis, including cell detection in medicine (Xie et al., 2015) and depth estimation in photographs (Li et al., 2015).Within the engineering field, they have been employed to predict remaining useful life of components (Babu et al., 2016).

Research contribution and motivation
Previous research works have applied machine learning to AGR graphite core seismic analysis.In (Dihoru et al., 2018), a method is developed which uses data from a physical model of the AGR reactor to train a neural network.The data is generated using a quarter-scale recreation of the AGR graphite core, including intact and cracked fuel bricks, interstitial bricks and the core restraint.This model is placed on a seismic shaker rig as described in (Dihoru et al., 2017) in order to simulate earthquakes, during which a series of sensors record displacement at various locations in real time.By arranging the cracked and uncracked fuel brick components within the physical model, different cracking configurations can be employed.Five such core configurations were produced and subjected to a simulated severe earthquake, each with the acceleration profile shown in Fig. 10.
The cracking configurations were generated using the procedural process mentioned in subSection 2.2.A simulation involving each configuration was repeated multiple times to account for noise and data measurement inaccuracy.
A stated research objective of this work was to produce a trained model capable of predicting displacements for a select number of bricks in the upper layer from selected earthquake acceleration inputs.This objective has been fulfilled to a large degree as good test accuracy was demonstrated.Hence, a model was produced which surrogates the utility of a physical AGR graphite core seismic model, at least for select inputs and outputs.A secondary research objective was to use such a neural network to inform experimental setup and configuration, although this does not appear to have been achieved.
Despite achieving good model accuracy, this research has some limitations.Physical models are expensive and time consuming to set up and so data for only five configurations was generated and included as part of the dataset used in the aforementioned research.With nearly two thousand fuel bricks within the physical model, with each of which able to be configured as cracked or uncracked, the five configurations used in this study represent only a tiny proportion of all possible configurations.Consequently, relationships between channel displacements and crack configurations would not be possible using this dataset.
In contrast to (Dihoru et al., 2017), the research for this study will use data from a computational model.These models perform numerical analysis to calculate displacements from inputs such as crack configuration.The computational model used in this research work, Parmec, does not require the same expensive set up and execution process as physical models yet produce comparable data (displacement history throughout the earthquake).As a result, data for a larger number of configurations can be generated and it will not be subject to the noise/ inaccuracy issues that physical models are.Hence, a larger dataset can be produced, allowing the creation of a machine learning model that predicts graphite core seismic response as a function of crack configuration i.e. brick displacements will be output from the model based on the location of cracks.
As stated in subsection 2.2, generation of safety analysis data in this field is resource expensive, even when considering computational rather than physical analysis.We considered in the aforementioned subsection the enormous size of the problem space, including all of the permutations of possible inputs and the large amount of output data generated per instance.Considering the combination of these two factors, exploring even a small portion of the problem space is implausible, even with considerable computational resources.In subsection 2.3 we observe that by contrast a machine learning model, once trained, can generate data with very low computational cost and considerably less time.Therefore, the generation of a surrogate machine learning model in this field would have advantages beyond simple cost savings.Assuming a fixed computational resource and a more efficient ML surrogate model, a larger proportional of the problem space can be explored.
Further, through the process of developing the model, it may be possible to gather insights into the underlying nature of the engineering problem.For instance, we may be able to identify input configurations that produce high model accuracy and we may identify cases which are difficult for the model to predict.Both may yield benefits beyond ML model development, such as informing future computational or physical model experimental set up.In summary, this research will employ computational, rather than physical models.The reduced set up and execution time will allow the production of a wider range of dataset examples, as well as the elimination of noise and inaccuracy.The production of a larger dataset may allow us to model the relationship between a wider range of inputs and outputs, including crack configurations.We note that the production of such a model would yield computational savings which could be invested into a wider search of the problem space.Further, the process of model development may yield insights into the underlying nature of the data and the underlying problem space.Using these insights, we may be able to inform future experimentation and analysis, both physical and computational.

Data generation and preparation
As mentioned in subSection 2.3, the production of a surrogate model requires the training of machine learning architecture on data from the base model.This dataset must include both the model inputs and the outputs which we seek to predict.The following section details the production of a dataset for this purpose and its manipulation for the purposes of the research problem at hand.

Data generation using engineering model
Approximately 8300 whole core calculations were performed using the Parmec software package.
The input to the Parmec engineering model is a tensor representing 1988 AGR fuel bricks and their cracking status: 1 or − 1 for cracked or uncracked, respectively.The cracking status for the range of fuel bricks in a Parmec example can be formed into a vector 1988 elements long (henceforth referred to by the notation BF).Each traditional engineering calculation (Fig. 8) was produced using a randomly generated crack configuration input, each of which produces a displacement history earthquake for every interstitial brick within the core.All cases have the same percentage of cracks within the core: 40% was selected.
Recall from subSection 2.2 that the Parmec model is used to simulate an earthquake (Fig. 10).This includes a settle down period before and after the earthquake proper.For 271 regular time intervals (TI) throughout the earthquake simulation, the model outputs a data-file.This file contains translation and rotation data for all bricks within the reactor core.There are 4173 interstitial bricks (henceforth referred to by the notation BI), each of which has 6 output metrics (OM), including displacement in and rotation about all three Cartesian directions.This results in nearly 6.8 million outputs per Parmec case.The full list of inputs and outputs to the Parmec model are summarised in Table 1.

Dataset generation framework
A software library was developed which streamlines the production of Parmec cases and provides an interface to efficiently access and prepare the data into datasets.This library consists of packages written in Python (Deitel, 2002) and Linux Bash command language.It makes full use of the object oriented programming (OOP) paradigm, structuring instances and datasets into classes which can be interfaced with easily by the user.
The Parmec case generation tool allows batch generation of instances, each traceable to a pseudo-random number seed (Blum et al., 1982).This allows a creation of a large and reproducible dataset.Should the dataset be lost/corrupted or need to be temporarily deleted, it can be exactly reproduced using the original seeds.
In its base state, the output files generated by Parmec are difficult to interpret or use in a practical manner.For each of the Parmec cases mentioned previously, the framework allows the generation of a data object in the OOP paradigm (Meyer, 1997).For this purpose a class was written in Python, the object of which is instantiated from the input and output files from a single Parmec case.The Parmec class has interface methods which allow simple access and parsing of the information about that case.For example, should the user want to get the number of cracks in a particular level of the core, or want to get the displacement of a brick at the central most position in the core, this can be achieved in a simple method call.
An additional layer of abstraction is provided by the creation of additional classes which aggregate all of the Parmec case objects in a batch of cases.This allows then generation of datasets for use in visual analysis or machine learning.
This framework provides an interface to allow machine learning tasks and experiments to be set up with ease.The framework allows the quick selection of inputs and outputs, including feature selection (Gurney, 1997).The parameters of the machine learning model can be selected or configured.It also automates the production of performance graphs and tables to allow the comparison of various approaches.This Python package was built using the existing machine learning frameworks Keras and Tensorflow (Géron, 2019), as well as the NumPy (Harris et al., 2020) and Matplotlib (Bisong, 2019) Python libraries.

Dataset selection and preparation
With 8300 Parmec examples, approximately 75% of the dataset was randomly partitioned to become the training set (N) with around 6300 instances, with the remaining 25% (2000 samples) forming the test set (N t ).
With 6300 training samples and 1988 input parameters (cracked or uncracked AGR bricks) this gives a matrix of dimensions N by BF (6300 x 1988), hence our feature matrix is defined as This matrix serves as the base input features to the machine learning models described in the following section.
As explained in subSection 3.2, each Parmec case contains approximately 6.8 million output parameters.For the purpose of this work, a process of narrowing down this large data space into a usable format was carried out.As this is a preliminary investigation into the application of surrogate models to this area of research, the data space will be simplified so as to provide a feasible demonstration.
Of the six output metrics, a single displacement metric was chosen (displacement translation along the horizontal axis in the direction of the applied earthquake acceleration) at a single time frame (time frame 48 -approximately 3.8 s into the simulated earthquake).Referring back to Fig. 10, it can be seen that the earthquake is well underway at this point.The data from later time-steps was found to have a greater level of homogeneity and less variation between instances.
The output space was further narrowed to include data for just a single brick at a time i.e. a single label value for each instance.The selection of outputs allows the research problem to be expressed as single label regression (as opposed to multi label regression where we would attempt to predict multiple outputs per instance).
Taking the dataset as a whole, the model labels can be represented by a vector of length 8300 split into 6300 training samples (Y) with the remainder being the test set (Y t ).In its raw format, this is a vector of real numbers representing brick displacements in millimetres (mm).Values range between approximately ±10, corresponding to the extremes in the negative and positive horizontal direction, respectively.Prior to use in machine learning model training/evaluation, the label set was normalised to the interval 0 to 1, so as to improve suitability and performance (Liao and Carneiro, 2016).
The brick chosen was the one located at the upper most level at the centre of the core.This position of this brick is shown in the lower part of Fig. 11.This brick is at a particularly important position from an engineering perspective.On average, the closer a channel is to the centre of the core, the higher its translational displacement (Fig. 12) relative to the surrounding structure.The upper level may also be considered of greater importance than those lower down, as it is the initial point of entry for a control rod.
The distribution of horizontal displacement at time frame 48 for all 8300 instances is shown as a histogram in the upper part of Fig. 11.It can be seen that the labels take the form of a Gaussian distribution, but with the modal value to the left of the median and mode.The data also has a long tail on the right hand side i.e. there are a number of outliers on the upper end of the distribution.

Traditional methods
We began with the use of traditional machine learning models, otherwise known as shallow methods.The methods used included: linear regression, Huber regression, support vector machines (Smola and Schölkopf, 2004) and decision tree regression (Navada et al., 2011).Each model type was optimised using the training set (N) followed by evaluation against the test set (N t ).

Neural networks
The remainder of this section discusses the development of experiments involving neural networks.Two types of neural network architecture were employed: dense neural networks (also known as fully connected networks) and convolutional neural networks.
To this end, each experimental configuration of parameters was evaluated by training a model using 10-fold cross-validation (Refaeilzadeh et al., 2009).To account for the stochastic nature of model training caused by the random initialisation of model weights, the training process was repeated 32 for each cross-validation fold.At the start of training, a learning rate of 5.00E-04 was employed, with this being programmatically halved each time a performance plateau is detected (no validation loss improvement for 50 epochs).Early stopping (Yao et al., 2007) was used when the learning rate drops below 1.00E-05, with training terminated and the parameters retained from the point at which lowest validation loss was achieved.Each optimised model was then evaluated against the test set of 2000 samples.A graphical representation of this process is seen in Fig. 13.
The next subsection discusses the general process of hyper-parameter optimisation during model design.The subsection after that discusses differing data input encoding shapes and the model architectures they require.Finally in this section, we discuss the impact of feature selection i.e. reducing the input feature space provided to the model.

General hyper-parameters
As mentioned in subSection 2.4, several hyper-parameters must be chosen for a neural network.The optimal selection of hyper-parameters were chosen through an experimental processes.
The first parameter that was chosen was the optimiser.After evaluating several optimiser algorithms, including the RMSprop, Adagrad, and SGD configurations, the Adam optimiser with Nesterov momentum (Dozat, 2016) was chosen as it provided the best model performance.
Three model additional model hyper-parameters were also refined: nodal architecture, activation functions and regularisation.Starting with the architecture, this was developed through a trial and error method.Beginning with a simple structure of a single hidden layer with a width of four nodes, this was then expanded and adjusted experimentally, using the testing loss for each configuration as a measure of its performance.Guidance was taken from previous works as to what architectures may be most effective, including (Fernandez et al., 2017) which describes four distinct architectures which were used as a starting point for the experiments performed here.
In tandem with the nodal architecture, the use of activation functions were chosen experimentally.At first, the same activation function was applied to each layer (bar the output layer).The use of three commonly employed activation functions were tested: sigmoid, the rectified linear unit (ReLu) (Hara et al., 2015) and the hyperbolic tangent function (Kalman and Kwasny, 1992).Also tested was the less common softplus function (Zheng et al., 2015) which has been used in other surrogate model works, including (Liang et al., 2018).In addition to a model design that uses the same activation function in all layers, alternative configurations were tested.Initially, a method employing non-linear activations on every other layer was employed, with the alternating layers simply outputting the linear combination of inputs and weights.Inspiration for this approach was taken from (Ahn and Lee, 2020) where non-linear activations were only placed on every 4th layer of the neural network architecture.This was further modified to an architecture employing two different non-linear activation functions in the same model, with the greatest success being observed when alternating softmax and tanh.The final model, demonstrating the optimal architecture, is shown in Fig. 14.
The regularisation parameter was also optimised experimentally.The most effective technique was found to be dropout (Srivastava et al., 2014) where the output of randomly selected layers are negated.It was determined experimentally that a dropout parameter of 0.4 was optimal -the output of 40% of the nodes in each layer were randomly dropped.
The experimental models were trained using several loss functions.These were mean squared error (MSE) mean absolute error (MAE) and

Table 2
Summary of the Input Encoding Experiment.For each encoding format, the total size of the input is the same; only the shape changes.Additionally, the output label in each case is the same shape.Note also that the 1D encoding requires a model containing only dense layers, whereas 2D and 3D encoding warrants a mix of convolutional and dense layers.the Huber loss (Huber, 1964).By inter-comparing each loss function, it was found that Huber was the optimal loss function for training the model.

Neural network input encoding
Three alternative configurations of neural network were employed, each requiring a differing data encoding format, as outlined in the following subsections.One of these configurations employs only dense neural network architecture, with the other two employing convolutional neural network architecture.An experiment was designed to compare each of these encoding formats in turn (Table 2).
In each case, the number of input features is the same: 1988 (the number of fuel bricks).Similarly, there is a single output (the displacement for the central interstitial brick in the top level).In each case, a similar model architecture is used, with notable exceptions such as the use of only dense layers or the inclusion of convolutional layers (see Table 2).

DNN -1D
A dense neural network with 1D encoding is employed.In this configuration, the data is encoded in the base format as mentioned in subSection 3.3 where each instance is input to the model as a vector.In the DNN, the output of each layer is fed into all nodes of the subsequent Fig. 15.Convolution on a 2-dimensional Input Space Using a 3x3 Filter.The input space is a top-down cross-sectional slice of a 3-dimensional tensor.This represents the distribution of cracks in a single level of the AGR reactor for a single example.The corners and edges have been padded with zeros to maintain a regular shape.The convolutional operation has been performed at select locations.Lower-right: the patch from the input space is the perfect inverse of the filter, resulting in the lowest possible output value (-9).Left: the input patch is a perfect match of the filter, resulting in the maximum output (9).Top-right: a middling example where the input patch has only a slight resemblance to the filter.Outside of the image on the top right-right hand corner, the mathematical operation between a data patch and filter is seen: both are flattened into a vector followed by a dot product calculation.layer.

CNN -2D
A convolutional neural network architecture with 2D encoding.In this format a 2-dimensional filter is applied at select patches on a 2dimensional input space (Fig. 15).To facilitate this, the 1988 element long input feature vector is reshaped into a matrix.The dimensions of this matrix are 7 by 284: the bricks per fuel channel (BFC) by number of fuel channels (NFC), respectively.Hence, in this encoding format, an individual instance has a feature matrix of the form x i = { − 1, 1} BFC×NFC .This format is analogous to CNNs used for image recognition which use a grey-scale bitmap as input (Bui et al., 2016).
The architecture produced during the DNN optimisation was kept the same.However, the first 5 layers now became CNN layers with the latter two remaining fully connected layers.This resulted in an additional parameter to be optimised: the size of the convolutional window shape (or kernel).A window size of 3 by 3 was experimentally found to produce the best result.

CNN -3D
A convolutional neural network architecture with 3D encoding.This arrangement reflects the true positional relationship within the actual core.The bricks are stacked in 7 levels (BFC).At its widest width and breadth, the core is 18 channels across.The tensor has been padded at the edges and corners with zeros to make it a regular shape and keep the input shape the same size in subsequent layers (Dwarampudi and Reddy, 2019).Consequently, CW & CB both have a value of 22 due to a double layer of zeros on each side.Therefore the dimensions of this input tensor are bricks per fuel channel (BFC) by core width (CW) by core breadth (CB): 7 by 20 by 20 i.e the input features for each instance is in the form x i = { − 1, 0, 1} BFC×CW×CB .This input shape can again be used by a CNN with a 2-dimensional architecture.However, with a 3-dimensional encoding each core level is used as a separate input channel.In this format, the filter has multiple channels, one for each of the channels in the input space (see Fig. 16).This is analogous to CNNs used for image recognition where colour data (usually red, green and blue) are represented by different input channels (Albawi et al., 2017).

Feature selection
In Section 3, it is mentioned that the input to Parmec is a vector 1988 elements long, each representing the cracking status on one bricks in the model.So far in this section, we have discussed various dimensional encoding formats available, including a 3-dimensional encoding which represents the true physical arrangement of the core.We have the choice to provide this information to the model in its entirety, or to reduce the scope of these inputs in some way.Through a process of feature selection (Gurney, 1997), we can actively chose only the most relevant features to the predicted output by monitoring the model performance as the scope is adjusted.Beyond model performance optimisation, the process of feature selection allows insights into the data, including relationships between inputs and outputs, what information is relevant and so on.As mentioned in subSection 3.2, the framework developed for this research work allows the selection of features pragmatically.
The proposed method for feature selection is to split the core in different ways and then perform experiments using only those features.Two differing strategies are discussed in the following sections.Left: a single filter from the first layer of a CNN, the planar dimensions are smaller than the input space (3 × 3), but the number of channels are equal to the number in the input space (7).The filter performs the mathematical operation shown in Fig. 15 on each channel.The sum of all channels operations forms a single value in the output feature map for this filter.Right: the feature map for the filter.If the aforementioned mathematical operation is performed across the entire input space, we get the feature map shown here.If this process is performed for multiple filters, several feature maps are produced, which each represent channel inputs for the next convolutional layer.Alternatively, the feature map can be flattened as to form the input for a dense layer).

Regional selection
The core features are split into concentric sections as shown in Fig. 17.Recall that the value we aim to predict is for the central channel (Fig. 11).Hence each concentric region is effectively further removed from the position of interest.It can be hypothesised that the data for regions closer to the central position are most important to model performance, with importance falling as we move further away.Testing this hypothesis is the purpose for this experiment.
Starting with the greater core region (i.e. the whole feature set) a model is trained and evaluated.The feature set is then reduced by incrementally removing channel regions and then training a new model using it.By repeating this process until only the inner channel region remains and evaluating the trained model at each stage, we can choose the optimal features for inclusion.

Level selection
In addition to feature selection based on the core regions from, the feature set was also varied in size by varying the number of core levels (refer back to Fig. 1).Recall from subSection 3.3 that the brick for which we are attempting to predict displacement values for is in the top layer.Similar to the experiment in the previous subsection, is can be conjectured that the lower levels of the core are less important than those near the top (closer to the point of interest).This again was determined experimentally.Starting with the lowest level (level 1), data for incrementally higher core levels was removed from the feature set, training and evaluating the model each time.

Traditional methods
As mentioned in subSection 4.1, the experimental processes for this work began with the use of traditional machine learning methods.The results are summarised in Table 3.As can be seen, the linear regression method produces the best test loss and therefore best performing model.This is followed by the Huber and support vector regression methods.A decision tree regression model can fit the training data almost perfectly (a training loss of zero).However, the decision tree regression model also performs worst on the testing set, suggesting overfitting.

Input encoding
Table 4 summarises the results of the experiment outlined in sub-Section 4.4, with the best result achieved with each encoding format shown.
From the results, it can be seen that a CNN employing 3D data encoding produces the best performance.In turn, it can also be seen that a CNN employing 2D encoding produces a lower overall loss value than a DNN employing 1D encoding.See subSection 5.4 for further discussion The method of this experiment is discussed in SubSection 4.5.1 with the results given in Table 5.   1.1e− 2 1.3e− 2 of these results.

Feature selection
Recall from subSection 4.5 that two strategies are proposed for experimental feature selection: regional and level selection.
Table 5 summarises the results for the regional feature selection experiment (subSection 4.5.1).Each row represents a model trained using the data for channels inclusively within the region indicated.For example, Main Core (green in Fig. 17) includes data for itself and the regions internal to it (central and inner) but not the region beyond it (greater core).
Table 6 summarises the results for the regional feature selection experiment (subSection 4.5.2).Each row represents a model trained using the data for channels inclusively within the levels indicated.
From Table 5, it can be seen that including features from the whole core produces the best result.From Table 6 it can be seen that the optimal result is achieved when including levels 5-7.See subSection 5.4 for further discussion of these results.

Analysis and discussion
Looking at the final model architecture (Fig. 14), it can be seen that considerable complexity must be employed in terms of model parameters to produce the optimal result.This suggests that relationships between inputs and outputs are complex.This is to be expected, as the Parmec model is itself highly complex involving thousands of equations and parameters.
From Table 4, it can be seen that CNN architecture produces superior results to DNNs.Encoding the input features in a 3-dimensional format produces the best results.In addition, neural network models perform better than traditional methods as seen in Table 3.This suggests that true physical relationships within the data, i.e. local 3-dimensional configurations of cracks, have a causal relationship with displacements.
During the feature extraction experiment, it is noteworthy that the best result was achieved when including features representing all core regions, including the most distant, outer regions far removed from the location of the central channel.It was conjectured earlier that only the channels most local to the position of interest (the centre) would be of importance to predicting displacement.However, from Table 5 it can be seen that model performance drops sharply when excluding data for the main core channels.This suggests that cracking beyond the local region has a causal effect on the central channel's displacement.This phenomena is particularly interesting when considered along with the fact that the optimal convolutional window size is 3x3 (a relatively small window size by CNN standards) which suggests small patterns are most Fig. 18.Visual Summary of the Regional Feature Extraction Experiment.Top: the concentric core regions as described in Fig. 17 with two inclusive regions delineated which represent the features included in the models corresponding to the mid and bottom image.Mid: The distribution of predictions made by a model trained on features representing the central two regions (red and yellow in the image above).It can be seen that the distribution of predictions in the upper end of the histogram is similar to that of the ground truth.However, this model makes few predictions in the lower region of the distribution, effectively overestimating these results.Bottom: When including all regions in the feature set, superior model performance is seen.The shape of the distribution for the model predictions closely fits that of the ground truth.important.
Looking more closely at the results of this experiment, Fig. 18 shows the distribution of predictions made by models trained on the channels within the central core and full core.It can be seen that a model that excludes the main and greater core channels has difficulty making predictions in the lower part of the prediction spectrum.Perhaps the cracking status of bricks beyond the central core has some slackening or tightening effect on the centre of the core.The exact reason for this phenomenon will require further study.
Examining the results of the experiment which excludes features by core levels (Table 6), it can be seen that the best overall result is achieved when including features for the top three levels (5-7).Note also that a similar performance is achieved with all arrangements except when only the top two levels were included (6-7), hence low sensitivity.From these results, we can conclude that the at least the top three levels have a discernible impact on the displacement of the central core channel.
Fig. 19 shows the test predictions for the best performing model (3D CNN trained on features for levels 5-7) plotted against the ground truth.It can be seen that 94.8% of the test predictions fall within a margin of 20% of the ground truth in absolute terms.Of the cases that fall outside this margin, we can see that most of them occur fairly close to the boundary.Using the same process as was used to generate this plot, we can identify the type of case which is difficult for the model to predict accurate values for, and generate cases accordingly.From a cursory examination of the cases outside of the 20% margin from Fig. 19, it was observed that they contained on average 5% fewer cracks in the top core level than the wider dataset.
To further evaluate the performance of the model, we can separate the plot shown in Fig. 19 into positive and negative values based on chosen thresholds.Fig. 20 shows the predictions and ground truth separated based on three chosen thresholds.Taking the mean as the positive/negative separator (top), it can be seen that the model performs well at predicting which region a given instance is in.However, the model performs less well at predicting values for cases at the very extremes of the data continuum (threshold of 0.2 and 0.6).Nevertheless, the 'false' examples are predicted as close to the boundary in all three graph configurations.In general, it can be seen that cases at the lower and upper ends of the data range are predicted as closer to the mean than the ground truth.Looking at the distribution of the dataset (bottom), a possible explanation is that it is highly 'biased' towards the region close to the mean i.e. the model has many fewer cases to learn from at the ends of the distribution.
Following the training process of all models discussed in this section, the generation of output values takes less than one second per instance.Using the same hardware, the equivalent data generation time would be over 2 h using the original Parmec software.
To summarise the observations made: 1. Model Complexity: the optimal surrogate model architecture exhibits considerable depth and width as well as requiring considerable refinement and tuning.This suggests a complex and non-trivial relationship between inputs and outputs.2. Input Encoding: optimal performance was seen when including true physical relationships in the input features i.e. a 3-dimensional encoding which represents the actual structure of the graphite core.3. Feature Selection -Regional: when selecting input features regionally by dividing the graphite core into radial segments, it was observed that optimal model performance was achieved when including data for all regions.This suggests that the cracking status of even bricks far away from the position of interest are important to the prediction of displacements.4. Feature Selection -Level: when selecting input features vertically by selecting inputs from inclusive ranges of core levels, it was observed that optimal model performance was achieved when including data for the top five levels.Further, little performance sensitivity was observed when reducing the levels included in the feature dataset. 5. Difficult to Predict Cases: cases were identified that are difficult for the model to accurately predict.Through examining these cases, some common characteristics were observed.These observations may help to design further machine learning experiments or inform engineering analysis.6. Dataset Bias: we can note that the model has lower prediction accuracy near the extremes of the data range.This can be explained by the fact that the dataset is biased towards the central region, meaning the model has fewer examples from the extremes to learn from.

Conclusion
A surrogate machine learning model was developed which aims to predict seismic graphite core displacements from crack configurations for the advanced gas-cooled reactor.The model was trained on a dataset generated by a software package which simulates the behaviour of the graphite core during a severe earthquake.
The main motivation behind this research was to increase the computational efficiency of seismic displacement output data generation in order to allow a wider search of the vast problem space.Following the training process of the models, the generation of output values takes less than one second.Using the same hardware, the equivalent data generation time would be over 2 h using the original Parmec software.
The best performing model was capable of making 95% of test set predictions within a 20 percentage point margin of the ground truth.Although this result demonstrates progress towards the development of an efficient and accurate machine learning surrogate, the high standards within the nuclear safety field mean that it is unlikely to be at a stage were it could be commercially deployed.Nevertheless, machine learning techniques were identified as highly applicable to this particular problem.For example, Convolutional neural networks were identified as more effective than other techniques available.Additionally insights into optimal hyper-parameters and inputs were also garnered.These observations may guide and support further development and progress in this area.
Going forward, further machine learning surrogate models for this problem may seek to expand the input and output space.The inputs could be expanded to include a earthquake accelerations and fuel brick crack fractions other than 40%.The outputs could be expanded to include the wider range of bricks and time points during the earthquake.

Declaration of Competing Interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Huw Jones reports financial support was provided by EDF Energy Nuclear Generation Ltd.Huw Jones reports a relationship with Jacobs UK Limited London that includes: employment.

Fig. 3 .
Fig. 3.A Top Down View of the AGR Reactor Core.

Fig. 5 .
Fig. 5. Data-point Representation of Cracking Throughout a Single Core Instance: Each data-point represents a fuel brick.Yellow and black data points represent uncracked and cracked bricks, respectively.

Fig. 6 .
Fig. 6.Two Types of Brick Response to Seismic Displacement.Left: An entire stack of bricks rotating around a central point.Individual bricks in a stack may rotate relative to one another.Right: Translational displacement where bricks move along a Cartesian axis.None of the responses are exclusive and bricks can move in any number of both rotational or translational displacement modes.

Fig. 7 .
Fig. 7. Relative Displacement of Control Rod Position.This is top down view of Fig. 6.If the bricks have moved due to seismic displacement, then the relative position of control rod insertion is now changed.

Fig. 9 .
Fig. 9. Statistical Safety Analysis.By running multiple random crack core instances through the engineering model and extracting the displacement data (as shown in Figs. 6 and 7) we can build a distribution as shown.By using statistical analysis techniques, we can determine what percentage of cases are likely to fall outside a certain margin.From such analysis, it may be possible to make probabilistic judgements which may feed into safety decision making or further analyses.

Fig. 10 .
Fig.10.History of Acceleration During a Severe Earthquake.Note the settling period before and after the earthquake proper.The true earthquake history includes an acceleration value every 100th of a second.This plot shows a sample every 10th of a second, which is the interval that the Parmec model outputs displacement data.Also shown is a vertical marker, indicating the selected time frame.

Fig. 11 .
Fig. 11.The Position in the core from which the labels are representative of (Bottom); Distribution of 8300 values in the label set (Top).

Fig. 12 .
Fig. 12. Channel Summation for the Entire Dataset at Time Frame 48: For each channel, the Parmec Outputs in the west-east direction were summed for every instance in the dataset.The contrast from blue to red represents low to high values, respectively.Each channel data point effectively represents the average value relative to the others in the core.Considering the inputs were generated randomly, it is noteworthy that the outputs display such regularity and symmetry.

Fig. 13 .
Fig. 13.Example of Model Training and Validation with Early Stopping and Model Saving.As can be seen, the best model performance is produced on epoch 244 where a validation loss of 0.97 is obtained.After training for another 50 epochs, training is terminated, as in this case minimum training rate has already been achieved.The optimal model weights are then retained to evaluate against the test set.
H.R. Jones et al.

Fig. 14 .
Fig. 14.The Final Architecture of the Model.

Fig. 16 .
Fig. 16.Convolution on a Single Instance Encoded in a 3-dimensional Format Using an Example Multi Channel Filter.Top: a single instance encoded with each core level represented as a separate channel.Left: a single filter from the first layer of a CNN, the planar dimensions are smaller than the input space (3 × 3), but the number of channels are equal to the number in the input space (7).The filter performs the mathematical operation shown in Fig.15on each channel.The sum of all channels operations forms a single value in the output feature map for this filter.Right: the feature map for the filter.If the aforementioned mathematical operation is performed across the entire input space, we get the feature map shown here.If this process is performed for multiple filters, several feature maps are produced, which each represent channel inputs for the next convolutional layer.Alternatively, the feature map can be flattened as to form the input for a dense layer).

Fig. 17 .
Fig. 17.The Core Separated into Regions for the Purpose of Feature Selection.The method of this experiment is discussed in SubSection 4.5.1 with the results given in Table5.

Fig. 19 .
Fig. 19.Model Predictions against Ground Truth for the Test Set.A 'perfect' model would produce values on the solid black line i.e. exactly the ground truth values.Lines representing an error margin of 10 and 20 absolute percentage points of deviation can be seen.Data-points are coloured according to their deviation from the ground truth.The testing mean squared error for this model was 9e-3, with an mean absolute error of 7.4e-3.The blue line represents a linear fit with the Eq.0.58× + 0.16.The R 2 value is 0.53.

Table 1
Summary of the Inputs and Outputs of the Parmec Engineering Model for AGR Graphite Core Seismic Analysis.

Table 3
Summary of the Test Results from the Traditional Methods Experiment (Sub-Section 4.1).Values are in mean squared error (MSE).

Table 4
Summary of the Test Results from the Input Encoding Experiment (SubSection 4.4).Values are in mean squared error (MSE).Each part of the experiment was repeated 32 times for each cross validation fold (320 total) with starting weights initialised randomly each time.The weights were stored at the optimal point during training (lowest validation loss).Each saved model was then evaluated against the test set of 2000 samples, with the lowest and mean values reported in this table.

Table 5
Summary of the Test Results from the Regional Feature Selection Experiment (SubSection 4.5.1).Values are in mean squared error (MSE).Each part of the experiment was repeated 32 times for each cross validation fold (320 total) with starting weights initialised randomly each time.The weights were stored at the optimal point during training (lowest validation loss).Each saved model was then evaluated against the test set of 2000 samples, with the lowest and mean values reported in this table.

Table 6
Summary of the Test Results from the Level Feature Selection Experiment (SubSection 4.5.2).Values are in mean squared error (MSE).Each part of the experiment was repeated 32 times for each cross validation fold (320 total) with starting weights initialised randomly each time.The weights were stored at the optimal point during training (lowest validation loss).Each saved model was then evaluated against the test set of 2000 samples, with the lowest and mean values reported in this table.