Machine learning for the classification of serial electron diffraction patterns: synthetic data

Gorelik, T.E.; Gorelik, E.

doi:10.1107/S2053273325005327

Download citation

Format		BIBTeX
		EndNote
		RefMan
		Refer
		Medline
		CIF
		SGML
		Plain Text
		Text

Download citation

Format		BIBTeX
		EndNote
		RefMan
		Refer
		Medline
		CIF
		SGML
		Plain Text
		Text

short communications

FOUNDATIONS
ADVANCES

ISSN: 2053-2733

Volume 81| Part 5| September 2025|

https://doi.org/10.1107/S2053273325005327

Open

access

Machine learning for the classification of serial electron diffraction patterns: synthetic data

Tatiana E. Gorelik ^a ^* and Evgeny Gorelik ^b

^aErnst Ruska-Centre for Microscopy and Spectroscopy with Electrons, Forschungszentrum Jülich, Jülich, 52428, Germany, and ^bTechnische Universität Berlin, Straße des 17. Juni, 135, Berlin, 10623, Germany
^*Correspondence e-mail: [email protected]

Edited by P. M. Dominiak, University of Warsaw, Poland (Received 28 February 2025; accepted 14 June 2025; online 7 July 2025)

Serial electron crystallography faces a fundamental challenge due to the flat Ewald sphere resulting from the short electron wavelength, leading to limited 3D information in individual patterns. Recently, an algorithm for unit-cell determination from zonal electron diffraction patterns (GM algorithm) [Miehe (1997). Ber. Dtsch. Miner. Ges. Beih. z. Eur. J. Miner. 9, 250; Gorelik et al. (2025). Acta Cryst. A81, 124–136] was introduced in the context of serial electron crystallography. This algorithm requires the extraction of 2D zonal patterns from the complete serial dataset. Here, we present a machine learning approach for pattern sorting and apply it initially to simulated electron diffraction patterns.

Keywords: serial electron crystallography; machine learning.

Similar articles

Crystal structure determination becomes challenging when the crystal size is smaller than a micrometre. To address this challenge, serial crystallography was developed (Chapman et al., 2011 ), a technique in which thousands or even millions of diffraction snapshots are collected from a large number of individual crystals. Bragg intensity measurements from each diffraction pattern are then combined into a single merged dataset, which is used for structure determination.

Serial crystallography was originally developed for X-ray free-electron lasers (XFELs), where each X-ray pulse destroys the crystal. It has since been adapted for synchrotron light sources, where it is particularly useful for time-resolved experiments. In the meantime, serial crystallography is primarily used for protein structure determination (Spence, 2017 ).

Recently, a small-molecule crystal structure was determined using serial XFEL crystallography (Takaba et al., 2023 ). The small unit-cell parameters of the compound resulted in a lower density of reflections in reciprocal space compared with proteins, leading to fewer diffraction spots per frame. As a result, the unit-cell parameters could not be determined directly from the data and were instead supplied from a complementary 3D electron diffraction (3D ED) (Gemmi et al., 2019 ) measurement (Takaba et al., 2023).

A serial crystallography experiment produces a massive amount of data, with only a small fraction containing diffraction patterns from crystals that can be used for crystallographic analysis. In this context, the primary task of data processing is to sort the patterns into no-hit (to be discarded) and hit patterns (to be further processed). This categorization task can be performed either by statistical methods – by detecting Bragg peaks in a pattern and discarding those with fewer than a threshold number of peaks – or, increasingly, by machine learning techniques, based on different architectures, aiming to recognize entire patterns as hits or no-hits in a manner similar to human visual perception (Ke et al., 2018 ; Nawaz et al., 2023 ; Rahmani et al., 2023 ; Rahmani et al., 2024 ). Meanwhile, large labelled experimental and synthetic datasets (Souza et al., 2019 $[Souza, A., Oliveira, L. B., Hollatz, S., Feldman, M., Olukotun, K., Holton, J. M., Cohen, A. E. & Nardi, L. (2019). DiffraNet: automatic classification of serial crystallography diffraction patterns, ICLR 2019 conference, https://openreview.net/forum?id=BkfxKj09Km.]$ ) have become available for training, facilitating further developments in this direction.

The feasibility of serial electron crystallography has already been demonstrated in a series of studies (Smeets et al., 2018 ; Bücker et al., 2020 ; Plana-Ruiz et al., 2023 ; Hogan-Lamarre et al., 2024 ). Serial electron crystallography offers several advantages over synchrotron-based methods. Compared with large synchrotron facilities, transmission electron microscopes are more accessible and cost-effective. The technique also provides greater flexibility in experimental geometry, as data can be collected in either transmission electron microscopy (TEM) or scanning TEM (STEM) mode, enabling targeted crystal selection and significantly reducing the number of no-hit patterns. In practice, this makes the pre-categorization of electron diffraction patterns into hits and no-hits nearly obsolete. Additionally, sample preparation and handling are considerably easier.

However, these benefits come with a particular challenge: the much shorter wavelength of electrons compared with X-rays results in an almost flat Ewald sphere. As a consequence, diffraction patterns appear nearly flat, making unit-cell determination from serial electron data even more difficult. All studies on serial electron crystallography reported so far have used unit-cell parameters obtained from other sources – either the unit cell was already known, or it was determined using a complementary technique such as 3D ED.

Electron diffraction patterns recorded in a serial experiment can be classified into two groups: zonal patterns and 3D patterns, which represent sections through multiple Laue zones. Zonal patterns contain only a 2D net of reflections and were extensively used in the early days of electron crystallography e.g. to determine the symmetry of materials (Steeds & Vincent, 1983 ). By definition, these patterns do not contain any 3D information. In contrast, randomly oriented 3D sections of reciprocal space do carry 3D information, and these patterns gained prominence with the development of the 3D ED method (Gemmi et al., 2019).

Recently, the GM algorithm – originally developed in the 1990s for unit-cell determination from zonal ED data – was introduced (Miehe, 1997 ; Gorelik et al., 2025 ). Interestingly, a similar algorithm was developed by Belletti et al. (2000 ), suggesting that this topic was actively explored decades ago, well before the advent of serial crystallography.

The GM algorithm has the potential to enable fully ab initio structure determination for serial electron diffraction data without requiring complementary information on unit-cell parameters. It takes a few zonal patterns as input, making the automatic extraction of zonal patterns from a full serial electron crystallography dataset (potentially containing thousands of patterns) a crucial task.

To address this challenge, we explored the use of machine learning classification techniques. The fundamental differences in the geometry of electron and X-ray diffraction patterns imply that existing labelled training data for X-rays cannot be directly applied to electron diffraction. Therefore, we simulated randomly oriented electron diffraction patterns to mimic the outcome of a serial electron diffraction experiment.

Electron diffraction (ED) patterns were simulated using the following parameters:

(i) The unit-cell volume, which determines the reflection density in reciprocal space. A random unit cell was generated with lattice parameters within a specified range. We varied these ranges to produce both nearly isotropic unit cells (with similar unit-cell lengths) and strongly anisotropic unit cells, ensuring coverage of all possible cases. Initially, we focused only on triclinic cells with the unit-cell volume of 500, 700 and 1000 Å³.

(ii) An electron wavelength of 0.0251 Å, corresponding to 200 kV electrons.

(iii) ED data resolution of 1 Å⁻¹.

(iv) An excitation error of 0.01 Å⁻¹.

For a given set of unit-cell parameters, 1000 patterns were calculated, each representing a different lattice orientation. These orientations were generated using the Fibonacci sphere method. The patterns were produced as black-and-white images, with a slightly enlarged central spot in the middle of the pattern (Fig. 1) representing the primary beam. All reflections had uniform size and intensity.

Figure 1
Typical patterns from the training data: (a) and (b) 3DLaueIntersections, (c) and (d) 2DZone. Unit-cell parameters: a = 5.6786, b = 10.4648, c = 20.4682 Å, α = 109.7914°, β = 106.5805°, γ = 105.8206°.

For training data, we manually labelled the patterns, classifying them into 2DZone and 3DLaueIntersections categories. Fig. 1 shows typical representatives of these classes. In total, 4000 patterns were used for training.

While labelling the patterns into 2DZone and 3DLaueIntersections classes, we observed that the fraction of zonal patterns was very low. For a unit-cell volume of 1000 Å³, an electron wavelength of 0.0251 Å, a data resolution of 1 Å and an excitation error of 0.01 Å−¹, only 3% of the patterns fell into the 2DZone category. It is important to note that we applied a very strict criterion for the 2DZone class – the pattern had to contain only reflections forming a visible 2D net, with no additional reflections beyond it.

For training, we used the Faster R-CNN architecture (Ren et al., 2015 ) with a ResNet-50 backbone (He et al., 2016 ) pre-trained on the COCO dataset (Lin et al., 2014 ). Although newer architectures, such as transformer-based models (Dosovitskiy et al., 2021 ), are available and can outperform our chosen network, the Faster R-CNN model was sufficient for differentiating between two classes in our synthetic binary data. Moreover, it was readily available in the PyTorch model zoo and required minimal modification for adaptation to our experiment.

Training was conducted using the PyTorch framework with GPU acceleration on an RTX 2080, over 10 epochs, selecting the best-performing model based on validation set performance. Images were downscaled to 254 × 254 pixels, and the number of output classes was set to 2 (2DZone and 3DLaueIntersections), replacing the 91 classes of the original COCO dataset. All labelled data were split into training and validation sets in an 80:20 ratio.

As the neural network was pre-trained on the COCO dataset, thereby being generalizable to many use-cases, fine-tuning the neural network on the task of 2DZone detection required little training time. Running the training pipeline for a dataset containing approximately 1000 samples for 7 epochs took around 10 min on an RTX 2080 GPU. Inference on the same device has a throughput rate of approximately 10 samples per second.

One issue encountered in the initial training iterations was a significant class imbalance. Since the number of zonal samples was much smaller than that of 3DLaueIntersections samples, the neural network achieved high overall accuracy by classifying all samples as 3DLaueIntersections, effectively neglecting the misclassification of zonal samples. To address this imbalance, we implemented a class-balanced sampling strategy for training, where the probability of selecting a sample was inversely proportional to the class size. This approach significantly improved prediction performance, ultimately achieving 100% accuracy on our validation dataset.

The result of the sorting process was the assignment of each pattern to one of the two classes, along with a confidence level. A confidence of 1 generally indicated a clear case of a 2DZone or 3DLaueIntersections pattern.

Interesting cases arose when the confidence level was lower than 1. In most instances, these patterns were close to a 2D zone but contained a few additional reflections from a different Laue layer, exhibiting a hybrid behaviour (Fig. 2). In principle, such patterns could still be used for the GM algorithm by extracting vectors from the dominant 2D pattern.

Figure 2
Members of the 2DZone class with confidence values less than 1. Monoclinic unit cell with cell parameters a = 3.8155, b = 10.0452, c = 13.7425 Å, β = 107.8717°: (a) confidence 0.9906, (b) 0.9999. Monoclinic unit cell with cell parameters a = 3.7061, b = 12.0249, c = 11.7474 Å, β = 103.1097°: (c) confidence 0.8651, (d) 0.8948. Triclinic unit cell with cell parameters a = 9.7733, b = 11.979, c = 9.6421 Å, α = 91.1645°, β = 113.204°, γ = 102.2199°: (e) confidence 0.7382, (f) 0.9408. Triclinic unit cell with cell parameters a = 6.4633, b = 9.4541, c = 19.4332 Å, α = 104.2276°, β = 118.6144°, γ = 90.5482°: (g) confidence 0.7575, (h) 0.6471.

We trained the model on four triclinic unit cells with a volume of 1000 Å³, then tested its performance on smaller unit cells with volumes of 500 and 700 Å³, achieving very good results. Finally, we generated unit cells with a monoclinic metric (a = 3.8155, b = 10.0452, c = 13.7425 Å, β = 107.8717°; a = 3.7061, b = 12.0249, c = 11.7474 Å, β = 103.1097°) and introduced extinctions along the b axis, corresponding to the effect of a 2₁ screw axis. Despite not being trained on data with extinctions, the model successfully classified the patterns correctly.

The success of the classification procedure was remarkable. This may be because we used simulated data, without the uncertainties inherent in experimental ED patterns, such as detector noise and gain, potential data pathologies, including no-hit patterns, multiple crystals in a pattern, crystal mosaicity and errors in peak position determination. Nevertheless, we demonstrated that the proposed architecture effectively sorts patterns into two distinct classes. This approach can be easily expanded to include additional categories, such as no-hit patterns and multiple overlying crystals, which can be handled separately in subsequent serial data processing steps.

In summary, we present a machine learning approach to classify simulated ED patterns into two categories: 2DZone or 3DLaueIntersections. Patterns assigned to the 2DZone class can subsequently be used for unit-cell determination using the GM algorithm (Miehe, 1997; Gorelik et al., 2025). While machine learning techniques have already been applied in X-ray crystallography to categorize patterns into hits and no-hits (Ke et al., 2018; Souza et al., 2019 $[Souza, A., Oliveira, L. B., Hollatz, S., Feldman, M., Olukotun, K., Holton, J. M., Cohen, A. E. & Nardi, L. (2019). DiffraNet: automatic classification of serial crystallography diffraction patterns, ICLR 2019 conference, https://openreview.net/forum?id=BkfxKj09Km.]$ ; Nawaz et al., 2023; Rahmani et al., 2023; Rahmani et al., 2024), we take a step further by analysing the internal structure and features of the patterns themselves.

In general, two strategies can be pursued in the future for sorting experimental ED patterns: (i) training models directly on experimental data, or (ii) extracting peak positions from experimental patterns – an inherent step in serial crystallography data processing – and then generating semi-synthetic patterns based on these peaks for classification. The latter approach could help decouple the influence of the detector from the data during training, although in practice it shifts the consideration of detector characteristics to the peak extraction step. Future investigations will determine which approach proves more robust, efficient, and practical to implement.

Acknowledgements

Open access funding enabled and organized by Projekt DEAL.

Data availability

The code for training and running inference on the model is available at: https://github.com/EvgenyGorelik/geometric-diffraction-analysis. Simulated electron diffraction patterns used in this study (training data and sorted patterns) as well as the relevant MATLAB codes for data generation and sorting are available at: https://doi.org/10.5281/zenodo.14925441.

References

Belletti, D., Calestani, G., Gemmi, M. & Migliori, A. (2000). Ultramicroscopy 81, 57–65. PubMed CAS Google Scholar
Bücker, R., Hogan-Lamarre, P., Mehrabi, P., Schulz, E. C., Bultema, L. A., Gevorkov, Y., Brehm, W., Yefanov, O., Oberthür, D., Kassier, G. H. & Dwayne Miller, R. J. (2020). Nat. Commun. 11, 996. Web of Science PubMed Google Scholar
Chapman, H. N., Fromme, P., Barty, A., White, T. A., Kirian, R. A., Aquila, A., Hunter, M. S., Schulz, J., DePonte, D. P., Weierstall, U., Doak, R. B., Maia, F. R., Martin, A. V., Schlichting, I., Lomb, L., Coppola, N., Shoeman, R. L., Epp, S. W., Hartmann, R., Rolles, D., Rudenko, A., Foucar, L., Kimmel, N., Weidenspointner, G., Holl, P., Liang, M., Barthelmess, M., Caleman, C., Boutet, S., Bogan, M. J., Krzywinski, J., Bostedt, C., Bajt, S., Gumprecht, L., Rudek, B., Erk, B., Schmidt, C., Hömke, A., Reich, C., Pietschner, D., Strüder, L., Hauser, G., Gorke, H., Ullrich, J., Herrmann, S., Schaller, G., Schopper, F., Soltau, H., Kühnel, K. U., Messerschmidt, M., Bozek, J. D., Hau-Riege, S. P., Frank, M., Hampton, C. Y., Sierra, R. G., Starodub, D., Williams, G. J., Hajdu, J., Timneanu, N., Seibert, M. M., Andreasson, J., Rocker, A., Jönsson, O., Svenda, M., Stern, S., Nass, K., Andritschke, R., Schröter, C. D., Krasniqi, F., Bott, M., Schmidt, K. E., Wang, X., Grotjohann, I., Holton, J. M., Barends, T. R., Neutze, R., Marchesini, S., Fromme, R., Schorb, S., Rupp, D., Adolph, M., Gorkhover, T., Andersson, I., Hirsemann, H., Potdevin, G., Graafsma, H., Nilsson, B. & Spence, J. C. H. (2011). Nature 470, 73–77. CAS PubMed Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J. & Houlsby, N. (2021). An image is worth 16x16 words: transformers for image recognition at scale. ICLR proceedings, https://openreview.net/forum?id=YicbFdNTTy. Google Scholar
Gemmi, M., Mugnaioli, E., Gorelik, T. E., Kolb, U., Palatinus, L., Boullay, P., Hovmöller, S. & Abrahams, J. P. (2019). ACS Cent. Sci. 5, 1315–1329. Web of Science CrossRef CAS PubMed Google Scholar
Gorelik, T. E., Miehe, G., Bücker, R. & Yoshida, K. (2025). Acta Cryst. A81, 124–136. CrossRef IUCr Journals Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. (2016). 2016 IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, pp. 770–778. Google Scholar
Hogan-Lamarre, P., Luo, Y., Bücker, R., Miller, R. J. D. & Zou, X. (2024). IUCrJ 11, 62–72. CrossRef CAS PubMed IUCr Journals Google Scholar
Ke, T.-W., Brewster, A. S., Yu, S. X., Ushizima, D., Yang, C. & Sauter, N. K. (2018). J. Synchrotron Rad. 25, 655–670. Web of Science CrossRef CAS IUCr Journals Google Scholar
Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P. & Zitnick, C. L. (2014). Computer vision – ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, Proceedings, Part V, pp. 740–755. Springer International Publishing. Google Scholar
Miehe, G. (1997). Ber. Dtsch. Miner. Ges. Beih. z. Eur. J. Miner. 9, 250. Google Scholar
Nawaz, S., Rahmani, V., Pennicard, D., Setty, S. P. R., Klaudel, B. & Graafsma, H. (2023). J. Appl. Cryst. 56, 1494–1504. Web of Science CrossRef CAS IUCr Journals Google Scholar
Plana-Ruiz, S., Gómez-Pérez, A., Budayova-Spano, M., Foley, D. L., Portillo-Serra, J., Rauch, E., Grivas, E., Housset, D., Das, P. P., Taheri, M. L., Nicolopoulos, S. & Ling, W. L. (2023). ACS Nano 17, 24802–24813. CAS PubMed Google Scholar
Rahmani, V., Nawaz, S., Pennicard, D. & Graafsma, H. (2024). J. Appl. Cryst. 57, 413–430. CrossRef CAS IUCr Journals Google Scholar
Rahmani, V., Nawaz, S., Pennicard, D., Setty, S. P. R. & Graafsma, H. (2023). J. Appl. Cryst. 56, 200–213. Web of Science CrossRef CAS IUCr Journals Google Scholar
Ren, S., He, K., Girshick, R. & Sun, J. (2015). Faster R-CNN: towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28 (NIPS 2015). Google Scholar
Smeets, S., Zou, X. & Wan, W. (2018). J. Appl. Cryst. 51, 1262–1273. Web of Science CrossRef CAS IUCr Journals Google Scholar
Souza, A., Oliveira, L. B., Hollatz, S., Feldman, M., Olukotun, K., Holton, J. M., Cohen, A. E. & Nardi, L. (2019). DiffraNet: automatic classification of serial crystallography diffraction patterns, ICLR 2019 conference, https://openreview.net/forum?id=BkfxKj09Km. Google Scholar
Spence, J. C. H. (2017). IUCrJ 4, 322–339. CrossRef CAS PubMed IUCr Journals Google Scholar
Steeds, J. W. & Vincent, R. (1983). J. Appl. Cryst. 16, 317–324. CrossRef CAS Web of Science IUCr Journals Google Scholar
Takaba, K., Maki-Yonekura, S., Inoue, I., Tono, K., Hamaguchi, T., Kawakami, K., Naitow, H., Ishikawa, T., Yabashi, M. & Yonekura, K. (2023). Nat. Chem. 15, 491–497. Web of Science CrossRef CAS PubMed Google Scholar

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.

FOUNDATIONS
ADVANCES

ISSN: 2053-2733

Volume 81| Part 5| September 2025|

https://doi.org/10.1107/S2053273325005327

Open

access

Search IUCr Journals		doi		Advanced search
Author		volume	page

short communications\(\def\hfill{\hskip 5em}\def\hfil{\hskip 3em}\def\eqno#1{\hfil {#1}}\)

Machine learning for the classification of serial electron diffraction patterns: synthetic data

Acknowledgements

Data availability

References

short communications