Dr. Balázs Bálint, Research Fellow at the Szeged Biological Research Centre, member of the “Lendület” (‘Impetus’) Research Group of Fungus Genomics and Evolution, and HPC expert of the Governmental Agency for IT Development. Number of scientific publications: 61; number of citations: 1791. Key research field: comparative genomics.

How did you get to know supercomputing?

Through the Research Group of Fungus Genomics and Evolution of the Szeged Biological Research Centre, which is active in the field of comparative genomics under the leadership of László Nagy. The administration of their cluster, which is an “ante-room” to supercomputers, has become my responsibility. Today, this cluster comprises 11 machines, and represents the first line of the bioinformatics work we do. In 2018, with the help of the Joint Genome Institute, we received machine time in a supercomputer of elevated ranking on the TOP500 list: the NERSC (National Energy Research Scientific Computing Center) in California. This was absolutely necessary because we wanted to prove that our comparative genomics program can also be used with extremely large amounts of data, even with a thousand genomes. Our third leg is KIFÜ’s supercomputer infrastructure, but when we tested it, we couldn’t find a solution because the sequence aligning program which we wanted to use failed to run on that. The arrival of Komondor will definitely change that.

What was your first supercomputer work and experience?

For me, there is no sharp boundary between a plain computer and a supercomputer because they share the basics. If we use this approach, then I had a resource-intensive task emerging from biology. We wanted to build the genomic sequence of a bacterium, and this was a highly challenging task at the time. This is where I had a chance to work with bioinformatics, programming, and Linux, which have become my daily routine since then. What is exciting for me in this is that I basically use the same tool sets from the comfort of my home – so it’s like a driving license valid for more than one vehicle categories. I used HPC for the first time in 2018, as part of the sequence alignment task related to a thousand genomes.

What do you use supercomputers for?

For evolutionary research. To largely simplify it, the catalogue of living organisms is their genome, which describes the “components” present in that organism. The information content of the hereditary material may also be catalogued on the level of proteins. Therefore, when the base sequence of a hereditary material is read through and the protein-encoding sections present within it are identified, all proteins occurring in a given organism can be assigned to that organism. These catalogues help us have a better understanding of the evolutionary history of living organisms and protein families. For this purpose, we used public databases to collect more than a thousand eukaryotes with at least 2000 and at most 5000 proteins per species, and grouped the proteins on the basis of their similarities between them. Next, you can draw a phylogenetic tree for each protein family based on the sequences of the proteins classified as belonging to a common family. A phylogenetic diagram of the various species is also prepared, and this shows the connections between them. A summary of these data paints an evolutionary story, which tells us about the events that could have occurred a very long time ago in the past and had led to the biological diversity we see today.

Can you tell us a concrete example?

There are proteins playing various regulatory roles that are very frequent in animals but are not present in fungi, for example. Interesting questions to ask include why it has been this way, and when a function was lost or emerged during the history of evolution. If a set of proteins responsible for a specific ontogenetic step in animals is completely missing from fungi, then how do fungi solve the same task? It is very interesting to see how an animal cell becomes “specialised” to for a specific task: for example, the way how it becomes an epithelial cell. It undergoes a well- regulated developmental process to achieve this. As it advances along the road to becominge an epithelial cell, one thing is definitively decided already at the outset: it cannot become a muscle cell or a liver cell. DNS methylation ensures that functions which are no longer needed may not be switched back on later. This is a regulatory system which is completely missing from fungi. Yet they have their ontogenesis, various cell types appear, and different parts can be distinguished within a fungus. One exciting question is whether we find something else that hasve taken over this task where such a key regulation is absent. We may hope to find answers to this question by learning more about the evolutionary history of the protein families of fungi. We expect that families having prominent importance during fungal ontogenesis will be more prevalent in fungi as compared to others. We try to assign certain visible characteristics to the underlying genetic, genomic and proteomic differences. Since the data amounts are huge, that’s where supercomputers come into the picture.

How can your results be utilised?

Most of my work is basic research – studying evolutionary problems. In practical terms, one goal may be to achieve higher yields in fungi, or to develop fungus varieties that grow an individually shaped sporophore conferring them better marketability, or to develop sporeless fungi, for example. But if you want to produce alcohol for energy generation purposes at the lowest cost possible and you have ample timber resources as raw material but the lignin contents of the timber are not easy to degrade at all – fungi may also help with this.

Do you have a project or result that would have been unfeasible without a supercomputer?

The entire project was like this. What we are currently working on, could be done without a supercomputer, but it is far from being computable on our own computer cluster. However, the next generations of equipment keep arriving with giant strides, and it may happen that tasks for which we are using supercomputers today will be resolvable on a desktop computer in ten years.

Is there an alternative to supercomputers in your field of science? Can such tasks be solved in other ways?

Not yet. In our case, computing in the cloud is highly cumbersome because we wouldn’t be able to migrate the amount of data that we need to work with. Data migration between us and our partner in the US or KIFÜ’s supercomputers also takes days. Therefore, uploading to the cloud would take weeks or months. What’s more, these cooperation efforts are favourable in terms of finances too – our basic research project could not be operated under market conditions and we couldn’t afford to rent such huge capacities.

Is every argument on HPC’s side?

I think HPC is a technical sport showcasing the cutting-edge technologies of the time. They are terribly expensive and become obsolete within not much more than a few years. At the same time, the world of HPC is an elite club that’s worth joining, otherwise, we will fall behind. It opens up research areas and enables technologies – including machine learning and the ability to work with huge amounts of data – that would be otherwise inconceivable. Data analysis has become a key point on the agenda by today. The Human Genome Project was launched in the end of the 1990’s, and the original plans envisioned more than ten years of work, astronomical budgets, and conventional automated sequenators. In comparison, today you can order your own complete genome reading for the price of a better cell phone, and the results can be obtained within days. Structuring such enormous amounts of data and extracting the relevant information will be the tasks of the period to come – and we will be highly reliant on supercomputers in this.

Isn’t all information relevant?

In my current work, 160 to 170 of the downloaded genomes show high levels of contamination. I could mention the cork-tree as an example – the hereditary material was purified from the leaves of a famous old tree, which was then processed, and the reference genome was published. From a closer look, however, it turned out that two genomes were mixed together, because the sampled tree was infected by some Ascomycota. The authors failed to notice that by their naked eyes, but the cork-tree genome is actually a mixed fungi/plant genome. Currently, my work is to find and fix those errors.

How should one think of this in practice?

I evaluate every protein with my computer, and try to map them in the protein space – to study who are relatives. The reference database comprises 280 million proteins, and we can try to align the 18 million proteins interesting for us and corresponding to the 1000 genomes to this background. The thing will pop out because a tested ‘plant’ protein will be more similar to those of a fungus than to those of a plant. You can also see that such strangely behaving proteins consistently indicate a rather narrow range of fungi as their closest relatives. I use these rules to write the cleaning program. The benefits of this project include giving feedback to the scientific community because using the data in this form could result in rather bizarre outcomes. It’s the same with bacteria – a great number of samples are contaminated with them, and this is also not something you immediately notice.

How often do you encounter contaminated genomes?

Unfortunately, it is very often. Cleaning is especially difficult when the same bacterium contaminates several species that are relatives. One good example of this is the Drosophila (fruit fly) genus, one of the most popular model animals of modern genetics. When the same protein contaminant is present in more than one fruit fly genomes, wrongly labelled proteins mutually confirm each other. In such cases, it is difficult to notice that we are in fact dealing with contamination. Therefore, the huge amounts of data are a blessing on the one hand because you can ask questions that were unanswerable before; on the other hand, you shouldn’t immediately believe everything you see.

Do you also need to write a cleaning program for that?

I don’t know the story of the genesis of those 1000 genomes but before I can start working with them, I need to check them, protein by protein, in all cases. 3 to 4 of 15 Drosophila genomes are contaminated with bacteria because no great emphasis was placed on cleaning those out during the genome sequencing project. This is like having two puzzles mixed up but we have the box of one of them only. There will be pieces that wouldn’t fit the picture no matter what. If we try to understand the phylogenesis of fruit flies based on contaminated data, then we are lost.

Have supercomputers made you surprised? Have supercomputers made you surprised?

Yes. Programs and methods that we routinely use on smaller scales are difficult to force into the frameworks of HPC. HPC ensures amazing computing capacities if we manage to feed the data in a manner so that the great potentials offered by HPC can manifest. You obviously need parallel data processing, and you must very precisely plan the way data are subdivided and the program is built. This was one of the biggest surprises – the amount of attention it requires. If I run a general bioinformatics program code without changes, I will not be efficient enough and I may run out of the available timebox despite having many processors and enormous RAM. Another surprise was to see how efficient tool a supercomputer is. During the analysis of the 1000 genomes, one of the most difficult tasks was Markov clustering. We grouped 18 million proteins based on 70 billion pieces of protein similarity data then. Practically speaking, performing this task would be almost unfeasible on our cluster. But the same task was carried out within hardly 2 hours in the supercomputer environment!

How can you interpret the results you obtain?

The outcome of Markov clustering is an easily readable text file listing families and the identifiers of the proteins classified into those families. This is now easy to work with. The analysis step itself – the way the program generates a protein relations graph from protein similarity data, and then identifies protein families based on walking through the nodes –, now that is complete “black magic” to me.

To what extent do you think the use of supercomputers is challenging?

Very much. Processing biological data on a computer is the cruellest torture because sometimes I have a single large file and sometimes I have 200,000 small files. You have parts that you cannot parallelise at all, and you have parts that are splendidly suitable for that. These completely different substeps with different potentials to be profiled come one after the other, and to concatenate these in a way that avoids the wasting of computing capacities is a terribly difficult task. Another challenge is that it’s astonishingly easy to lose data. We process data in batches and some batches may get lost without us noticing it. If you don’t check yourself at every point, the end result may easily become distorted. Data format conflicts are very frequent. One source marks the end of a protein with an asterisk, the other uses no “stop” marks at all. Though the asterisk is useful because it confirms that we aren’t dealing with a protein fragment, there are analytical programs which “see stars” when such asterisk-type stars appear, and will not even run. The greatest challenge in using supercomputers is to do it efficiently.

How fast can someone acquire the necessary skills?

This is like a foreign language. You acquire the basic vocabulary very fast, but using the language of the supercomputer as a mother tongue is a distant desire.

For whom would you recommend the use of supercomputers?

Supercomputers use Linux so it is important to have some Linux background. And the contrary is also true: if you want to work with supercomputers you will not be able to avoid Linux. Therefore, I would recommend supercomputers for those who have the ability to learn new things and are motivated to work with a text-based environment. Supercomputers are very far from comfortable graphical solutions. Making them beautiful or comfortable are not among the goals – everything is about efficiency. Still, I would recommend them to anyone having a problem that would justify the use of HPC.

What makes an efficient supercomputer user?

Thorough knowledge of the problem, the opportunities, and the characteristics of the machine. You will also need advice from the experts providing support services, for example, Dr. Attila Fekete (Senior User Support Expert of the HPC Competence Centre – editor’s comment). You need to find your way through the folders on the machine, you need to know the programs that have already been prepared for you, you need to know how to install a program if necessary, and you need to know the structure of the machine. The key is good planning.

Why did you apply to join the experts of the HPC Competence Centre?

There are two reasons why I would also like to increase the number of HPC KK experts by the end of 2022. One of them is a picture that emerged in my head during the local system administration tasks: a picture about the problems users keep facing, and then one about the bottlenecks. Using a supercomputer is like nurturing a plant. It needs various nutrients. If there is a shortage of any of them, the plant will not grow well. In a supercomputer environment, all parameters have to be developed in harmony with each other. Those tests and benchmarks, and the analytical diagnostic approach that I acquired using our own cluster may become handy when operating Komondor and providing user support. The other reason is to learn. This group of experts is an elite club of interesting members with various backgrounds and useful experiences. We can learn a lot from each other as well.

What is the most important task of the Competence Centre?

I would name two inseparable tasks. On the one hand, to offer a chance of learning to use supercomputers to as many individuals as possible ; and to help them learn on the other hand. We need to train professionals who will be able to use the available, expensive and valuable resources in an efficient manner. As regards hardware composition, Komondor is not behind the leading supercomputers of the world – this is a huge opportunity to take, but fruitful work requires a proper operation.

To what extent do you perceive the development of supercomputers?

Absolutely. New generations of processors with much larger performance are released one after the other – you can wrap much more memory into a computing unit, and these are key advances. I was working with two generations at our partner in the US: a machine called ‘Edison’ was retiring at the time we joined the project. Cori was our flagship, and it still exists but will be soon replaced by a machine called Perlmutter. The will be a time when it will simply not be worth paying the electricity bill of an older tool because a new tool will ensure higher computing capacities at an operational cost negligible to that of the previous one. Hungary shows the same trend. With the arrival of Komondor, a system that has become obsolete will be switched off.

How do you see the future of supercomputers?

It is inseparable from the future of information technology. Development trends affect all levels, including supercomputers. I expect further increases in performance in any case, and I am curious about the future of quantum computing. The big question is when quantum computers become widespread and what set of new tools they will open up – tools we cannot even hope for today.