How Vision Transformer Helps Scientist Identify Galaxies

Spread the love

Recently, Joshua, Song-Mao, Hung-Jin, Olivia, and I published a paper on arXiv. This paper is accepted by NeuroIPS 2021 Machine Learning and the Physical Sciences Workshop. We will have a poster section on 2021/12/13. Welcome to visit our poster and chat with us !!

In our paper, we use Vision Transformer (ViT) to classify galaxies from how they look like. ViT is a new tool in deep learning community. We are the first group to apply ViT in astronomy field. In this post, I will explain why we care about how galaxies look like , what ViT is, and what we learned.

Why do we care about how the galaxies look like?

How galaxy looks like in your naked eye

Galaxy is one type of deep-sky object. The only galaxy you can see with naked eyes is the Andromeda Galaxy (M31). Observing the Andromeda Galaxy with naked eyes is not a great experience. First, we need a clear night sky and great naked eyes (or a great pair of glasses). Next, we should be able to find the constellation Cassiopeia and Andromeda. Finally, after locating the Andromeda galaxy, we only see a tiny blurred clump. That’s it. This is how the galaxy looks like with your naked eyes. Thank Galileo that he invented the telescope 400 years ago. Otherwise, we would only have one classification label for the galaxy, the blurred clump.

How galaxy looks like in the telescopes

With powerful telescopes, we can resolve the details in the galaxy’s appearance. Some examples are in Figure 1. Their names reflect how they look. Their appearance serves as a first criterion for us to classify. This method sounds superficial and offensive since we are always taught not to judge something from their appearance. But, this is how scientists classify them. Disclaimer: I don’t mean scientists are superficial or offensive. They are great people, or at least most of them.

What information we learn from their appearance

Why do scientists use the galaxy’s appearance to classify? The reason is that galaxies with similar appearances have similar properties. For example, the elliptical galaxies are older, whereas the spiral galaxies are younger. This indicates that how galaxies look is a good label. Upon seeing the galaxy, we can have a basic understanding of the galaxy, such as its ages and metal abundance.

How we distinguish different galaxies

A naive answer is by the human eye. Indeed, scientists used their human eyes, aka professional human inspection, to distinguish different galaxies in early days. But, as the data went up to 10,000 or even 100,000 images, this task seemed impossible. So, how do scientists solve this problem? Find volunteers to help! This is the core idea in the Galaxy Zoo project. They invited interested people, aka citizen astronomers, to do this visual classification. Each galaxy image is inspected by many people, and the label with the most votes would win. The Galaxy Zoo project turned out to be successful. It provides scientists with many labeled images (and free labors I think) and gives the general public opportunities to participate in scientific projects.

Even though the human eye method helps us in most galaxy classification tasks, we admit this is inaccurate and inefficient. An accurate automated algorithm to perform classification would be the best candidate here.

One possible solution could be found in the deep learning community. In the deep learning community, they have spent much time studying this problem, an accurate automated algorithm. One famous example is called Convolutional Neural Network (CNN). Hopefully, we will have NPR in deep learning field someday. In our work, we use a new tool for visual classification, Vision Transformer (ViT).

What is Vision Transformer (ViT)?

Vision Transformer — A new deep learning tool in visual classification

Vision Transformer belongs to one variant of the Transformer-like architecture. Initially, people used Transformer to solve problems in natural language processing (NLP). One typical example is to let the machine do real-time translation between different languages. Transformer has made a great success in the NLP field. Then, Google first successfully applied this architecture in solving visual classification problems. They called it Vision Transformer (ViT). This success triggers the rage of ViT in the deep learning community. All these Transformer-like tools share a similar underlying mechanism — attention mechanism.

Attention Mechanism — Core in Vision Transformer

“May I have your attention please.” This is the most common airport announcement opening. It plays an essential role for people to perceive important information.

Imagine that you are in a crowded airport, such as Los Angeles International Airport (the worst airport in my opinion). Now, you get stuck in the Starbucks waiting line. A lot of sounds and information are around you. Two undergraduates in front of you are chatting about how they screwed up their finals. The Starbucks clerk is calling the name “Tim” for the third time. The baby behind you starts to cry. You are exhausted from the information bombardment. In this situation, you can easily miss the last call of your flight. This “may I have your attention please” can help you refocus and put attention on the following information.

The attention mechanism in Vision Transformer (ViT) does a similar task. It tells ViT which parts in the image are significant for classification. Then, ViT pays attention to them and classifies the image.

How ViT learns which parts in image are significant

You need to feed your ViT a lot of images and let ViT learn by itself. I know this sounds irresponsible, as some advisors in academia. However, as we feed ViT enough data amounts, ViT can perform a great job in image classification and even “beat” CNN, aka the state of the art in visual identification field.

I feel that this is an inspirational story, especially for those graduates students with irresponsible advisors. You are gonna survive and learn something novel.

What have we learned in this project?

We are the first group to use Vision Transformer in galaxy classification. For comparison, we also use the CNN model (ResNet-50) as a baseline.

The best accuracy in the ViT model is 80.55%, whereas the accuracy in the CNN model is 85.12%. Even though our ViT is not better than CNN, we investigate some interesting cases. These cases are correctly classified by ViT but failed by CNN.

We find that ViT reaches higher classification accuracy in classifying smaller and fainter galaxies. Note that these galaxies are more challenging to classify since the image quality of these samples are noisier.

This difference might come from the attention mechanism in ViT model. More future works are required to conclude. However, these preliminary results give us confidence about the future application of Vision Transformer in scientific fields. We can use ViT to study those cases which are already examined by CNN. Maybe more interesting results would pop up. We can have a deeper understanding of our nature.