CosmoFlow

A representation learning model for dark matter studies

Sidharth Kannan, Tian Qiu, Carolina Cuesta-Lazaro, Haewon Jeong

Photo of Sid
Sidharth Kannan
CosmoFlow is a representation learning model for dark matter simulation data. It employs flow-matching, a state-of-the-art generative modeling approach, to learn low dimensional representations of high-dimensional dark matter simulations. Our representations are 32x smaller than the original simulation data, but can be used to produce high fidelity reconstructions of the full data. They also encode important scientific information, like the cosmological parameters. Finally, we show that we can design the latent space of our model so that different segments of the representation correspond to features at different cosmological scales, enabling a greater degree of interpretability.

ABOUT THE MODEL

Cosmology has a big data problem.


State-of-the-art early universe simulations like AbacusSummit can produce petabyte-scale datasets, making analysis and interpretation of the data extremely expensive and time-consuming. Representation learning provides a path forward by enabling the construction of compact, low dimensional representations of the data that encode the important scientific information.

CosmoFlow uses flow-matching to learn compact representations.

CosmoFlow employs an encoder-decoder architecture, where a ResNet based encoder is used to produce a low-dimensional representation of the data, which we call the compressed field. The compressed field is used to condition the UNet based decoder, which uses it to reconstruct the original data. Models trained with flow matching, like CosmoFlow, operate similarly to diffusion models; they start from a sample of Gaussian noise and iteratively denoise it until a clean sample is produced.

This image shows an example of the generation process. We start from pure noise, and then the model iteratively refines the sample until we get a clean image out.

CosmoFlow’s representations are designed to be interpretable.

As illustrated in the image above, flow matching models reconstruct a sample in multiple steps. First, the large scale features are constructed, and then the finer details are filled in. We take advantage of this feature to structure the compressed field. During both training and inference, depending on the flow matching time step, segments of the compressed field are masked out. It starts entirely masked, and as time progresses, we progressively unmask it. The model then learns that the channel added at a particular time corresponds most strongly to features at the scale being reconstructed at that time step.

By imbuing the channels of the learned representations with a semantic meaning (they correspond to features at different scales), we can individually manipulate the simulation data to emphasize features at different scales.

By manipulating individual channels in the compressed field, we are able to emphasize features at different scales. For example, by manipulating the ‘low frequency’ portion of the compressed field, the image becomes more smoothed out.

ABOUT THE EVALUATION


What can the model be used for?

In order to be useful, our compressed fields need to encode enough information to be useable for scientific tasks. The first example we test is reconstruction; can the compressed field be used to reconstruct the original field with high fidelity? Yes! An example is shown below. The reconstructions are highly realistic, and capture both the large scale structure and fine details present in the original image. The reconstruction is not perfect; in particular it deviates from the original image in the fine details, but it is still extremely high fidelity.

This image shows a comparison between the original simulated data (left) and CosmoFlow’s reconstruction, performed with a compressed field 32x smaller.

CosmoFlow’s compressed fields also encode physical information about the simulations. A common benchmark in representation learning for cosmology is to test how much the compressed representations encode about the cosmological parameters. The two key parameters related to dark matter are a) the percentage of the universe’s energy content that is matter, and b) the standard deviation of the matter distribution in the early universe. Loosely, these translate to asking how much matter is there, relative to other forms of energy, and how clumpy is that matter. We train a convolutional neural network (CNN) to predict these parameters from the raw simulation data, and a feed forward network to predict the parameters from CosmoFlow’s compressed fields. The CNN is able to estimate (a) with 4.96% relative error and (b) with 2.94% relative error, while the feed forward network trained atop the compressed fields is able to achieve a comparable 5.24% and 4.03% relative errors respectively. This demonstrates that the representations that CosmoFlow learns end up encoding meaningful scientific information in a fully unsupervised fashion.

Finally, CosmoFlow can be used to generate new synthetic data at different cosmological parameter values. Running simulations like these are very expensive, and every time you want a sample at a new cosmological parameter value, the entire simulation needs to be run again. We show that the compressed fields can be used to generate samples at new values of the cosmological parameters, without the need for expensive simulations.

This image demonstrates CosmoFlow’s interpolation capabilities. The leftmost and rightmost images are simulation data that start at the same initial conditions but use different values for the cosmological parameters. This results in the same general features, but different smoothness, size, and frequency of the fine details. The second from the left and second from the right are CosmoFlow’s reconstructions of the simulation data. Finally, the middle image is formed by averaging the compressed fields corresponding to both simulation data samples, and generating with that as a condition. This results in an “intermediate” field, which is comparable to a new simulation at the average values of the cosmological parameters used for the two simulations.

In short, CosmoFlow’s representations are able to compactly represent high dimensional dark matter simulation data, and can be used for a range of scientifically relevant downstream tasks.

Contributions


Sidharth Kannan¹, Tian Qiu¹, Carolina Cuesta-Lazaro², Haewon Jeong¹

¹ University of California, Santa Barbara

² NSF Institute for Artificial Intelligence and Fundamental Interactions

Contact


For any questions, please contact the authors at {skannan}@ucsb.edu