Proposal for higher dimensional data and user defined spacers #148

aglavic · 2026-01-27T15:35:05Z

@jochenstahn asked me about an option to add extra newlines after certain datasets to facility 2D plotting with Gnuplot of blocks of datasets (e.g. a time series as separate files that contain spin-up and spin-down data in the same time), this made me think about a possibile generic solution to higher dimensional data (off-specular / GISANS).

I've come up with a proposal of how we could handle such a situation:

A dataset with more then 1 dimension per column is stored as flattened version (C-style indexing as it's numpy default)
In such a case the original dimensions are stored as the optional "data_shape" attribute of the header.
When written to .ort the last dimension is taken as "row" and every row a comment string is written with optional spacer.
This keeps backward compatibility when reading such files.
2D data can, out of the box, be plotted with Gnuplot.

In addition to this treatment, I've added some convenience functions to the OrsoDataset class that allow to iterate over columns and index it for data, in which cases the data is automatically reconstructed to the original shape. I haven't tested it yet, but I think the NeXus version should work automatically.

A second part is an optional spacer that can be added when writing text files. The class just inserts a string between datasets, but could be sub-classed for more custom behavior. Typical applications would be additional new-lines or a nice separator comment making the file more readible. It is only relevant for the text file export and lost on read.

I'll add some examples and extra tests, but wanted to get some opinions first.

Example File: test.ort

…ndling

bmaranville · 2026-01-27T17:17:04Z

For the second part, I was under the impression we already allowed the user to specify a separator between datasets, which could be multiple newlines if you want. Why do we need anything additional?

orsopy/orsopy/fileio/orso.py

Line 218 in 642acc7

    
               :param data_separator: Optional string of newline characters to separate multiple datasets.

bmaranville · 2026-01-27T17:22:44Z

For the first part, I am not quite sure what the motivation is. I thought the main benefit of the text format was a human-readable format that could also easily be ingested by common programs like Excel or Origin or numpy, without modification?

I can see making a format that gnuplot can natively read, but that already seems like a niche use case and if a bespoke reader library is required (in order to read the custom data_shape header item and interpret and apply it), then why are we bothering to write a text format at all? What is the benefit?

andyfaff · 2026-01-29T04:33:17Z

There are several things to unpick here.

time based datasets

Time sliced datasets should ideally be put into separate datasets. There's no guarantee that each time slice would have the same number of data points or identical Q-values.
Putting them into a single dataset, with an extra time column doesn't give the feeling of robustness.
I would say that different polarisation channels would fall into this basket.

multidimensional datasets

<TLDR: can additional columns after Q/R/dR/dQ be multidimensional, and be stored in different HDF datasets?>

There's a high chance multidimensional datasets such as GISANS are not rectangular. e.g. whilst a detector image has a pixel grid, the Qz/Qy values probably aren't linearly spaced. For datasets like that, if you're saving in an ORT file it might be better to add columns, i.e. the regular 4, and then the extra axes.

Qz, R, dR, dQ [, Qy].

You just have to have as many rows as you do pixels.

It might be a better path to transition to ORB, rather than ORT, at this point. I believe, @bmaranville correct me if I'm wrong, that each columns doesn't have to have 1 dimension, they can be multidimensional? If not, then I'd like for that to be worked on.

For higher dimensions binary files are probably better than text based format. It's much easier to load/save multiple dimension arrays to HDF (npy) than it is to come up with a list of rules to how they should be stored in text.

Multidimensional arrays are what is needed for sophisticated resolution smearing kernels. Here each Q point has an associated probability distribution, i.e. kernel_q, p(kernel_q) , so two arrays that have M number of points each. The mean of that probability distribution is Qz, and the standard deviation is dQz. I've been writing those to my own HDF file, but I'd like to do that with ORB.

This means that for a dataset with N points each of the columns has shape:

Qz, (N,)
R, (N,)
dR, (N,)
dQz (N,) # gaussian approximation to resolution function)
kernel_qz (N, M)
kernel_pqz (N, M)

I reckon it's easier to do this in ORB than ORT.

andyfaff · 2026-01-29T04:40:56Z

It might also be nice if the ORB file could have every column being multidimensional.

aglavic · 2026-02-02T07:50:17Z

<TLDR: Why not add a convenience capability that does not reduce generatlity/compatibility?>

Just a bit of my background thoughts on the two point from.

Why do we need an extra spacer: Yes, we have a spacer between datasets. But if we want to indicate the end of a sequence, a manual extra spacer is needed. (e.g. there is a time series in the first 100 Datasets for spin-up and another 100 Datasets for spin-down) With the extra spacer gnuplot knows how to split the sequences an plot them on separate grids. It is purely for the text output and does not alter the validity/compatibility of the files.
Why do we need an facility for higher dimensional data in text:
- I think the .ort should be able to store the same data as .orb, if possible and the user wants to. For regular grided data (e.g. timeseries, series of measurements with same angles, off-specular, GISANS) it is more convenient than pure columns with the right coordinates (often plots better, too. I remember using the unique(columns) trick a lot to find the dimensions of existing text data). It is straight forward to store that in the NeXus format just in that higher dimension. Yes, it requires extra treatment, eigther by using orsopy or by reshaping the data, but in Python that is just one extra line that would only be needed if the program actually uses the higher dimensionality, In other cases the file reads like a single dataset without issues, so there is no incompatibility problem.
- @bmaranville I think it makes the files more readible as it gives clear separations after the end of a stride of one dimension. Compatibility with Excel et. al is not modified.
- I don't quite get the point that higher dimensional datasets are likely not rectangular. Yes, the Qz/Qy/Qx values are probably not on a regular grid, but that doesnt mean the datastructure is not a matrix (i.e. number Qz-points per Qy-point are identical). Each of the columns has this higher dimension, so every pixel has it's own coordinate. The shape just tells you an order to the pixels. For most off-specular and GISANS datasets I would expect such a structure (e.g. GISANS given by the pixel grid of the detector, off-specular using the same number of lambda-points for each detector pixel or in monochromatic each angle uses the same pixels). In the odd other cases, you just revert to using 1 Dimensional columns.
- @andyfaff your example of a dataset with higher number of dimensions for only a sub-set of columns seems to me as a sepcial case that benefits mostly from the NeXus file type. Otherwise you could split those into M columns.

As said before, for me this is an improvement without negative side effects. What we could discuss is, if we want to flatten the columns only on export and keep them in their original shape in the Orso object. This has the advantage that the shape is always clear, but would break existing program integration that expects a column to be 1d.

Implement dataset spacer for text export and high dimensional data ha…

1451f8c

…ndling

aglavic requested review from acaruana2009, andyfaff and bmaranville January 27, 2026 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for higher dimensional data and user defined spacers #148

Proposal for higher dimensional data and user defined spacers #148

Uh oh!

aglavic commented Jan 27, 2026

Uh oh!

bmaranville commented Jan 27, 2026

Uh oh!

bmaranville commented Jan 27, 2026

Uh oh!

andyfaff commented Jan 29, 2026

Uh oh!

andyfaff commented Jan 29, 2026

Uh oh!

aglavic commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Proposal for higher dimensional data and user defined spacers #148

Are you sure you want to change the base?

Proposal for higher dimensional data and user defined spacers #148

Uh oh!

Conversation

aglavic commented Jan 27, 2026

Uh oh!

bmaranville commented Jan 27, 2026

Uh oh!

bmaranville commented Jan 27, 2026

Uh oh!

andyfaff commented Jan 29, 2026

time based datasets

multidimensional datasets

Uh oh!

andyfaff commented Jan 29, 2026

Uh oh!

aglavic commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants