REST-for-Physics  v2.3
Rare Event Searches ToolKit for Physics
Data Structures | Public Member Functions | Protected Member Functions | Private Member Functions | Private Attributes
TRestDataSet Class Reference

Detailed Description

It allows to group a number of runs that satisfy given metadata conditions.

This class allows to make a selection of ROOT data files that fulfill certain metadata conditions allowing to create a group of files that define a particular dataset. The files will be searched in a relative or absolute path that is given together the filePattern parameter.

Basic file selection

We will be able to define the dates range where files will be accepted, using startTime and endTime parameters. The run start time and end time stored inside TRestRun will be evaluated to decide if the file should be considered.

A summary of the basic parameters follows:

Metadata rules

We may add rules for any metadata class existing inside our ROOT datafiles. For such, we use the filter key where we define the metadata member name where we want to evaluate the rules. We need to define the metadata field where we specify the class name or metadata user given name, together with the metadata member we want to access, the metadata member must be named using the coventions defined inside the methods TRestRun::ReplaceMetadataMember and TRestRun::ReplaceMetadataMembers.

Three optional fields can be used to apply the rule:

Example of metadata rule:

<filter metadata="TRestRun::fRunTag" contains="Baby" />

Branches (or observables) selection

Once the files that fulfill the given dates, filename pattern and metadata rules have been identified, the initialization will produce an instance to a ROOT::RDataFrame and an instance to a ROOT::TTree that will give access to the unified analysis tree. The available columns or branches at those instances will be defined by the user inside this metadata class, through the special keywords observables and processObservables.

Their use can be seen in the following example:

<TRestDataSet name="DummySet">
<parameter name="startTime" value = "2022/04/28 00:00" />
<parameter name="endTime" value = "2022/04/28 23:59" />
<parameter name="filePattern" value="test*.root"/>
<filter metadata="TRestRun::fRunTag" contains="Baby" />
// Will add to the final tree only the specific observables
<observables list="g4Ana_totalEdep:hitsAna_energy" />
// Will apply a cut to the observables
<cut name="c1" variable="rawAna_NumberOfGoodSignals" condition=">1" />
<cut name="c1" variable="rawAna_NumberOfGoodSignals" condition="<100" />
// Will add all the observables from the process `rawAna`
<processObservables list="rate:rawAna" />
A class to help on cuts definitions. To be used with TRestAnalysisTree.
Definition: TRestCut.h:31
It allows to group a number of runs that satisfy given metadata conditions.
Definition: TRestDataSet.h:34

Basic usage

The basic usage of this class is by loading the metadata class as any other metadata class. After initialization, the user will get access to the internal RDataFrame and TTree data members, as shown in the following example:

restRoot
[0] TRestDataSet d("dataset");
[1] d.GenerateDataSet();
[2] d.GetTree()->GetEntries()
[3] d.GetDataFrame().GetColumnNames()

We can then use our favorite ROOT::RDataFrame or TTree methods.

Exporting datasets

On top of performing a compilation of runs to construct a dataset and access the data in a unified way, we may also save the generated dataset to disk. This feature might be used to generate easier to handle data compilations that have been extracted from an official data repository.

Different output formats are supported:

Example 1 Generate DataSet from config file:

restRoot
[0] TRestDataSet d("dataset", "dataSetName");
[1] d.GenerateDataSet();
[2] d.Export("mydataset.csv");
[3] d.Export("mydataset.root");

Example 2 Import existing DataSet:

restRoot
[0] TRestDataSet d();
[1] d.Import("myDataSet.root");
[2] d.GetTree()->GetEntries()

Relevant quantities

Sometimes we will be willing that our dataset contains few variables that are extremelly meaningful for the data compilation, and that will be required for further calculations or for the proper interpretation of the data. The key <quantity will allow the user to define relevant quantities that will be stored together with the dataset. These quantitites must be extracted from existing metadata members that are present at the original files. There are different fields allowed inside, such as: name, metadata, strategy and description.

Example:

<quantity name="Nsim" metadata="[TRestProcessRunner::fEventsToProcess]"
strategy="accumulate" description="The total number of simulated events."/>

The name field will be the user given name of the quantity. The metadata field inside the <quantity definition will allow to include a metadata member or a calculation based on a formula where metadata members intervine. The method TRestRun::ReplaceMetadataMembers is the responsible to translate the given metadata formula into a numeric value, check the documentation inside that method to find out the proper format of metadata members inside this field.

There are also different strategies for extracting the quantity value, which are defined by the user using the field strategy, the different options available are:

Adding a new column based on relevant quantities

Using the method TRestDataSet::Define method we can implement a formula based on column names and relevant quantities. Then, the relevant quantities will be sustituted by their dataset value.

dataset.GetColumnNames()
dataset.Define("newColumnName", "QuantityName * column1" )
dataset.GetColumnNames()
dataset.GetDataFrame().Display({"column1", "newColumnName"})->Print();
void Print()
Implementing TObject::Print() method.

Adding a new column on-the-fly during the dataset generation

It is also possible to add new column definitions inside the RML so that the new column will be already pre-generated and included in the new dataset when we invoke TRestDataSet::GenerateDataSet.

We can use a valid mathematical expression where we may include any already existing column or observable, and/or any relevant quantity we introduced in our dataset definition.

We may use the addColumn keyword to add new columns as follows:

<addColumn name="NormalFlux" expression="SolarFlux*GeneratorArea/Nsim" />

where SolarFlux,GeneratorArea and Nsim are the given names of the relevant quantities inside the dataset.


REST-for-Physics - Software for Rare Event Searches Toolkit

History of developments:

2022-November: First implementation of TRestDataSet Javier Galan

Author
: Javier Galan (javie.nosp@m.r.ga.nosp@m.lan.l.nosp@m.acar.nosp@m.ra@ce.nosp@m.rn.c.nosp@m.h)

Definition at line 34 of file TRestDataSet.h.

#include <TRestDataSet.h>

Inheritance diagram for TRestDataSet:
TRestMetadata

Data Structures

struct  RelevantQuantity
 

Public Member Functions

 ClassDefOverride (TRestDataSet, 5)
 
ROOT::RDF::RNode DefineColumn (const std::string &columnName, const std::string &formula)
 This function will add a new column to the RDataFrame using the same scheme as the usual RDF::Define method, but it will on top of that evaluate the values of any relevant quantities used. More...
 
void Export (const std::string &filename)
 It will generate an output file with the dataset compilation. Only the selected branches and the files that fulfill the metadata filter conditions will be included. More...
 
void GenerateDataSet ()
 This function generates the data frame with the filelist and column names (or observables) that have been defined by the user. More...
 
auto GetCut () const
 
ROOT::RDF::RNode GetDataFrame () const
 Gives access to the RDataFrame. More...
 
auto GetEndTime () const
 
auto GetFilePattern () const
 
std::vector< std::string > GetFileSelection ()
 It returns a list of the files that have been finally selected. More...
 
auto GetFileSelection () const
 
auto GetFilterContains () const
 
auto GetFilterEndTime () const
 
auto GetFilterEqualsTo () const
 
auto GetFilterGreaterThan () const
 
auto GetFilterLowerThan () const
 
auto GetFilterMetadata () const
 
auto GetFilterStartTime () const
 
size_t GetNumberOfBranches ()
 Number of variables (or observables) More...
 
size_t GetNumberOfColumns ()
 Number of variables (or observables) More...
 
auto GetObservablesList () const
 
auto GetProcessObservablesList () const
 
auto GetQuantity () const
 
auto GetStartTime () const
 
Double_t GetTotalTimeInSeconds () const
 It returns the accumulated run time in seconds. More...
 
TTree * GetTree () const
 Gives access to the tree. More...
 
void Import (const std::string &fileName)
 This function imports metadata from a root file it import metadata info from the previous dataSet while it opens the analysis tree. More...
 
void Import (std::vector< std::string > fileNames)
 This function initializes the chained tree and the RDataFrame using as input several root files that should contain TRestDataSet metadata information. The values of the first dataset will be considered to be stored in this new instance. More...
 
void Initialize () override
 This function initialize different parameters from the TRestDataSet. More...
 
auto IsMergedDataSet () const
 
ROOT::RDF::RNode MakeCut (const TRestCut *cut)
 This function apply a TRestCut to the dataframe and returns a dataframe with the applied cuts. Note that the cuts are not applied directly to the dataframe on TRestDataSet, to do so you should do fDataSet = MakeCut(fCut);. More...
 
TRestDataSetoperator= (TRestDataSet &dS)
 Operator to copy TRestDataSet metadata. More...
 
void PrintMetadata () override
 Prints on screen the information about the metadata members of TRestDataSet. More...
 
void SetDataFrame (const ROOT::RDF::RNode &dS)
 
void SetFilePattern (const std::string &pattern)
 
void SetObservablesList (const std::vector< std::string > &obsList)
 
void SetQuantity (const std::map< std::string, RelevantQuantity > &quantity)
 
 TRestDataSet ()
 Default constructor. More...
 
 TRestDataSet (const char *cfgFileName, const std::string &name="")
 Constructor loading data from a config file. More...
 
 ~TRestDataSet ()
 Default destructor. More...
 

Protected Member Functions

virtual std::vector< std::string > FileSelection ()
 Function to determine the filenames that satisfy the dataset conditions. More...
 

Private Member Functions

void InitFromConfigFile () override
 Initialization of specific TRestDataSet members through an RML file. More...
 

Private Attributes

std::vector< std::pair< std::string, std::string > > fColumnNameExpressions
 A list of new columns together with its corresponding expressions added to the dataset. More...
 
TRestCutfCut = nullptr
 Parameter cuts over the selected dataset. More...
 
ROOT::RDF::RNode fDataSet = ROOT::RDataFrame(0)
 The resulting RDF::RNode object after initialization. More...
 
Double_t fEndTime = REST_StringHelper::StringToTimeStamp(fFilterStartTime)
 TimeStamp for the end time of the last file. More...
 
std::string fFilePattern = ""
 A glob file pattern that must be satisfied by all files. More...
 
std::vector< std::string > fFileSelection
 A list populated by the FileSelection method using the conditions of the dataset. More...
 
std::vector< std::string > fFilterContains
 If not empty it will check if the metadata member contains the string. More...
 
std::string fFilterEndTime = "3000/12/31"
 All the selected runs will have an ending date before fEndTime. More...
 
std::vector< Double_t > fFilterEqualsTo
 If the corresponding element is not empty it will check if the metadata member is equal. More...
 
std::vector< Double_t > fFilterGreaterThan
 If the corresponding element is not empty it will check if the metadata member is greater. More...
 
std::vector< Double_t > fFilterLowerThan
 If the corresponding element is not empty it will check if the metadata member is lower. More...
 
std::vector< std::string > fFilterMetadata
 A list of metadata members where filters will be applied. More...
 
std::string fFilterStartTime = "2000/01/01"
 All the selected runs will have a starting date after fStartTime. More...
 
std::vector< std::string > fImportedFiles
 The list of dataset files imported. More...
 
Bool_t fMergedDataset = false
 It keeps track if the generated dataset is a pure dataset or a merged one. More...
 
std::vector< std::string > fObservablesList
 It contains a list of the observables that will be added to the final tree or exported file. More...
 
std::vector< std::string > fProcessObservablesList
 It contains a list of the process where all observables should be added. More...
 
std::map< std::string, RelevantQuantityfQuantity
 The properties of a relevant quantity that we want to store together with the dataset. More...
 
Double_t fStartTime = REST_StringHelper::StringToTimeStamp(fFilterEndTime)
 TimeStamp for the start time of the first file. More...
 
Double_t fTotalDuration = 0
 The total integrated run time of selected files. More...
 
TChain * fTree = nullptr
 A pointer to the generated tree. More...
 

Additional Inherited Members

Constructor & Destructor Documentation

◆ TRestDataSet() [1/2]

TRestDataSet::TRestDataSet ( )

Default constructor.

Definition at line 268 of file TRestDataSet.cxx.

◆ TRestDataSet() [2/2]

TRestDataSet::TRestDataSet ( const char *  cfgFileName,
const std::string &  name = "" 
)

Constructor loading data from a config file.

If no configuration path is defined using TRestMetadata::SetConfigFilePath the path to the config file must be specified using full path, absolute or relative.

The default behaviour is that the config file must be specified with full path, absolute or relative.

Parameters
cfgFileNameA const char* giving the path to an RML file.
nameThe name of the specific metadata. It will be used to find the corresponding TRestAxionMagneticField section inside the RML.

Definition at line 284 of file TRestDataSet.cxx.

◆ ~TRestDataSet()

TRestDataSet::~TRestDataSet ( )

Default destructor.

Definition at line 293 of file TRestDataSet.cxx.

Member Function Documentation

◆ DefineColumn()

ROOT::RDF::RNode TRestDataSet::DefineColumn ( const std::string &  columnName,
const std::string &  formula 
)

This function will add a new column to the RDataFrame using the same scheme as the usual RDF::Define method, but it will on top of that evaluate the values of any relevant quantities used.

For example, the following code line would create a new column named test replacing the relevant quantity Nsim and the previously existing column probability.

d.Define("test", "Nsim * probability");

Definition at line 542 of file TRestDataSet.cxx.

◆ Export()

void TRestDataSet::Export ( const std::string &  filename)

It will generate an output file with the dataset compilation. Only the selected branches and the files that fulfill the metadata filter conditions will be included.

For the moment we produce two different output files.

  • A plain-text (file recognized with csv or txt extension): It will produce a table with observable values, including a header of the dataset conditions.
  • A root file (recognized through its root extension): It will write to disk an Snapshot of the current dataset, i.e. in standard TTree format, together with a copy of the TRestDataSet instance that contains the conditions used to generate the dataset.

Definition at line 776 of file TRestDataSet.cxx.

◆ FileSelection()

std::vector< std::string > TRestDataSet::FileSelection ( )
protectedvirtual

Function to determine the filenames that satisfy the dataset conditions.

Definition at line 378 of file TRestDataSet.cxx.

◆ GenerateDataSet()

void TRestDataSet::GenerateDataSet ( )

This function generates the data frame with the filelist and column names (or observables) that have been defined by the user.

Definition at line 305 of file TRestDataSet.cxx.

◆ GetCut()

auto TRestDataSet::GetCut ( ) const
inline

Definition at line 164 of file TRestDataSet.h.

◆ GetDataFrame()

ROOT::RDF::RNode TRestDataSet::GetDataFrame ( ) const
inline

Gives access to the RDataFrame.

Definition at line 121 of file TRestDataSet.h.

◆ GetEndTime()

auto TRestDataSet::GetEndTime ( ) const
inline

Definition at line 153 of file TRestDataSet.h.

◆ GetFilePattern()

auto TRestDataSet::GetFilePattern ( ) const
inline

Definition at line 154 of file TRestDataSet.h.

◆ GetFileSelection() [1/2]

std::vector< std::string > TRestDataSet::GetFileSelection ( )
inline

It returns a list of the files that have been finally selected.

Definition at line 145 of file TRestDataSet.h.

◆ GetFileSelection() [2/2]

auto TRestDataSet::GetFileSelection ( ) const
inline

Definition at line 156 of file TRestDataSet.h.

◆ GetFilterContains()

auto TRestDataSet::GetFilterContains ( ) const
inline

Definition at line 159 of file TRestDataSet.h.

◆ GetFilterEndTime()

auto TRestDataSet::GetFilterEndTime ( ) const
inline

Definition at line 151 of file TRestDataSet.h.

◆ GetFilterEqualsTo()

auto TRestDataSet::GetFilterEqualsTo ( ) const
inline

Definition at line 162 of file TRestDataSet.h.

◆ GetFilterGreaterThan()

auto TRestDataSet::GetFilterGreaterThan ( ) const
inline

Definition at line 160 of file TRestDataSet.h.

◆ GetFilterLowerThan()

auto TRestDataSet::GetFilterLowerThan ( ) const
inline

Definition at line 161 of file TRestDataSet.h.

◆ GetFilterMetadata()

auto TRestDataSet::GetFilterMetadata ( ) const
inline

Definition at line 158 of file TRestDataSet.h.

◆ GetFilterStartTime()

auto TRestDataSet::GetFilterStartTime ( ) const
inline

Definition at line 150 of file TRestDataSet.h.

◆ GetNumberOfBranches()

size_t TRestDataSet::GetNumberOfBranches ( )
inline

Number of variables (or observables)

Definition at line 142 of file TRestDataSet.h.

◆ GetNumberOfColumns()

size_t TRestDataSet::GetNumberOfColumns ( )
inline

Number of variables (or observables)

Definition at line 139 of file TRestDataSet.h.

◆ GetObservablesList()

auto TRestDataSet::GetObservablesList ( ) const
inline

Definition at line 155 of file TRestDataSet.h.

◆ GetProcessObservablesList()

auto TRestDataSet::GetProcessObservablesList ( ) const
inline

Definition at line 157 of file TRestDataSet.h.

◆ GetQuantity()

auto TRestDataSet::GetQuantity ( ) const
inline

Definition at line 163 of file TRestDataSet.h.

◆ GetStartTime()

auto TRestDataSet::GetStartTime ( ) const
inline

Definition at line 152 of file TRestDataSet.h.

◆ GetTotalTimeInSeconds()

Double_t TRestDataSet::GetTotalTimeInSeconds ( ) const
inline

It returns the accumulated run time in seconds.

Definition at line 148 of file TRestDataSet.h.

◆ GetTree()

TTree * TRestDataSet::GetTree ( ) const
inline

Gives access to the tree.

Definition at line 129 of file TRestDataSet.h.

◆ Import() [1/2]

void TRestDataSet::Import ( const std::string &  fileName)

This function imports metadata from a root file it import metadata info from the previous dataSet while it opens the analysis tree.

Definition at line 908 of file TRestDataSet.cxx.

◆ Import() [2/2]

void TRestDataSet::Import ( std::vector< std::string >  fileNames)

This function initializes the chained tree and the RDataFrame using as input several root files that should contain TRestDataSet metadata information. The values of the first dataset will be considered to be stored in this new instance.

The metadata member fMergedDataset will be set to true to understand this dataset is the combination of several datasets, and not a pure original one.

Definition at line 952 of file TRestDataSet.cxx.

◆ InitFromConfigFile()

void TRestDataSet::InitFromConfigFile ( )
overrideprivatevirtual

Initialization of specific TRestDataSet members through an RML file.

Reading filters

Reading observables

Reading process observables

Reading relevant quantities

Reading new dataset columns

Reimplemented from TRestMetadata.

Definition at line 647 of file TRestDataSet.cxx.

◆ Initialize()

void TRestDataSet::Initialize ( )
overridevirtual

This function initialize different parameters from the TRestDataSet.

Reimplemented from TRestMetadata.

Definition at line 299 of file TRestDataSet.cxx.

◆ IsMergedDataSet()

auto TRestDataSet::IsMergedDataSet ( ) const
inline

Definition at line 165 of file TRestDataSet.h.

◆ MakeCut()

ROOT::RDF::RNode TRestDataSet::MakeCut ( const TRestCut cut)

This function apply a TRestCut to the dataframe and returns a dataframe with the applied cuts. Note that the cuts are not applied directly to the dataframe on TRestDataSet, to do so you should do fDataSet = MakeCut(fCut);.

Definition at line 492 of file TRestDataSet.cxx.

◆ operator=()

TRestDataSet & TRestDataSet::operator= ( TRestDataSet dS)

Operator to copy TRestDataSet metadata.

Definition at line 880 of file TRestDataSet.cxx.

◆ PrintMetadata()

void TRestDataSet::PrintMetadata ( )
overridevirtual

Prints on screen the information about the metadata members of TRestDataSet.

Reimplemented from TRestMetadata.

Definition at line 557 of file TRestDataSet.cxx.

◆ SetDataFrame()

void TRestDataSet::SetDataFrame ( const ROOT::RDF::RNode &  dS)
inline

Definition at line 126 of file TRestDataSet.h.

◆ SetFilePattern()

void TRestDataSet::SetFilePattern ( const std::string &  pattern)
inline

Definition at line 168 of file TRestDataSet.h.

◆ SetObservablesList()

void TRestDataSet::SetObservablesList ( const std::vector< std::string > &  obsList)
inline

Definition at line 167 of file TRestDataSet.h.

◆ SetQuantity()

void TRestDataSet::SetQuantity ( const std::map< std::string, RelevantQuantity > &  quantity)
inline

Definition at line 169 of file TRestDataSet.h.

Field Documentation

◆ fColumnNameExpressions

std::vector<std::pair<std::string, std::string> > TRestDataSet::fColumnNameExpressions
private

A list of new columns together with its corresponding expressions added to the dataset.

Definition at line 106 of file TRestDataSet.h.

◆ fCut

TRestCut* TRestDataSet::fCut = nullptr
private

Parameter cuts over the selected dataset.

Definition at line 85 of file TRestDataSet.h.

◆ fDataSet

ROOT::RDF::RNode TRestDataSet::fDataSet = ROOT::RDataFrame(0)
private

The resulting RDF::RNode object after initialization.

Definition at line 109 of file TRestDataSet.h.

◆ fEndTime

Double_t TRestDataSet::fEndTime = REST_StringHelper::StringToTimeStamp(fFilterStartTime)
private

TimeStamp for the end time of the last file.

Definition at line 97 of file TRestDataSet.h.

◆ fFilePattern

std::string TRestDataSet::fFilePattern = ""
private

A glob file pattern that must be satisfied by all files.

Definition at line 58 of file TRestDataSet.h.

◆ fFileSelection

std::vector<std::string> TRestDataSet::fFileSelection
private

A list populated by the FileSelection method using the conditions of the dataset.

Definition at line 91 of file TRestDataSet.h.

◆ fFilterContains

std::vector<std::string> TRestDataSet::fFilterContains
private

If not empty it will check if the metadata member contains the string.

Definition at line 70 of file TRestDataSet.h.

◆ fFilterEndTime

std::string TRestDataSet::fFilterEndTime = "3000/12/31"
private

All the selected runs will have an ending date before fEndTime.

Definition at line 55 of file TRestDataSet.h.

◆ fFilterEqualsTo

std::vector<Double_t> TRestDataSet::fFilterEqualsTo
private

If the corresponding element is not empty it will check if the metadata member is equal.

Definition at line 79 of file TRestDataSet.h.

◆ fFilterGreaterThan

std::vector<Double_t> TRestDataSet::fFilterGreaterThan
private

If the corresponding element is not empty it will check if the metadata member is greater.

Definition at line 73 of file TRestDataSet.h.

◆ fFilterLowerThan

std::vector<Double_t> TRestDataSet::fFilterLowerThan
private

If the corresponding element is not empty it will check if the metadata member is lower.

Definition at line 76 of file TRestDataSet.h.

◆ fFilterMetadata

std::vector<std::string> TRestDataSet::fFilterMetadata
private

A list of metadata members where filters will be applied.

Definition at line 67 of file TRestDataSet.h.

◆ fFilterStartTime

std::string TRestDataSet::fFilterStartTime = "2000/01/01"
private

All the selected runs will have a starting date after fStartTime.

Definition at line 52 of file TRestDataSet.h.

◆ fImportedFiles

std::vector<std::string> TRestDataSet::fImportedFiles
private

The list of dataset files imported.

Definition at line 103 of file TRestDataSet.h.

◆ fMergedDataset

Bool_t TRestDataSet::fMergedDataset = false
private

It keeps track if the generated dataset is a pure dataset or a merged one.

Definition at line 100 of file TRestDataSet.h.

◆ fObservablesList

std::vector<std::string> TRestDataSet::fObservablesList
private

It contains a list of the observables that will be added to the final tree or exported file.

Definition at line 61 of file TRestDataSet.h.

◆ fProcessObservablesList

std::vector<std::string> TRestDataSet::fProcessObservablesList
private

It contains a list of the process where all observables should be added.

Definition at line 64 of file TRestDataSet.h.

◆ fQuantity

std::map<std::string, RelevantQuantity> TRestDataSet::fQuantity
private

The properties of a relevant quantity that we want to store together with the dataset.

Definition at line 82 of file TRestDataSet.h.

◆ fStartTime

Double_t TRestDataSet::fStartTime = REST_StringHelper::StringToTimeStamp(fFilterEndTime)
private

TimeStamp for the start time of the first file.

Definition at line 94 of file TRestDataSet.h.

◆ fTotalDuration

Double_t TRestDataSet::fTotalDuration = 0
private

The total integrated run time of selected files.

Definition at line 88 of file TRestDataSet.h.

◆ fTree

TChain* TRestDataSet::fTree = nullptr
private

A pointer to the generated tree.

Definition at line 112 of file TRestDataSet.h.


The documentation for this class was generated from the following files: