Introduction

Overview

Teaching: 10 min
Exercises: min

Questions

How are EIC/ePIC simulation outputs organised?

Objectives

Understand how the simulation output is organised

Find out how to request a new simulation

Discover the tools that are available to browse and access the simulation output

Simulation Campaigns

Simulations of a range of physics processes in the ePIC detector are typically run on a monthly basis by the Production Working Group. Information on simulation campaigns can be found on the Production Working Group pages. This includes details of files produced in previous campaigns.

A list of current request from Detector Subsystem Co-ordinators and the Physics Analysis Co-ordinators can be found here.

Campaigns are designated by a standardised format - YY.MM.Ver

YY - Year the campaign ran, e.g. 26 is 2026
MM - Month the campaign ran, e.g. 02 is February
Ver - Version of the campaign, starts from 0. May have different versions

These are linked to specific software releases following the same format.

Note that campaigns more than ~6 months old will not directly be accessible using the methods we will explore in this tutorial.

Various types of files are produced as part of the simulation campaign as we will discuss in the next section. The files you may wish to access will differ depending upon your use case. In this tutorial, we will explore a few different common use cases and the types of files you may want in each.

Submitting a New Simulation Request

If you would like to submit a new request to a future campaign for a dataset that is not in production, please follow the following process:

Coordinate with your physics or detector working group and the detector subsystem or physics analysis co-ordinators to add your request to the overview spreadsheet and assign a priority.
Generate the Monte-Carlo input for your new request.
- Please follow the pre-processing guidelines when preparing your new input files for submission.
Once your input files are ready, submit a simulation request form.
- If your input is not pre-processed following the pre-processing guidelines, it will not be simulated. Please review these carefully.

Simulation Files Organisation

Within a simulation campaign, there are three broad classes of files that are produce:

EVGEN: The input hepmc3 datasets
- E.g. some files that have been supplied by a physics event generator
FULL: The full GEANT4 output root files (usually only saved for a fraction of runs)
- If running a simulation yourself, this would be your output from processing npsim
RECO: The output root files from the reconstruction
- And again, if running yourself, this would be your output from EICrecon (after you’ve used your awesome new reconstruction algorithm from the later tutorial of course)

Most users and use cases will interact with RECO files, the output of the full simulation and reconstruction chain. We will explore some use cases and how to find the relevant files in each case.

How can I Browse the Simulation Campaign Output and Access Files?

To browse the campaign output and find the files we want, we can use Rucio. Rucio is an open source scientific data management system. It is utilised in other large physics experiments such as ATLAS.

Wait, I read I should use XrootD to find and access files?

You may find reference to or instructions on using XrootD to browse and access files. These may still work and indeed, we will use some of these commands later in this tutorial. However, Rucio is now the preferred method for the cases we will examine. The recommended workflow is now:

Find file location with Rucio
Stream or download with XrootD

Why? This change isn’t just to make everybody learn something new, it is also a consequence of the expansion of the volume of ePIC data now available. Previously (before 2026), all simulated data was stored on Jefferson Lab servers. However, data is now spread between multiple sites. This makes finding an accessing it using XrootD more complicated. Rucio can deal with this “issue” in a straightforward way.

You may also find reference to an S3 server. This is now deprecated and cannot be used. If you find such references or instructions to S3 server usage in tutorial material, please raise an issue on the GitHub page for this tutorial flagging that this should be removed.

Key Points

Simulation campaigns run on a regular (monthly basis)

Input requests must be formatted in a specific way and meet certain pre-requisites

Rucio is the primary way to browse and access simulated EIC/ePIC data

Rucio Usage

Overview

Teaching: 15 min
Exercises: 15 min

Questions

How can I use Rucio?

Objectives

Become familiar with aspects of Rucio

Use Rucio tags to find specific types of files

Learn how to download or stream files for further use

Getting Started

We can access and run the Rucio client from within eic-shell. From wherever you have eic-shell:

./eic-shell
rucio whoami

This should print out some information:

email      : eicprod@jlab.org
account    : eicread
account_type : GROUP
...

We can also check the arguments we can supply to rucio, as well as usage info with:

rucio -h

To use Rucio further, we will need to briefly look at how Rucio organises data.

Datasets and DIDs

Typically, we want to analyse data contained within specific files. Files can be grouped together into datasets which can themselves, be grouped into containers. All three refer to “data”. As such, the term “data identifier` or DID is used in Rucio. A DID is just the name of a single file, dataset or container.

In Rucio, all DIDs follow a naming scheme which is composed of two strings - a scope and a name, formatted as:

scope:name

For epic, the scope is always epic, meaning that all of our DIDs look like:

epic:name

The name contains information about the dataset in question and contains information such as the software release used to create the file, electron and ion beam energies etc.

As an example, consider the DID for the dataset:

epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x130/q2_10_20/pi+

The name here - /RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x130/q2_10_20/pi+, tells us many things about the contents of this dataset. Let’s break this down, examining the component enclosed within each pair of /---/ -

RECO
- This tells us that the DID contains reconstructed output file information
26.02.0
- This tells us that the 26.02.0 software release was used, the February 2026 release (version 0).
epic_craterlake
- This tells us that the epic_craterlake detector configuration was used in the simulation
EXCLSUIVE
- The DID is for a dataset of exclusive physics events
DEMP
- This is the specific exclusive process simulated in the dataset, Deeply Exclusive Meson Production, DEMP
DEMPgen-1.2.4
- The simulation in this dataset is based upon input generated by the DEMPgen-1.2.4 event generator
10x130
- The files in this dataset were simulated with 10 GeV electrons on 130 GeV ions (in this case protons, if the ion species is not specified, it is likely to be an ep simulation)
- The AAxBBB format, where AA is the electron energy and BBB is the ion energy, is typical for quoting beam energies in EIC files. This may be written as AAonBBB too.
q2_10_20
- The files in this dataset are from a simulation input containing events in the 10 to 20 GeV2 Q2 range
pi+
- Pi+ are generated in this output - this is specific to this DEMP reaction and signifies that it is Deeply Exclusive Pion Production

Warning - Not a filepath!

The name of our DID here looks a lot like a filepath, however it is a flat object and does not have any hierarchy as we will see in the next section.

Other names may not necessarily contain all of the same information, but as a bare minimum, are likely to tell us something about the physics process simulated and beam conditions, as well as which software release was used. This is reflected in the metadata tags assigned as we will see later.

Finding DIDs

Now that we know what a DID looks like, how can we find the DID corresponding to the file or dataset that we’re interested in?

Well, we can list DIDs using:

rucio did list scope:name

To begin though, let’s try:

rucio did list --help

As we can see, we can apply filters to our list request if we want. We can also apply wildcards, but we need to be a bit careful due to the warning above. Whilst our DIDs look like a unix file path, they are not. The structure is flat. We can quickly see this if we try:

rucio did list epic:/RECO/\*

We get an enormous number of DIDs returned! This is every reconstruction related DID available to access right now.

Warning - Check the campaign date!

If you encounter any issues when processing the DID listed earlier, it may be due to the software release version. Remember that campaigns older than ~6 months will not be instantly accessible. Try switching to a more recent campaign version.

Working backwards from the full DID we had earlier, we could combine in the software release, detector configuration, process and generator to narrow down the list of DIDs:

rucio did list epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/\*

We now have a more manageable list and we can see that we have some different beam energy configurations and Q2 ranges available. Note that we can list just the DID itself by adding --short as a prefix to our did list call.

We could also make slightly better use of wildcards in this command too:

rucio did list epic:*RECO*26.02.0*DEMP*

But as above, this does require some knowledge of what our DID looks like to begin with.

Pin for later:

If we used --short as suggested to just get a list of DIDs, we could pipe this output to a file.

Each line would be the full DID for an item which we could potentially make use of.

As we can see, the DIDs we have in our list now are all datasets. We can check the contents of these datasets too. Let’s pick one of our DIDs and examine the content. We can do this via:

rucio did content list scope:name

e.g.

rucio did content list epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+

Again, a lot of output. We see though that this DID is for a dataset of a large number of files. We can again use --short to just get this as a list:

rucio did content list --short epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+

We can check where a specific file in our dataset is stored too:

rucio replica list file --protocols root --pfns --rses isopenaccess epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root

list file comment:

Despite the slightly misleading command above, we can actually just provide a dataset DID here too. If we do so, we will get the location of all files in the dataset in one command, e.g:
rucio replica list file --protocols root --pfns --rses isopenaccess epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+
You could pipe this to a file for later usage. However, note that replicas may exist for a given file. Both would be printed by this command as is. You can check if multiple copies exist via:
rucio rule list --did scope:name
e.g.
rucio rule list --did epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+
This will list where the DID is stored and how many copies exist.

The root://dtn-eic.jlab.org at the start of the output tells us that this particular file is stored on JLab servers. As mentioned in the outset, Rucio works across multiple sites easily, however, methods which we might use to stream files do not. As such, being able to check where our files are stored is a useful feature.

So, we can find DIDs, check what they are and what they contain. To get to this point though, we needed some pre-knowledge of what the DID looked like which isn’t necessarily that helpful for finding something. However, a much easier approach to finding what we need is to use the metadata tags that are assigned all DIDs from March 2026 onwards.

Metadata Tags

Thanks!

Automatically adding these metadata tags to datasets was enabled due to work by Sakib Rahman (BNL), Anil Panta (JLab) and ePIC Software & Computing. Thanks to their efforts, finding ePIC data using Rucio is more straightforward!

The following tags are available as of March 2026:

software_release
- Software release used in the simulation. Written as a container version tag/simulation campaign naming:
- YY.MM.v
- E.g. 25.06.2-stable -> June 2025 software container, version 2 stable
  - Note, campaign release files will almost always be from a -stable release/container version
is_background_mixed
- True/false depending upon whether sample includes any background mixing
data_level
- Level of simulation data, simulation or reconstruction
geometry_config
- Geometry config tag, e.g. craterlake_18x275, craterlake_5x41_He3
generator
- MC event generator used to generate the simulated data
  - pythia6, pythia8, beagle, djangoh, rapgap, dempgen, sartre, lager, estarlight, eic_sr_geant4, eic_esr_xsuite, sherpa, single_particle, epic, other
requester_pwg
- Defines the physics working group (PWG) that the simulated data relates to, options are:
  - edt (exclusive, diffractive and tagging), inclusive, jets_hf, semi_inclusive, ew_bsm, other
- Can be one or more
- Skipped for SINGLE/BACKGROUNDS
requester_dsc
- Detector subsystem collaboration requester
  - tracking, other
- Set to tracking for background related datasets
electron_beam_energy_gev
- Electron beam energy in GeV
ion_beam_energy_gev
- Ion/nucleus beam energy in GeV
ion_species
- Ion species in the simulation, defaults to p, proton, if not specified
  - p, Au197, Cu63, He3, H2, Ru96
q2_min_gev2
- Minimum Q2 value (GeV^2) in the simulation file, entered as a number.
q2_max_gev2
- Maximum Q2 value (GeV^2) in the simulation file, entered as a number.
gun_particle
- Single particle type
  - e-, e+, proton, neutron, pi+, pi-, pi0, kaon-, kaon+, gamma, mu-
gun_momentum_min_gev
- Minimum gun momentum in GeV
gun_momentum_max_gev
- Maximum gun momentum in GeV
gun_theta_min_deg
- Minimum gun polar angle in degrees
gun_theta_max_deg
- Maximum gun polar angle in degrees
gun_phi_min_deg
- Minimum gun azimuthal angle in degrees, default 0
gun_phi_max_deg
- Maximum gun azimuthal angle in degrees, default 360
gun_distribution
- Type of distribution for particle gun
  - uniform, cos(theta), eta, pseudorapidity, ffbar

Reference Sheet

This information is available segmented out from this tutorial as a reference sheet in the extras section by following the Rucio Metadata Tags link.

Most of the tags in this list are optional and may not be applied to all datasets. However, the following tags are required for all datasets:

software_release
is_background_mixed
data_level
geometry_config
generator

This does not mean you are required to use these, rather that all datasets will have these tags.

Note that tags are entered in lower case, with the exception of ion species.

We can use these tags to filter through the available datasets and identify those of interest to us. For example:

rucio did list --filter 'TAG==*' 'scope:*'

So, as an example, we could list all DIDs with electron beam energies of 10 GeV via:

rucio did list --filter 'electron_beam_energy_gev==10' 'epic:*'

We can also combine tags and filter on several at once, e.g:

rucio did list --filter 'electron_beam_energy_gev==10, ion_beam_energy_gev==250' 'epic:*'

which will return only datasets with 10x250 collisions (10 GeV electrons on 250 GeV ions using the standard ePIC conventions). We can keep adding filters in this manner as we like to really narrow down the DIDs we return with our query.

Logical Expressions

Note that in our examples we use == with our filters, but other logical expressions can be used too. E.g. >=, <=, >, < and so on are all valid for tags expecting an integer/number value.

Note that we can also use wildcards in our tag searches. This could be helpful if we don’t know if a particular dataset was run in a specific campaign for example. We could do:

rucio did list --filter 'software_release=26.*, electron_beam_energy_gev==10' 'epic:*'

to just get a list of all DIDs with 10 GeV beam electrons from 2026 software releases for example. However, remember that tags have only been applied from March 2026 onwards.

Exercise:

Using tags, find the DIDs of the latest:

DEMP events in the Q2 range of 3 to 10 for 10 GeV electrons on 250 GeV protons

Print the full DID and check the number of files in the dataset

Hint - Check the example name we looked at when introducing DIDs in a previous section.

Using DIDs - Downloading or Processing Files

So far we’ve seen how we can find DIDs and check some basic info such as what type of data they point to and where that data is stored. We generally want to do a bit more than that though. Typically we want to find data to use it in some way. For our simulation data, this is usually to analyse it!

We can download DIDs, containers, datasets or files, straightforwardly:

rucio download scope:name

So to download our file from earlier, we just do:

rucio download epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root

By default it will download to our current directory with its original name. In this case, that’s unfortunate because as we noticed earlier, this looks a lot like a UNIX file path. As such, we now have a large number of nested directories to go through before we get to our file!

Warning - Do you need the whole dataset?

Think very carefully before downloading a DID. What is it? If it’s a full dataset, do you really need all of the data?

Generally you will not need a local copy of a full dataset. It’s generally best to only download a small subset of files to test and run.

We can stream files from a full dataset rather than downloading them as we’ll see in a moment.

It might actually be easier to use XrootD to grab our file as it’s a bit more intuitive, we do need our location from earlier for this though:

xrdcp root://dtn-eic.jlab.org:1094//volatile/eic/EPIC//RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root ./

Where ./ is our current directory, this time, we’ll just get the actual file! We could also specify a new name if desired in place of just ./.

Our full path, including location, as used here, will also be useful if we want to stream our files directly in a script. In ROOT, we could just do:

auto f = TFile::Open("root://dtn-eic.jlab.org:1094//volatile/eic/EPIC//RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root")

or using python and uproot:

import uproot
import XRootD
file_path = "root://dtn-eic.jlab.org:1094//volatile/eic/EPIC//RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root"
root_file = uproot.open(file_path)

or directly with Pyroot:

import ROOT
import XRootD
file_path = "root://dtn-eic.jlab.org:1094//volatile/eic/EPIC//RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root"
file = ROOT.TFile.Open(file_path, "READ")

Testing File Streaming

We can quickly check the three methods above work.

To test the ROOT approach, we can make a macro called Test.C and add:

void Test(){
  auto f = TFile::Open("root://dtn-eic.jlab.org:1094//volatile/eic/EPIC//RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root");
  auto tree = f->Get<TTree>("events");
  Long64_t nEntries = tree->GetEntries(); // read the number of entries in the tree
  cout << nEntries << " events in tree" << endl;
}

If we run this script with root Test.C, it should stream our file and print the number of entries (the number of events) in the file.

Similarly in Python, we can make Test.py and use Uproot:

import uproot
import XRootD
file_path = "root://dtn-eic.jlab.org:1094//volatile/eic/EPIC//RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root"
root_file = uproot.open(file_path)
print(root_file['events'].num_entries, "events in this tree.")

or directly using PyRoot:

import ROOT
import XRootD
file_path = "root://dtn-eic.jlab.org:1094//volatile/eic/EPIC//RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x250/q2_3_10/pi+/DEMPgen-1.2.4_10x250_pi+_q2_3_10_ab.0550.eicrecon.edm4eic.root"
file = ROOT.TFile.Open(file_path, "READ")
print((file.Get("events")).GetEntries(), "events in this tree.")

All three approaches should yield the same result.

Key Points

Rucio works with datasets and Data Identifiers (DIDs)

ePIC DIDs may look or be formatted like a nested filepath, but they are flat

Tags can be used to quickly sort and find data of interest

Once you find the file location with Rucio, you can use xrootd to download or stream it too

Use Cases

Overview

Teaching: 10 min
Exercises: 40 min

Questions

How do users interact with EIC/ePIC data?

Objectives

Explore different use cases for ePIC simulation data and how users work with EIC/ePIC data

Discover how simulation files can be utilised in further analysis

Know how to download files if needed (and when it might be needed)

In this episode, we will explore a few common use cases and how users may want to interact with simulation campaign output in each case. Examples of carrying out some common tasks associated with each use case will be included.

Physics Analyser - Novice

This use case explores a user new to analysing ePIC data to try and look at a specific physics process. They will likely want to find and identify a specific physics process to pass through their analysis code. Their requirements are likely to include:

Reconstructed output files (RECO)
A general type of physics process (or group of processes)
The latest available files to test
A specific collider (energy and ion species) configuration

They may also want to only test a small subset of data to test and develop their analysis. This use case is one example where downloading a small number of files locally may be beneficial.

To find files that meet their requirements they could utilise the following tags:

software_release
requester_pwg
electron_beam_energy_gev
ion_beam_energy_gev
ion_species
data_level

We can use these tags to filter through the DIDs and find datasets of interest:

rucio did list --filter 'software_release==XXX, requester_pwg==YYY, electron_beam_energy_gev==ZZ, ion_beam_energy_gev==iii, ion_species==jjj' 'epic:*'

Where we can substitute in our chosen values for each in place of XXX, YYY, ZZ, iii and jjj.

Beam Energies:

Whilst we can enter any number for the electron_beam_energy_gev and ion_beam_energy_gev values, there are only certain combinations actually in use. electron_beam_energy_gev is typically 5, 9, 10 or 18 GeV ion_beam_energy_gev is typically 41, 100, 130, 250 or 275 GeV for protons. For other ion species, 110 and 166 may also be used.

Once we have identified a specific dataset of interest, we can look at the files within it using:

rucio did content list scope:name

and we can get locations of the files within the dataset via:

rucio replica list file --protocols root --pfns --rses isopenaccess scope:name_of_file
rucio replica list file --protocols root --pfns --rses isopenaccess scope:name_of_Dataset

as we saw in the last episode. We can get just the location of a specific file OR the location of all files within the dataest, depending upon which we specify. We could download this file locally using

xrdcp FILEPATH ./

where FILEPATH is the path to one specific file from the output of one of the rucio commands above.

Exercise:

Using the suggested tags, find the latest available datasets for:

Neutral current (NC) DIS events for 10 GeV electrons colliding with 130 GeV protons

Download one file from this dataset of your choice

Physics Analyser - Experienced

In this use case, we consider an experienced physics analyser that has a well developed analysis script that they want to run on a large number of files, possibly even a full dataset, for a specific physics process they’re interested in. Their requirements are likely to include:

Reconstructed output files (RECO)
A specific physics process (or group of processes)
The latest available files to test
A specific collider (energy and ion species) configuration
A datasets with machine backgrounds embedded in the simulated output files

To find files that meet their requirements they could utilise the following tags:

software_release
requester_pwg
electron_beam_energy_gev
ion_beam_energy_gev
generator
data_level

They may also want to use the q2_min_gev2 ad q2_max_gev2 tags, along with the ion_species tags to narrow down to an even more specific subset of files. They may also want to analyse files with or without background enabled.

As they want to process a large number of files, it is unlikely (and not recommended) that they will want to download a large number of files to process them locally. Instead, they may want to stream their files directly in their analysis script. They could do this via:

auto f = TFile::Open("FILEPATH");
auto tree = f->Get<TTree>("events");

or if they’re using python:

import uproot
import XRootD
file_path = "FILEPATH"
root_file = uproot.open(file_path)

As they may wish to process a full dataset, they might want to feed their script a full list of files to stream and run. They could print the full list of files in a dataset:

rucio replica list file --protocols root --pfns --rses isopenaccess scope:name_of_Dataset > FileList

This could then be processed in the script:

void FileListProcess(){
  string line;
  ifstream fstream ("FileList");
  int FileCount = 0;
  //TChain *AnalysisChain = new TChain("events"); // We could define a chain to process our files too and add them as we scan over our list
  while(getline(fstream, line)){
    if (FileCount > 5) continue; // Stop loop after 5 files, comment out to read full file
    // Check file exists
    TString tmpFile{line};
    auto RootFile = TFile::Open(tmpFile);
    if(!RootFile){ // Check file exists
      cout << "File not found:"<<tmpFile << endl;
      continue;
    }
    cout << "Found file - " << line << endl;
    //AnalysisChain->Add(tmpFile) // Add to our chain if we want
    FileCount++;
  }
}

or if they’re using python:

import ROOT
import uproot
import XRootD
import awkward as ak

Files=[]

with open('FileList', 'r') as file:
    lines_list = file.readlines()
    for line in lines_list[:5]: # Read only lines 0:5 - remove [:5] to read all or change 5 to N where N is the number of lines you want
        file_path = line.rstrip() # rstrip to remove trailing white space/new lines
        try:
            with uproot.open(file_path) as file:
                Files.append(file_path) # Add file path to array
                print("Found file - ", file_path, "and appended to list for processing.")
        except Exception as e:
            print(f"Could not open file: {e}")
        
# Use the uproot iterate method to process our list of files - See https://uproot.readthedocs.io/en/stable/uproot.behaviors.TBranch.iterate.html
#for chunk in uproot.iterate({f: "events" for f in Files}, expressions=["MCParticles.PDG"]): # Open files in array f and process events tree with branches specified
    # Process each chunk - Do something
    # print(ak.type(chunk))

Note that we have restricted these examples to only print out the first five files in the list we created. We can comment out or change the lines as noted to process the full list (or adjust the cutoff value in the condition to process a different number).

Exercise:

Using the suggested tags, find the latest available dataset for:

Deeply Virtual Compton Scattering (DVCS) events from the EpIC event generator for 10 GeV electrons colliding with 130 GeV protons without background included

Stream one file from this dataset in a script, check the number of events in this file

Print all of the files in this dataset to a text file

Stream five of the files in this dataset in a script, check the total number of events contained in all five files.

Hint - See the example scripts in the last episode for how to get the number of events in a root file in a few different ways.

Detector Designer/Optimiser, Algorithm/Reconstruction Development

In this use case, someone updating the design of a detector in DD4HEP, or adjusting a reconstruction algorithm for a detector, may not want full reconstructed data. Instead, they may want more raw, hit level information. They may also want a specific detector configuration for comparison. In terms of physics process, they may not be looking at an actual reaction at all, but a particle gun simulation. To summarise, they may want:

Simulated (FULL) as well as reconstructed output files (RECO)
Particle gun studies with specific single particles
- Specific momentum or angular ranges
Different versions of a dataset to track changes between software releases
Specific geometry files in use for the simulation

Some tags they might use to find their data include:

software_release
data_level
requester_dsc
geometry_config
gun_particle
gun_momentum_min_gev
gun_momentum_max_gev
gun_theta_min_deg
gun_theta_max_deg
gun_phi_min_deg
gun_phi_max_deg
gun_distribution

Exercise:

Using combinations of the suggested tags, find the latest available dataset(s) for:

K- single particle gun simulations

Determine the available momentum and angular ranges available for this/these dataset(s)

Do non-reconstructed files exist for this/these dataset(s)?

Conclusion and Comments

That wraps up our introduction to using Rucio and some example use cases and scenarios.

New tags may be added in the future. We’re welcome to take on board any suggestions or changes as we roll out Rucio and it becomes more widely used. Get in touch via - stephen.kay@york.ac.uk

or on Mattermost with suggestions, comments and feedback.

Remember to consider whether you need full datasets before downloading them and keep an eye on whether your files have multiple open access copies when making file lists.

Also, if you find any nice tricks or develop short scripts (maybe one which makes a file list for the latest version of a dataset based upon inputs?) then feel free to share them too!

Key Points

Files from datasets can be directly streamed in analysis scripts

Files (or whole datasets) can be downloaded locally, but this is usually not needed

EIC Tutorial: File Access

Introduction

Overview

Simulation Campaigns

Submitting a New Simulation Request

Simulation Files Organisation

How can I Browse the Simulation Campaign Output and Access Files?

Wait, I read I should use XrootD to find and access files?

Key Points

Rucio Usage

Overview

Getting Started

Datasets and DIDs

`Warning - Not a filepath!`

Finding DIDs

`Warning - Check the campaign date!`

`Pin for later:`

`list file comment:`

Metadata Tags

Thanks!

Reference Sheet

Logical Expressions

`Exercise:`

Using DIDs - Downloading or Processing Files

`Warning - Do you need the whole dataset?`

Testing File Streaming

Key Points

Use Cases

Overview

Physics Analyser - Novice

`Beam Energies:`

`Exercise:`

Physics Analyser - Experienced

`Exercise:`

Detector Designer/Optimiser, Algorithm/Reconstruction Development

`Exercise:`

Conclusion and Comments

Key Points