Introduction
Overview
Teaching: 10 min
Exercises: minQuestions
How are EIC/ePIC simulation outputs organised??
Objectives
Understand how the simulation output is organised
Find out how to request a new simulation
Discover the tools that are available to browse and access the simulation output
Simulation Campaigns
Simulations of a range of physics processes in the ePIC detector are typically run on a monthly basis by the Production Working Group. Information on simulation campaigns can be found on the Production Working Group pages. This includes details of files produced in previous campaigns.
A list of current request from Detector Subsystem Co-ordinators and the Physics Analysis Co-ordinators can be found here.
Campaigns are designated by a standardised format - YY.MM.Ver
- YY - Year the campaign ran, e.g. 26 is 2026
- MM - Month the campaign ran, e.g. 02 is January
- Ver - Version of the campaign, starts from 0. May have different versions
These are linked to specific software releases following the same format.
Note that campaigns more than ~6 months old will not directly be accessible using the methods we will explore in this tutorial.
Various types of files are produced as part of the simulation campaign as we will discuss in the next section. The files you may wish to access will differ depending upon your use case. In this tutorial, we will explore a few different common use cases and the types of files you may want in each.
If you would like to submit a new request to a future campaign for a dataset that is not in production, please follow the following process:
- Coordinate with your physics or detector working group and the detector subsystem or physics analysis co-ordinators to add your request to the overview spreadsheet and assign a priority.
- Generate the Monte-Carlo input for your new request.
- Please follow the pre-processing guidelines when preparing your new input files for submission.
- Once your input files are ready, submit a simulation request form.
- If your input is not pre-processed following the pre-processing guidelines, it will not be simulated. Please review these carefully.
Simulation Files Organisation
Within a simulation campaign, there are three broad classes of files that are produce:
- EVGEN: The input hepmc3 datasets
- E.g. some files that have been supplied by a physics event generator
- FULL: The full GEANT4 output root files (usually only saved for a fraction of runs)
- If running a simulation yourself, this would be your output from processing npsim
- RECO: The output root files from the reconstruction
- And again, if running yourself, this would be your output from EICrecon (after you’ve used your awesome new reconstruction algorithm from the later tutorial of course)
Most users and use cases will interact with RECO files, the output of the full simulation and reconstruction chain. We will explore some use cases and how to find the relevant files in each case.
How can I Browse the Simulation Campaign Output and Access Files?
To browse the campaign output and find the files we want, we can use Rucio. Rucio is an open source scientific data management system. It is utilised in other large physics experiments such as ATLAS.
Wait, I read I should use XrootD to find and access files?
You may find reference to or instructions on using Xrootd to browse and access files.These may still work and indeed, we will use some of these commands later in this tutorial. However, Rucio is now the preferred method for the cases we will examine.
Why? This change isn’t just to make everybody learn something new, it is also a consequence of the expansion of the volume of ePIC data now available. Previously (before 20260, all simulated data was stored on Jefferson Lab servers. However, data is now spread between multiple sites. This makes finding an accessing it using XrootD more complicated. Rucio can deal with this “issue” in a straightforward way.
You may also find reference to an S3 server. This is now deprecated and cannot be used. If you find such references or instructions to S3 server usage in tutorial material, please raise an issue on the GitHub page for this tutorial flagging that this should be removed.
Key Points
Simulation campaigns run on a regular (monthly basis)
Input requests must be formatted in a specific way and meet certain pre-requisites
Rucio is the primary way to browse and access simulated EIC/ePIC data
Rucio Usage
Overview
Teaching: 15 min
Exercises: 15 minQuestions
How can I use Rucio?
Objectives
Become familiar with aspects of Rucio
Use Rucio tags to find specific types of files
Getting Started
We can access and run the Rucio client from within eic-shell. From wherever you have eic-shell:
./eic-shell
rucio whoami
This should print out some information -
email : eicprod@jlab.org
account : eicread
account_type : GROUP
...
We can also check the arguments we can supply to rucio, as well as usage info with:
rucio -h
To use Rucio further, we will need to briefly look at how Rucio organises data.
Datasets and DIDs
Typically, we want to analyse data contained within specific files. Files can be grouped together into datasets which can themselves, be grouped into containers. All three refer to “data”. As such, the term “data identifier` or DID is used to represent any set of files, datasets or containers in Rucio. a DID is just the name of a single file, dataset or container.
In Rucio, all DIDs follow a naming scheme which is composed of two strings - a scope and a name, formatted as -
scope:name
For epic, the scope is always epic, meaning that all of our DIDs look like:
epic:name
The name contains information about the dataset in question and contains information such as the software release used to create the file, electron and ion beam energies etc.
As an example, consider the DID for the dataset:
epic:/RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x130/q2_10_20/pi+
The name here - /RECO/26.02.0/epic_craterlake/EXCLUSIVE/DEMP/DEMPgen-1.2.4/10x130/q2_10_20/pi+, tells us many things about the contents of this dataset. Let’s break this down, examining the component enclosed within each pair of /---/ -
RECO- This tells us that the DID contains reconstructed output file information
26.02.0- This tells us that the
26.02.0software release was used, the February 2026 release (version 0).
- This tells us that the
epic_craterlake- This tells us that the
epic_craterlakedetector configuration was used in the simulation
- This tells us that the
EXCLSUIVE- The DID is for a dataset of exclusive physics events
DEMP- This is the specific exclusive process simulated in the dataset, Deeply Exclusive Meson Production, DEMP
DEMPgen-1.2.4- The simulation in this dataset is based upon input generated by the
DEMPgen-1.2.4event generator
- The simulation in this dataset is based upon input generated by the
10x130- The files in this dataset were simulated with 10 GeV electrons on 130 GeV ions (in this case protons, if the ion species is not specified, it is likely to be an ep simulation)
- The AAxBBB format, where AA is the electron energy and BBB is the ion energy, is typical for quoting beam energies in EIC files. This may be written as AAonBBB too.
q2_10_20- The files in this dataset are from a simulation input containing events in the 10 to 20 GeV2 Q2 range
pi+- Pi+ are generated in this output - this is specific to this DEMP reaction and signifies that it is Deeply Exclusive Pion Production
Warning - Not a filepath!The
nameof our DID here looks a lot like a filepath, however it is a flat object and does not have any hierarchy as we will see in the next section.
Other names may not necessarily contain all of the same information, but as a bare minimum, are likely to tell us something about the physics process simulated and beam conditions, as well as which software release was used. This is reflected in the metadata tags assigned as we will see later.
Finding DIDs
Now that we know what a DID looks like, how can we find the DID corresponding to the file or dataset that we’re interested in?
… … …
However, a much easier approach to finding what we need is to use the metadata tags that are assigned all DIDs from March 2026 onwards.
Metadata Tags
The following tags are available as of March 2026:
- software_release
- Software release used in the simulation. Written as a container version tag/simulation campaign naming:
- vYY.MM.v
- E.g. v25.06.2 -> June 2025 software container, version 2
- physics_process
- Defines the physics working group (PWG) that the simulated data relates to, options are:
- excl_diff_tagging
- inclusive
- jets_hf
- semi_inclusive
- ew_bsm
- other
- Can be one or more
- q2_min
- Minumum Q2 value (GeV^2) in the simulation file, entered as a number.
- Optional tag - Not all simulated files use this
- q2_max
- Maximum Q2 value (GeV^2) in the simulation file, entered as a number.
- Optional tag - Not all simulated files use this
- electron_beam_energy
- Electron beam energy in GeV
- ion_beam_energy
- Ion/nucleus beam energy in GeV
- is_background_mixed
- True/false depending upon whether sample includes any background mixing
- ion_species
- Ion species in the simulation, defaults to
p, proton, if not specified
- Ion species in the simulation, defaults to
- generator
- MC event generator used to generate the simulated data
- E.g. Pythia8, Herwig etc
As noted on some items in this list, some tags are optional and may not be applied to all datasets. However, the following tags are required for all datasets:
- software_release
- physics_process
- electron_beam_energy
- ion_beam_energy
- is_background_mixed
- ion_species
- generator
We can use these tags to filter through the available datasets and identify those of interest to us. For example:
Example command
Exercise:Using tags, find the DIDs of the latest:
- DEMP events in the Q2 range of 3 to 10 for 10 GeV electrons on 250 GeV protons
- Print the full DID and check the number of files in the dataset Hint - Check the example name we looked at when introducing DIDs in a previous section.
Using DIDs
Info on checking DID info and downloading
Key Points
Rucio works with datasets and Data Identifiers (DIDs)
ePIC DIDs may look or be formatted like a nested filepath, but they are flat
Tags can be used to quickly sort and find data of interest
Use Cases
Overview
Teaching: 10 min
Exercises: 40 minQuestions
How do users interact with EIC/ePIC data?
Objectives
Explore different use cases for ePIC simulation data and how users work with EIC/ePIC data
Discover how simulation files can be utilised in further analysis
Know how to download files if needed (and when it might be needed)
In this episode, we will explore a few common use cases and how users may want to interact with simulation campaign output in each case. Examples of carrying out some common tasks associated with each use case will be included.
Physics Analyser - Novice
This use case explores a user new to analysing ePIC data to try and look at a specific physics process. They will likely want to find and identify a specific physics process to pass through their analysis code. Their requirements are likely to include:
- Reconstructed output files (RECO)
- A general type of physics process (or group of processes)
- The latest available files to test
- A specific collider (energy and ion species) configuration
They may also want to only test a small subset of data to test and develop their analysis. This use case is one example where downloading a small number of files locally may be beneficial.
To find files that meet their requirements they could utilise the following tags…
-
- -
We can use these tags to filter through the DIDs and find datasets of interest:
Example command
Once we have identified a specific dataset of interest, we can look at the files within it using:
Example command
as we saw in the last episode. We could download this file locally using
Example command
Exercise:Using the suggested tags, find the latest available datasets for:
- Neutral current (NC) DIS events for 10 GeV electrons colliding with 130 GeV protons
- Download one file from this dataset of your choice
Physics Analyser - Experienced
In this use case, we consider an experienced physics analyser that has a well developed analysis script that they want to run on a large number of files, possibly even a full dataset, for a specific physics process they’re interested in. Their requirements are likely to include:
- Reconstructed output files (RECO)
- A specific physics process (or group of processes)
- The latest available files to test
- A specific collider (energy and ion species) configuration
- A datasets with machine backgrounds embedded in the simulated output files
To find files that meet their requirements they could utilise the following tags…
-
- -
As they want to process a large number of files, it is unlikely (and not recommended) that they download a large number of files to process them locally. Instead, they may want to stream their files directly in their analysis script. They could do this via
root based streaming example
Full working script
or if they’re using python -
Python based streaming example
Full working script
As they may wish to process a full dataset, they might want to feed their script a full list of files to stream and run. They could print the full list of files in a dataset via -
Example command to pipe dataset list to a file
Note:We have limited this to only pipe 5 files in the dataset to our list. Remove the
fragmentpart of the command to instead print all lines. Alternatively, edit this to be the number of lines that you want.
This could then be processed in the script via -
root based streaming example
Full working script
or if they’re using python -
Python based streaming example
Full working script
Exercise:Using the suggested tags, find the latest available dataset for:
- Deeply Virtual Compton Scattering (DVCS) events from the EpIC event generator for 10 GeV electrons colliding with 130 GeV protons without background included
- Stream one file from this dataset in a script, check the number of events in this file
- Print all of the files in this dataset to a text file
- Stream five of the files in this dataset in a script, check the total number of events contained in all five files.
Detector Designer/Optimiser
Discussion of use case based upon SIM data
Algorithm/Reconstruction Development
Discussion of use case based upon SIM data and tags - merge with previous?
General Comments
Some general comments and info. Pointers, things to avoid or recommendations etc.
Key Points
Files from datasets can be directly streamed in analysis scripts
Files (or whole datasets) can be downloaded locally, but this is usually not needed