Research Computing (full-day) Advanced Bootcamp - Office of Research Cyberinfrastructure

Tuesday, June 23, 2026
8:30 AM – 4:30 PM
Partnership 3 – Room 233 (in-person only)

The Office of Research Cyberinfrastructure is hosting a one-day “Research Computing Advanced Bootcamp” for users interested in specialized topics in research computing such as strategies for leveraging tools for understanding system profiling, limitations of pandas for large DataFrames, other high-performance tools for DataFrames, querying large language models via Python APIs, reproducibility practices, and automated plotting techniques. The workshop will include three sessions featuring hands-on exercises, followed by an open discussion and Q&A.

Due to limited seating capacity preference will be given to graduate students.

Agenda Register

Session 1: Handling Large DataFrames in Python

Pandas is the default tool for working with tabular data across many research domains, but as datasets grow into the multi-gigabyte range, it begins to break down, first in memory usage and then in performance. This session walks through that transition end-to-end using a real-world dataset. It examines where pandas struggles at scale (in both memory and CPU) and evaluates how far pandas can be pushed before alternative solutions become necessary.

The session then introduces a newer-generation DataFrame library (e.g., Polars) that is multithreaded by default, uses a columnar memory layout for improved performance on analytical workloads, and can reorganize and optimize sequences of operations prior to execution. It concludes by highlighting how storage format choices and the ability to process data in chunks (rather than loading everything into memory at once) determine whether code remains effective as datasets grow by an order of magnitude, along with a brief discussion of how these concepts extend to multi-machine (e.g., Dask) or GPU-based (e.g., cuDF) workloads.

Session 2: Python and DataFrames for Sensible Experiment Management

Summary: What is “effective” computational research? At any given time, you may have multiple directions you want to investigate. What kind of tooling can you construct during the exploratory phase that will aide you when it’s time to publish and distribute your findings?

In this workshop, you’ll build a benchmarking framework to run inference against a shared LLM server to replicate results of a paper on Zero-shot Chain-of-Thought prompting. Along the way, we will discuss how and when it is appropriate to rearrange and organize our research code.

Topics covered: Querying LLMs via the API in Python, Pandas aggregation, reproducibility, and automated plotting.

****Please Note: All the sessions have a hands-on component. To participate in the hands-on exercises during the session, you will need to bring your own computer equipped with a web browser. Prior experience in Python is helpful. ****

Tentative Agenda for Tuesday, June 23, 2026

Please note that this agenda is subject to updates.

Event	Start	End	Speaker
Sign-in and Breakfast	08:30 AM	8:50 AM	–
Introduction to Office of Research Cyberinfrastructure	08:50 AM	09:00 AM	TBD
Handling Large DataFrames in Python	09:00 AM	10:30 AM	Fahad Khan
Coffee Break	10:30 AM	10:45 AM	–
Handling Large DataFrames in Python (cont.)	10:45 AM	11:45 AM	Fahad Khan
Lunch Break	11:45 AM	12:45 PM	–
Python and DataFrames for Sensible Experiment Management	12:45 PM	02:15 PM	Benjamin Keene
Coffee Break	02:15 PM	02:30 PM	–
Python and DataFrames for Sensible Experiment Management (cont.)	02:30 PM	03:30 PM	Benjamin Keene
TBD	03:30 PM	04:00 PM
Q & A and Discussion	04:00 PM	04:30 PM