This article is the second in a series of posts where I share my journey exploring proteomics, cancer biology, and AI-driven solutions for oncology. In the upcoming posts, I will expand on the data foundations of the project, the machine learning methods we are testing, and the clinical implications of this research. The goal of the series is to make complex concepts accessible while also showing the step-by-step progress of building an applied research project from biology to artificial intelligence. Stay tuned if you are curious about how molecular data, medical images, and deep learning can come together to address one of the most fundamental challenges in cancer research.In Post 1 we asked: can H&E histology carry a footprint of telomerase activity? Here we operationalize that idea: assemble images and RNA for the same patients, create a robust telomerase-high label, and produce clean tensors for modeling.Several public resources exist (e.g., TCGA = The Cancer Genome Atlas; CPTAC = Clinical Proteomic Tumor Analysis Consortium). We use TCGA because it provides, at population scale, paired WSIs + RNA-seq, harmonized pipelines, rich metadata, and open GCS access via ISB-CGC (Institute for Systems Biology - Cancer Genomics Cloud, an NCI-funded platform). That makes it ideal for reproducible ETL.TCGA assigns every specimen a structured barcode. An example is shown in Figure 1. The components we care about first are:Using the structured barcode, we can join data reliably. For labels computed at the patient level, we attach RNA to slides via the case barcode (TCGA-XX-YYYY). When we need stricter pairing at the sample level, we use the sample barcode (first five components). Following this procedure, more than 10k samples were paired.BigQuery is Google Cloud’s serverless, scalable data warehouse where you can run super fast SQL analytics on massive datasets without managing infrastructure. Through ISB-CGC, it hosts TCGA tables we need:RNA-seq: isb-cgc-bq.TCGA_versioned.RNAseq_hg38_gdc_r42Slides: isb-cgc-bq.TCGA.slide_images_gdc_currentIf you like pandas-style queries, ibis lets you write SQL as Python expressions and read directly to DataFrames.Finally, across cohorts we compute the 75th percentile (Q3) of TERT expression and define the labels:In plain terms: a slide is considered telomerase-high if its TERT reading falls in the top quartile.Whole-slide images (WSIs) are huge, high-resolution images. To get local representations, we (1) detect tissue, (2) tile it into smaller patches, and (3) compute patch embeddings with a vision transformer. The TRIDENT library can run this end-to-end in one command:Other patch encoders besides uni_v2 are available (see the TRIDENT repo). The patch_size and embedding dimensionality depend on the chosen encoder, so pick accordingly. Some WSIs contain pen marks or artifacts; –remove_artifacts helps filter them.Embeddings are written under path_to_outputs as one .h5 file per slide, containing an array of shape (num_patches, embedding_dim) plus patch coordinates. With uni_v2, the embedding dimension is 1536; the number of patches varies per slide. Computing embeddings is time-consuming; an alternative is the precomputed UNI2-h features on Hugging Face. If you use those, pairing with RNA-seq requires two merges:At this point, each WSI is represented by many patch embeddings, while TERT gives one label per slide. Intuitively, only a subset of patches drives the telomerase signal.We conclude with a simple baseline: compute a slide embedding as the mean of its patch embeddings, then feed it to a small MLP (see BaselineClassifier below).This “mean-pool + MLP” setup is clean and reproducible, and it confirms that an image-only signal correlates with telomerase (see Figure 4). In Post 3, we’ll replace mean pooling with Attention-based Multi-Instance Learning (ABMIL) to generate patch-level heatmaps that localize the signal on WSIs.While these global baselines already capture meaningful signal, they treat each slide as a single object. In the next post, we relax this assumption and ask a more refined question: can we learn where in the tissue telomerase-associated morphology resides by reasoning at the patch level?AI for breakthroughs, not buzzwords.© 2026 Barnacle Labs Ltd.