Setgraph, Apple Health, TabPFN, and the day my lifting log got serious

Based on personal experiments with my own training data. This is an n=1 retrospective analysis, not causal evidence or coaching advice. Any opinions here are my own and do not represent my employer.

Setgraph + Apple Health + six-model holdout

I wanted to know if a good lifting day was predictable before it happened.

I joined Setgraph with Apple Health, rebuilt the sessions into a usable feature table, and tested the whole thing on future sessions only. The short version: yes, session quality was predictable, and a few of the strongest signals were not the ones gym folklore would have picked.

Setgraph sets

6,290

93 exercises logged from 2023-04-28 through 2026-03-16.

Health activity rows

1,773

The current pass also folds in bodyweight, sleep, HRV, effort, and heart-rate context.

Modeled exercise-sessions

1,548

1,455 had prior history, which made them usable for the holdout test.

Source A

Setgraph export

Reps, load, timestamps, rest gaps, and the full set-by-set trail.

Source B

Apple Health export

Cardio, sleep, HRV, steps, bodyweight, and Watch signals stitched into the recovery context.

Inference engine

TabPFN

Pre-trained tabular transformer sitting on top of the engineered feature stack, tested on future sessions only.

Signal 1

TabPFN won the six-model field at R² = 0.9217

Signal 2

The table ended up with 49 model columns

Signal 3

Cardio 24-48h before lifts looked strongest

Signal 4

Everyday defaults and PR buckets are not the same thing

Quick glossary

10 terms

Open this if the ML shorthand gets annoying.

Handy for non-technical readers. The dotted terms later in the piece still work as hover refreshers.

Expand

Personal record

Your best result to date for that lift or exercise variation.

R²

R-squared

How much of the session-to-session variation the model explains. Closer to 1 is better.

MAE

Mean absolute error

The average miss between the model's prediction and the actual session result. Lower is better.

RMSE

Root mean squared error

Like MAE, but bigger misses count extra. Lower is better.

HRV

Heart rate variability

Variation in the time between heartbeats. Higher values often line up with better recovery and lower fatigue.

RHR

Resting heart rate

Your baseline heart rate at rest. A spike above normal can hint at stress, illness, or poor recovery.

RPE / RIR

Rate of perceived exertion / reps in reserve

Two ways lifters describe difficulty: how hard a set felt, or how many reps were left in the tank.

Holdout split

Chronological holdout

The future-only test set. Here, the last 20% of each exercise history was held back for evaluation.

TabICL

In-context tabular learner

A foundation-model-style tabular predictor that finished second in this run.

TabPFN

Tabular foundation model

A pre-trained model for tabular data from Prior Labs. It was the strongest model on this holdout split.

I wanted a blunt answer: how much of a lifting session's quality can I actually anticipate from my log and recovery context?

More broadly, I wanted to know what I could actually tweak in future plans while keeping the whole thing honest, even if this started as a local n=1 experiment with my own data.

The inputs were ordinary enough: a Setgraph export for the lifting history, an Apple Health export for the surrounding cardio, and a lot of feature work to turn both into a table a model could actually use.

By the current pass, that table was no longer just lifting plus cardio. It also carried interpolated bodyweight, sleep stages, HRV, resting heart rate, respiratory rate, steps, and Apple Watch effort and heart-rate signals.

The answer was: more than I expected, with an important catch. On a chronological holdout that forced the models to score later sessions, the best setup explained 92.17% of the variance in performance_vs_best_clippedsession target.

That was not a pure before-the-workout forecast. The feature table mixed signals available before the session with signals only revealed during the session itself. Even so, the result was not model magic. It came from rebuilt sessions, explicit rest windows, warmups separated from work, fatigue estimated from first-versus-last work sets, 24h/48h cardio lookbacks, and the added recovery context from bodyweight and Apple Health. Once the table was clean, TabPFN finished first in an expanded six-model field, with TabICL also landing ahead of every tree baseline.

It is also worth being honest about the subject. This was a dad-of-two training log from a stretch where recovery was not always textbook, which is exactly why the added bodyweight and readiness context mattered. Some of the flatter-looking sessions make more sense once you stop pretending every week happened under identical conditions.

The project, minus the incense

At the raw-file level this was:

Setgraph_Set_Export_2026-03-18.csv with 6,290 sets
Setgraph_Sessions_Export_2026-03-18.csv with 1,773 Apple Health activity rows
A derived table of 1,548 exercise-sessions spanning 2023-04-28 to 2026-03-16
1,455 modeled sessions with prior history and a 298-row chronological holdout
A 49-column model feature set spanning lift structure, cardio timing, bodyweight, sleep, HRV/RHR, steps, and Watch effort / heart-rate context

Some of that context existed before training started. Some of it only existed once the session was underway or already finished. That distinction matters if you care about true pre-session prediction rather than post hoc explanation.

The target was pragmatic: how close each exercise-session came to the best prior performance for that exercise. A value of 1.0 means "matched the lifetime best to date." Values above 1.0 are personal-record (PR) territory. Values below 1.0 mean the day landed somewhere under the existing ceiling.

That target is not perfect, but it is useful. It asks the question I actually care about when I look at a session: was this close to my best recent version of this lift or not?

Pipeline sketch

What the project actually did

Step 01

Export the raw log

Every set became a row: exercise, timestamp, reps, load, and enough history to rebuild a training day.

Setgraph

6,290 sets

93 exercises

Step 02

Layer recovery context

Cardio timing came first, then the table widened to bodyweight, sleep, HRV, RHR, steps, and Watch effort / heart-rate signals.

Cardio

Sleep

HRV

Effort

Step 03

Build feature columns

The script promoted the raw log into 49 model columns: spacing, work-rest, session volume, warmups, fatigue, sleep, bodyweight, and recovery signals.

49 features

Chronological split

No leakage

Step 04

Score the future

A six-model field had to predict held-out future sessions rather than shuffled history.

Ridge

Trees

TabICL

TabPFN

The interesting structural choice was the split. Instead of shuffling rows randomly, the project held out the last 20% of each exercise history. That matters. Random shuffles flatter models by letting them peek around the future. This setup forced every model to prove itself on later data.

The first punchline: later sessions were more explainable than I expected

I like starting tabular work with a ladder instead of a leap.

Use a linear model to see whether the obvious signal is real.
Use a strong tree model to check whether non-linearity matters.
Use a foundation model to see whether there is structure the first two are still leaving on the table.

That is exactly what happened here.

For reference, the model field mixed ridge regression, random forests, the benchmark boosted-tree families behind XGBoost and CatBoost, and the newer tabular foundation-model papers that made TabPFN and TabICL worth testing in the first place.Refs1-6

Chronological holdout

Six models, one target, one clear winner

Higher R² is better. Lower MAE and RMSE are better.

Model	R²	MAE	RMSE	Read
Ridge Strong enough to prove there was real signal, but too linear to catch the messy interactions.	0.7220	0.0699	0.0937	Useful sanity-check baseline.
XGBoost A standard boosted-tree benchmark, but not especially comfortable on this split.	0.7664	0.0483	0.0859	Reasonable benchmark, weak fit here.
CatBoost Handled the richer feature mix better than XGBoost, but still trailed the strongest field entries.	0.8564	0.0468	0.0673	Solid tree baseline, not the winner.
Random Forest A huge jump over Ridge, then mostly flat once the obvious structure was already learned.	0.8726	0.0450	0.0634	Best tree model in the field.
TabICL Runner-up overall, which made the foundation-model story harder to dismiss as a one-off.	0.9092	0.0369	0.0535	Strong runner-up, ahead of every tree.
TabPFN Best Still the clean winner once the richer recovery context and expanded model field were in place.	0.9217	0.0309	0.0497	Won every metric on the same holdout.

MAE

-31.3%

TabPFN cut average absolute error versus the forest.

RMSE

-21.6%

Big misses dropped too, not just the average miss.

R²

+4.9 pts

Same target, same split, materially better fit.

Model ladder

The scoreboard was not subtle

The detailed metric table above already carries the full numbers. This section is just the shape of the race: a sane baseline, stronger trees, then two foundation-model entries on the exact same chronological holdout.

R² stack

Normalized scale 0.70-0.93

0.70

0.80

0.90

Ridge

Rank 6

0.7220

Delta

Baseline

MAE

0.0699

XGBoost

Rank 5

0.7664

Delta

+0.0444

MAE

0.0483

CatBoost

Rank 4

0.8564

Delta

+0.0900

MAE

0.0468

Random Forest

Rank 3

0.8726

Delta

+0.0162

MAE

0.0450

TabICL

Rank 2

0.9092

Delta

+0.0366

MAE

0.0369

TabPFN

Rank 1

Best

0.9217

Delta

+0.0125

MAE

0.0309

Same targetSame holdout ruleBest tree: Random ForestRunner-up: TabICLWinner: TabPFN

The useful visual read is compact: trees found most of the obvious structure, then the two foundation-model entries stepped past the whole classical field on the exact same setup.

The jump from Ridge to Random Forest was the big "okay, there is real non-linear structure in here" moment.

The more current version of the comparison made the result harder to wave away. TabICL reached 0.9092 R², CatBoost 0.8564, and XGBoost 0.7664 on the same split, so TabPFN was not winning against a strawman field.

The jump from Random Forest to TabPFN was still the more interesting one. That was not a hyperparameter trick. It was a different way of seeing the table.

Prior Labs describes TabPFN as a tabular foundation model, and their docs frame it as a pre-trained transformer that does not need dataset-specific training in the usual sense. Instead, it applies learned priors to a new table in-context and produces predictions in a single forward pass. That description still matches what this project felt like in practice: once the obvious engineered features were present, TabPFN kept converting extra context into real gains after the tree models had mostly flattened out. TabICL finishing second only made that pattern more interesting.Refs5-6 See the Prior Labs overview for the product-level explanation.

The diminishing-returns curve is the real lesson:

Better algorithm choice still got the first huge lift.
The expanded six-model field did not dethrone TabPFN; it confirmed the win.
The richer feature set widened the context more than it changed the headline.
The next serious gains probably require deliberate new logging like compliance, protein, and subjective RPE/RIR, not more cleverness squeezed out of the same exports.

The part that surprised me

Sample

n=225

The 5+ warmup bucket looked amazing, but only nine sessions lived there. That is a curiosity, not a prescription.

Fatigue profile

Potentiation lifts

Squat -4.1%, Deadlift -4.0%

Negative fatigue-drop means the last work set was stronger than the first. Squat and deadlift tended to wake up as the session went on.

Fatigue profile

Fade by the last set

Bench +3.3%, Press +2.2%, Pull-Up +2.4%

Bench, press, incline dumbbell press, and pull-ups usually bled a little output across work sets instead of ramping upward.

The headline signals looked like this:

Recovery: 11-14 days between exposures was the strongest global bucket at 91.2% median performance and a 21.3% PR rate.
Work-set rest: 2.5-3.5 minutes led on the composite score, with 3.5-4.5 minutes nearly identical on median performance.
Volume: 7-8 total sets was the strongest safe global set-count bucket.
Warmups: 1-2 warmup sets beat zero warmups with a much healthier sample than the flashy 5+ bucket.
Cardio timing: 24-48h before lifting was the strongest cardio bucket in the entire joined dataset.

90.4%

PR rate

14.6%

Still fine, just not as strong as the 24-48h window.

The cardio result is the one most likely to trigger arguments in a group chat, so it is worth saying carefully.

This project did not prove that cardio causes better lifting performance.

It did show that in this log, the strongest historical lift sessions were more likely to show up when cardio had happened 24-48 hours earlier rather than not at all or inside the previous 24 hours. That bucket landed at 94.3% median performance and a 22.6% PR rate, versus 88.0% and 13.4% for sessions with no recent cardio. Most of the joined cardio windows were tied to running rather than cycling, swimming, or rowing.

That finding has a few plausible explanations:

light-to-moderate cardio might have acted like active recovery
good schedule discipline may have clustered cardio and lifting around generally better weeks
the cardio window may simply be standing in for other unlogged variables like sleep, food, or readiness

All three are believable. The useful part is not pretending to know which one is true. The useful part is admitting the old "cardio kills gains" shortcut did not survive this particular dataset.

That also fits the broader concurrent-training literature: cardio is not automatically a strength killer. In real terms, an easy run a day or two before lifting is a very different thing from piling hard endurance work onto the same window as a heavy lifting session, and the outcome depends on how much cardio you do, how hard it is, and when it lands.Refs7

The lift-specific playbooks were better than the global averages

Global patterns are great for orientation, but they are still averages. The more useful output was the per-exercise layer, because it split each lift into two different ideas:

the strong-session range, which is the safer everyday default
the best PR bucket, which is the narrower pocket where PRs were most likely to happen

PR bucket

<=2 days, 6+ min rest, 1-2 sets, 4-6 reps.

The PR bucket looks almost rude compared with the default range. That is another reminder to separate peaking from sustainable programming.

Bench and squat made that distinction obvious.

Bench had a broader strong-session range of 5-8 days, 6-8 sets, and 3-6 top-set reps, but its PR bucket tightened down to 3-4 days, 7-8 sets, and a 1-3 rep top set.

Squat did something similar from the other direction. The broader strong-session range lived at 5-7 days, while the best PR bucket stretched out to 11-14 days with 7-8 sets and 13-15 reps.

That is exactly the kind of subtlety that gets lost when people flatten training advice into one rule per lift. PR conditions are often narrower and stranger than sustainable training defaults.

Deadlift was the cleanest of the spotlight lifts because the strong-session range and the PR bucket mostly pointed in the same direction. Military press had the nicest recent trend line. Pull-ups looked good as a reminder that a narrow PR bucket can be very different from a sane weekly setup.

What got killed, what stayed true

Myth check

The fun part was watching gym stories run into actual evidence.

Not every intuition got wrecked, but a few of them definitely left with a limp.

Claim

What the project actually found

Question

Did the fancy model actually matter?

Docs sayYes

In the expanded six-model field, TabPFN finished at 0.9217 R² versus 0.9092 for TabICL and 0.8726 for Random Forest, while cutting MAE by roughly 31.3% versus the forest. This was not cosmetics.

Question

Did cardio hurt lifting?

Docs sayNot here

The strongest cardio bucket was 24-48 hours before lifting, not "none." That does not prove causality, but it absolutely fails to support the lazy default story.

Question

Should the best PR bucket become the whole program?

Docs sayNo

Bench, squat, and pull-up all showed a meaningful gap between the broader strong-session range and the narrower PR bucket. Peak conditions are not the same thing as default conditions.

Question

Did massive warmups earn a gold star?

Docs sayToo small

Five or more warmup sets looked incredible on paper, but the bucket only had n=9. The believable read is that 1-2 warmup sets beat zero, not that everyone should suddenly perform a small play before each top set.

Feature engineering mattered more than hero worship

The project is a nice advertisement for TabPFN, but the more transferable lesson is about feature work and clean framing.

The model never saw "Bench Press, 100 kg, 5 reps" as a raw row and spontaneously became insightful. It saw a mix of signals from three different moments:

Before the session: days since the last exposure, exposure count, cardio proximity and duration, bodyweight, sleep stages, resting heart rate, HRV, respiratory rate, and steps
During the session: average work-set rest, number of sets and work sets, top-set reps, warmup count and warmup ratio, exercise order in the session, and Apple Watch effort / heart-rate signals during the session
Relative to prior history: intensity versus the prior best, performance versus the prior session, and fatigue drop from first to last work set

That is the real craft. If the table is badly expressed, the model ceiling is low no matter how fancy the model sounds. It is also why the headline number should be read as "completed session quality was highly explainable on later holdout data," not "I built a pure pre-workout oracle."

Random Forest made that painfully clear. It extracted the big obvious wins and then mostly flattened out. TabPFN and TabICL were the only models that kept turning the richer contextual columns into clear extra signal, with TabPFN finishing best overall.

Caveats, because grown-ups are allowed to have them

Keep your head on straight

Great fit is not the same thing as clean causality

The target is historical performance versus your own best, not a lab-grade readiness measure. Small buckets still exist. Sleep, bodyweight, and Watch effort helped, but subjective RPE/RIR, soreness, nutrition, and compliance tracking still did not. This is a very strong personal decision-support system, not a universal theory of strength training.

There were a few specific caveats worth keeping visible:

This is all correlational. Better historical association is not the same thing as mechanism.
Some of the strongest model scores rely on features observed during the session, so they should not be read as pure pre-workout forecasts.
The target naturally favors the lifter's personal history, so experience level leaks into the prediction in useful but slightly flattering ways.
Some of the most exciting buckets are still small.
Subjective RPE/RIR, soreness, protein intake, and explicit compliance tracking were not in the table.
This is deeply useful as an n=1 decision system and much less useful as a universal rulebook.

What I would do next

If I were continuing this experiment tomorrow, I still would not spend the day trying to squeeze another half-point out of model tuning. The current pass already has six models and a much richer feature table than the original draft.

The next missing layer is less about passive context and more about deliberate logging:

Subjective RPE or RIR so objective Watch effort can be compared with how the session actually felt.
Compliance tracking so the next version can compare recommended spacing with actual spacing instead of guessing.
Protein and nutrition logging so recovery is not inferred entirely from physiology and schedule.
Soreness or readiness notes so the model can separate a bad plan from a bad day.
A deliberate 8-12 week block that follows the default ranges, because then the next export becomes a proper follow-up experiment instead of another pile of mixed behavior.

That is the part I liked most about this day of messing with the project: it stopped feeling like retroactive analysis and started feeling like a loop.

Export. Join. Model. Learn. Train differently. Export again.

That is a much better use of a lifting log than letting it sit there as a digital notebook full of vibes.

References

Hoerl AE, Kennard RW. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics. 1970;12(1):55-67.
Breiman L. Random Forests. Machine Learning. 2001;45:5-32.
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. arXiv. 2016. doi:10.48550/arXiv.1603.02754.
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. In: Advances in Neural Information Processing Systems 31 (NeurIPS 2018).
Hollmann N, Müller S, Purucker L, et al. Accurate predictions on small data with a tabular foundation model. Nature. 2025;637:319-326.
Qu J, Holzmüller D, Varoquaux G, Le Morvan M. TabICL: A Tabular Foundation Model for In-Context Learning on Large Data. arXiv. 2025. doi:10.48550/arXiv.2502.05564.
Wilson JM, Marin PJ, Rhea MR, Wilson SMC, Loenneke JP, Anderson JC. Concurrent training: a meta-analysis examining interference of aerobic and resistance exercises. J Strength Cond Res. 2012;26(8):2293-2307.

On this page

I wanted to know if a good lifting day was predictable before it happened.

Open this if the ML shorthand gets annoying.

What the project actually did

Export the raw log

Layer recovery context

Build feature columns

Score the future

Six models, one target, one clear winner

The scoreboard was not subtle

The signals that kept surviving contact with the data

Rest windows mattered more than gym folklore wants to admit.

Heavy sets liked patience, but not a full weather report between efforts.

High-quality sessions tended to look like actual sessions, not drive-bys.

The weirdest winner was cardio 24-48 hours before lifting.

Warmups helped a little. Warmup maximalism did not earn a free pass.

Potentiation lifts

Fade by the last set

The 24-48 hour cardio window got all the swagger

No cardio before

Cardio 24-48h before

Cardio <24h before

The main lifts turned into actual playbooks

Bench Press

Squat

Deadlift

Military Press

Incline Dumbbell Press

Pull-Up

The fun part was watching gym stories run into actual evidence.

Did the fancy model actually matter?

Did cardio hurt lifting?

Should the best PR bucket become the whole program?

Did massive warmups earn a gold star?

Great fit is not the same thing as clean causality

Related entries