Setgraph, Apple Health, TabPFN, and the day my lifting log got serious
I pulled my Setgraph history and Apple Health context into one table, rebuilt 1,548 exercise-sessions, benchmarked six tabular models, and learned three things fast: completed session quality was more explainable than I expected, cardio 24-48 hours before lifting looked better than expected, and personal-record windows were narrower than normal training defaults.
On this page
9 sectionsStarts closed so you can get straight into the article. Open it if you want the outline.
ExpandCollapse
On this page
9 sectionsStarts closed so you can get straight into the article. Open it if you want the outline.
- The project, minus the incense
- The first punchline: later sessions were more explainable than I expected
- The part that surprised me
- The lift-specific playbooks were better than the global averages
- What got killed, what stayed true
- Feature engineering mattered more than hero worship
- Caveats, because grown-ups are allowed to have them
- What I would do next
- References
Based on personal experiments with my own training data. This is an n=1 retrospective analysis, not causal evidence or coaching advice. Any opinions here are my own and do not represent my employer.
I wanted to know if a good lifting day was predictable before it happened.
I joined Setgraph with Apple Health, rebuilt the sessions into a usable feature table, and tested the whole thing on future sessions only. The short version: yes, session quality was predictable, and a few of the strongest signals were not the ones gym folklore would have picked.
Setgraph sets
6,290
93 exercises logged from 2023-04-28 through 2026-03-16.
Health activity rows
1,773
The current pass also folds in bodyweight, sleep, HRV, effort, and heart-rate context.
Modeled exercise-sessions
1,548
1,455 had prior history, which made them usable for the holdout test.
Source A
Setgraph export
Reps, load, timestamps, rest gaps, and the full set-by-set trail.
Source B
Apple Health export
Cardio, sleep, HRV, steps, bodyweight, and Watch signals stitched into the recovery context.
Inference engine
TabPFN
Pre-trained tabular transformer sitting on top of the engineered feature stack, tested on future sessions only.
TabPFN won the six-model field at R² = 0.9217
The table ended up with 49 model columns
Cardio 24-48h before lifts looked strongest
Everyday defaults and PR buckets are not the same thing
Quick glossary
10 termsOpen this if the ML shorthand gets annoying.
Handy for non-technical readers. The dotted terms later in the piece still work as hover refreshers.
ExpandCollapse
Quick glossary
10 termsOpen this if the ML shorthand gets annoying.
Handy for non-technical readers. The dotted terms later in the piece still work as hover refreshers.
PR
Personal record
Your best result to date for that lift or exercise variation.
R²
R-squared
How much of the session-to-session variation the model explains. Closer to 1 is better.
MAE
Mean absolute error
The average miss between the model's prediction and the actual session result. Lower is better.
RMSE
Root mean squared error
Like MAE, but bigger misses count extra. Lower is better.
HRV
Heart rate variability
Variation in the time between heartbeats. Higher values often line up with better recovery and lower fatigue.
RHR
Resting heart rate
Your baseline heart rate at rest. A spike above normal can hint at stress, illness, or poor recovery.
RPE / RIR
Rate of perceived exertion / reps in reserve
Two ways lifters describe difficulty: how hard a set felt, or how many reps were left in the tank.
Holdout split
Chronological holdout
The future-only test set. Here, the last 20% of each exercise history was held back for evaluation.
TabICL
In-context tabular learner
A foundation-model-style tabular predictor that finished second in this run.
Tabular foundation model
A pre-trained model for tabular data from Prior Labs. It was the strongest model on this holdout split.
I wanted a blunt answer: how much of a lifting session's quality can I actually anticipate from my log and recovery context?
More broadly, I wanted to know what I could actually tweak in future plans while keeping the whole thing honest, even if this started as a local n=1 experiment with my own data.
The inputs were ordinary enough: a Setgraph export for the lifting history, an Apple Health export for the surrounding cardio, and a lot of feature work to turn both into a table a model could actually use.
By the current pass, that table was no longer just lifting plus cardio. It also carried interpolated bodyweight, sleep stages, HRV, resting heart rate, respiratory rate, steps, and Apple Watch effort and heart-rate signals.
The answer was: more than I expected, with an important catch. On a chronological holdout that forced the models to score later sessions, the best setup explained 92.17% of the variance in performance_vs_best_clippedsession target.
That was not a pure before-the-workout forecast. The feature table mixed signals available before the session with signals only revealed during the session itself. Even so, the result was not model magic. It came from rebuilt sessions, explicit rest windows, warmups separated from work, fatigue estimated from first-versus-last work sets, 24h/48h cardio lookbacks, and the added recovery context from bodyweight and Apple Health. Once the table was clean, TabPFN finished first in an expanded six-model field, with TabICL also landing ahead of every tree baseline.
It is also worth being honest about the subject. This was a dad-of-two training log from a stretch where recovery was not always textbook, which is exactly why the added bodyweight and readiness context mattered. Some of the flatter-looking sessions make more sense once you stop pretending every week happened under identical conditions.
The project, minus the incense
At the raw-file level this was:
Setgraph_Set_Export_2026-03-18.csvwith 6,290 setsSetgraph_Sessions_Export_2026-03-18.csvwith 1,773 Apple Health activity rows- A derived table of 1,548 exercise-sessions spanning 2023-04-28 to 2026-03-16
- 1,455 modeled sessions with prior history and a 298-row chronological holdout
- A 49-column model feature set spanning lift structure, cardio timing, bodyweight, sleep, HRV/RHR, steps, and Watch effort / heart-rate context
Some of that context existed before training started. Some of it only existed once the session was underway or already finished. That distinction matters if you care about true pre-session prediction rather than post hoc explanation.
The target was pragmatic: how close each exercise-session came to the best prior performance for that exercise. A value of 1.0 means "matched the lifetime best to date." Values above 1.0 are personal-record (PR) territory. Values below 1.0 mean the day landed somewhere under the existing ceiling.
That target is not perfect, but it is useful. It asks the question I actually care about when I look at a session: was this close to my best recent version of this lift or not?
Pipeline sketch
What the project actually did
Step 01
Export the raw log
Every set became a row: exercise, timestamp, reps, load, and enough history to rebuild a training day.
Step 02
Layer recovery context
Cardio timing came first, then the table widened to bodyweight, sleep, HRV, RHR, steps, and Watch effort / heart-rate signals.
Step 03
Build feature columns
The script promoted the raw log into 49 model columns: spacing, work-rest, session volume, warmups, fatigue, sleep, bodyweight, and recovery signals.
Step 04
Score the future
A six-model field had to predict held-out future sessions rather than shuffled history.
The interesting structural choice was the split. Instead of shuffling rows randomly, the project held out the last 20% of each exercise history. That matters. Random shuffles flatter models by letting them peek around the future. This setup forced every model to prove itself on later data.
The first punchline: later sessions were more explainable than I expected
I like starting tabular work with a ladder instead of a leap.
- Use a linear model to see whether the obvious signal is real.
- Use a strong tree model to check whether non-linearity matters.
- Use a foundation model to see whether there is structure the first two are still leaving on the table.
That is exactly what happened here.
For reference, the model field mixed ridge regression, random forests, the benchmark boosted-tree families behind XGBoost and CatBoost, and the newer tabular foundation-model papers that made TabPFN and TabICL worth testing in the first place.Refs1-6
Chronological holdout
Six models, one target, one clear winner
Higher R² is better. Lower MAE and RMSE are better.
| Model | R² | MAE | RMSE | Read |
|---|---|---|---|---|
Ridge Strong enough to prove there was real signal, but too linear to catch the messy interactions. | 0.7220 | 0.0699 | 0.0937 | Useful sanity-check baseline. |
XGBoost A standard boosted-tree benchmark, but not especially comfortable on this split. | 0.7664 | 0.0483 | 0.0859 | Reasonable benchmark, weak fit here. |
CatBoost Handled the richer feature mix better than XGBoost, but still trailed the strongest field entries. | 0.8564 | 0.0468 | 0.0673 | Solid tree baseline, not the winner. |
Random Forest A huge jump over Ridge, then mostly flat once the obvious structure was already learned. | 0.8726 | 0.0450 | 0.0634 | Best tree model in the field. |
TabICL Runner-up overall, which made the foundation-model story harder to dismiss as a one-off. | 0.9092 | 0.0369 | 0.0535 | Strong runner-up, ahead of every tree. |
TabPFN Best Still the clean winner once the richer recovery context and expanded model field were in place. | 0.9217 | 0.0309 | 0.0497 | Won every metric on the same holdout. |
MAE
-31.3%
TabPFN cut average absolute error versus the forest.
RMSE
-21.6%
Big misses dropped too, not just the average miss.
R²
+4.9 pts
Same target, same split, materially better fit.
Model ladder
The scoreboard was not subtle
The detailed metric table above already carries the full numbers. This section is just the shape of the race: a sane baseline, stronger trees, then two foundation-model entries on the exact same chronological holdout.
R² stack
Normalized scale 0.70-0.93
Ridge
Rank 6
Delta
Baseline
MAE
0.0699
XGBoost
Rank 5
Delta
+0.0444
MAE
0.0483
CatBoost
Rank 4
Delta
+0.0900
MAE
0.0468
Random Forest
Rank 3
Delta
+0.0162
MAE
0.0450
TabICL
Rank 2
Delta
+0.0366
MAE
0.0369
TabPFN
Rank 1
Delta
+0.0125
MAE
0.0309
The useful visual read is compact: trees found most of the obvious structure, then the two foundation-model entries stepped past the whole classical field on the exact same setup.
The jump from Ridge to Random Forest was the big "okay, there is real non-linear structure in here" moment.
The more current version of the comparison made the result harder to wave away. TabICL reached 0.9092 R², CatBoost 0.8564, and XGBoost 0.7664 on the same split, so TabPFN was not winning against a strawman field.
The jump from Random Forest to TabPFN was still the more interesting one. That was not a hyperparameter trick. It was a different way of seeing the table.
Prior Labs describes TabPFN as a tabular foundation model, and their docs frame it as a pre-trained transformer that does not need dataset-specific training in the usual sense. Instead, it applies learned priors to a new table in-context and produces predictions in a single forward pass. That description still matches what this project felt like in practice: once the obvious engineered features were present, TabPFN kept converting extra context into real gains after the tree models had mostly flattened out. TabICL finishing second only made that pattern more interesting.Refs5-6 See the Prior Labs overview for the product-level explanation.
The diminishing-returns curve is the real lesson:
- Better algorithm choice still got the first huge lift.
- The expanded six-model field did not dethrone TabPFN; it confirmed the win.
- The richer feature set widened the context more than it changed the headline.
- The next serious gains probably require deliberate new logging like compliance, protein, and subjective RPE/RIR, not more cleverness squeezed out of the same exports.
The part that surprised me
The global charts were useful, but the project really got fun once the weird findings started showing up.
Findings deck
The signals that kept surviving contact with the data
Recovery
Rest windows mattered more than gym folklore wants to admit.
Peak bucket
11-14 days
Median
91.2% median
PR rate
21.3% PR rate
Sample
n=207
That was the strongest global bucket, although the everyday sweet spot for big lifts was usually tighter per exercise.
Intra-session rest
Heavy sets liked patience, but not a full weather report between efforts.
Peak bucket
2.5-3.5 min
Median
89.7% median
PR rate
17.0% PR rate
Sample
n=470
3.5-4.5 minutes nearly matched it on median performance, so the real message is to stop rushing serious work sets.
Volume
High-quality sessions tended to look like actual sessions, not drive-bys.
Peak bucket
7-8 sets
Median
92.5% median
PR rate
9.7% PR rate
Sample
n=93
The 9+ bucket was even higher on median but tiny. Seven or eight total sets looks like the safer global read.
Cardio interference
The weirdest winner was cardio 24-48 hours before lifting.
Peak bucket
24-48h
Median
94.3% median
PR rate
22.6% PR rate
Sample
n=53
Most of those windows were anchored by running, which is a fun way to annoy anyone still yelling 'cardio kills gains.'
Warmups
Warmups helped a little. Warmup maximalism did not earn a free pass.
Peak bucket
1-2 warmups
Median
90.3% median
PR rate
10.7% PR rate
Sample
n=225
The 5+ warmup bucket looked amazing, but only nine sessions lived there. That is a curiosity, not a prescription.
Fatigue profile
Potentiation lifts
Squat -4.1%, Deadlift -4.0%
Negative fatigue-drop means the last work set was stronger than the first. Squat and deadlift tended to wake up as the session went on.
Fatigue profile
Fade by the last set
Bench +3.3%, Press +2.2%, Pull-Up +2.4%
Bench, press, incline dumbbell press, and pull-ups usually bled a little output across work sets instead of ramping upward.
The headline signals looked like this:
- Recovery: 11-14 days between exposures was the strongest global bucket at 91.2% median performance and a 21.3% PR rate.
- Work-set rest: 2.5-3.5 minutes led on the composite score, with 3.5-4.5 minutes nearly identical on median performance.
- Volume: 7-8 total sets was the strongest safe global set-count bucket.
- Warmups: 1-2 warmup sets beat zero warmups with a much healthier sample than the flashy 5+ bucket.
- Cardio timing: 24-48h before lifting was the strongest cardio bucket in the entire joined dataset.
That last one deserved its own panel.
Cardio timing
The 24-48 hour cardio window got all the swagger
The useful framing here is not "cardio is magic." It is "cardio was not obviously destructive in this log, and the best historical window sat a day or two before the lift." Most of the matched cardio sessions were runs.
none
No cardio before
Median
88.0%
PR rate
13.4%
The baseline.
best bucket
Cardio 24-48h before
Median
94.3%
PR rate
22.6%
The strongest historical mix in this dataset.
same-day orbit
Cardio <24h before
Median
90.4%
PR rate
14.6%
Still fine, just not as strong as the 24-48h window.
The cardio result is the one most likely to trigger arguments in a group chat, so it is worth saying carefully.
This project did not prove that cardio causes better lifting performance.
It did show that in this log, the strongest historical lift sessions were more likely to show up when cardio had happened 24-48 hours earlier rather than not at all or inside the previous 24 hours. That bucket landed at 94.3% median performance and a 22.6% PR rate, versus 88.0% and 13.4% for sessions with no recent cardio. Most of the joined cardio windows were tied to running rather than cycling, swimming, or rowing.
That finding has a few plausible explanations:
- light-to-moderate cardio might have acted like active recovery
- good schedule discipline may have clustered cardio and lifting around generally better weeks
- the cardio window may simply be standing in for other unlogged variables like sleep, food, or readiness
All three are believable. The useful part is not pretending to know which one is true. The useful part is admitting the old "cardio kills gains" shortcut did not survive this particular dataset.
That also fits the broader concurrent-training literature: cardio is not automatically a strength killer. In real terms, an easy run a day or two before lifting is a very different thing from piling hard endurance work onto the same window as a heavy lifting session, and the outcome depends on how much cardio you do, how hard it is, and when it lands.Refs7
The lift-specific playbooks were better than the global averages
Global patterns are great for orientation, but they are still averages. The more useful output was the per-exercise layer, because it split each lift into two different ideas:
- the strong-session range, which is the safer everyday default
- the best PR bucket, which is the narrower pocket where PRs were most likely to happen
Those are not the same thing, and that difference might be the cleanest programming lesson in the whole project.
Exercise atlas
The main lifts turned into actual playbooks
Bench Press
138 sessions / 6 PRs
Default range
5-8d between exposures, 3.4-4.2 min rest, 6-8 total sets, 3-6 top-set reps, 100-109 kg.
PR bucket
3-4 days, 3.5-4.5 min rest, 7-8 sets, and a 1-3 rep top set.
Classic example of why the PR bucket is not the everyday plan. The best spikes were narrower than the broader strong-session range.
Squat
107 sessions / 3 PRs
Default range
5-7d between exposures, 5.0-5.6 min rest, 5-6 sets, 6-15 top-set reps, 130-150 kg.
PR bucket
11-14 days, 4.5-6 min rest, 7-8 sets, 13-15 reps.
The global recovery winner shows up hard here, but the day-to-day squat story is still tighter than the peaking window.
Deadlift
50 sessions / 6 PRs
Default range
6-12d between exposures, 4.0-5.0 min rest, 6-7 sets, 6-8 reps, 120-134 kg.
PR bucket
5-7 days, 3.5-4.5 min rest, 7-8 sets, 7-9 reps.
One of the cleaner signals in the file. The strong-session range and the PR bucket mostly point in the same direction.
Military Press
81 sessions / 2 PRs
Default range
7-12d between exposures, 3.2-3.7 min rest, 5 sets, 7-10 top-set reps, 55-60 kg.
PR bucket
5-7 days, 3.5-4.5 min rest, 5-6 sets, 4-6 reps.
The happiest recent trend line in the spotlight lifts, and another case where tighter peaking conditions make sense near the ceiling.
Incline Dumbbell Press
80 sessions / 6 PRs
Default range
6-8d between exposures, 3.5-4.3 min rest, 5-6 sets, 8-12 reps, 34-38 kg.
PR bucket
8-10 days, 2.5-3.5 min rest, 5-6 sets, 7-9 reps.
Quietly one of the most reliable lifts in the project: enough history, enough PRs, and a calm default range.
Pull-Up
123 sessions / 3 PRs
Default range
2-8d between exposures, 3.1-4.0 min rest, 3-4 sets, 5-8 reps, 11-15 kg.
PR bucket
<=2 days, 6+ min rest, 1-2 sets, 4-6 reps.
The PR bucket looks almost rude compared with the default range. That is another reminder to separate peaking from sustainable programming.
Bench and squat made that distinction obvious.
Bench had a broader strong-session range of 5-8 days, 6-8 sets, and 3-6 top-set reps, but its PR bucket tightened down to 3-4 days, 7-8 sets, and a 1-3 rep top set.
Squat did something similar from the other direction. The broader strong-session range lived at 5-7 days, while the best PR bucket stretched out to 11-14 days with 7-8 sets and 13-15 reps.
That is exactly the kind of subtlety that gets lost when people flatten training advice into one rule per lift. PR conditions are often narrower and stranger than sustainable training defaults.
Deadlift was the cleanest of the spotlight lifts because the strong-session range and the PR bucket mostly pointed in the same direction. Military press had the nicest recent trend line. Pull-ups looked good as a reminder that a narrow PR bucket can be very different from a sane weekly setup.
What got killed, what stayed true
The fun part was watching gym stories run into actual evidence.
Not every intuition got wrecked, but a few of them definitely left with a limp.
Did the fancy model actually matter?
In the expanded six-model field, TabPFN finished at 0.9217 R² versus 0.9092 for TabICL and 0.8726 for Random Forest, while cutting MAE by roughly 31.3% versus the forest. This was not cosmetics.
Did cardio hurt lifting?
The strongest cardio bucket was 24-48 hours before lifting, not "none." That does not prove causality, but it absolutely fails to support the lazy default story.
Should the best PR bucket become the whole program?
Bench, squat, and pull-up all showed a meaningful gap between the broader strong-session range and the narrower PR bucket. Peak conditions are not the same thing as default conditions.
Did massive warmups earn a gold star?
Five or more warmup sets looked incredible on paper, but the bucket only had n=9. The believable read is that 1-2 warmup sets beat zero, not that everyone should suddenly perform a small play before each top set.
Feature engineering mattered more than hero worship
The project is a nice advertisement for TabPFN, but the more transferable lesson is about feature work and clean framing.
The model never saw "Bench Press, 100 kg, 5 reps" as a raw row and spontaneously became insightful. It saw a mix of signals from three different moments:
- Before the session: days since the last exposure, exposure count, cardio proximity and duration, bodyweight, sleep stages, resting heart rate, HRV, respiratory rate, and steps
- During the session: average work-set rest, number of sets and work sets, top-set reps, warmup count and warmup ratio, exercise order in the session, and Apple Watch effort / heart-rate signals during the session
- Relative to prior history: intensity versus the prior best, performance versus the prior session, and fatigue drop from first to last work set
That is the real craft. If the table is badly expressed, the model ceiling is low no matter how fancy the model sounds. It is also why the headline number should be read as "completed session quality was highly explainable on later holdout data," not "I built a pure pre-workout oracle."
Random Forest made that painfully clear. It extracted the big obvious wins and then mostly flattened out. TabPFN and TabICL were the only models that kept turning the richer contextual columns into clear extra signal, with TabPFN finishing best overall.
Caveats, because grown-ups are allowed to have them
Keep your head on straight
Great fit is not the same thing as clean causality
The target is historical performance versus your own best, not a lab-grade readiness measure. Small buckets still exist. Sleep, bodyweight, and Watch effort helped, but subjective RPE/RIR, soreness, nutrition, and compliance tracking still did not. This is a very strong personal decision-support system, not a universal theory of strength training.
There were a few specific caveats worth keeping visible:
- This is all correlational. Better historical association is not the same thing as mechanism.
- Some of the strongest model scores rely on features observed during the session, so they should not be read as pure pre-workout forecasts.
- The target naturally favors the lifter's personal history, so experience level leaks into the prediction in useful but slightly flattering ways.
- Some of the most exciting buckets are still small.
- Subjective RPE/RIR, soreness, protein intake, and explicit compliance tracking were not in the table.
- This is deeply useful as an n=1 decision system and much less useful as a universal rulebook.
What I would do next
If I were continuing this experiment tomorrow, I still would not spend the day trying to squeeze another half-point out of model tuning. The current pass already has six models and a much richer feature table than the original draft.
The next missing layer is less about passive context and more about deliberate logging:
- Subjective RPE or RIR so objective Watch effort can be compared with how the session actually felt.
- Compliance tracking so the next version can compare recommended spacing with actual spacing instead of guessing.
- Protein and nutrition logging so recovery is not inferred entirely from physiology and schedule.
- Soreness or readiness notes so the model can separate a bad plan from a bad day.
- A deliberate 8-12 week block that follows the default ranges, because then the next export becomes a proper follow-up experiment instead of another pile of mixed behavior.
That is the part I liked most about this day of messing with the project: it stopped feeling like retroactive analysis and started feeling like a loop.
Export. Join. Model. Learn. Train differently. Export again.
That is a much better use of a lifting log than letting it sit there as a digital notebook full of vibes.
References
- Hoerl AE, Kennard RW. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics. 1970;12(1):55-67.
- Breiman L. Random Forests. Machine Learning. 2001;45:5-32.
- Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. arXiv. 2016. doi:10.48550/arXiv.1603.02754.
- Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. In: Advances in Neural Information Processing Systems 31 (NeurIPS 2018).
- Hollmann N, Müller S, Purucker L, et al. Accurate predictions on small data with a tabular foundation model. Nature. 2025;637:319-326.
- Qu J, Holzmüller D, Varoquaux G, Le Morvan M. TabICL: A Tabular Foundation Model for In-Context Learning on Large Data. arXiv. 2025. doi:10.48550/arXiv.2502.05564.
- Wilson JM, Marin PJ, Rhea MR, Wilson SMC, Loenneke JP, Anderson JC. Concurrent training: a meta-analysis examining interference of aerobic and resistance exercises. J Strength Cond Res. 2012;26(8):2293-2307.
Related entries
Supabase vs CloudKit for a local-first diagnostics app
An early working comparison of Supabase and CloudKit for a private, local-first iPhone diagnostics app. These are provisional findings and product-shape predictions that I want to revisit once the real build, sync path, and browser workflow are exercised.