A Tuesday evening Zwift ride. You finish, save the ride. Garmin uploads the FIT file to Garmin Connect. Zwift uploads its own FIT to Zwift's platform. Strava gets notified by both and creates a single activity record, then pacelore imports it via Strava OAuth. Three platforms, three representations of the same 90 minutes. Only one should exist in pacelore.

Why it's harder than it sounds

The obvious dedup key — activity ID — doesn't work across platforms. Garmin assigns its own numeric ID. Zwift assigns a UUID. Strava assigns yet another numeric ID. There's no shared external identifier. The Garmin FIT file contains a local_activity_id field, but Strava doesn't expose it in their API response. Zwift's FIT has its own session UUID in the file header, but again Strava doesn't pass it through.

Sport type is another hazard. Garmin records the activity as cycling. Zwift categorizes it as virtual_ride. Strava translates Garmin's version to ride and Zwift's version to virtual_ride. When pacelore normalizes sport types, both become cycling — but the raw sport field diverges during the match phase if you check before normalization.

The two-pass strategy

Pass 1: exact external ID match

Every activity in D1 has an external_source and external_id column with a composite unique constraint. Before inserting any activity, check:

SELECT id FROM activities
WHERE external_source = ? AND external_id = ?
LIMIT 1;

If a row exists, skip the insert. This handles the case of re-importing from the same source — the Strava backfill being re-run, or a FIT file being uploaded twice. It's the fast path and handles most re-import scenarios.

Pass 2: time window + sport match

For new activities that pass the external ID check, run a second query before inserting:

SELECT id FROM activities
WHERE athlete_id = ?
  AND sport = ?
  AND ABS(start_time_unix - ?) < 300
LIMIT 1;

The threshold is 300 seconds (5 minutes). If an activity with the same sport started within 5 minutes of the new activity, it's almost certainly the same effort. The new activity is dropped and the existing one is kept.

The 5-minute window is intentionally conservative. Two different activities of the same sport within 5 minutes of each other would require the athlete to finish one and immediately start another — uncommon enough that the false-positive rate is acceptable. The sport check is applied after normalization, so virtual_ride and cycling both become cycling before the comparison.

What we found in practice

The first real test account had been importing from Strava, uploading FIT files manually, and had previously connected a Zwift account. Running the two-pass dedup on the full activity history removed 71 duplicate activities. The breakdown:

  • 41 duplicates were caught by Pass 1 (same Strava ID re-imported from two different OAuth sessions)
  • 30 duplicates were caught by Pass 2 (Zwift activity appearing as both a direct FIT upload and a Strava-imported activity)

The 30 Pass-2 catches were all Zwift rides. The start time divergence within those pairs was typically under 60 seconds — Garmin and Zwift record slightly different start times because one starts the timer on button press and the other starts on first sensor movement.

Which copy to keep

When Pass 2 detects a duplicate, the question is which copy to keep. The existing record is always kept — it was already written, may have segment efforts computed against it, and may have been referenced in club feed events. The incoming duplicate is dropped.

There's a data quality argument for keeping the more data-rich copy instead: a direct FIT upload has native power data and cadence; the Strava-synthesized version may have gaps. This is a planned improvement — when a duplicate is detected, compare the sample density of both copies and merge the richer fields if they differ.

Cross-sport false positives

One failure mode: a brick workout (bike immediately followed by run). Two activities starting within 5 minutes of each other — the transition time. Pass 2 would catch this as a duplicate if both had the same sport, but bike/run differ. The sport check protects against this specific case. Bike/bike bricks (rare but exist in triathlon training) would still be deduplicated incorrectly. The fix is to add a duration check: if the durations differ by more than 20%, they're not the same activity regardless of start time proximity.