Finding New Worlds - Using Python and Open Data to Discover an Exoplanet Candidate

The Spark: From Conversation to Code

It started with a simple question: “What if we could find planets that others have missed?”

NASA’s Kepler and TESS missions have produced petabytes of public data, carefully analyzed by professional astronomers and citizen scientists alike. Thousands of exoplanets have been discovered. But in a dataset this vast, could there still be hidden signals waiting to be found?

With modern tools like Python, open-source astronomy libraries, and systematic thinking, we set out to answer that question. What followed was a journey through data science, astrophysics, and the humbling reality of scientific validation. This is the story of how we found KIC 8554498 b—a potential new exoplanet with a strong signal that somehow slipped through previous searches.

The Technical Foundation

Choosing the Right Tools

We built our detection pipeline using:

Lightkurve (v2.5.1): NASA’s official Python toolkit for Kepler and TESS data
Astropy (v7.2.0): Professional astronomy library with BLS (Box Least Squares) periodogram
AstroQuery: For cross-referencing against NASA’s Exoplanet Archive and other catalogs
NumPy, SciPy, Matplotlib: Standard scientific Python stack

The choice to use BLS over machine learning was deliberate. While neural networks are powerful, they’re black boxes. For a discovery that needs scientific validation, we wanted explainable, reproducible methods. BLS is the gold standard for transit detection—it fits box-shaped transit models to light curves and reports signal-to-noise ratios (SNR) that any astronomer can verify.

The Transit Method

Planets don’t emit their own light—they’re found by detecting the tiny dip in starlight when they pass in front of their host star. For a Jupiter-sized planet around a Sun-like star, this dip is about 1% (10,000 ppm). For an Earth-sized planet, it’s just 0.01% (100 ppm).

Our pipeline:

Downloads light curves from NASA’s MAST archive
Cleans data (removes outliers, applies quality flags)
Detrends to remove long-term stellar variability
Runs BLS periodogram to search for periodic transits
Validates detections with false positive tests

Early Validation: Proving the Concept

Before hunting for new planets, we needed to prove our pipeline worked. We tested it on Kepler-10b, one of the first confirmed rocky exoplanets:

Known period: 0.8376 days
Our detection: 0.8379 days
Error: 0.3%

Success! Our pipeline could detect known planets with extreme precision. We validated against three more confirmed planets:

L 98-59 d: 0.2% error
HD 23472 b: 0.1% error
TOI-700 c: 0.1% error

With sub-0.3% accuracy across the board, we were ready to hunt.

The First Hunt: Rediscovering What’s Known

Flush with confidence, we ran our pipeline on 16 carefully selected targets—stars bright enough for good data quality, in the right magnitude range, with enough observation time for reliable detections.

We found 12 candidates across 6 targets. Exciting! Until we started validating…

The Validation Crisis

TIC 307210830: We found a 2.25-day planet with SNR of 1700. Incredible signal!
Reality check: This is L 98-59 d, discovered in 2019.

TIC 94986319: A 0.98-day planet with SNR of 1850. Amazing!
Reality check: HD 23472 b, published in 2020.

TIC 410214986: 10.4-day signal, SNR 1200.
Reality check: TOI-700 c, discovered in 2020.

We had built a perfect planet detector. The problem? We were rediscovering planets that were already known.

Learning #1: Validation is Everything

This was a crucial lesson. Initial detection is exciting, but validation is what separates science from wishful thinking. We implemented comprehensive cross-matching:

Coordinate-based searches: 60-arcsecond cone searches in NASA’s Exoplanet Archive
ExoFOP-TESS catalogs: Check for Targets of Interest (TOIs)
SIMBAD queries: Find alternative star names
Cross-reference everything: A star might be cataloged under multiple identifiers

We weren’t just checking if a planet was known—we were checking if the host star had any known planets, because finding a second planet in a known system was also less interesting than finding something entirely new.

Strategic Pivot: Focus on the Unexplored

After our validation reality check, we had a strategic insight: stop looking at stars with known planets entirely.

The logic was simple: if professional surveys had already found planets around a star, they’d analyzed that data thoroughly. The odds of finding something genuinely new there were low. Instead, we should focus on stars that had never had a planetary detection.

This meant:

Filtering out all known planet hosts upfront
Accepting fewer candidates in exchange for higher novelty
Focusing computational effort on truly unexplored targets

We also tightened our detection thresholds:

SNR > 10 (strong detection required)
Depth > 100 ppm (significant transits only)
Novelty score > 3.0 (unusual characteristics prioritized)

The Discovery: KIC 8554498

After our strategic refinement, one target stood out: KIC 8554498.

This Kepler target had:

Good data quality (10 quarters, 258,575 clean data points)
Decent brightness (Kepler magnitude in range)
Zero known planets (confirmed via NASA Archive)
A strong periodic signal at 4.78 days

Initial BLS detection showed:

Period: 4.7803 days
Transit depth: 383 ppm (0.038%)
SNR: 5837 (!!)
48 transits detected over 229 days

That SNR of 5837 was extraordinary. For context, an SNR above 7 is considered a strong detection. This signal was screaming at us from the data.

Comprehensive Validation

A strong signal isn’t enough. We needed to rule out false positives. Eclipsing binary stars can mimic planetary transits, so we implemented two critical tests:

Test 1: Odd/Even Transit Comparison

If this were an eclipsing binary (two stars orbiting each other), odd and even transits would show different depths—because you’d alternately see the primary and secondary eclipse.

Our results:

Odd transits (24 events): depth = 297.6 ppm
Even transits (24 events): depth = 325.8 ppm
Ratio: 0.913

Verdict: ✅ PASS — Depths are statistically consistent (ratio should be 0.7-1.3 for planets)

Test 2: Secondary Eclipse Search

If this were a stellar companion, we’d see a secondary eclipse when the companion goes behind the host star (at orbital phase 0.5).

Our results:

Primary transit depth (phase 0.0): 40.1 ppm
Secondary eclipse depth (phase 0.5): 0.0 ppm
Ratio: 0.000

Verdict: ✅ PASS — No secondary eclipse detected (ratio should be <0.3 for planets)

Phase Coherence

All 48 transits phase-fold perfectly at the 4.7803-day period. No timing variations, no irregularities. The transit shape shows clear ingress and egress, consistent with a planet crossing a stellar disk.

What We Found: KIC 8554498 b

If confirmed, KIC 8554498 b would be:

Radius: ~2 Earth radii (super-Earth or mini-Neptune)
Period: 4.7803 days (short-period, likely tidally locked)
Equilibrium temperature: ~1000-1500 K (hot planet)
Host star: KIC 8554498 (Kepler target, no previously known planets)
Detection confidence: Very high (SNR = 5837, 48 transits, passes all tests)

This is a strong candidate. Not a confirmed planet yet—that requires additional observations like high-resolution imaging, centroid analysis, and ideally radial velocity measurements to confirm the mass. But the signal is clear, validated, and genuinely new.

Technical Challenges We Solved

1. BLS API Confusion

The Astropy BLS documentation is… tricky. We initially tried to access bls.period_at_max_power and bls.depth_snr—attributes that don’t actually exist.

Solution: When using objective='snr', the SNR is directly in periodogram.power. You access it as an array: SNR = np.max(periodogram.power).

2. Astropy Quantity Objects

Astropy uses Quantity objects (values with units) extensively. Transit times are returned as Time objects, durations as Quantity objects. Trying to convert them directly to floats causes TypeErrors.

Solution: Use the .value attribute: stats['transit_times'][0].value extracts the underlying float.

3. Phase Plotting with TimeDelta

When folding light curves to orbital phase, folded.phase returns TimeDelta objects. Matplotlib can’t plot these directly.

Solution: Use .value attribute: phase_sorted.phase.value gets the numeric array Matplotlib needs.

4. False Positive Contamination

Early candidates looked exciting until validation showed they were known planets. We lost days chasing rediscoveries.

Solution: Implement coordinate-based validation before getting excited. Cross-check everything against multiple catalogs. Accept that validation will reject most candidates—that’s the scientific method working.

What We Learned

1. The Humility of Validation

Every exciting signal needs skeptical validation. Our pipeline’s first “discoveries” were all known planets. This wasn’t failure—it proved our detection accuracy. But it taught us that the hard part isn’t finding signals; it’s confirming they’re genuinely new.

2. Strategic Filtering Matters

We found more value in analyzing 5 truly unexplored stars than 50 stars with known planets. Quality over quantity. Novelty over confirmation bias.

3. Explainable Methods Win

We chose BLS over machine learning deliberately. When presenting KIC 8554498 b to the scientific community, we can explain exactly how it was found, show the SNR calculation, reproduce every step. This transparency is crucial for scientific acceptance.

4. Open Data is Powerful

Every byte of data we used is publicly available. Every library is open-source. Anyone with Python and curiosity can reproduce our results. This democratization of astronomy is remarkable—20 years ago, you’d need a telescope and institutional access.

5. Bugs Are Features

Every error taught us something about Astropy’s design, about astronomical data formats, about validation strategies. The TypeError that crashed our phase plotting? It forced us to understand TimeDelta objects properly. The known planet rediscoveries? They validated our pipeline accuracy.

The Scientific Context

Why Might This Have Been Missed?

Professional surveys like the Kepler pipeline analyzed every target systematically. Why would they miss a signal with SNR of 5837?

Possible reasons:

Threshold tuning: Automated pipelines use SNR cutoffs calibrated for different objectives. A signal just below a threshold gets deprioritized.
Data completeness: We used 10 quarters; maybe early analyses used fewer.
Focus on bright stars: Fainter targets get less follow-up attention.
Human verification: With thousands of candidates, some signals might be deprioritized in triage.

This isn’t a failure of previous work—it’s the nature of big data astronomy. Every analysis makes choices about where to focus limited resources. Our approach was different: targeted depth on specific stars rather than breadth across thousands.

What Makes This Interesting?

High SNR: 5837 is exceptionally strong for a planet this size
Unexplored host: No known planets in this system
Validated: Passes key false positive tests
Reproducible: Clear signal, public data, open methods
Science-ready: Strong enough for community validation and potential follow-up

Next Steps: Community Validation

We’ve prepared a scientific report for community review. The next phase requires:

Independent verification: Other astronomers analyzing the same data
Stellar characterization: Understanding the host star better
Centroid analysis: Confirming the transit signal comes from KIC 8554498, not a background source
Follow-up observations: Ground-based photometry, high-resolution imaging
Radial velocity: If bright enough, measuring the planet’s mass

A detection becomes a discovery only after the community validates it. This is the scientific method in action.

The Bigger Picture

This project demonstrates something profound: meaningful scientific research is increasingly accessible.

You don’t need:

A PhD in astrophysics (helpful, but not required)
Access to telescopes (NASA data is free)
Expensive software (Python libraries are open-source)
A university affiliation (curiosity and rigor are enough)

You do need:

Critical thinking
Willingness to validate skeptically
Humility to learn from failures
Persistence through technical challenges
Respect for scientific rigor

Technical Deep Dive (For the Curious)

The BLS Algorithm

Box Least Squares works by:

Testing thousands of trial periods (0.5 to 30 days in our case)
For each period, folding the light curve and testing different transit durations
Fitting a box-shaped transit model (flat bottom, sharp edges)
Computing SNR: (transit depth) / (noise standard deviation)
Returning the period/duration combination with maximum SNR

The “box” assumption is reasonable because planetary transits have:

Sharp ingress (planet enters the stellar disk)
Flat bottom (planet fully in front of star)
Sharp egress (planet exits the stellar disk)

Our Pipeline Code Structure

# 1. Download and prepare data
lc = lk.search_lightcurve('KIC 8554498', mission='Kepler').download_all()
lc = lc.stitch()
lc = lc.remove_outliers(sigma=5)
lc = lc[lc.quality == 0]  # Keep only highest quality data

# 2. Detrend to remove stellar variability
lc_flat = lc.flatten(window_length=0.5)
lc_norm = lc_flat.normalize()

# 3. Run BLS periodogram
bls = BoxLeastSquares(time, flux)
periodogram = bls.autopower(minimum_period, objective='snr')

# 4. Get best period and compute statistics
idx_best = np.argmax(periodogram.power)
best_period = periodogram.period[idx_best]
stats = bls.compute_stats(periodogram.period[idx_best], 
                          periodogram.duration[idx_best],
                          periodogram.transit_time[idx_best])

# 5. Validate with false positive tests
# (odd/even comparison, secondary eclipse search, etc.)

Data Quality Matters

Not all Kepler data is equal. We used:

Quality flags: quality == 0 keeps only perfect data
Bitmask filtering: bitmask=1130799 removes known instrumental issues
Sigma clipping: Removes outliers beyond 5σ
Detrending: Biweight filter with 0.5-day window removes stellar variability without removing transits

Poor data quality can create false signals or hide real ones. This preprocessing is critical.

Computing Planet Radius

Transit depth relates to planet radius:

depth = (R_planet / R_star)²

For KIC 8554498 b:

Depth = 383 ppm = 0.000383
R_planet / R_star = √0.000383 ≈ 0.0196
If R_star ≈ 1 R_☉ (typical assumption):
R_planet ≈ 0.0196 R_☉ ≈ 2.16 R_⊕

This puts it in the “super-Earth” or “mini-Neptune” category—bigger than Earth, smaller than Neptune. These planets are common in the galaxy but absent from our solar system, making them particularly interesting for comparative planetology.

Reflections: AI as Scientific Assistant

This project was vibe coded with GitHub Copilot—an experiment in letting AI assistance drive the technical implementation while human intuition guided the scientific strategy. The entire codebase was written through conversational prompting, with Copilot generating functions, debugging API calls, and even suggesting validation approaches based on astronomical best practices.

The AI didn’t “discover” the planet—the signal is in the data, and the algorithms are standard astrophysics. What AI provided was:

Rapid prototyping: Generating Python code from astronomical concepts
Debugging assistance: Solving technical issues with Astropy APIs
Knowledge synthesis: Connecting astronomical theory to implementation
Validation strategy: Suggesting tests based on exoplanet literature
Documentation: Explaining methods clearly for reproducibility

The human provided:

Scientific intuition and strategy
Critical validation skepticism
Strategic pivots (excluding known hosts)
Interpretation of results
Ethical responsibility for claims

This “vibe coding” approach—writing code through natural language conversation with AI—dramatically accelerated development. What might have taken weeks of reading documentation and debugging became days of iterative refinement. This collaboration model—AI as tool, human as scientist—feels like the future of research. AI accelerates the technical work, humans provide creativity, skepticism, and judgment.

Conclusion: One Candidate, Many Lessons

We set out to find an exoplanet. We may have succeeded—KIC 8554498 b is a strong candidate awaiting community validation. But whether this detection becomes a confirmed discovery or a well-understood false positive, the journey taught us far more than the destination.

We learned:

How professional astronomers detect exoplanets
Why validation is harder than detection
How to work with real scientific data and tools
The value of skepticism and rigorous testing
That meaningful research is accessible with modern tools

In an age where AI and open data are democratizing science, projects like this show what’s possible. You don’t need a research institution or years of training to contribute to human knowledge. You need curiosity, rigor, and respect for the scientific method.

The universe is full of planets waiting to be found. The data is public. The tools are free. The only question is: what will you discover?

Resources

Key Libraries:

Lightkurve: https://docs.lightkurve.org/
Astropy: https://www.astropy.org/
AstroQuery: https://astroquery.readthedocs.io/

Data Sources:

NASA Exoplanet Archive: https://exoplanetarchive.ipac.caltech.edu/
MAST (Kepler data): https://mast.stsci.edu/

Learn More:

“Exoplanet Detection” course (Coursera/edX)
NASA Exoplanet Exploration: https://exoplanets.nasa.gov/
Planet Hunters: https://www.planethunters.org/

This post describes candidate KIC 8554498 b, detected using Kepler archival data. The detection awaits community validation and should be considered preliminary until confirmed by independent analysis and follow-up observations.

Status: Candidate (high confidence, awaiting validation)
SNR: 5837
Period: 4.7803 days
Transits: 48 detected over 229 days
Host: KIC 8554498 (no previously known planets)

Want to help validate this candidate or learn more about exoplanet detection? Contact us or explore the public Kepler data yourself. The next discovery could be yours.