A Python-based system for analyzing, matching, and finding similar geographic addresses. The system uses advanced text similarity algorithms and geographic distance calculations to find the most relevant matches for any given address query.
-
Flexible Address Input:
- Full address with coordinates:
latitude,longitude|street_address|locality|region|postcode
- Street address only:
street_number street_name
- Partial address with components:
street_address|locality|region|postcode
- Support for comma-separated addresses:
street_number street_name, locality
- Full address with coordinates:
-
Smart Address Parsing:
- Automatic extraction of address components
- Support for unit/apartment numbers
- Coordinate detection and normalization
- Handling of various address formats
-
Intelligent Matching:
- Geographic distance-based matching using coordinates
- Text similarity using combined Jaccard and sequence matching
- Component-wise comparison with weighted scoring
- Duplicate removal in search results
-
Address Components:
- Coordinates (latitude, longitude)
- House number
- Unit/apartment number
- Street name
- Locality
- Region
- Postcode
- Create a virtual environment:
python3 -m venv venv
source venv/bin/activate # On Unix/macOS
- Install dependencies:
pip install -r requirements.txt
Run the interactive CLI:
python3 address_cli.py
Available commands:
:help
- Show help message:quit
- Exit the program:ex
- Show example addresses
- Full address with coordinates:
55.853729,-4.254518|40 Carlton Pl|Glasgow|Lanarkshire|G5 9TS
- Street address only:
40 Carlton Pl
- Address with locality:
40 Carlton Pl, Glasgow
- Address with unit:
Unit 7, 23 Westminster Terrace
The system expects training data in the following format:
latitude,longitude|street_address|locality|region|postcode
Training data should be stored in:
data/train.txt
- Training datasetdata/val.txt
- Validation datasetdata/test.txt
- Test dataset
The system uses a weighted scoring system for address matching:
- Coordinates: 15%
- House number: 20%
- Unit/apartment: 10%
- Street name: 30%
- Locality: 15%
- Region: 5%
- Postcode: 5%
Distance calculations for coordinates use the Haversine formula to account for Earth's curvature.
ngram/
├── data/
│ ├── train.txt
│ ├── val.txt
│ └── test.txt
├── address_analyzer.py # Core address parsing and matching logic
├── address_cli.py # Interactive command-line interface
├── requirements.txt # Project dependencies
└── README.md # This file
- Python 3.6+
- NumPy - For numerical operations and distance calculations
- Regular expressions (built-in) - For address parsing
- difflib (built-in) - For text similarity calculations
- Coordinates are normalized to 3 decimal places for consistency
- Addresses are case-insensitive during matching
- The system removes duplicates from search results based on core address components
- Geographic distance is normalized using a 5km scale factor
- Text similarity uses a combination of word-based (Jaccard) and character-based (sequence) matching