Skip to content

Commit

Permalink
LOF docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jameswillis committed Oct 15, 2024
1 parent 08c8515 commit 66ac16e
Show file tree
Hide file tree
Showing 3 changed files with 86 additions and 6 deletions.
21 changes: 20 additions & 1 deletion docs/api/stats/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ complete set of geospatial analysis tools.

## Using DBSCAN

The DBSCAN function is provided at `org.apache.sedona.stats.DBSCAN.dbscan` in scala/java and `sedona.stats.dbscan.dbscan` in python.
The DBSCAN function is provided at `org.apache.sedona.stats.clustering.DBSCAN.dbscan` in scala/java and `sedona.stats.clustering.dbscan.dbscan` in python.

The function annotates a dataframe with a cluster label for each data record using the DBSCAN algorithm.
The dataframe should contain at least one `GeometryType` column. Rows must be unique. If one
Expand All @@ -29,3 +29,22 @@ names in parentheses are python variable names
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false

The output is the input DataFrame with the cluster label added to each row. Outlier will have a cluster value of -1 if included.

## Using Local Outlier Factor (LOF)

The LOF function is provided at `org.apache.sedona.stats.outlierDetection.LocalOutlierFactor.localOutlierFactor` in scala/java and `sedona.stats.outlier_detection.local_outlier_factor.local_outlier_factor` in python.

The function annotates a dataframe with a column containing the local outlier factor for each data record.
The dataframe should contain at least one `GeometryType` column. Rows must be unique. If one
geometry column is present it will be used automatically. If two are present, the one named
'geometry' will be used. If more than one are present and neither is named 'geometry', the
column name must be provided.


### Parameters
names in parentheses are python variable names
- dataframe - dataframe containing the point geometries
- k - number of nearest neighbors that will be considered for the LOF calculation
- geometry - name of the geometry column
- handleTies (handle_ties) - whether to handle ties in the k-distance calculation. Default is false
- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
67 changes: 64 additions & 3 deletions docs/tutorial/sql.md
Original file line number Diff line number Diff line change
Expand Up @@ -750,23 +750,23 @@ The first parameter is the dataframe, the next two are the epsilon and min_point
=== "Scala"

```scala
import org.apache.sedona.stats.DBSCAN.dbscan
import org.apache.sedona.stats.clustering.DBSCAN.dbscan

dbscan(df, 0.1, 5).show()
```

=== "Java"

```java
import org.apache.sedona.stats.DBSCAN;
import org.apache.sedona.stats.clustering.DBSCAN;

DBSCAN.dbscan(df, 0.1, 5).show();
```

=== "Python"

```python
from sedona.stats.dbscan import dbscan
from sedona.stats.clustering.dbscan import dbscan

dbscan(df, 0.1, 5).show()
```
Expand All @@ -793,6 +793,67 @@ The output will look like this:
+----------------+---+------+-------+
```

## Calculate the Local Outlier Factor (LOF)

Sedona provides an implementation of the [Local Outlier Factor](https://en.wikipedia.org/wiki/Local_outlier_factor) algorithm to identify anomalous data.

The algorithm is available as a Scala and Python function called on a spatial dataframe. The returned dataframe has an additional column added containing the local outlier factor.

The first parameter is the dataframe, the next is the number of nearest neighbors to consider use in calculating the score.

=== "Scala"

```scala
import org.apache.sedona.stats.outlierDetection.LocalOutlierFactor.localOutlierFactor

localOutlierFactor(df, 20).show()
```

=== "Java"

```java
import org.apache.sedona.stats.outlierDetection.LocalOutlierFactor;

LocalOutlierFactor.localOutlierFactor(df, 20).show();
```

=== "Python"

```python
from sedona.stats.outlier_detection.local_outlier_factor import local_outlier_factor

local_outlier_factor(df, 20).show()
```

The output will look like this:

```
+--------------------+------------------+
| geometry| lof|
+--------------------+------------------+
|POINT (-2.0231305...| 0.952098153363662|
|POINT (-2.0346944...|0.9975325496668104|
|POINT (-2.2040074...|1.0825843906411081|
|POINT (1.61573501...|1.7367129352162634|
|POINT (-2.1176324...|1.5714144683150393|
|POINT (-2.2349759...|0.9167275845938276|
|POINT (1.65470192...| 1.046231536764447|
|POINT (0.62624112...|1.1988700676990034|
|POINT (2.01746261...|1.1060219481067417|
|POINT (-2.0483857...|1.0775553430145446|
|POINT (2.43969463...|1.1129132178576646|
|POINT (-2.2425480...| 1.104108012697006|
|POINT (-2.7859235...| 2.86371824574529|
|POINT (-1.9738858...|1.0398822680356794|
|POINT (2.00153403...| 0.927409656346015|
|POINT (2.06422812...|0.9222203762264445|
|POINT (-1.7533819...|1.0273650471626696|
|POINT (-2.2030766...| 0.964744555830738|
|POINT (-1.8509857...|1.0375927869698574|
|POINT (2.10849080...|1.0753419197322656|
+--------------------+------------------+
```

## Run spatial queries

After creating a Geometry type column, you are able to run spatial queries.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ object LocalOutlierFactor {
* column name must be provided.
*
* @param dataframe
* apache sedona idDataframe containing the point geometries
* dataframe containing the point geometries
* @param k
* number of nearest neighbors that will be considered for the LOF calculation
* @param geometry
Expand All @@ -46,7 +46,7 @@ object LocalOutlierFactor {
* whether to use a cartesian or spheroidal distance calculation. Default is false
*
* @return
* A PySpark DataFrame containing the lof for each row
* A DataFrame containing the lof for each row
*/
def localOutlierFactor(
dataframe: DataFrame,
Expand Down

0 comments on commit 66ac16e

Please sign in to comment.