LOF docs

apache · Oct 15, 2024 · 66ac16e · 66ac16e
1 parent 08c8515
commit 66ac16e
Show file tree

Hide file tree

Showing 3 changed files with 86 additions and 6 deletions.
diff --git a/docs/api/stats/sql.md b/docs/api/stats/sql.md
@@ -9,7 +9,7 @@ complete set of geospatial analysis tools.
 
 ## Using DBSCAN
 
-The DBSCAN function is provided at `org.apache.sedona.stats.DBSCAN.dbscan` in scala/java and `sedona.stats.dbscan.dbscan` in python.
+The DBSCAN function is provided at `org.apache.sedona.stats.clustering.DBSCAN.dbscan` in scala/java and `sedona.stats.clustering.dbscan.dbscan` in python.
 
 The function annotates a dataframe with a cluster label for each data record using the DBSCAN algorithm.
 The dataframe should contain at least one `GeometryType` column. Rows must be unique. If one
@@ -29,3 +29,22 @@ names in parentheses are python variable names
 - useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
 
 The output is the input DataFrame with the cluster label added to each row. Outlier will have a cluster value of -1 if included.
+
+## Using Local Outlier Factor (LOF)
+
+The LOF function is provided at `org.apache.sedona.stats.outlierDetection.LocalOutlierFactor.localOutlierFactor` in scala/java and `sedona.stats.outlier_detection.local_outlier_factor.local_outlier_factor` in python.
+
+The function annotates a dataframe with a column containing the local outlier factor for each data record.
+The dataframe should contain at least one `GeometryType` column. Rows must be unique. If one
+geometry column is present it will be used automatically. If two are present, the one named
+'geometry' will be used. If more than one are present and neither is named 'geometry', the
+column name must be provided.
+
+
+### Parameters
+names in parentheses are python variable names
+- dataframe - dataframe containing the point geometries
+- k - number of nearest neighbors that will be considered for the LOF calculation
+- geometry - name of the geometry column
+- handleTies (handle_ties) - whether to handle ties in the k-distance calculation. Default is false
+- useSpheroid (use_spheroid) - whether to use a cartesian or spheroidal distance calculation. Default is false
diff --git a/docs/tutorial/sql.md b/docs/tutorial/sql.md
@@ -750,23 +750,23 @@ The first parameter is the dataframe, the next two are the epsilon and min_point
 === "Scala"
 
 	```scala
-	import org.apache.sedona.stats.DBSCAN.dbscan
+	import org.apache.sedona.stats.clustering.DBSCAN.dbscan
 
 	dbscan(df, 0.1, 5).show()
 	```
 
 === "Java"
 
 	```java
-	import org.apache.sedona.stats.DBSCAN;
+	import org.apache.sedona.stats.clustering.DBSCAN;
 
 	DBSCAN.dbscan(df, 0.1, 5).show();
 	```
 
 === "Python"
 
 	```python
-	from sedona.stats.dbscan import dbscan
+	from sedona.stats.clustering.dbscan import dbscan
 
 	dbscan(df, 0.1, 5).show()
 	```
@@ -793,6 +793,67 @@ The output will look like this:
 +----------------+---+------+-------+
 ```
 
+## Calculate the Local Outlier Factor (LOF)
+
+Sedona provides an implementation of the [Local Outlier Factor](https://en.wikipedia.org/wiki/Local_outlier_factor) algorithm to identify anomalous data.
+
+The algorithm is available as a Scala and Python function called on a spatial dataframe. The returned dataframe has an additional column added containing the local outlier factor.
+
+The first parameter is the dataframe, the next is the number of nearest neighbors to consider use in calculating the score.
+
+=== "Scala"
+
+	```scala
+	import org.apache.sedona.stats.outlierDetection.LocalOutlierFactor.localOutlierFactor
+
+    localOutlierFactor(df, 20).show()
+	```
+
+=== "Java"
+
+	```java
+	import org.apache.sedona.stats.outlierDetection.LocalOutlierFactor;
+
+	LocalOutlierFactor.localOutlierFactor(df, 20).show();
+	```
+
+=== "Python"
+
+	```python
+	from sedona.stats.outlier_detection.local_outlier_factor import local_outlier_factor
+
+	local_outlier_factor(df, 20).show()
+	```
+
+The output will look like this:
+
+```
++--------------------+------------------+
+|            geometry|               lof|
++--------------------+------------------+
+|POINT (-2.0231305...| 0.952098153363662|
+|POINT (-2.0346944...|0.9975325496668104|
+|POINT (-2.2040074...|1.0825843906411081|
+|POINT (1.61573501...|1.7367129352162634|
+|POINT (-2.1176324...|1.5714144683150393|
+|POINT (-2.2349759...|0.9167275845938276|
+|POINT (1.65470192...| 1.046231536764447|
+|POINT (0.62624112...|1.1988700676990034|
+|POINT (2.01746261...|1.1060219481067417|
+|POINT (-2.0483857...|1.0775553430145446|
+|POINT (2.43969463...|1.1129132178576646|
+|POINT (-2.2425480...| 1.104108012697006|
+|POINT (-2.7859235...|  2.86371824574529|
+|POINT (-1.9738858...|1.0398822680356794|
+|POINT (2.00153403...| 0.927409656346015|
+|POINT (2.06422812...|0.9222203762264445|
+|POINT (-1.7533819...|1.0273650471626696|
+|POINT (-2.2030766...| 0.964744555830738|
+|POINT (-1.8509857...|1.0375927869698574|
+|POINT (2.10849080...|1.0753419197322656|
++--------------------+------------------+
+```
+
 ## Run spatial queries
 
 After creating a Geometry type column, you are able to run spatial queries.

diff --git a/...k/common/src/main/scala/org/apache/sedona/stats/outlierDetection/LocalOutlierFactor.scala b/...k/common/src/main/scala/org/apache/sedona/stats/outlierDetection/LocalOutlierFactor.scala
@@ -35,7 +35,7 @@ object LocalOutlierFactor {
    * column name must be provided.
    *
    * @param dataframe
-   *   apache sedona idDataframe containing the point geometries
+   *   dataframe containing the point geometries
    * @param k
    *   number of nearest neighbors that will be considered for the LOF calculation
    * @param geometry
@@ -46,7 +46,7 @@ object LocalOutlierFactor {
    *   whether to use a cartesian or spheroidal distance calculation. Default is false
    *
    * @return
-   *   A PySpark DataFrame containing the lof for each row
+   *   A DataFrame containing the lof for each row
    */
   def localOutlierFactor(
       dataframe: DataFrame,