Skip to main content
ARS Home » Northeast Area » Beltsville, Maryland (BARC) » Beltsville Agricultural Research Center » Environmental Microbial & Food Safety Laboratory » Research » Publications at this Location » Publication #411680

Research Project: Improving Pre-harvest Produce Safety through Reduction of Pathogen Levels in Agricultural Environments and Development and Validation of Farm-Scale Microbial Quality Model for Irrigation Water Sources

Location: Environmental Microbial & Food Safety Laboratory

Title: Estimating concentrations of Escherichia coli across a farm pond from the sUAS-based RGB imagery and water quality variables with machine learning techniques

Author
item HONG, SEOKMIN - Ulsan National Institute Of Science And Technology (UNIST)
item Morgan, Billie
item Stocker (ctr), Matthew
item SMITH, JACLYN - Orise Fellow
item Kim, Moon
item CHO, KYUNG HWA - Ulsan National Institute Of Science And Technology (UNIST)
item Pachepsky, Yakov

Submitted to: Water Research
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 5/30/2024
Publication Date: 6/3/2024
Citation: Hong, S., Morgan, B.J., Stocker, M.D., Smith, J.E., Kim, M.S., Cho, K., Pachepsky, Y.A. 2024. Estimating concentrations of Escherichia coli across a farm pond from the sUAS-based RGB imagery and water quality variables with machine learning techniques. Water Research. 260: Article e121861. https://doi.org/10.1016/j.watres.2024.121861.
DOI: https://doi.org/10.1016/j.watres.2024.121861

Interpretive Summary: Concentrations of the bacterium Escherichia coli (E. coli) in irrigation water are crucial for evaluating drinking water quality and public health. Those concentrations vary in space and time in irrigation ponds in accordance with variation of E. coli survival conditions. Survival conditions manifest themselves in water quality parameters. Our hypothesis was that the E. coli survival conditions can manifest themselves also in differences in color of water viewed in visible and infrared ranges of spectrum. We tested this hypothesis by applying machine learning to predict the E. coli levels using water quality parameters with and without images made with GoPro cameras from the drone. The machine learning models offered an efficient means to predict E. coli patterns in irrigation ponds. Results of this work can be used by the water quality professional to design the water sampling strategy for the efficient microbial water quality monitoring.

Technical Abstract: Achieving rapid and efficient quantification of E. coli concentrations is a significant goal in the field of microbial water quality. Therefore, remote sensing techniques and machine learning algorithms have been utilized to detect and estimate E. coli. Despite applying these approaches, challenges arise due to the limited number of available samples and imbalances in water quality datasets. The objective of this study was to estimate E. coli concentrations within an irrigation pond in Maryland during the summer season using demosaiced RGB imagery in visible and infrared spectrum ranges and a set of 14 water quality parameters. Four machine learning models - Random Forest (RF), Gradient Boosting Machine (GBM), Extreme Gradient Boosting (XGB), and K-nearest Neighbor (KNN)- were employed under three scenarios; utilizing only water quality parameters, incorporating both water quality and sUAS-based RGB data, and using only RGB data. Two data splitting methods for selecting training and test datasets were utilized; traditional random data splitting (ordinary data splitting), and quantile data splitting provided the constant splitting ratio in each decile od the E. coli concentration distribution. The quantile data splitting resulted in better model performance metrics and smaller differences between those metrics over training and testing datasets. Machine learning models, RF, GBM, and XGB trained with quantile data splitting after hyperparameter optimization had R2 values above 0.847 over training and above 0.689 over test dataset. The combination of water quality and imagery data led to larger R2 values exceeding 0.896 for the test dataset. Shapley additive explanations (SHAP), which helps to visualize the variable importance of machine learning models, revealed that the visible blue spectrum intensity along with air temperature was the most influential parameter in the RF model. Demosaiced RGB imagery served as a useful predictor for E. coli concentration across the studied irrigation pond.