Skip to main content
ARS Home » Northeast Area » Beltsville, Maryland (BARC) » Beltsville Agricultural Research Center » Environmental Microbial & Food Safety Laboratory » Research » Publications at this Location » Publication #387800

Research Project: Design and Implementation of Monitoring and Modeling Methods to Evaluate Microbial Quality of Surface Water Sources Used for Irrigation

Location: Environmental Microbial & Food Safety Laboratory

Title: Prediction of E. coli concentrations in agricultural pond waters: application and comparison of machine learning algorithms

Author
item STOCKER, MATTHEW - ORISE FELLOW
item Pachepsky, Yakov
item HILL, ROBERT - UNIVERSITY OF MARYLAND

Submitted to: Frontiers in Artificial Intelligence
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 12/13/2022
Publication Date: 1/11/2022
Citation: Stocker, M., Pachepsky, Y.A., Hill, R. 2022. Prediction of E. coli concentrations in agricultural pond waters: application and comparison of machine learning algorithms. Frontiers in Artificial Intelligence. https://doi.org/10.3389/frai.2021.768650.
DOI: https://doi.org/10.3389/frai.2021.768650

Interpretive Summary: The microbial quality of irrigation water is a public health concern when it comes to growing produce. Bacterium E. coli is the widely accepted indicator of the level of microbial contamination, and monitoring E. coli is mandated. E. coli concentrations vary across irrigation ponds. Predicting this variation using more easily obtainable water quality parameters appears to be attractive. However, the complexity of relationships between E. coli and other water quality parameters complicates such predictions. Machine learning algorithms were shown to mimic complex relationships between different parameters in natural systems. We applied several popular algorithms to describe relationships between E. coli and various water quality parameters in two intensively monitored irrigation ponds in Maryland. We found that the random forest algorithm performed best, and that inexpensive sensors provided the accuracy of E. coli predictions that did not improve when more tedious and expensive measurements were added. Results of this work will be of use for agricultural enterprises using pond water for produce irrigation in that microbial water quality monitoring can be improved by means of using the readily available water sensing technology.

Technical Abstract: The microbial quality of irrigation water is an important issue as contaminated waters have been linked to several incidences of foodborne outbreaks. To expedite microbial water quality determinations, many researchers have turned to estimate concentrations of the microbial contamination indicator E. coli from the concentrations of chemical water quality parameters. However, these relationships mainly were non-linear and exhibited changes above or below certain thresholds. Machine learning (ML) algorithms have been shown to make accurate predictions in datasets with complex relationships. The purpose of this work was to evaluate several ML models for the prediction of E. coli in agricultural pond waters. Two ponds in Maryland were monitored from 2016 to 2018 during the irrigation season. E. coli concentrations along with 12 other water quality parameters were measured in water samples. The resulting datasets were used to predict E. coli using stochastic gradient boosting machines, random forest, support vector machines, and k-nearest neighbor algorithms. The random forest model provided the lowest RMSE value for predicted E. coli concentrations in both ponds in individual years and over consecutive years in almost all cases. The RMSE for the random forest model using the 3-year dataset were 0.334 and 0.381 for Pond 1 and Pond 2, respectively. For individual years, this value ranged from 0.244 to 0.346 and 0.304 to 0.418 at Pond 1 and Pond 2, respectively, all in log10 E. coli concentrations. In most cases, there was no significant difference (P > 0.05) between RMSE of random forest and other ML models when these RMSE were treated as statistics derived from 10-fold cross-validation performed with five repeats. Important E. coli predictors were turbidity, dissolved organic matter content, specific conductance, chlorophyll concentration, and temperature. Model predictive performance did not significantly differ when 5 predictors were used versus 8 or 12, indicating that more tedious and costly measurements bring no substantial improvement in the ML predictive accuracy.