Journal Menu
Archive
Last Edition
Archive

Intelligent clustering and classification of the gas-crude oil separation process using k-means and random forest: a data science based approach

Authors:

Rubén Darío Vega Mejía1

, Natali Lisbeth Campos Rodríguez1

, Omar José 

Sánchez Roca1

, Cristhian Ronceros Morales2

1University of Oriente, School of Engineering and Applied Sciences (EICA), Department of Petroleum Engineering, Maturin, Venezuela
2Technological University of Peru, Department of Systems Engineering and Computer Science, Ica, Peru

Received: 25 August 2025
Revised: 13 November 2025
Accepted: 27 November 2025
Published: 15 December 2025

Abstract:

Applying data science methodology, the study examined the separation of gas and crude oil in the state of the QE2 compressor in Monagas, Venezuela. It began with a descriptive statistical study that found and eliminated 2.22% of anomalous data, revealing a trimodal behavior for crude oil and a bimodal for gas. With skewness and a coefficient of determination (R2) where 0.7645 for the gas-crude ratio, both variables had a coefficient of variation greater than 20%. The K-means algorithm was used, which found four well-formed clusters. However, the Kruskal-Wallis method could not find statistically significant differences between them, suggesting that the variability is due to different operating rules, crude types or process errors, rather than clearly differentiated groups. Finally, a Random Forest algorithm was developed with one hundred trees. The most significant achieved an accuracy of 0.9929. Despite an initial Gini value of 0.725 (moderate impurity), it was segmented into two branches. The branch with a raw value ≤1.15 Thousands of Barrels of Crude Oil per Day (MBNPD) showed superior performance, with a Gini value of 0.01, indicating near-perfect purity. This shows that this branch classifies with high accuracy.

Keywords:

Artificial intelligence, Machine learning, Descriptive statistics, Oil production, Natural gas, Gini value

References:

[1] A.C.C. Rodrigues, Decreasing natural gas flaring in Brazilian oil and gas industry. Resources Policy, 77, 2022: 102776. https://doi.org/10.1016/j.resourpol.2022.102776
[2] J.G. Speight, The Chemistry and Technology of Petroleum, 5th ed. CRC Press, Boca Raton, 2014. https://doi.org/10.1201/b16559
[3] K.K. Orisaremi, F.T.S. Chan, N.S.K. Chung, Potential reductions in global gas flaring for determining the optimal sizing of gas-to-wire (GTW) process: An inverse DEA approach. Journal of Natural Gas Science and Engineering, 93, 2021: 103995. https://doi.org/10.1016/j.jngse.2021.103995
[4] O.E. Gualdrón, L.D. García Mateus, K.D.J. Beleño Sáenz, Identificación de un sistema de separación bifásica en una estación de recolección de crudo a través de técnicas de inteligencia artificial. Prospectiva, 2(12), 2014: 18–28. https://doi.org/10.15665/rp.v12i2.285
[5] G. Kooti, B. Dabir, R. Taherdangkoo, C. Butscher, Modelling droplet size distribution in inline electrostatic coalescers for improved crude oil processing. Scientific Reports, 13(1), 2023: 20209.
https://doi.org/10.1038/s41598-023-46251-4
[6] M. Shahab-Deljoo, B. Medi, M.-K. Kazi, M. Jafari, A techno-economic review of gas flaring in Iran and its human and environmental impacts. Process Safety and Environmental Protection, 173, 2023: 642–665. https://doi.org/10.1016/j.psep.2023.03.051
[7] A.H. Al-Rubaye, D.J. Jasim, S.A. Jassam, H.M. Jasim, M. Ameen, F.A. Khoshnaw, Associated Petroleum Gas: Environmental, Utilization, and Economic Rationale. IOP Conference Series: Earth and Environmental Science, 1262, 2023: 022026. https://doi.org/10.1088/1755-1315/1262/2/022026
[8] Q. Davarikhah, D. Jafari, M. Esfandyari, Prediction of a wellhead separator efficiency and risk assessment in a gas condensate reservoir. Chemometrics and Intelligent Laboratory Systems, 204, 2020: 104084. https://doi.org/10.1016/j.chemolab.2020.104084
[9] G. Pan-Echeverría, T. Gaumer-Araujo, D. Pacho-Carrillo, Simulación y optimización de una planta de separación y estabilización de gas y condensados. Tecnología, Ciencia, Educación, 24(1), 2009: 66–75. (In Spanish)
[10] X. Chen, J. Zheng, J. Jiang, H. Peng, Y. Luo, L. Zhang, Numerical Simulation and Experimental Study of a Multistage Multiphase Separation System. Separations, 9(12), 2022: 405.
https://doi.org/10.3390/separations9120405
[11] A.D. Sarvestani, A.M. Goodarzi, A. Hadipour, Integrated asset management: a case study of technical and economic optimization of surface and well facilities. Petroleum Science, 16, 2018: 1221-1236.
https://doi.org/10.1007/s12182-019-00356-6
[12] J.A. Massinguil, L.H. Lucas, P. Skalle, Effect of extended heavier hydrocarbon fraction (Cn+) composition on optimum surface separation pressure and temperature. Journal of Petroleum and Gas Engineering, 9(5), 2018: 41-55. https://doi.org/10.5897/JPGE2018.0291
[13] C. Ronceros, R. Pomblas, Modelo de Confiabilidad, Disponibilidad y Mantenibilidad Operacional para una Planta Compresora de Gas. Revista Politécnica, 51(1), 2023:117–129. (In Spanish)
https://doi.org/10.33333/rp.vol51n1.10
[14] T. Jonach, B. Haddadi, C. Jordan, M. Harasek, Dynamic Simulation of a Gas and Oil Separation Plant with Focus on the Water Output Quality. Energies, 16(10), 2023: 4111. https://doi.org/10.3390/en16104111
[15] X. Cao, J. Bian, Supersonic separation technology for natural gas processing: A review. Chemical Engineering and Processing – Process Intensification, 136, 2019: 138–151.
https://doi.org/10.1016/j.cep.2019.01.007
[16] I.C. Callaghan, C.M. Gould, A.J. Reid, D.H. Seaton, Crude oil foaming problems at the Sullom Voe terminal. Journal of Petroleum Technology, 37(12), 1985: 2211–2218. https://doi.org/10.2118/12809-PA
[17] N. Prieto-Jiménez, G. González-Silva, A. Chaves-Guerrero, Revisión del proceso de separación de fases del gas natural a alta presión en la industria Oil&Gas. Entramado, 15(1), 2019: 312–329. (In Spanish)
https://doi.org/10.18041/1900-3803/entramado.1.5433
[18] J. Yu, C. Cao, Y. Pan, Advances of adsorption and filtration techniques in separating highly viscous crude oil/water mixtures. Advanced Materials Interfaces, 8(16), 2021: 2100061.
https://doi.org/10.1002/admi.202100061
[19] Y. Jia, C. Shen, Z. Jin, J. Jiang, Numerical Study on Osmotic Equilibrium Timeliness of Oil-Gas Separation Membrane in Online Monitoring System of Transformer Oil Chromatogram. 2023 8th Asia Conference on Power and Electrical Engineering (ACPEE), 2023, pp.2113-2117.
https://doi.org/10.1109/ACPEE56931.2023.10135698
[20] Q. Quan, D. Li, S. Wang, Research on univariate anomaly diagnosis of gas pipeline measurement data based on Random Forest algorithm. Journal of Physics: Conference Series, 2294, 2022: 012004.
https://doi.org/10.1088/1742-6596/2294/1/012004
[21] J. Yu, L. Zhu, R. Qin, Z. Zhang, L. Li, T. Huang, Combining K-Means Clustering and Random Forest to Evaluate the Gas Content of Coalbed Methane Reservoirs. Geofluids, 2021(1): 9321565.
https://doi.org/10.1155/2021/9321565
[22] D. Fan, S. Lai, H. Sun, Y. Yang, C. Yang, N. Fan, M. Wang, Review of Machine Learning Methods for Steady State Capacity and Transient Production Forecasting in Oil and Gas Reservoir. Energies, 18(4), 2025: 842. https://doi.org/10.3390/en18040842
[23] W.J. Al-Mudhafar, Integrating machine learning and data analytics for geostatistical characterization of clastic reservoirs. Journal of Petroleum Science and Engineering, 195, 2020: 107837.
https://doi.org/10.1016/j.petrol.2020.107837
[24] P. Nerurkar, A. Shirke, M. Chandane, S. Bhirud, Empirical analysis of data clustering algorithms. Procedia Computer Science, 125, 2018: 770–779. https://doi.org/10.1016/j.procs.2017.12.099
[25] J.F. Hair, W.C. Black, B.J. Babin, R.E. Anderson, Multivariate Data Analysis, 7th ed., Pearson Education, Harlow, 2014.
[26] L. Marrero, D. Carrizo, L. García-Santander, F. Ulloa-Vásquez, Using K-means algorithm to classify customer profiles with data from smart energy consumption meters: A case study. Chilean Journal of Engineering, 29(4), 2021: 778–787. (In Spanish)  http://dx.doi.org/10.4067/S0718-33052021000400778
[27] L. Breiman, Random forests. Machine Learning, 45(1), 2001: 5–32.
https://doi.org/10.1023/A:1010933404324
[28] B.H. Menze, B.M. Kelm, R. Masuch, U. Himmelreich, P. Bachert, W. Petrich, F.A. Hamprecht, A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics, 10, 2009: 213.
https://doi.org/10.1186/1471-2105-10-213
[29] W. Li, X. Wang, Q. Sheng, S. Liu, G. Wan, Y. Li, X. Dong, Abnormal Energy Consumption Diagnosis Method of Oilfields Based on Multi-Model Ensemble Learning. Processes, 13(5), 2025: 1501.
https://doi.org/10.3390/pr13051501
[30] C. O’Neil, R. Schutt, Doing Data Science: Straight Talk from the Frontline. O’Reilly Media, Sebastopol, 2014.
[31] S. Mokhatab, W.A. Poe, J.Y. Mak, Handbook of Natural Gas Transmission and Processing, 5th ed. Gulf Professional Publishing, Cambridge, 2021.
[32] J. Larios-González, T.I. Guerrero-Sarabia, Beneficios de la estabilización y optimización de pozos e instalaciones superficiales con alta RGL: experiencias en un campo marino de aceite pesado. Ingeniería Petrolera, 59(1), 2019: 22–35. (In Spanish)
[33] L. Hendraningrat, Complex Fluid Mixtures Characterization of Gas Condensate Reservoir with High CO₂: An Improved Gas Flow Assurance Analysis. IPTC 2025, Kuala Lumpur. https://doi.org/10.2523/IPTC-24854-EA
[34] V. Alvarado, E. Manrique, Enhanced Oil Recovery: Field Planning and Development Strategies. Elsevier, 2010.
[35] A. Rosiles-Villalobos, L.A. Lugo-Ramírez, M.Á. Clara-Zafra, C.A. Ramírez-Dolores, Statistical analysis of relationship between work climate and job satisfaction. Aposta, 86, 2020: 86–102. (In Spanish)
[36] C.F. Rivas, C. De La Cruz, R. De La Cruz, O. De La Cruz, J. Colivet, Análisis correlacional y contenido de metales pesados en sedimentos superficiales. Avances en Química, 7(2), 2012: 111–117. (In Spanish)
[37] C. Veliz-Capuñay, Estadística para la administración y los negocios. Pearson Educación, México, 2011. (In Spanish)
[38] J.A. Contreras, G.I. Villalba, E.L. González, Estrategia de cobertura con derivados para el mercado energético colombiano. Estudios Gerenciales, 30(130), 2014: 55–64. (In Spanish)
[39] E. Saavedra, Acerca de la moda. Revista de Educación Matemática, 36(1), 2021: 75–90. (In Spanish) https://doi.org/10.33044/revem.28231
[40] C.M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
[41] P. Nerurkar, A. Shirke, M. Chandane, S. Bhirud, Empirical analysis of data clustering algorithms. Procedia Computer Science, 125, 2018: 770–779. https://doi.org/10.1016/j.procs.2017.12.099
[42] L. Orellana, K-means clustering analysis for the ZOO database (Master’s thesis), University of Santiago, 2020.
[43] N.A. Khairani, E. Sutoyo, Application of k-means clustering for fire-prone areas. IJADIS, 1(1), 2020: 9–16. https://doi.org/10.25008/ijadis.v1i1.13
[44] J.F. Lea, H.V. Nickens, Solving gas-well liquid-loading problems. Journal of Petroleum Technology, 56(4), 2004: 30–36. https://doi.org/10.2118/72092-JPT
[45] H. Prabhu, C.M. Ravishankar, A. Ganesan, M. Pandya, H. Bhosale, R. Dhadwal, N.R. Parlikkad, P. Siarry, J.K. Valadi, Enhancing random forest model prediction of gas holdup in internal draft airlift loop contactors. Scientific Reports, 15, 2025: 9325. https://doi.org/10.1038/s41598-025-92728-9

© 2025 by the authors. This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)

Volume 10
Number 4
December 2025

Loading

Last Edition

Volume 10
Number 4
December 2025

How to Cite

R.D. Vega Mejía, N.L. Campos Rodríguez, O.J. Sánchez Roca, C. Ronceros Morales, Intelligent Clustering and Classification of the Gas-Crude Oil Separation Process Using K-Means and Random Forest: A Data Science Based Approach. Applied Engineering Letters, 10(4), 2025: 222-233.
https://doi.org/10.46793/aeletters.2025.10.4.4

More Citation Formats

Vega Mejía, R.D., Campos Rodríguez, N.L., Sánchez Roca, O.J., & Ronceros Morales, C. (2025). Intelligent Clustering and Classification of the Gas-Crude Oil Separation Process Using K-Means and Random Forest: A Data Science Based Approach. Applied Engineering Letters, 10(4), 2025: 222-233.
https://doi.org/10.46793/aeletters.2025.10.4.4

Vega Mejía, Rubén Darío, et al. “Intelligent Clustering and Classification of the Gas-Crude Oil Separation Process Using K-Means and Random Forest: A Data Science Based Approach.“ Applied Engineering Letters, vol. 10, no. 4, 2025, pp. 222-233. https://doi.org/10.46793/aeletters.2025.10.4.4

Vega Mejía, Rubén Darío, Natali Lisbeth Campos Rodríguez, Omar José Sánchez Roca, Cristhian Ronceros Morales. 2025. “Intelligent Clustering and Classification of the Gas-Crude Oil Separation Process Using K-Means and Random Forest: A Data Science Based Approach.“ Applied Engineering Letters, 10 (4): 222-233. https://doi.org/10.46793/aeletters.2025.10.4.4

Vega Mejía, R.D., Campos Rodríguez, N.L., Sánchez Roca, O.J. and Ronceros Morales, C. (2025). Intelligent Clustering and Classification of the Gas-Crude Oil Separation Process Using K-Means and Random Forest: A Data Science Based Approach. Applied Engineering Letters, 10(4), pp. 222-233.
doi: 10.46793/aeletters.2025.10.4.4.