Projekt i implementacja modelu do oceny jakości kodu generowanego przez duże modele językowe

Poręba, Konrad

Simple view

Full metadata view

Authors

Statistics

Projekt i implementacja modelu do oceny jakości kodu generowanego przez duże modele językowe

master

Alternative title

Design and Implementation of a Benchmarking Framework for Assessing Quality of Code generated by Large Language Models

Author

Poręba Konrad

Reviewer

Roman Adam

Mnich Michał

Advisor

Roman Adam

Date of defence

2025-07-14

Keywords in Polish

duże modele językowe, generowanie kodu, jakość kodu, metryki jakości, benchmark, analiza statyczna, testy jednostkowe, inżynieria oprogramowania, ewaluacja modeli, automatyczne programowanie

Keywords in English

large language models, code generation, code quality, quality metrics, benchmark, static analysis, unit testing, software engineering, model evaluation, automated programming

Language

Polish

Abstract in Polish

Wraz z dynamicznym rozwojem dużych modeli językowych (LLM – ang. Large Language Model ), rośnie ich popularność w zastosowaniach związanych z generowaniem kodu oprogramowania. Mimo imponujących możliwości tych modeli, w porównaniu do wcześniejszych generacji modeli uczenia maszynowego i głębokiego, kwestia jakości generowanego przez nie kodu nie została jeszcze dostatecznie zbadana. Celem niniejszej pracy jest przegląd obecnie dostępnych benchmarków służących do oceny możliwości LLM-ów w generowaniu kodu, identyfikacja ich braków oraz zaprojektowanie i implementacja własnej propozycji modelu do benchmarkowania LLM-ów pod kątem jakości generowanego kodu. Opracowany system powinien umożliwiać definiowanie praktycznych wymagań aplikacji, które składają się zwielu plików, modułów i pakietów, połączonych odpowiednią strukturą zależności. Wymagania te powinny odzwierciedlać typowe zadania oraz format wymagań, jakie otrzymują inżynierowie oprogramowania przed rozpoczęciem implementacji. Dla każdego z proponowanych zadań, na podstawie wcześniej zdefiniowanych wymagań, należy przygotować zestawtestów funkcjonalnych i niefunkcjonalnych, napisanych przez człowieka, oraz zdefiniować metryki jakie będą zbierane podczas statycznej analizy wygenerowanego kodu. Praca kończy się wykorzystaniem stworzonego benchmarku do zbadania charakterystyk jakości kodu aplikacji oraz testów jednostkowych generowanego przez kilka najpopularniejszych obecnie dużych modeli językowych, analizą uzyskanych wyników oraz przedstawieniem wniosków i propozycji dalszego rozwoju opisanego rozwiązania.

Abstract in English

With the rapid development of Large Language Models (LLM), their popularity in applications related to code generation is steadily increasing. Despite the impressive capabilities of these models compared to earlier generations of machine learning and deep learning models, the issue of the code quality they generate has not yet been sufficiently explored. The aim of this thesis is to review the currently available benchmarks used to evaluate the code generation capabilities of LLMs, identify their shortcomings, and design and implement an original benchmarking framework to assess the quality of code generated by LLMs. The developed system should enable the definition of practical application requirements, consisting of multiple files, modules, and packages connected by an appropriate dependency structure. These requirements should reflect the typical tasks and requirement formats that software engineers receive before beginning implementation. For each proposed task, based on the previously defined requirements, a set of functional and non-functional tests written by humans should be prepared, along with the definition of metrics to be collected during static analysis of the generated application and unit test code. The thesis concludes with the use of the developed benchmark to examine the quality characteristics of application code and unit tests generated by several of the most popular large language models currently available, followed by an analysis of the results and a presentation of conclusions and proposals for further development of the described solution.

dc.abstract.en	With the rapid development of Large Language Models (LLM), their popularity in applications related to code generation is steadily increasing. Despite the impressive capabilities of these models compared to earlier generations of machine learning and deep learning models, the issue of the code quality they generate has not yet been sufficiently explored. The aim of this thesis is to review the currently available benchmarks used to evaluate the code generation capabilities of LLMs, identify their shortcomings, and design and implement an original benchmarking framework to assess the quality of code generated by LLMs. The developed system should enable the definition of practical application requirements, consisting of multiple files, modules, and packages connected by an appropriate dependency structure. These requirements should reflect the typical tasks and requirement formats that software engineers receive before beginning implementation. For each proposed task, based on the previously defined requirements, a set of functional and non-functional tests written by humans should be prepared, along with the definition of metrics to be collected during static analysis of the generated application and unit test code. The thesis concludes with the use of the developed benchmark to examine the quality characteristics of application code and unit tests generated by several of the most popular large language models currently available, followed by an analysis of the results and a presentation of conclusions and proposals for further development of the described solution.	pl
dc.abstract.pl	Wraz z dynamicznym rozwojem dużych modeli językowych (LLM – ang. Large Language Model ), rośnie ich popularność w zastosowaniach związanych z generowaniem kodu oprogramowania. Mimo imponujących możliwości tych modeli, w porównaniu do wcześniejszych generacji modeli uczenia maszynowego i głębokiego, kwestia jakości generowanego przez nie kodu nie została jeszcze dostatecznie zbadana. Celem niniejszej pracy jest przegląd obecnie dostępnych benchmarków służących do oceny możliwości LLM-ów w generowaniu kodu, identyfikacja ich braków oraz zaprojektowanie i implementacja własnej propozycji modelu do benchmarkowania LLM-ów pod kątem jakości generowanego kodu. Opracowany system powinien umożliwiać definiowanie praktycznych wymagań aplikacji, które składają się zwielu plików, modułów i pakietów, połączonych odpowiednią strukturą zależności. Wymagania te powinny odzwierciedlać typowe zadania oraz format wymagań, jakie otrzymują inżynierowie oprogramowania przed rozpoczęciem implementacji. Dla każdego z proponowanych zadań, na podstawie wcześniej zdefiniowanych wymagań, należy przygotować zestawtestów funkcjonalnych i niefunkcjonalnych, napisanych przez człowieka, oraz zdefiniować metryki jakie będą zbierane podczas statycznej analizy wygenerowanego kodu. Praca kończy się wykorzystaniem stworzonego benchmarku do zbadania charakterystyk jakości kodu aplikacji oraz testów jednostkowych generowanego przez kilka najpopularniejszych obecnie dużych modeli językowych, analizą uzyskanych wyników oraz przedstawieniem wniosków i propozycji dalszego rozwoju opisanego rozwiązania.	pl
dc.affiliation	Wydział Matematyki i Informatyki	pl
dc.area	obszar nauk ścisłych	pl
dc.contributor.advisor	Roman, Adam - 142015	pl
dc.contributor.author	Poręba, Konrad - USOS309300	pl
dc.contributor.departmentbycode	UJK/WMI2	pl
dc.contributor.reviewer	Roman, Adam - 142015	pl
dc.contributor.reviewer	Mnich, Michał - 152762	pl
dc.date.accessioned	2025-07-15T22:56:29Z
dc.date.available	2025-07-15T22:56:29Z
dc.date.createdat	2025-07-15T22:56:29Z	en
dc.date.submitted	2025-07-14	pl
dc.fieldofstudy	informatyka	pl
dc.identifier.apd	diploma-174646-309300	pl
dc.identifier.uri	https://ruj.uj.edu.pl/handle/item/557370
dc.language	pol	pl
dc.subject.en	large language models, code generation, code quality, quality metrics, benchmark, static analysis, unit testing, software engineering, model evaluation, automated programming	pl
dc.subject.pl	duże modele językowe, generowanie kodu, jakość kodu, metryki jakości, benchmark, analiza statyczna, testy jednostkowe, inżynieria oprogramowania, ewaluacja modeli, automatyczne programowanie	pl
dc.title	Projekt i implementacja modelu do oceny jakości kodu generowanego przez duże modele językowe	pl
dc.title.alternative	Design and Implementation of a Benchmarking Framework for Assessing Quality of Code generated by Large Language Models	pl
dc.type	master	pl
dspace.entity.type	Publication

dc.abstract.enpl

With the rapid development of Large Language Models (LLM), their popularity in applications related to code generation is steadily increasing. Despite the impressive capabilities of these models compared to earlier generations of machine learning and deep learning models, the issue of the code quality they generate has not yet been sufficiently explored. The aim of this thesis is to review the currently available benchmarks used to evaluate the code generation capabilities of LLMs, identify their shortcomings, and design and implement an original benchmarking framework to assess the quality of code generated by LLMs. The developed system should enable the definition of practical application requirements, consisting of multiple files, modules, and packages connected by an appropriate dependency structure. These requirements should reflect the typical tasks and requirement formats that software engineers receive before beginning implementation. For each proposed task, based on the previously defined requirements, a set of functional and non-functional tests written by humans should be prepared, along with the definition of metrics to be collected during static analysis of the generated application and unit test code. The thesis concludes with the use of the developed benchmark to examine the quality characteristics of application code and unit tests generated by several of the most popular large language models currently available, followed by an analysis of the results and a presentation of conclusions and proposals for further development of the described solution.

dc.abstract.plpl

Wraz z dynamicznym rozwojem dużych modeli językowych (LLM – ang. Large Language Model ), rośnie ich popularność w zastosowaniach związanych z generowaniem kodu oprogramowania. Mimo imponujących możliwości tych modeli, w porównaniu do wcześniejszych generacji modeli uczenia maszynowego i głębokiego, kwestia jakości generowanego przez nie kodu nie została jeszcze dostatecznie zbadana. Celem niniejszej pracy jest przegląd obecnie dostępnych benchmarków służących do oceny możliwości LLM-ów w generowaniu kodu, identyfikacja ich braków oraz zaprojektowanie i implementacja własnej propozycji modelu do benchmarkowania LLM-ów pod kątem jakości generowanego kodu. Opracowany system powinien umożliwiać definiowanie praktycznych wymagań aplikacji, które składają się zwielu plików, modułów i pakietów, połączonych odpowiednią strukturą zależności. Wymagania te powinny odzwierciedlać typowe zadania oraz format wymagań, jakie otrzymują inżynierowie oprogramowania przed rozpoczęciem implementacji. Dla każdego z proponowanych zadań, na podstawie wcześniej zdefiniowanych wymagań, należy przygotować zestawtestów funkcjonalnych i niefunkcjonalnych, napisanych przez człowieka, oraz zdefiniować metryki jakie będą zbierane podczas statycznej analizy wygenerowanego kodu. Praca kończy się wykorzystaniem stworzonego benchmarku do zbadania charakterystyk jakości kodu aplikacji oraz testów jednostkowych generowanego przez kilka najpopularniejszych obecnie dużych modeli językowych, analizą uzyskanych wyników oraz przedstawieniem wniosków i propozycji dalszego rozwoju opisanego rozwiązania.

dc.affiliationpl

Wydział Matematyki i Informatyki

dc.areapl

obszar nauk ścisłych

dc.contributor.advisorpl

Roman, Adam - 142015

dc.contributor.authorpl

Poręba, Konrad - USOS309300

dc.contributor.departmentbycodepl

UJK/WMI2

dc.contributor.reviewerpl

Roman, Adam - 142015

dc.contributor.reviewerpl

Mnich, Michał - 152762

dc.date.accessioned

2025-07-15T22:56:29Z

dc.date.available

2025-07-15T22:56:29Z

dc.date.createdaten

2025-07-15T22:56:29Z

dc.date.submittedpl

2025-07-14

dc.fieldofstudypl

informatyka

dc.identifier.apdpl

diploma-174646-309300

dc.identifier.uri

https://ruj.uj.edu.pl/handle/item/557370

dc.languagepl

pol

dc.subject.enpl

large language models, code generation, code quality, quality metrics, benchmark, static analysis, unit testing, software engineering, model evaluation, automated programming

dc.subject.plpl

duże modele językowe, generowanie kodu, jakość kodu, metryki jakości, benchmark, analiza statyczna, testy jednostkowe, inżynieria oprogramowania, ewaluacja modeli, automatyczne programowanie

dc.titlepl

Projekt i implementacja modelu do oceny jakości kodu generowanego przez duże modele językowe

dc.title.alternativepl

Design and Implementation of a Benchmarking Framework for Assessing Quality of Code generated by Large Language Models

dc.typepl

master

dspace.entity.type

Publication

Affiliations

No affiliation

Poręba, Konrad

Roman, Adam

Mnich, Michał

* The migration of download and view statistics prior to the date of April 8, 2024 is in progress.

Views

0

Views per month

Collections