Phase 1: Study of available AI equipment, techniques, and architectures. Identification of limitations/constraints and possibilities for improving the technology to be adopted in correlation with specific requirements. Database collection, initiation of the database collection process

D1.1. The deliverable documented software module requirements, datasets, and performance metrics, aligning with project goals.

D1.2. The deliverable conducted a comprehensive evaluation of the state-of-the-art performance in incomplete sentence detection, satire, sarcasm, and irony detection, author attribution, toponym geolocation, numeral classification, multidomain paraphrasing, and readability analysis to achieve the project objectives.

D1.3. The deliverable analyzed data protection regulations, identified software risks, proposed mitigation strategies, and evaluated security measures, offering a comprehensive guide for safe practices and compliance with standards.

D1.4. The activity explored NLP solutions for the various software modules involved, specifically tailored to Romanian language. It clarified requirements and optimized approaches, laying a solid foundation for developing robust, scalable systems in the next project phase.

D1.5. The activity reviewed literature and databases on automatic lip reading, assessed resource limitations, and outlined requirements for enhanced datasets to advance visual speech recognition systems.

D1.6. The activity encompassed designing the user interface, back-end components, and planning software requirements for AI-driven modules to produce a functional prototype of the RoNLP system.

D1.7. The activity analyzed annotated databases for various software modules, evaluated existing resources, identified gaps, and proposed directions for creating new datasets to support advanced applications and future development.

Phase 2: Continued Data Collection. Design, Development, and Implementation of Innovative Algorithms and Software Modules to Be Integrated into the Final Solution

The activities corresponding to subcomponents A.2.1–A.2.15 were carried out and completed. These activities consisted of expanding and consolidating the annotated datasets required for the development of AI capabilities, as well as developing software modules, and marked the commencement of the activities that will be finalized in Phase 3 of the project. The following deliverables have been completed:

D2.1. The activity produced the RoVSR corpus, an annotated Romanian audio–video dataset for lip-reading. Source materials were collected, standardized, preprocessed, synchronized, and transcribed, with manual verification on a test subset. The final dataset includes over 100 hours of high-quality, demographically diverse recordings, fully meeting project requirements and ready for developing visual speech recognition systems.

D2.2. The activity expanded and improved the Romanian authorship-recognition corpus. The team performed advanced cleaning, metadata extraction, and manual verification for accuracy and consistency. The final standardized and validated dataset provides a reliable basis for developing and evaluating authorship identification algorithms for the final software solution.

D2.3. The activity produced large, high-quality Romanian datasets annotated for satire, sarcasm, and irony. The final standardized corpus includes tens of thousands of positive and neutral examples across the three categories, exceeding project requirements and providing a robust linguistic resource for developing and evaluating AI models for figurative language detection.

D2.4. This activity produced a standardized annotated dataset for resolving ambiguous Romanian toponyms extracted via NER. After collecting the corpus and automatically identifying location entities, annotators manually validated ambiguous cases across five defined ambiguity types. The final JSON-formatted dataset meets all quality criteria and ensures full compatibility with machine-learning workflows.

D2.5. The activity collected diverse Romanian audio recordings—including specialized jargon and regionalisms—automatically transcribed them, and validated them manually. The resulting JSON-formatted dataset meets ASR integration needs, and the final evaluation informed future improvements and infrastructure plans for expanding the audio corpus.

D2.6. The activity delivered a complete, scalable text-analysis pipeline integrating ingestion, preprocessing, segmentation, vectorization, semantic search, lexical analysis, and interpretation. The system identifies truncated text segments and produces standardized, auditable reports ready for integration into the project infrastructure.

D2.7. The activity delivered a AI module that automatically geolocates NER-extracted toponyms through normalization, multi-criteria disambiguation, and geospatial resolution. The system produces stable, structured outputs with coordinates and confidence scores, forming a robust basis for future contextual geolocation features in the project infrastructure.

D2.8. The activity produced a Romanian-tailored module for transcribing and classifying numerals, capable of handling general numbers, monetary values, dates, and standardized codes (IBAN, ISBN, CUI, VIN). The system integrates Romanian-specific linguistic resources to achieve high accuracy in processing numeric information in text.

D2.9. The activity produced a AI module for detecting satire, sarcasm, and irony in Romanian. A large corpus of manually annotated texts was used to train several machine-learning classifiers, all of which exceeded the required performance thresholds on independent test sets. These models were integrated into a containerized software module and validated in an on-premise environment to ensure stability and interoperability. The resulting system provides a robust, fully operational solution for detecting rhetorical figures across media and digital-content analysis workflows.

D2.10. The activity delivered an authorship-identification module that extracts linguistic embeddings, compares them through several scoring methods, and generates confidence-ranked authorship predictions. After defining requirements, implementing the architecture, and validating performance on test sets, the module is fully documented and ready for integration into later project stages.