
\section{Introduction}
\label{sec:intro}

\IEEEPARstart{D}{igital} imaging technologies enable capturing the real world and recreating it in other times or other places. This has been extended to 3D scenes with the help of computer graphics and photogrammetry techniques. We can now capture 3D objects and scenes using purely RGB cameras~\cite{pages2018affordable} or RGB cameras with additional sensors~\cite{collet2015high}. Two representations are commonly used representation for 3D models: coloured point clouds and textured 3D meshes~\cite{zerman2020vsensevvdb2}.
In this paper, we focus on point clouds which are crucial for many applications such as augmented and virtual reality. 

Point clouds are typically captured via camera arrays, LiDAR sensors, and cameras.
The resulting volume of data is extremely large and compression becomes essential for transmission and storage.
However, evaluating compression algorithms requires assessing the quality of point clouds distorted by compression relative to the original point clouds.
While research into this area has been recently expanding~\cite{alexiou2023subjective}, there are still many open questions and problems.
Specifically, we address the issues of enabling evaluation of learning-based compression methods and creation of learning-based quality metrics by providing a subjective dataset of sufficient scale, consistency and diversity.
Existing datasets until now have been either too small, inconsistent (with respect to content normalization, rendering conditions, etc.) or too lacking in diversity (in terms of semantics or the number of source contents) for research into learning-based approaches.

The contributions of this work are threefold and can be listed as follows:
\begin{itemize}
    \item We present, and make publicly available (partly, during the ICIP 2023 Point Cloud Visual Quality Assessment Grand Challenge), a broad point cloud quality assessment database comprising 75 unique contents that are semantically meaningful for a telepresence scenario, described in Section~\ref{sec:database},
    \item We compare the performances of various state-of-the-art methods for point cloud compression (Section~\ref{sec:subjective}), and
    \item We provide a complete benchmark of the state-of-the-art point cloud quality metrics, including both point-based and rendering-based assessment (Section~\ref{sec:objective}).
\end{itemize}

The created BASICS database is made publicly available under Creative Commons Attribution (CC BY-NC-SA 4.0) license to support further research in the field. During the ICIP 2023 Grand Challenge, part of the dataset will be accessible to the grand challenge participants via CodaLab registration (Please see the grand challenge website~\cite{codalabChallengeSite}). 




\section{Why is There a Need for Another Database?}
\label{sec:motivation}


In this section, we delve deeper into the following question "Why is there a need for another point cloud quality dataset?".
First, it is important to note that learning-based point cloud compressio
has been a strong focus of recent research among researchers in the last few years.
It it thus important to accurately assess the quality of point clouds distorted by such learning-based compression methods.
In addition, learning-based point cloud quality assessment
have also been explored recently.
However, existing datasets lack several qualities in order to enable evaluation of learning-based methods and research into learning-based quality assessment approaches.
 






The existing datasets lack diversity, specifically they present one or more of the following issues:
\begin{itemize}
    \item Use of the same point clouds across datasets.
    \item Lacking variety in terms of geometric complexity and semantic categories.
    \item Use of the same compression algorithms and consequently metrics over perform on these distortions while failing on novel distortions (e.g. learning based coding distortions).
\end{itemize}
This lack of diversity combined with the small scale of the existing datasets make them unsuitable for learning-based quality assessment.
In addition, the lack of learning-based compression approaches in these datasets leads to novel metrics failing for these methods.

The existing datasets also exhibit critical issues relating to data availability:
\begin{itemize}
    \item Source point clouds unavailable due to copyright issues (Table \ref{tab:public_availability}). 
    \item Unavailable raw scores, standard deviation and/or confidence intervals (Table \ref{tab:public_availability}).
\end{itemize}
The missing data impedes research into more sophisticated metric development methods.

In addition, the existing datasets lack strongly in consistency.
Specifically:
\begin{itemize}
    \item Point clouds are not normalized which impedes analysis and rendering. In this dataset, all point clouds are voxelized with 10 bit quantization. As a result, all coordinates are integers ranging between 0 and 1023.
    \item Point cloud rendering is inconsistent which distorts results depending on the compression method. In this dataset, all points are rendered using cubes spanning the volume of their associated voxel. This has an especially noticeable effect for octree compression (GPCC) as the resulting rendering is watertight at all compression levels.
    \item Rendering is stable which is not the case with most other datasets. Rendering voxelized point clouds with cubes guarantees rendering without flickers as the volumes are non overlapping.
\end{itemize}
The lack of consistency in existing datasets make them unpractical for analysis and research.


\input{tables/RelatedWorkOverview.tex}

\input{tables/public_availability.tex}



In summary, a new dataset is required as existing datasets are lacking in one or more aspects.
Namely, in terms of diversity, scale and consistency.
We propose a new dataset that fulfills all of these characteristics which are essential for learning-based use cases.
Specifically, this dataset enables better research into learning-based quality assessment.
In addition, it also provides crucial insight on the behavior of point cloud quality metrics when applied to learning-based compression algorithms.


In particular, the proposed BASICS database is aimed to provide a foundation for research that supports the telepresence applications, in terms of compression and quality assessment. 





\section{The BASICS Database}
\label{sec:database}

Imagine that you are a researcher or a developer who wishes to develop a quality metric using learning-based approaches. Or, you might be seeking to validate your quality metric. In order to provide means for both of these scenarios, and to remedy the limitations discussed in the previous section, we generate the BASICS database. In the following, we describe various stages of the database generation procedure. 


\subsection{Material selection}
\label{subsec:database_materials}

Our motivation in this work, as previously mentioned, is to develop a semantically diverse point cloud quality assessment database for a telepresence application in which a real scene is captured, compressed, transmitted, and displayed using point clouds. In almost all applications of telepresence, humans are the main subject of the communication. Therefore, it is important to capture the point clouds that represent humans. Additionally, pets or other animals can be part of the scene in some applications. Inanimate objects will be in the scene, and they can also be a topic of work meetings (e.g., designing objects) or education applications (e.g., museum artifacts). Lastly, buildings or background landscape can also be part of the contents in telepresence applications. 

Considering the above discussion, we identify fundamental building blocks and corresponding semantic categories for point cloud contents to include in the database. Three main categories are selected for prospective 3D models for the reference content: \textit{(i)} humans \& animals, \textit{(ii)} inanimate objects, and \textit{(iii)} buildings \& landscapes. Screenshots of sample figures from each category are shown in Fig.~\ref{fig:DBcategories}. 

\input{figures/DBcategories}

For the data collection and material selection part, we aimed to collect as many publicly available point clouds as possible, which could be redistributable. However, many data sources restrict redistribution. Therefore, we acquired the 3D models from two sources: collaborator studios (i.e., V-SENSE studio and XD Productions) and an online repository for 3D model sharing called SketchFab\footnote{https://www.sketchfab.com/}. Even in this case, there were not many point cloud sources. Therefore, we gathered 3D meshes and generated point clouds via sampling the mesh surface (cf. Section~\ref{subsec:database_preprocess}). 

In total, 104 models were handpicked by three authors of this paper considering the semantic categories described above. After eliminating very similar materials, the models with non-ideal characteristics (e.g., highly reflective material, imperfect texturing etc.), and the least relevant semantic categories, the total number of models were dropped to 75.




\subsection{Pre-processing}
\label{subsec:database_preprocess}

To discount for all other possible aspects that can introduce distortions, the collected 3D models needed a pre-processing and a conversion into point clouds before any further processing, as these models were in different formats.  

Models that were already in point cloud PLY format did not need too much attention, except for voxelization. 3D meshes, on the other hand, needed to go through several steps. These steps are further discussed below. 

\subsubsection{Making 3D meshes uniform}
Among collected 3D mesh models, some of them were in OBJ format and some others were in FBX format. Using Blender\footnote{https://www.blender.org/} and Meshlab\footnote{https://www.meshlab.net/}, all 3D meshes were converted into OBJ format. 

\subsubsection{Cleaning 3D meshes}
Some of the 3D meshes had either parts that had transparent or reflective properties (e.g., glasses in some models). Some other meshes had parts the reconstruction of which were incomplete (e.g., trees, some of the building fa\c{c}ades) which would decrease the users' quality of experience and introduce other sources of distraction and distortion. To avoid such effects, these parts were removed or cleaned in Blender.

Following this, the mesh files were unified into a single OBJ file, so that the sampling process in the pipeline could be done with ease. Next, the material properties (which are described in the .mtl files) are checked to eliminate the any other reflective properties of the materials, which could not be reproduced correctly in the point cloud format. After all these operations, the 3D meshes were ready for the point cloud sampling step.

\subsubsection{Sampling point clouds}
Using CloudCompare\footnote{https://www.cloudcompare.org/main.html}, point clouds were sampled from the 3D meshes' surfaces using ... command. During this operation more than 5 million points were sampled on the surfaces of the said meshes. The sampling operation extracted the location, color, and normal attributes for each point in the PCs. At the end of this stage, all 3D models were in or converted to point cloud format.

\subsubsection{Point cloud voxelization}

We perform point cloud voxelization using 10 bit quantization.
That is, the spatial coordinates are normalized such that they are integers between 0 and 1023.
This has two main advantages: first, the coordinates are in a range that is predictable for point cloud processing but also with respect to rendering and second, we use voxelized coordinates in combination with cube based rendering to improve stability, predictability and quality of renderings.






\subsection{Compression}
\label{subsec:database_compression}

As mentioned above, the main goal for the BASICS database is to provide a foundation for further point cloud compression and point cloud quality assessment research, especially for the telepresence use case. In this use case, capturing and display needs to be real-time. Therefore, compression carries a huge importance.

For preparing the processed point clouds (PPC), the following compression methods are selected: the octree-based compression method MPEG GPCC~\cite{graziosi2020overview}\footnote{github.com/MPEGGroup/mpeg-pcc-tmc13/releases/tag/release-v14.0}, the video-based compression method MPEG VPCC~\cite{graziosi2020overview}\footnote{github.com/MPEGGroup/mpeg-pcc-tmc2/releases/tag/release-v15.0}, and a learning-based compression method GeoCNN~\cite{quach2020improved}\footnote{\url{github.com/mauriceqch/pcc_geo_cnn_v2}}. GPCC and VPCC are selected as they are the MPEG standardization efforts which focus the main expert knowledge in the standardization field. GeoCNN was selected to include more recent learning-based algorithms, which was one of the best performing learning-based compression method at the time of dataset preparation.


Geometry-based point cloud compresison (GPCC) focuses on encoding the point cloud directly in the 3D space~\cite{graziosi2020overview} using octree or trisoup (triangle soup) methods. The attributes (such as color) can also be coded using either Region Adaptive Hierarchical Transform (RAHT) or Predicting/Lifting (PredLift) transform. Video-based point cloud compression (VPCC) instead uses a different approach and is focused on compression of dynamic point clouds (i.e., point clouds changing in time). VPCC projects the point cloud content onto a depth map and a texture map and uses a state-of-the-art video encoder (e.g., HEVC) to compress the PCs. GeoCNN~\cite{quach2020improved} compresses voxelized point clouds by first performing block partitioning.
Then, each block is passed to a variational autoencoder~\cite{quach2019learning} where the encoder transforms the input binary occupancy voxel grid to a latent space.
The latent space is then quantized and entropy coded using a learned entropy model.
After entropy decoding the bitstream, the latent space is transformed back to a voxel grid containing predicted occupancy probabilities.
The probabilities are then thresholded to binary values which yields the decoded block.
With the result of each block, the entire decompressed point cloud is obtained.

In this work, we used GPCC-Octree-RAHT, GPCC-Octree-Predlift, VPCC, and GeoCNN for compression of the PCs. The details regarding the compression parameters will be provided after the ICIP 2023 Grand Challenge is completed.








\section{Subjective quality assessment}
\label{sec:subj_experiment}

We conducted a large-scale subjective experiment in Prolific~\cite{prolific} crowdsourcing platform with more than 3000 
participants. More than 1200 
stimuli from 75 original point clouds were generated for the experiment wherein each processed point cloud (PPC) were evaluated by around 60 unique participants on average. 
This section describes the detail regarding the crowdsourcing study.

Subjective quality assessment of point cloud content can be categorized into two as interactive and passive~\cite{alexiou2023subjective}. In the interactive paradigm, observers have the freedom to inspect the point cloud from any point of view without any restriction often with an augmented reality or virtual reality application. In the passive approach, point clouds are rendered with a predefined camera trajectory as a traditional video. Although both paradigm has their own advantages and disadvantages, there is no statistically significant difference between the subjective opinions collected with each~\cite{vioal2022interactiveVSpassive}. In order to minimize the variance between observer opinions and allow ourselves a more practical data collection through crowdsourcing, we adopted the passive approach~\cite{nehme2021crowdsourcingreliabilityMeshes}.





\subsection{Methodology}
\label{subsec:subjective_methodology}

Several methodologies can be found in the literature and recommendations for subjective quality assessment of traditional image and video sequences~\cite{itu2019recBT500}. Commonly used methodologies include, but not limited to: Absolute Category Rating (ACR), Double Stimulus Impairment Scale (DSIS) and Two-Alternative Forced Choice (2AFC) or Pairwise Comparison (PC). Several studies compared the accuracy and reliability of each methodologies for varying multimedia content. For traditional image and videos, Mantiuk et al. denote that the PC methodology tends to be more accurate due to straightforward experiment procedure and there is no statistically significant difference between ACR and DSIS methodology~\cite{mantiuk2012SubjMethodComp2D}. However, despite the simplicity of the task, PC methodology may require exponentially more comparisons and may become impractical with high number of test conditions~\cite{zerman2018pcCrosscontent}. On the other hand, the recent study by Nehme et al.~\cite{nehme2019SubjMethodComparison} suggests that the DSIS method is more accurate than ACR for 3D graphical content. It is suggested that, in ACR experiments, participants who are unfamiliar with the pristine models are not able to discriminate all type of distortions. DSIS methodology leads to a more accurate evaluation by presenting the reference and the distorted model prior to rating. Therefore, we utilized DSIS methodology with a side-by-side presentation as suggested by Nehme et al.~\cite{nehme2019SubjMethodComparison}.




\subsection{Generating Visual Stimuli}
\label{subsec:subjective_render}


In a voxelized point cloud, points have a one-to-one mapping to a voxel as part of a voxel grid.
Building on this property, each point is rendered as a cube spanning the volume of its voxel.
This is different from common "point" (OpenGL point primitive) based rendering which renders points as screen-aligned squares of a given window space size.
The main issue is that the size is given in window space, thus when a zoom is performed the points become smaller and the point cloud appears sparser.
In addition, point based rendering causes flicker artifacts due to spatial overlaps especially during perspective changes.
Cube based rendering corrects this issue and enables watertight rendering from all perspectives.

Moreover, cube based rendering has a particular relationship with octree based compression.
A typical octree compression algorithm
represents the point cloud using an octree, providing a natural decomposition in level of details.
Specifically, at each octree subdivision, each occupied voxel is subdivided into eight equal voxels and their occupancies are then encoded.
Typically, a desired level of detail is selected and each occupied voxel at this level of detail is transformed into a point.
However, this neglects that each point actually corresponds to a volume.
Using this property, rendering each point as a cube spanning this volume greatly improves rendering results for octree methods.
In particular, a watertight voxel point cloud remains watertight regardless of the level of detail.
Compared to the previous approach, point clouds look visually "blockier" rather than sparser which preserves visual continuity of the rendered objects.

In practice, we specify cube sizes.
For octree based methods, the size of the cube is defined based on the number of removed octree levels $n_r$.
With $n_l$ bit quantization, the maximum is $n_l$ levels.
Thus, we specify the size of the cube as $2^{n_l - n_r}$: that is, a size of 1 when lossless, a size of 2 when removing one octree level, a size of 4 when removing two octree levels, etc. For other methods, the cube size is determined empirically in a pilot test by a subset of the authors manually to ensure that the output looks watertight.


Helix-like rendering trajectory is utilized as visualized in Figure~\ref{fig:rendering_trajectory}. Front direction of each point cloud were assigned manually and rendering trajectory is always initiated from the assigned front side. A small overlap is adopted between the start and end point of the trajectory to ensure that the front side of the point cloud is seen either at the end or at the beginning of the rendered video. Certain point clouds are unnatural to observe from lower angles(e.g., landscape, buildings). Therefore, after a pilot test, each point cloud manually assigned to one of the following categories: low, mid, high. It is used to determine the starting elevation of the rendering trajectory. While moving on the rendering trajectory, the camera is always directed towards the point cloud center.

\begin{figure}[!t]
    \centering
    \includegraphics[width=\columnwidth]{figures/rendering_trajectories.png}
    \caption{Visualization of the rendering trajectory from top and front views.}
    \label{fig:rendering_trajectory}
\end{figure}








\subsection{Test Procedure}
\label{subsec:subjective_procedure}


Earlier studies suggests that the crowdsourcing experiments can be as accurate as laboratory experiments for various QoE tasks and with different experiment designs \cite{goswami2021crowdsourcingreliabilityTMO, nehme2021crowdsourcingreliabilityMeshes}. To benefit from the wide participant pool and faster data collection, we utilized the Prolific~\cite{prolific} crowdsourcing platform to recruit participants and to conduct the subjective experiment. On Prolific, the participants are clearly informed that they are being recruited as part of a research study and the requirements for the experiments are well balanced to benefit both sides; researchers and participants~\cite{palan2018prolificmainpaper}. 

\begin{figure}[!t]
    \centering
    \includegraphics[width=\columnwidth]{figures/sample_test_screen.png}
    \caption{Sample screenshots from the experiment. Rendered point cloud videos were shown side-by-side (above), and each stimulus was followed by a voting screen (below).}
    \label{fig:sample_test_screen}
\end{figure}


\textbf{Test sessions \& Duration:} Due to lack of supervision on participants during the experiment, the number of stimuli and the duration of the test in crowdsourcing settings should be kept much lower than the laboratory experiments. To this end, we split the experiment into 60 sessions each containing 25 stimuli and 2 dummies. One dummy from the highly compressed stimuli and one dummy from the lowest compressed stimuli were selected to be shown to every participant to create expectations about the range of distortions. Dummy stimuli were the same for every participant and participants were not informed that these stimuli were shown for training purposes. In total, every participant rated 27 stimuli of 10 seconds video renderings. With unlimited voting time after each stimuli, test sessions lasted around 5 minutes 30 seconds on average. See a sample screenshot from the experiment in Figure~\ref{fig:sample_test_screen}.

\textbf{Participants \& Requirements:} We recruited 60 participants (50\% female - 50\% male) on average per session, more than 3000 
participants in total. Every participant was compensated for their time and the age of the participants range from 18 to 70. Moreover, to ensure all stimuli were shown as intended, participants were limited to use selected browsers on full screen with 1080p resolution. In addition, participants were required to complete at least 200 submissions with 100\% approval rate on Prolific.




\section{Subjective Experiment Results}
\label{sec:subjective}

This section presents the result of subjective quality scores analyses. Section \ref{subsec:content_ambiguity} investigates the content ambiguity for each source point cloud. Comparison of different methods to acquire MOS from individual observer opinions presented in Section \ref{subsec:mos_analysis}. Finally, performance of the compression algorithms are analyzed in Section \ref{subsec:compression_algorithm_performance}


\begin{figure*}[!t]
    \centering
    \includegraphics[width=\linewidth]{figures/br_mos_plots2.png}
    \caption{Bit per point vs MOS plots for each SRC. Each point represents a PPC acquired from the SRC indicated at the title of each plot. Vertical axis of each plot is aligned to [1, 5] range and indicates the MOS of PPCs. Horizontal axis represents the bit per point for each PPC and the bit per point ranges are not necessarily the same for all SRCs. Since GeoCNN compresses only the geometry information, it is excluded from the analysis. }
    \label{fig:mos_comparisonPlots}
\end{figure*}



\subsection{Content Ambiguity}
\label{subsec:content_ambiguity}

Some contents can be more difficult to evaluate than the others. Depending on the QoE scenario, various factors can lead to more/less ambiguous contents. In order to estimate the content ambiguity of source point clouds in the dataset, we used Netflix-Sureal package\cite{li2017netflixsureal}. We observe that the ambiguity of the point clouds are correlated with the visual quality of the source point cloud. In other words, less artefacts (due to acquisition, processing, etc.) on the source content, leads to easier evaluation of the compression distortions. This phenomenon is in fact well known in image and video quality domain. 


To further explore the content ambiguity, we analyzed the correlation between number of points and content ambiguity for each semantic category in the dataset. 
We observe a linear correlation between number of points and content ambiguity for the point clouds in \textit{humans \& animals} category. However, the same conclusion cannot be drawn for the \textit{buildings \& landscapes} and \textit{inanimate objects} categories. This can be explained by the similar geometric complexity and physical size of the source point clouds in \textit{humans \& animals} category. Due to low variance in geometric complexity and physical size, number of points often dictates the visual quality of the source point clouds. Therefore, we observe a greater correlation between content ambiguity and number of points in this category. In contrast, variance of geometric complexity and physical size is much higher in \textit{buildings \& landscapes} and \textit{inanimate objects} categories. Therefore, number of points alone cannot dictate the visual quality. 
More details will be provided for content ambiguity after ICIP 2023 Grand Challenge is concluded.






\subsection{Mean Opinion Scores}
\label{subsec:mos_analysis}


In order to analyze the validity of the collected subjective opinion scores and provide reliable MOS, we compared three methodologies to estimate MOS. First, no observer screening was applied to the collected opinion scores. Raw MOS is calculated simply averaging all opinions for each PPC. Secondly, BT500 MOS is calculated by following the recommendations in ITU-R BT.500-14 A1-2.3.1~\cite{itu2019recBT500}. Observer screening is applied once before calculating the MOS. Among more than 3000 
total observers participated to the subjective experiment, 47 of them found as outliers and omitted from the BT500 MOS calculation. As the last method, Netflix Sureal \cite{li2017netflixsureal} was used to estimate the MOS by taking subject inconsistency and biases into account. All participants' opinions were included in the Sureal MOS estimation. Figure \ref{fig:mos_comparison} presents the results as scatter plots between each pair of methodologies as well as pearson and spearman rank order correlation coefficients. Results clearly indicate that there is no significant difference between the three methodologies. This further confirms the validity of Prolific participant pool and the experiment design. In the dataset public repository, we provide the raw opinion scores and MOS acquired by following the ITU-R BT.500-14 recommendations.



\begin{figure}[!t]
    \centering
    \includegraphics[width=\columnwidth]{figures/mos_comparison.png}
    \caption{Comparison of calculated MOS with three different methodologies. Each plot contains a pair of comparison between the three methods. Higher MOS indicates higher visual quality. Raw represents the simple mean over collected opinion scores without outlier detection. BT500 represents the MOS values after BT.500-14~\cite{itu2019recBT500} observer screening step and Sureal is the MOS estimated with Netflix Sureal\cite{li2017netflixsureal}. Pearson and spearman rank order correlations are recorded on each plot.}
    \label{fig:mos_comparison}
\end{figure}

Based on the MOS acquired by ITU-R BT.500-14 recommendations, the distribution of MOS for each SRC was checked, and it was observed that, for a given SRC, MOS distributions cover the whole quality range with few exceptions.
Note that the experiment is conducted with DSIS methodology. Consequently, acquired MOS are content aware and equivalent to DMOS in an ACR-HR experiment methodology.




\subsection{Performance of compression algorithms}
\label{subsec:compression_algorithm_performance}

For completeness, the performance of compression algorithms were also analyzed. For this analysis, the rate-distortion curves have been plotted as shown in Figure~\ref{fig:mos_comparisonPlots}. Bit-per-point is used for the rate, which is shown on the x-axes of the plots, and mean opinion scores are shown on the y-axes of the plots. For the sake of comparisons, the bitrate values were stretched to show the full extent of the bitrate range. This means that the plots are only meaningful within themselves and should not be compared to other plots. 

The results show that, in general, VPCC is performing better than GPCC, which supports the findings of earlier studies~\cite{zerman2020textured}. Among GPCC-RAHT and GPCC-Predlift there seems to be no significant difference, considering all the different SRC contents, even though GPCC-RAHT seems to yield slightly lower bitrate for higher bitrates. As the cautious readers can identify, there is a slight decrease in the subjective quality for the highest bitrate of p45, which is caused by a hole in the final rendering, the cause of which is unknown.







\section{Objective Quality Assessment}
\label{sec:objective}

Point cloud objective quality metrics can be categorized into three, considering the type of input to the quality metrics:  \textit{(i)} image-based, \textit{(ii)} color-based, and \textit{(iii)} geometry-based. Image-based metrics take the rendered point cloud image or image sequences as input and assess the quality of the point clouds. Geometry-based metrics rely only on the geometry information (e.g., the location in 3D space) stored at each point in the point cloud, ignoring the color attribute. Color-based metrics utilizes the both geometry and color information of each point to assess the point cloud quality. Moreover, each metric can be categorized into three based on the presence of reference point cloud information as full-reference (FR), reduced-reference (RR) and no-reference (NR). FR metrics access all information from the reference point cloud in addition to the distorted point cloud. NR metrics can access only partial information (features) from the reference point clouds. NR metrics assess the quality of the point cloud without any access to the reference point cloud. 

In this section, we benchmark 14 image-based, 9 color-based and 17 geometry-based metrics from the literature. Selected metrics are introduced in Section~\ref{subsec:selected_metrics}. Various methodologies were adopted to evaluate metric performances and introduced in Section~\ref{subsec:evaluation_criteria}. Results of these analyses are presented in Section~\ref{subsec:corr_analysis_results} and Section~\ref{subsec:krasulas_method_results}


\subsection{Selected Metrics}
\label{subsec:selected_metrics}

For all image-based metrics, average pooling over 30 fps video renderings has been used to predict the final quality as recommended in \cite{ak2021temporalsampling}. Image-based metrics include simple measures such as MSE, PSNR and SSIM~\cite{wang2004ssim} and 11 other more sophisticated metrics. Feature similarity index (FSIM~\cite{zhang2011fsim}) and its color-dependent variant FSIMc~\cite{zhang2011fsim} are full reference metrics that rely on phase congruency and gradient magnitude to quantify image quality locally and uses phase congruency as a weighting function to obtain a single quality score. Gradient magnitude similarity deviation (GMSD\cite{xue2014gmsd}) is another full reference metric utilizing the pixel-wise gradient magnitude similarity to predict the image quality. D-JNDQ~\cite{ak2022djndq} is a learning-based full reference metric that is trained on first just noticeable difference (JND) points of JPEG compression artefacts. It combines a white-box optical and retinal pathway model with a Siamese neural network to predict image quality. MW-PSNR~\cite{stankovic2015mwpsnrFR, stankovic2016mwpsnrRR} is based on morphological wavelet decomposition and MSE of the wavelet sub-bands. Both full reference (MW-PSNR-FR) and reduced reference (MW-PSNR-RR) versions are included in the evaluation. 

The geometry-based metrics disregard the color information. The point-to-point~\cite{mekuria2016point2point} and point-to-plane~\cite{tian2017evaluation} metrics are most commonly used in MPEG standardization activities, and they focus on distance among the nearest points or distance between the point and the projected point on the second set considering the normal, respectively. Plane-to-plane~\cite{alexiou2018angular} measures the angular difference between the normals of the closest points. PCQM~\cite{meynet2020pcqm} focuses on finding the curvature of a predicted surface from a set of closest points with normals and estimates quality based on this curvature estimate. PointSSIM~\cite{evangelos2020pointssim} calculates attributes such as geometry, normals, curvature, and colors. Once the features are calculated, they are fed into a similarity function and pooled all together. As geometry-based attributes were found second best (to the color-based attributes), PointSSIM Geom- metrics were included in this analysis.

The color-based metrics differ in nature. So far, two color-based metrics were considered. The metrics named as ``Color $<$channel name$>$ $<$pooling$>$'' do measure the color difference between the closest points and disregard any geometrical information. The PointSSIM~\cite{evangelos2020pointssim}, on the other hand, have two different modes and can take either a set distance or k-nearest neighbors. As color-based attributes were found to be the highest performing among other types of attributes, PointSSIM Color- metrics were included in this analysis.



\subsection{Evaluation Criteria}
\label{subsec:evaluation_criteria}

The whole dataset was used for the evaluation. For learning-based objective quality metrics, no training or fine-tuning is applied prior to evaluation. Evaluation of the objective quality performances were made with 2 main methods. First analysis relies on traditional measures to analyze the correlation between metric predictions and MOS. Second analysis relies on statistical significance of the differences between pairs of stimuli. It measures the metric performance based on the capability of identifying significantly different pairs.

\input{tables/MetricCatCorr.tex}

\textbf{Correlation measures:} Pearson's linear correlation coefficient (PLCC) measures the prediction accuracy of the objective metrics whereas Spearman's rank-order correlation coefficient (SROCC) measures the strength of prediction monotonicity~\cite{itu2019recP1401}. Following the recommendations~\cite{itu2019recBT500, itu2019recP1401}, a 4 parameter polynomial function was fitted prior to evaluation. Both PLCC and SROCC, the values are in the range [0, 1] and higher values indicate a better correlation.

\begin{figure}
    \centering
    \includegraphics[width=\columnwidth]{figures/krasula_ideal_distributions.png}
    \caption{Ideal distributions of metric score differences for ``Different vs Similar'' and ``Better vs Worse'' analysis. A greater metric score difference is expected for different pairs in ``Different vs Similar'' analysis. For ``Better vs Worse'' analysis, metric score differences are expected to be positive and negative respectively for better and worse pairs.}
    \label{fig:ideal_distributions}
\end{figure}

\textbf{Krasula's method:} This method evaluates the objective metric performances in 2 stages, namely ``Different vs Similar'' and ``Better vs Worse''. For ``Different vs Similar'' analysis, pairs of PPCs from the dataset are split into two categories as pairs with (\textit{i.e., different}) and without (\textit{i.e., similar}) statistically significant differences. For a given pair of PPC, one way ANOVA followed by Tukey's honest significance difference test~\cite{tukey1949honestsignftest} is used to measure the statistical significance of the differences. Krasula's method assumes that the absolute difference of metric predictions for different pairs should be larger than the similar pairs. To quantify the performance Receiver Operating Characteristic (ROC) analysis is used and the performance of the metrics are expressed as Area Under the ROC Curve (AUC). The second stage uses the pairs that are identified as \textit{different} in the first stage. In the ``Better vs Worse'' stage, the aim is to measure the performance of metrics on identifying the better PPC in pairs with statistically significant difference. Performance of the metrics in ``Better vs Worse'' analysis can then be expressed as correct classification percentage as well as AUC values similar to first stage. 

Figure \ref{fig:ideal_distributions} depicts the ideal distributions of the metric score differences for each stage of the analysis. In ``Different vs Similar'' analysis, we expect higher metric score differences for ``Different'' pairs and lower for ``Similar'' pairs. In ``Better vs Worse'' analysis, after fixing the order of the pairs as ``Better'' or ``Worse'', we expect positive  and negative metric score differences respectively for each category of pairs.

\subsection{Correlation Analysis Results}
\label{subsec:corr_analysis_results}



Table~\ref{tab:metric_cat_corr} presents the PLCC and SROCC of each metric. Metrics are categorized into three as previously discussed in Section~\ref{subsec:selected_metrics}. First two column presents the metrics' PLCC and SROCC scores on the whole dataset. Moreover, metric performances were evaluated for individual compression algorithms and the results are presented in following columns as indicated above. 

PCQM~\cite{meynet2020pcqm} performs the best compared to other selected metrics in the whole dataset, despite its poor performance on prediction of GeoCNN compression distortions. Among color-based metrics, we again notice a similar pattern on the accuracy of metrics when it comes to GeoCNN compression distortions. PointSSIM variants perform relatively better than other metrics in this category. 

Simple image-based metrics (\textit{e.g., MSE, PSNR, SSIM, MS-SSIM}) have low accuracy across all compression categories and consequently on the whole dataset. VMAF shows the best performance among image-based metrics in the whole dataset. We also observe a general trend among image-based metrics towards a lack of accuracy on VPCC compression distortions. 

To sum up, PCQM~\cite{meynet2020pcqm} performs the best on predicting GPCC-Predlift, GPCC-Raht and VPCC compression distortions whereas D-JNDQ~\cite{ak2022djndq} provides the highest accuracy on GeoCNN compression distortions. 



\subsection{Results of the Analysis by Krasula's Method}
\label{subsec:krasulas_method_results}


Prior to the analysis, we pre-process the subjective scores as described in Krasula's method\cite{krasula2016metriceval}. First, 20 PPCs were paired within each source point cloud, generating $(20\times(20-1)/2)$ pairs per SRC. In total, we end up with 14143 pairs. Thanks to high number of PPCs in the dataset, we kept the analysis within SRC. Afterwards, a one-way ANOVA test is applied to individual scores collected for each stimulus in each pair, followed by Tukey's Honest test. 5019 pairs among the total 14143 were identified as ``Similar'' whereas 9124 contains a statistically significant different between the two PPC and thus identified as ``Different''. From those ``Different'' pairs, we split them into two roughly equal sized groups as ``Better'' and ``Worse'' depending on the order of the pair. There are 4075 ``Better'' and 5049 ``Worse'' pairs. We report the result of the analysis on the top performing 10 metrics among the initial 40. Rest of the results can be found in the github repository, which will be added after ICIP 2023 Grand Challenge is concluded.




\subsubsection{Different vs Similar Analysis}

\begin{figure*}
    \centering
    \includegraphics[width=\textwidth]{figures/krasula_different_similar.pdf}
    \caption{Metric score differences for pairs categorized as ``Different'' and ``Similar''. Metric score differences are normalized individually for each metric within minimum and maximum ranges. Height of the bars denote the occurrences and each bar ranges between [0, 1500] for each plot. Metric names are indicated at the top of each plot. Area under the curve (AUC) values are reported below each metric name.}
    \label{fig:different_similar_distributions}
\end{figure*}

Figure \ref{fig:different_similar_distributions} presents the results of the analysis as histograms of metric score differences for ``Different'' and ``Similar'' pairs. We expect better performing metrics to provide metric score distributions similar to the ideal case as depicted in Figure \ref{fig:ideal_distributions}. Additionally, performance of each metric quantified with AUC values, reported under each metric name. In line with the results of the correlation analysis, we observe a better performance from PCQM, providing a higher AUC value and a very similar distribution to the ideal case. Statistical significance tests on this task also reveals that PCQM performs significantly better than all other metrics except ADM2 in ``Different vs Similar'' task. 



\subsubsection{Better vs Worse Analysis}

\begin{figure*}[!t]
    \centering
    \includegraphics[width=\textwidth]{figures/krasula_better_worse.pdf}
    \caption{Metric score differences for pairs categorized as ``Better'' and ``Worse''. Metric score differences are normalized individually for each metric within minimum and maximum ranges. Height of the bars denote the occurrences of each bar ranges between [0, 800] for each plot. Metric names are indicated at top left corner of each plot. Area under the curve (AUC) and correct classification percentages (CC) are reported in top right corner of each plot. }
    \label{fig:better_worse_distributions}
\end{figure*}

Similar to the previous stage, Figure \ref{fig:better_worse_distributions} presents the results as histograms of metric score differences and quantifies the performance of each metric with AUC and CC values. We observe that most metrics perform relatively well on identifying ``Better'' and ``Worse'' pairs apart. PCQM performs significantly better than all other metrics in this task. 













\section{Conclusion}
\label{sec:conclusion}



We conducted a large-scale crowdsourcing study on point cloud compression quality assessment. To the best of our knowledge, this is the largest publicly available point cloud quality assessment dataset containing 75 source point clouds each compressed with 4 different point cloud coding algorithms resulting in nearly 1500 processed point clouds. More than 3500 naive observers participated to do experiment. 


Although most point cloud objective quality metrics accurately predicts GPCC distortions, VPCC distortions still poses a challenge to majority of the metrics. Moreover, most point cloud quality metrics fail to assess the quality of the point clouds compressed with GeoCNN. There is definitely a room for improvement for assessing learning-based coding algorithm related distortions.

We expect that the created database will provide a solid foundation for further point cloud compression and point cloud quality assessment research. This will then allow better telepresence experience in different applications and platforms.

The point cloud compression and quality assessment are both hot topics and the number of methods evolve rapidly. In the future, this database can be extended with further with more learning-based approaches to find the challenges and the limitations of the ongoing research and development on compression and quality estimation research.




\section*{Acknowledgments}

This work has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Grant Agreement No. 765911 (RealVision) and from Science Foundation Ireland (SFI) under the Grant Number 15/RP/27760.



{



