Three ways to deduplicate vector in STL

Original link: https://ifmet.cn/posts/be0b5490/

Brief description: Deduplicate the elements in std::vector , where the elements are custom structure types. Provide three ideas, with detailed examples and analysis

C++ unique function causes vector to change

c++ vector deduplication

custom structure

[TOC]

This article was originally published on ” Xie Zang’s Small Station “, and is reproduced here simultaneously.

background

In the business, it is necessary to deduplicate the samples passed into Vector ; this article uses the C++ STL library throughout, and does not call the Qt package. The business abstracts a concrete example.

Build Environment: ? win10 21H2 ? Visual Studio 2019

De-heavy thinking

The search found that std::vector is divided into two types, one basic data type and the other custom structure type.

  1. basic data type

    SO offers three solutions, the second is recommended. See What’s the most efficient way to erase duplicates and sort a vector?

  2. custom struct type

    In actual projects, this kind of thing we encounter more often is what is needed. The rest of the text focuses on this.

solution

For custom structure types, when using STL’s set , sort , and unique , it is necessary to customize their strictly weakly ordered comparison function or equality function (represented by _Pr here).

There are three implementations of the comparison function _Pr :

  1. Overloading the < function in a custom structure is simple, see Code Mode 1. slightly
  2. Define a comparison function , the most commonly used (also defined by Lambda);
  3. Using the function object is also simple, it is a variant of 2, and a layer of class or struct is placed outside the custom function.

PS: Because of the structure in the actual project, there is no and cannot overload operator< , so 1 is not available. Then decided to use the 2 way.

And Type 2 recommends using the conventional way of declaring and defining functions, rather than using the Lambda way of defining; the reason is that the latter will fail in some algorithms, because .

“One” vector, sort + unique

Idea: first use sort to sort, make the duplicate items adjacent, and then use the unique attribute to deduplicate. Note the unique deduplication feature.

 // 核心代码sort ( vec . begin ( ) , vec . end ( ) , cmpSort ) ; auto ite = std :: unique ( vec . begin ( ) , vec . end ( ) , cmpUnique ) ; vec . erase ( ite , vec . end ( ) ) ;

But there are plenty of articles in there, so here’s a full example:

 # include <iostream> # include <string> # include <algorithm> # include <vector> # include <set> using namespace std ; struct MyData { wstring name ; wstring md5 ; } ; // std::sort 的自定义比较函数bool cmpSort ( const MyData & d1 , const MyData & d2 ) { return d1 . md5 < d2 . md5 ; } ; // std::unique 的自定义比较函数bool cmpUnique ( const MyData & d1 , const MyData & d2 ) { return d1 . md5 == d2 . md5 ; } ; int main ( ) { vector < MyData > vec = { { L "a1.exe" , L "B9204B6362A65C248950D5A41F341DCC" } , { L "a2.exe" , L "B9204B6362A65C248950D5A41F341DCC" } , { L "b1.exe" , L "C96D1AA5D314954417C28A645ED72665" } , { L "b2.exe" , L "C96D1AA5D314954417C28A645ED72665" } } ; wcout . imbue ( locale ( "" , LC_CTYPE ) ) ; wcout << L "----------调整前--------" << endl ; for ( const auto & it : vec ) wcout << L "[it.name]" << it . name << L "[it.md5]" << it . md5 << endl ; sort ( vec . begin ( ) , vec . end ( ) , cmpSort ) ; auto ite = std :: unique ( vec . begin ( ) , vec . end ( ) , cmpUnique ) ; vec . erase ( ite , vec . end ( ) ) ; wcout << L "----------调整后--------" << endl ; for ( const auto & it : vec ) wcout << L "[it.name]" << it . name << L "[it.md5]" << it . md5 << endl ; } 

Run and see the results, it seems to be perfectly as expected? ? ? Is it really so?

20220813131959.png

In fact, it will cause a memory leak, because it is caused by the custom cmpUnique() function used, which is not really equal to the two structure objects, and stackoverflow.com can also prove it [1]

20220813133846.png

Try to modify cmpUnique() as follows, there will be no more memory leaks, but this function is also invalid, and it does not have the effect of deduplication, you can debug and check the effect yourself

 bool cmpUnique ( const MyData & d1 , const MyData & d2 ) { //return d1.md5 == d2.md5; return d1 . md5 == d2 . md5 && d1 . name == d2 . name ; } ;

The actual situation is much more complicated: the actual MyData structure is the Bundle structure in Microsoft’s COM component, and the number and content of property items of each ** Bundle object** may be different, with only a small overlap. The requirement is to only need to compare whether the contents of two or three attributes are completely consistent to determine whether the two objects are the same duplicate file. It also does not provide an interface to directly judge the equality of two Bundle objects . It’s a hell of a lot, Help, Help, Help! ! !

Based on the above situation, it is decided to abandon this method and use “Method 2” instead, and finally achieve the purpose

“Second” vector + set (manual assignment)

Idea: It is very simple, first import the vector data into the set, use its singleness to remove duplication, and then assign the data to the vector, a perfect idea, and there will be no memory leak in the first method. The difficulty is to sort and assign a custom type to the object that creates the std::set. Refer to Appendix [2] Method 3 for the solution to its difficulties

Among them, the custom sorting function cmpSort() of std::set is divided into whether it is defined outside the class or inside the class. If the former is even simple, if the latter, it must be defined as a static member function .

cmpSort() is defined outside the Class

 // 核心代码set < MyData , decltype ( cmpSort ) * > s ( & cmpSort ) ; // 此种方式比较少见,且第二个& 若想省略亦可,则编译器亦会自动添加for ( unsigned i = 0 ; i < vec . size ( ) ; ++ i ) s . insert ( vec [ i ] ) ; vec . assign ( s . begin ( ) , s . end ( ) ) ; 

cmpSort() is defined as a Class member variable

Attachment: Complete source code DeDuplication

 // 核心代码set < MyData , decltype ( DeDuplication :: cmpSort ) * > s ( & DeDuplication :: cmpSort ) ; // 本行定义为重点for ( unsigned i = 0 ; i < m_vec . size ( ) ; ++ i ) s . insert ( m_vec [ i ] ) ; m_vec . assign ( s . begin ( ) , s . end ( ) ) ;

Among them, the cmpSort() of Class DeDuplication must be of static type, which is defined as follows.

 # pragma once # include <iostream> # include <string> # include <algorithm> # include <vector> # include <set> using namespace std ; struct MyData { wstring name ; wstring md5 ; } ; class DeDuplication { private : static bool cmpSort ( const MyData & d1 , const MyData & d2 ) { // ***** 注意,必须是静态***** // Simple example return d1 . md5 < d2 . md5 ; // Complex sample //if (d1.md5 != d2.md5) // return d1.md5 < d2.md5; //else // return d1.name < d2.name; } ; public : void finally ( ) { wcout . imbue ( locale ( "" , LC_CTYPE ) ) ; wcout << L "----------调整前--------" << endl ; for ( const auto & it : m_vec ) wcout << L "[it.name]" << it . name << L "[it.md5]" << it . md5 << endl ; set < MyData , decltype ( DeDuplication :: cmpSort ) * > s ( & DeDuplication :: cmpSort ) ; for ( unsigned i = 0 ; i < m_vec . size ( ) ; ++ i ) s . insert ( m_vec [ i ] ) ; m_vec . assign ( s . begin ( ) , s . end ( ) ) ; wcout << L "----------调整后--------" << endl ; for ( const auto & it : m_vec ) wcout << L "[it.name]" << it . name << L "[it.md5]" << it . md5 << endl ; } ; private : vector < MyData > m_vec = { { L "a1.exe" , L "B9204B6362A65C248950D5A41F341DCC" } , { L "a2.exe" , L "B9204B6362A65C248950D5A41F341DCC" } , { L "b1.exe" , L "C96D1AA5D314954417C28A645ED72665" } , { L "b2.exe" , L "C96D1AA5D314954417C28A645ED72665" } } ; } ; 

If it is a common type, it will cause the std::set object to compile, but the error is as follows, and the meaning is referred to devblogs.microsoft.com [3] .

 Error ( active ) E0289no instance of constructor "std::set<_Kty, _Pr, _Alloc>::set [with _Kty=MyData, _Pr=bool (DeDuplication::**)(const MyData &d1, const MyData &d2), _Alloc=std::allocator<MyData>]" matches the argument listDeDuplication 

“Three” vector + set (constructor)

Its performance is not as good as “method 2”, omitted, please refer to the third type of Ref [1] .

Summarize

Finally, combined with the actual situation, it is most appropriate to use the method of defining static in the class of the second method to solve this problem. When there is time, we will continue to study the next method, and see if the custom cmpUnique() modification meets the requirements of strict weak order, whether there is no memory leak, and it can perfectly meet the requirements. And attach the complete source code

  1. For the out-of-class definitions of methods 1 and 2, see the project Unique for the complete source code
  2. For the in-class definition of method 2, see the project DeDuplication of the complete source code

Serial address

QtExamples 【DeDuplication】

Welcome to star ⭐ and fork ? this series of C++ / QT / DTK learning, with a list of learning from shallow to deep.

Ref

[1] : eliminating duplicates from vector of custom object – std:unique causes crash on free

[2] : C++ set custom sorting

[3] : Error C3867: non-standard syntax; use ‘&’ to create a pointer to member: What it means and how to fix it

This article is reproduced from: https://ifmet.cn/posts/be0b5490/
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment