A script to quickly find duplicate files in a project

There are more and more files in the project, which leads to the continuous increase of the generated apk package, but whether there will be duplicate files in these files, this is a problem worthy of verification, after all, it will reduce a lot of apk size after solving .

It is impossible for us to rely on manual manual search, because it is too time-consuming. So we tried again using a script to check for duplicate files in the project.

Script ideas

  • Use md5 to calculate content
  • Set a dictionary, the key is the md5 value, and the value is the corresponding file path list
  • If there is more than one value item (file path list) in the above dictionary, it means that there are duplicate files
  • Using the number of duplicate files and the size of the duplicate files, we calculate the total amount of space that can be saved

script content

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 twenty one twenty two twenty three twenty four 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 
#!/usr/bin/env ruby # encoding: utf-8 require 'find' require 'digest/md5' # 通常为项目的路径targetDirToSearch = ARGV [ 0 ] $hashesFiles = {} $sizeCanBeSaved = 0 def getFileMd5Checksum ( file ) return Digest : :MD5 . hexdigest ( File . read ( file )) end def shouldCheckThisFile ( f ) isFile = File . file? ( f ) isGitFile = f . include? ".git/" isGradleFile = f . include? ".gradle/" isIdeFile = f . include? ".idea/" return isFile && ! isGitFile && ! isGradleFile && ! isIdeFile end def getFilesByMd5 ( md5Value ) existingFiles = $hashesFiles [ md5Value ] if ( existingFiles == nil ) existingFiles = [] end return existingFiles end def recordFile ( f ) md5 = getFileMd5Checksum ( f ) $hashesFiles [ md5 ] = getFilesByMd5 ( md5 ) . push ( f ) end def printHashesFiles () $hashesFiles . values . select { | array | array . size > 1 } . sort_by { | files | File . size ( files [ 0 ] ) } . each { | array | fileSize = File . size ( array [ 0 ] ) puts "Duplicated files size= #{ format_mb ( fileSize ) } " array . each { | f | puts f } $sizeCanBeSaved += fileSize * ( array . size - 1 ) puts "" } end def format_mb ( size ) conv = [ 'b' , 'kb' , 'mb' , 'gb' , 'tb' , 'pb' , 'eb' ] ; scale = 1024 ; ndx = 1 if ( size < 2 * ( scale ** ndx ) ) then return " #{ ( size ) } #{ conv [ ndx - 1 ] } " end size = size . to_f [ 2 , 3 , 4 , 5 , 6 , 7 ]. each do | ndx | if ( size < 2 * ( scale ** ndx ) ) then return " #{ '%.3f' % ( size / ( scale ** ( ndx - 1 ))) } #{ conv [ ndx - 1 ] } " end end ndx = 7 return " #{ '%.3f' % ( size / ( scale ** ( ndx - 1 ))) } #{ conv [ ndx - 1 ] } " end def getFileSize ( f ) return format_mb ( File . size ( f )) end def start ( dirToSearch ) Find . find ( dirToSearch ) . select { | f | shouldCheckThisFile ( f ) } . each { | f | puts "Checking file #{ f } " recordFile ( f ) } printHashesFiles () puts "Size can be saved #{ format_mb ( $sizeCanBeSaved ) } " end start ( targetDirToSearch )

Results of the

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 twenty one twenty two twenty three twenty four 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 
MacBook-Pro-8:~/Documents/AndroidProjects/EasyHybridApp ( master|✚3 ) % findDuplicatedFiles.rb ./ Checking file ./.gitignore Checking file ./EasyHybridApp.iml Checking file ./app/.gitignore Checking file ./app/app.iml Checking file ./app/build.gradle Checking file ./app/proguard-rules.pro Checking file ./app/src/androidTest/java/com/droidyue/easyhybridapp/ExampleInstrumentedTest.kt Checking file ./app/src/main/AndroidManifest.xml Checking file ./app/src/main/java/com/droidyue/easyhybridapp/AppInfo.kt Checking file ./app/src/main/res/layout/activity_main.xml Checking file ./app/src/main/res/values/colors.xml Checking file ./app/src/main/res/values/strings.xml Checking file ./app/src/main/res/values/styles.xml Checking file ./app/src/main/res/xml/network_security_config.xml Checking file ./app/src/test/java/com/droidyue/easyhybridapp/ExampleUnitTest.kt Checking file ./build.gradle Checking file ./common/.gitignore Checking file ./common/build.gradle Checking file ./common/common.iml Checking file ./common/consumer-rules.pro Checking file ./common/proguard-rules.pro Checking file ./common/src/androidTest/java/com/droidyue/common/ExampleInstrumentedTest.kt Checking file ./common/src/main/AndroidManifest.xml Checking file ./common/src/main/java/com/droidyue/common/ClassExt.kt Checking file ./common/src/main/java/com/droidyue/common/ConfirmDialogExt.kt Checking file ./common/src/main/java/com/droidyue/common/ContextExt.kt Checking file ./webview/src/androidTest/java/com/droidyue/webview/ExampleInstrumentedTest.kt Checking file ./webview/src/main/java/com/droidyue/webview/webviewclient/PageRequestWebViewClient.kt Checking file ./webview/src/main/java/com/droidyue/webview/webviewclient/WhitelistLaunchingIntentWebViewClient.kt Checking file ./webview/src/main/res/layout/activity_webview.xml Checking file ./webview/src/main/res/raw/deeplink_whitelist.json Checking file ./webview/src/main/res/values/strings.xml Checking file ./webview/src/test/java/com/droidyue/webview/ExampleUnitTest.kt Checking file ./webview/webview.iml Duplicated files size = 0 b ./common/consumer-rules.pro ./webview/consumer-rules.pro Duplicated files size = 7 b ./app/.gitignore ./common/.gitignore ./webview/.gitignore Duplicated files size = 272 b ./app/src/main/res/mipmap-anydpi-v26/ic_launcher.xml ./app/src/main/res/mipmap-anydpi-v26/ic_launcher_round.xml Duplicated files size = 751 b ./app/proguard-rules.pro ./common/proguard-rules.pro ./webview/proguard-rules.pro Size can be saved 1788 b

Running advice

  • It is recommended to execute the following clean before execution, such as ./gradlew clean for Android projects

droidyue_gzh_green_png.png

This article is reprinted from https://droidyue.com/blog/2022/05/15/how-to-find-duplicated-file-via-one-script/
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment