1 delete orphan files#35
Conversation
|
|
||
| // Remove orphans. | ||
| OrphanFileScanner orphanFileScanner = new OrphanFileScanner(table); | ||
| orphanFileScanner.removeOrphanedFiles(olderThanMillis, false); |
There was a problem hiding this comment.
we may want a shorter ttl here
| import software.amazon.awssdk.services.s3.model.ListObjectsV2Response; | ||
|
|
||
| public class OrphanFileScanner { | ||
| private static final Logger LOG = LoggerFactory.getLogger(OrphanFileScanner.class); |
There was a problem hiding this comment.
called logger in the rest of the codebase
| String bucket = location.replace("s3://", "").split("/")[0]; | ||
| String prefix = location.replace("s3://" + bucket + "/", ""); | ||
|
|
||
| S3Client s3 = S3.newClient(true); |
There was a problem hiding this comment.
com.altinity.ice.cli.internal.* shouldn't be used here; true indicates "anonymous" access -> won't work;
can we use table io here (so that impl would be not tied to s3)?
| deletedFiles.forEach(f -> LOG.info("Deleted: {}", f)); | ||
| } | ||
|
|
||
| executor.shutdownNow(); |
There was a problem hiding this comment.
we already called shutdown on line 120
| return allFiles; | ||
| } | ||
|
|
||
| public void removeOrphanedFiles(long olderThanMillis, boolean dryRun) throws IOException { |
There was a problem hiding this comment.
based on current impl this method should be executed only when warehouse is s3 (not s3 tables, etc)
|
|
||
| // Set<String> orphanedFiles = new HashSet<>(); | ||
|
|
||
| allFiles.remove(knownFiles); |
There was a problem hiding this comment.
isn't this supposed to be removeAll?
| } | ||
| } | ||
|
|
||
| return knownFiles; |
There was a problem hiding this comment.
The result looks incomplete. Inserting a single file into empty table yields
./metadata/00000-f62c927f-3965-42e9-8f3a-57809ca49f55.metadata.json
./metadata/4b00c8c2-6d25-4ba5-bf27-989fbdb9daa1-m0.avro
./metadata/snap-7186194791324723249-1-4b00c8c2-6d25-4ba5-bf27-989fbdb9daa1.avro
./metadata/00001-07fe7fa7-8fe1-4666-bd7b-7c88b9c861fe.metadata.json
./data/pickup_datetime_month=2009-01/1750715797525-2d861d7a4e5196bf04e5c3be506b0152f4170cf4e941424c8245b46903c3e252-part.parquet
i.e. we're going to nuke a bunch of critical files and break the catalog...
| knownFiles.add(dataFile.path().toString()); | ||
| } | ||
| } catch (Exception e) { | ||
| logger.error("Error getting list of data files", e); |
There was a problem hiding this comment.
network error here leads to us deleting files from catalog that are not supposed to be deleted
…eve snapshot metadata JSON files for orphan file deletion.
|
Tested by dropping a new file |
| return; | ||
| } | ||
|
|
||
| if (dryRun) { |
There was a problem hiding this comment.
dryRun should probably a config variable, so they can try before toggling it?
|
|
||
| if (orphanFileExpirationDays == 0) { | ||
| logger.info( | ||
| "Skipping orphan file removal for table {} since orphanFileExpirationDays config was not set", |
There was a problem hiding this comment.
maybe this can be avoided for every table if we have a separate iteration, not sure if this is the best option.
|
we should probably default to
to match https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-table-buckets-maintenance.html |
Setting default to 10 days for |
| FileIO io = table.io(); | ||
|
|
||
| TableOperations ops = ((BaseTable) table).operations(); | ||
| TableMetadata meta = ops.current(); | ||
|
|
||
| String currentMetadataFile = meta.metadataFileLocation(); | ||
| // Current metadata json file | ||
| knownFiles.add(currentMetadataFile); | ||
|
|
||
| // All the previous metadata JSON(is there a chance there might be | ||
| // json files that are not physically present?. | ||
| for (TableMetadata.MetadataLogEntry previousFile : meta.previousFiles()) { | ||
| knownFiles.add(previousFile.file()); | ||
| } |
There was a problem hiding this comment.
same files are added on each loop iteration
There was a problem hiding this comment.
you have a single snapshot; add another one to repro. but take a look at the selected code - it's fairly obvious the code is not connected to the outer for loop
| for (DataFile dataFile : files) { | ||
| knownFiles.add(dataFile.path().toString()); | ||
| } | ||
| } catch (Exception e) { |
| // All the previous metadata JSON(is there a chance there might be | ||
| // json files that are not physically present?. | ||
| for (TableMetadata.MetadataLogEntry previousFile : meta.previousFiles()) { | ||
| knownFiles.add(previousFile.file()); |
There was a problem hiding this comment.
the result is missing statistics & partition statistics files
|
|
||
| // All the previous metadata JSON(is there a chance there might be | ||
| // json files that are not physically present?. | ||
| for (TableMetadata.MetadataLogEntry previousFile : meta.previousFiles()) { |
There was a problem hiding this comment.
files referenced by previous manifests are also missing
|
A version of this with #35 (review) addressed was included in 0.5.0. |


closes:#1