How To DIY Site Search Analytics Using Athena – Part 2

This article continues the work on our analytics application from Part 1. You will need to read Part 1 to understand the content of this post. “How To DIY Site search analytics — Part 2” will add the following features to our application:

How-To Site-Search Analytics follow these steps

  1. Upload CSV files containing our E-Commerce KPIs in a way easily readable by humans.
  2. Convert the CSV files into Apache Parquet format for optimal storage and query performance.
  3. Upload Parquet files to AWS S3, optimized for partitioned data in Athena.
  4. Make Athena aware of newly uploaded data.

CSV upload to begin your DIY for Site Search Analytics

First, create a new Spring-Service that manages all file operations. Open your favorite IDE where you previously imported the application in part 1 and create a new class called FileService in the package com.example.searchinsightsdemo.service and paste in the following code:

				
					@Service
public class FileService {

    private final Path uploadLocation;

    public FileService(ApplicationProperties properties) {
        this.uploadLocation = Paths.get(properties.getStorageConfiguration().getUploadDir());
    }

    public Path store(MultipartFile file) {
        String filename = StringUtils.cleanPath(file.getOriginalFilename());
        try {
            if (file.isEmpty()) {
                throw new StorageException("Failed to store empty file " + filename);
            }
            if (filename.contains("..")) {
                // This is a security check, should practically not happen as
                // cleanPath is handling that ...
                throw new StorageException("Cannot store file with relative path outside current directory " + filename);
            }
            try (InputStream inputStream = file.getInputStream()) {
                Path filePath = this.uploadLocation.resolve(filename);
                Files.copy(inputStream, filePath, StandardCopyOption.REPLACE_EXISTING);
                return filePath;
            }
        }
        catch (IOException e) {
            throw new StorageException("Failed to store file " + filename, e);
        }
    }

    public Resource loadAsResource(String filename) {
        try {
            Path file = load(filename);
            Resource resource = new UrlResource(file.toUri());
            if (resource.exists() || resource.isReadable()) {
                return resource;
            }
            else {
                throw new StorageFileNotFoundException("Could not read file: " + filename);

            }
        }
        catch (MalformedURLException e) {
            throw new StorageFileNotFoundException("Could not read file: " + filename, e);
        }
    }

    public Path load(String filename) {
        return uploadLocation.resolve(filename);
    }

    public Stream<Path> loadAll() {
        try {
            return Files.walk(this.uploadLocation, 1)
                    .filter(path -> !path.equals(this.uploadLocation))
                    .map(this.uploadLocation::relativize);
        }
        catch (IOException e) {
            throw new StorageException("Failed to read stored files", e);
        }

    }

    public void init() {
        try {
            Files.createDirectories(uploadLocation);
        }
        catch (IOException e) {
            throw new StorageException("Could not initialize storage", e);
        }
    }
				
			

That’s quite a lot of code. Let’s summarize the purpose of each relevant method and how it helps To DIY Site Search Analytics:

  • store: Accepts a MultipartFile passed by a Spring Controller and stores the file content on disk. Always pay extra attention to security vulnerabilities when dealing with file uploads. In this example, we use Spring’s StringUtils.cleanPath to guard against relative paths, to prevent someone from navigating up our file system. In a real-world scenario, this would not be enough. You’ll want to add more checks for proper file extensions and the like.
  • loadAsResource: Returns the content of a previously uploaded file as a Spring Resource.
  • loadAll: Returns the names of all previously uploaded files.

To not unnecessarily inflate the article, I will refrain from detailing either the configuration of the upload directory or the custom exceptions. As a result, please review the packages com.example.searchinsightsdemo.config, com.example.searchinsightsdemo.service and the small change necessary in the class SearchInsightsDemoApplication to ensure proper setup.

Now, let’s have a look at the Spring Controller. Using the newly created service, create a Class FileController in the package com.example.searchinsightsdemo.rest and paste in the following code:

				
					@RestController
@RequestMapping("/csv")
public class FileController {

    private final FileService fileService;

    public FileController(FileService fileService) {
        this.fileService = fileService;
    }

    @PostMapping("/upload")
    public ResponseEntity<String> upload(@RequestParam("file") MultipartFile file) throws Exception {
        Path path = fileService.store(file);
        return ResponseEntity.ok(MvcUriComponentsBuilder.fromMethodName(FileController.class, "serveFile", path.getFileName().toString()).build().toString());
    }

    @GetMapping("/uploads")
    public ResponseEntity<List<String>> listUploadedFiles() throws IOException {
        return ResponseEntity
                .ok(fileService.loadAll()
                        .map(path -> MvcUriComponentsBuilder.fromMethodName(FileController.class, "serveFile", path.getFileName().toString()).build().toString())
                        .collect(Collectors.toList()));
    }

    @GetMapping("/uploads/{filename:.+}")
    @ResponseBody
    public ResponseEntity<Resource> serveFile(@PathVariable String filename) {
        Resource file = fileService.loadAsResource(filename);
        return ResponseEntity.ok()
                .header(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=\"" + file.getFilename() + "\"").body(file);
    }

    @ExceptionHandler(StorageFileNotFoundException.class)
    public ResponseEntity<?> handleStorageFileNotFound(StorageFileNotFoundException exc) {
        return ResponseEntity.notFound().build();
    }
}
				
			

Nothing special. We just provided request mappings to;

  1. Upload a file
  2. List all uploaded files
  3. Serve the content of a file

This will ensure the appropriate use of the service methods. Time to test the new functionality, start the Spring Boot application and run the following commands against it:

				
					# Upload a file:
curl -s http://localhost:8080/csv/upload -F file=@/path_to_sample_application/sample_data.csv

# List all uploaded files
curl -s http://localhost:8080/csv/uploads

# Serve the content of a file
curl -s http://localhost:8080/csv/uploads/sample_data.csv
				
			

The sampledata.csv file can be found within the project directory. However, you can also use any other file.