How to read a large file in Java

Problem with reading large file

Large file can be any plain text or binary file which is huge in size and can not fit in JVM memory at once. For example if a java application allocated with 256 MB memory and it tries to load a file completely which is more or close to that memory in size then it may throw out of memory error.

Points to be remembered

Never read the whole file at once.
Read file line by line or in chunks, like reading few lines from text file or reading few bytes from binary file.
Do not store the whole data in memory, like reading all lines and keeping as string.

Java has many ways to read the file in chunks or line by line, like BufferedReader, Scanner, BufferedInputStream, NIO API. We will use NIO API to read the file and Java stream to process it. We will also see how to span the processing with multiple threads to make the processing faster.

CSV file

In this example I am going to read a CSV file which is around 500 MB in size. Sample is as given below.

"year_month","month_of_release","passenger_type","direction","citizenship","visa","country_of_residence","estimate","standard_error","status"

"2018-06","2019-08","Long-term migrant","Arrivals","non-NZ","Student","Andorra",1,0,"Provisional"

"2019-06","2019-08","Long-term migrant","Arrivals","non-NZ","Student","Andorra",0,0,"Provisional"

File reading and counting year wise

We will read CSV file and provide the count year wise using the first column in this CSV file. We will see it in two different ways, one is synchronous way and another is asynchronous way using CompletableFuture. We will see the code in next sections.

Instance Variables

    private final long mb = 1024*1024;
    private final String file = "/Users/Downloads/sample.csv";

Common Methods

Below method is used to collect the count year wise in map.

    public void yearCount(String line, Map<String, Integer> countMap){
        String key = line.substring(1, 5);
        if(countMap.containsKey(key)) {
            countMap.put(key, countMap.get(key)+1);
        }else
            countMap.put(key, 1);
    }

Below method I have annotated with EventListener to get it invoked itself when application is up and ready. This method also calculates memory consumption and execution time.

    @EventListener(ApplicationReadyEvent.class)
    public void testLargeFile() throws Exception{
        long premem = Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();
        long start = System.currentTimeMillis();

        System.out.println("Used memory pre run (MB): "+(premem/mb));
        //PLEASE UNCOMMENT THE 1 LINE OUT OF BELOW 2 LINES AT A TIME TO TEST
        //THE DESIRED FUNCTIONALITY
//        System.out.println("Year count: "+simpleYearCount(file));//process file synchronously and print details
//        System.out.println("Year count: "+asyncYearCount(file));//process file asynchronously and print details
        
        long postmem = Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();

        System.out.println("Used memory post run (MB): "+(postmem/mb));
        System.out.println("Memory consumed (MB): "+(postmem-premem)/mb);
        System.out.println("Time taken in MS: "+(System.currentTimeMillis()-start));
    }

Synchronous processing

Below is the code which reads the file using NIO API and calculates year count synchronously. Code looks pretty simple and small.

    public Map<String, Integer> simpleYearCount(String file) throws IOException {
        Map<String, Integer> yearCountMap = new HashMap<>();
        
        Files.lines(Paths.get(file))
        .skip(1)//skip first line
        .forEach((s)->{
                yearCount(s, yearCountMap);
        });

        return yearCountMap;
    }

Output

Used memory pre run (MB): 41
Year count: {2019=1178560, 2018=2775136, 2017=559632, 2016=250144, 2015=248192, 2014=144656}
Used memory post run (MB): 304
Memory consumed (MB): 262
Time taken in MS: 1971

Asynchronous processing

Here we are going to read the file using NIO API and then will process it asynchronously using CompletableFuture. For example will read 10000 lines and then process them asynchronously, then next 5000 and so on. See the below code.

    public Map<String, Integer> asyncYearCount(String file) throws IOException, InterruptedException, ExecutionException {
        try {
            List<CompletableFuture<Map<String, Integer>>> futures = new ArrayList<>();
            
            List<String> items = new ArrayList<>();
            Files.lines(Paths.get(file))
            .skip(1)//skip first line
            .forEach(line->{
                items.add(line);
                if(items.size()%10000==0) {
                    //add completable task for each of 10000 rows
                    futures.add(CompletableFuture.supplyAsync(yearCountSupplier(new ArrayList<>(items), new HashMap<>())));
                    items.clear();
                }
            });
            if(items.size()>0) {
                //add completable task for remaining rows
                futures.add(CompletableFuture.supplyAsync(yearCountSupplier(items, new HashMap<>())));
            }
            return CompletableFuture.allOf(futures.toArray(new CompletableFuture[futures.size()]))
                .thenApply($->{
                    //join all task to collect result after all tasks completed
                    return futures.stream().map(ftr->ftr.join()).collect(Collectors.toList());
                })
                .thenApply(maps->{
                    Map<String, Integer> yearCountMap = new HashMap<>();
                    maps.forEach(map->{
                        //merge the result of all the tasks
                        map.forEach((key, val)->{
                            if(yearCountMap.containsKey(key)) {
                                yearCountMap.put(key, yearCountMap.get(key)+val);
                            }else
                                yearCountMap.put(key, val);
                        });
                    });
                    return yearCountMap;
                })
                .get();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return new HashMap<>();
    }

    //Supplier method to count the year in given rows
    public Supplier<Map<String, Integer>> yearCountSupplier(List<String> items, Map<String, Integer> map){
        return ()->{
            items.forEach((line)->{
                yearCount(line,map);
            });
            return map;
        };
    }

Output

Used memory pre run (MB): 120
Year count: {2019=1178560, 2018=2775136, 2017=559632, 2016=250144, 2015=248192, 2014=144656}
Used memory post run (MB): 262
Memory consumed (MB): 142
Time taken in MS: 1549

Conclusion

Now we have seen how to read and process the huge file. We also learnt to do it synchronously and asynchronously. If we compare the output of both execution we can notice the memory consumption and execution time difference which says that async execution is faster however it may use more memory as multiple threads are processing at same time. Async execution may be more useful when you have more heavy files then difference is significant.
I would suggest if you have less memory then go with synchronous execution otherwise use async execution for better performance. You may use async execution also with less memory but it may not be that much beneficial due to small chunks and too many threads.

Comments

subha6 July 2020 at 09:12
wow this java program very nice to read...i very much egarly waiting too see ur next post..
AngularJS training in chennai | AngularJS training in anna nagar | AngularJS training in omr | AngularJS training in porur | AngularJS training in tambaram | AngularJS training in velachery
ReplyDelete
Replies
Beel SEO EXPORT28 May 2022 at 15:04
i'm dazzled. I don't assume Ive met each body who knows as an incredible arrangement simply extra or considerably less this present circumstance as you reach. you are in goal of truth handily proficient and colossally sparkling. You thought of something that individuals ought to get and made the trouble energizing for us all. absolutely, satisfying website you have came.. Razer Surround Pro Key Generator
ReplyDelete
Replies
trublogger28 May 2022 at 23:54
basically I experience your page besides practicing and helpful appraisal. it is defended colossally advantageous calculation when a combination of our assets.thanks for part. I participate in this make perceived. Tally GsT Crack
ReplyDelete
Replies

Add comment

The Techno Journals

Search This Blog