Skip to main content

How to read a large file in Java

Problem with reading large file

Large file can be any plain text or binary file which is huge in size and can not fit in JVM memory at once. For example if a java application allocated with 256 MB memory and it tries to load a file completely which is more or close to that memory in size then it may throw out of memory error.

Points to be remembered

  • Never read the whole file at once.
  • Read file line by line or in chunks, like reading few lines from text file or reading few bytes from binary file.
  • Do not store the whole data in memory, like reading all lines and keeping as string. 
Java has many ways to read the file in chunks or line by line, like BufferedReader, Scanner, BufferedInputStream, NIO API. We will use NIO API to read the file and Java stream to process it. We will also see how to span the processing with multiple threads to make the processing faster.

CSV file

In this example I am going to read a CSV file which is around 500 MB in size.  Sample is as given below.
"2018-06","2019-08","Long-term migrant","Arrivals","non-NZ","Student","Andorra",1,0,"Provisional"
"2019-06","2019-08","Long-term migrant","Arrivals","non-NZ","Student","Andorra",0,0,"Provisional"

File reading and counting year wise

We will read CSV file and provide the count year wise using the first column in this CSV file. We will see it in two different ways, one is synchronous way and another is asynchronous way using CompletableFuture. We will see the code in next sections.

Instance Variables

    private final long mb = 1024*1024;
    private final String file = "/Users/Downloads/sample.csv";

Common Methods

Below method is used to collect the count year wise in map.
    public void yearCount(String line, Map<String, Integer> countMap){
        String key = line.substring(1, 5);
        if(countMap.containsKey(key)) {
            countMap.put(key, countMap.get(key)+1);
            countMap.put(key, 1);

Below method I have annotated with EventListener to get it invoked itself when application is up and ready. This method also calculates memory consumption and execution time.
    public void testLargeFile() throws Exception{
        long premem = Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();
        long start = System.currentTimeMillis();

        System.out.println("Used memory pre run (MB): "+(premem/mb));
//        System.out.println("Year count: "+simpleYearCount(file));//process file synchronously and print details
//        System.out.println("Year count: "+asyncYearCount(file));//process file asynchronously and print details
        long postmem = Runtime.getRuntime().totalMemory()-Runtime.getRuntime().freeMemory();

        System.out.println("Used memory post run (MB): "+(postmem/mb));
        System.out.println("Memory consumed (MB): "+(postmem-premem)/mb);
        System.out.println("Time taken in MS: "+(System.currentTimeMillis()-start));

Synchronous processing 

Below is the code which reads the file using NIO API and calculates year count synchronously. Code looks pretty simple and small.
    public Map<String, Integer> simpleYearCount(String file) throws IOException {
        Map<String, Integer> yearCountMap = new HashMap<>();
        .skip(1)//skip first line
                yearCount(s, yearCountMap);

        return yearCountMap;


Used memory pre run (MB): 41
Year count: {2019=1178560, 2018=2775136, 2017=559632, 2016=250144, 2015=248192, 2014=144656}
Used memory post run (MB): 304
Memory consumed (MB): 262
Time taken in MS: 1971

Asynchronous processing 

Here we are going to read the file using NIO API and then will process it asynchronously using CompletableFuture. For example will read 10000 lines and then process them asynchronously, then next 5000 and so on. See the below code.
    public Map<String, Integer> asyncYearCount(String file) throws IOException, InterruptedException, ExecutionException {
        try {
            List<CompletableFuture<Map<String, Integer>>> futures = new ArrayList<>();
            List<String> items = new ArrayList<>();
            .skip(1)//skip first line
                if(items.size()%10000==0) {
                    //add completable task for each of 10000 rows
                    futures.add(CompletableFuture.supplyAsync(yearCountSupplier(new ArrayList<>(items), new HashMap<>())));
            if(items.size()>0) {
                //add completable task for remaining rows
                futures.add(CompletableFuture.supplyAsync(yearCountSupplier(items, new HashMap<>())));
            return CompletableFuture.allOf(futures.toArray(new CompletableFuture[futures.size()]))
                    //join all task to collect result after all tasks completed
                    Map<String, Integer> yearCountMap = new HashMap<>();
                        //merge the result of all the tasks
                        map.forEach((key, val)->{
                            if(yearCountMap.containsKey(key)) {
                                yearCountMap.put(key, yearCountMap.get(key)+val);
                                yearCountMap.put(key, val);
                    return yearCountMap;
        } catch (IOException e) {
        return new HashMap<>();
    //Supplier method to count the year in given rows
    public Supplier<Map<String, Integer>> yearCountSupplier(List<String> items, Map<String, Integer> map){
        return ()->{
            return map;


Used memory pre run (MB): 120
Year count: {2019=1178560, 2018=2775136, 2017=559632, 2016=250144, 2015=248192, 2014=144656}
Used memory post run (MB): 262
Memory consumed (MB): 142
Time taken in MS: 1549


Now we have seen how to read and process the huge file. We also learnt to do it synchronously and asynchronously. If we compare the output of both execution we can notice the memory consumption and execution time difference which says that async execution is faster however it may use more memory as multiple threads are processing at same time. Async execution may be more useful when you have more heavy files then difference is significant.
I would suggest if you have less memory then go with synchronous execution otherwise use async execution for better performance. You may use async execution also with less memory but it may not be that much beneficial due to small chunks and too many threads. 


Post a Comment

Popular Posts

Asynchronous REST service implementation in Spring boot

In this tutorial we will see how to create an asynchronous REST service endpoint using Spring boot application. Asynchronous service works in a way that it will not block the client request and do the processing in separate thread. When work is complete the response returned to the client so our service will be able to handle more client requests at the same time, compare to synchronous processing model. Let's understand how it is working in synchronous mode. In such server/client application at server side it has a pool of threads which are serving the request. If a request received by a thread then it will be blocked until it send the response back to client. In this case if processing doesn't take much time it will be able to process it quickly and accept other client requests but there could be one situation when all threads are busy and not able to accept the new client requests. To overcome of such problems, asynchronous processing model introduced for REST service

SpringBoot - @ConditionalOnProperty example for conditional bean initialization

@ConditionalOnProperty annotation is used to check if specified property available in the environment or it matches some specific value so it can control the execution of some part of code like bean creation. It may be useful in many cases for example enable/disable service if specific property is available. Below are the attributes which can be used for property check. havingValue - Provide the value which need to check against specified property otherwise it will check that value should not be false. matchIfMissing - If true it will match the condition and execute the annotated code when property itself is not available in environment. name - Name of the property to be tested. If you want to test single property then you can directly put the property name as string like "" and if you have multiple properties to test then you can put the names like {"prop.name1","prop.name2"} prefix - It can be use when you want to apply some prefix to

Setting up kerberos in Mac OS X

Kerberos in MAC OS X Kerberos authentication allows the computers in same domain network to authenticate certain services with prompting the user for credentials. MAC OS X comes with Heimdal Kerberos which is an alternate implementation of the kerberos and uses LDAP as identity management database. Here we are going to learn how to setup a kerberos on MAC OS X which we will configure latter in our application. Installing Kerberos In MAC we can use Homebrew for installing any software package. Homebrew makes it very easy to install the kerberos by just executing a simple command as given below. brew install krb5 Once installation is complete, we need to set the below export commands in user's profile which will make the kerberos utility commands and compiler available to execute from anywhere. Open user's bash profile: vi ~/.bash_profile Add below lines: export PATH=/usr/local/opt/krb5/bin:$PATH export PATH=/usr/local/opt/krb5/sbin:$PATH export LDFLAGS=&

Web scraper using JSoup and Spring Boot

What is webscraping Webscraping is a technique to extract or pull the data from a website to gather required information by parsing the HTML source of their websites, such as articles from news or books site, products information from online shopping sites or course information from education sites. There are many organisations who uses web scraper to provide the best experience to their customers, for example extract the price for a smartphone from multiple online websites and show their customers the best and cheap product URL. We will learn here how to code a web scraper by developing a simple new scraper service. News scraper News scraper is used to extract the news articles or other related contents from a news site. Here we are going to create a web scraper application to pull the articles from news site. Below are the operations provided by our news scraper service. List all the authors Search articles by author name Search articles by article title Search articles